Data Intensive Analysis and Computing (DIAC) Lab
Faculty: Keke Chen, Amit Sheth
Recent advances in computing, communication, and digital storage technologies have enabled incredible volumes of data to be accessible remotely across geographical and administrative boundaries. For example, the number of web pages on the internet is growing at an astonishing speed; search engines receive tens of millions of requests per day and all user clickthrough activities are recorded and analyzed; online social networks involve numerous users around the world and keep track of their interactions; scientific simulations generate gigabytes of data in seconds; credit card transactions happen everywhere at every second and all are recorded, monitored, and analyzed for fraud detection. There is an increasing demand on summarizing, understanding, monitoring, learning, and collaboratively mining over the distributed large data stores. In DIAC lab, we study the research problems related to large, distributed, and shared datasets. The following are some sample projects.
-
Visual Analytics and Cluster Analysis
Large datasets are also characterized by high complexity and uncertainty. Clustering is an effective tool for understanding this complexity and uncertainty. In DIAC lab, we investigate novel techniques that combine visual analytics and statistical analysis to help better understanding the clustering patterns in large datasets. In particular, we are interested in visually exploring and validating clustering patterns in large multi-dimensional datasets (VISTA, iVIBRATE), finding the optimal number of clusters in categorical ACE and BestK) and transactional datasets (Weight Coverage Density and DMDI), and monitoring the change of clustering patterns in categorical data streams (CatStream).
-
Privacy Preserving Computing, Trustworthy Computing
When large datasets are shared crossing boundaries, privacy and trust have become the major concerns. In DIAC we study the privacy issues in distributed data intensive computing, particularly in collaborative data mining. In the initial work, we proposed the geometric data perturbation (GDP), which can be used to fully preserve data utility in terms of classification modeling, while providing satisfactory privacy guarantee. -
Web Science
For large-scale complicated learning problems, it is very expensive to collect sufficient amount of labeled training data. Learning to rank in web search is one of such problems. There are multiple ways to extend training dataset, such as leveraging large amount of unlabeled data (i.e., semi-supervised learning), or searching over the large amount of unlabeled data to find the most effective candidate examples for labeling (i.e., active learning). In learning to rank, we study some novel strategies to enhance the training data. Concretely, we develop new algorithms to utilize pairwise preference training data mined from implicit user feedback (GBRank), to adapt the model trained with small amount of labeled data to the pairwise preference data ( ClickAdapt), and to adapt a ranking function trained on one search domain to another (Trada).
-
Learning from Large and Evolving Datasets
-
Cloud-supported Data Management and Mining
Back