Data Intensive Analysis & Computing Lab

Joshi Research Center, Room 364

Recent advances in computing, communication, and digital storage technologies have enabled incredible volumes of data to be accessible remotely across geographical and administrative boundaries. There is an increasing demand on summarizing, understanding, monitoring, learning, and collaboratively mining from large, evolving, and possibly private data stores. In the DIAC lab, we study the research problems and applications related to such large datasets. For more about the lab's activities, read our informational materials.


Keke Chen & Amit Sheth



  • Confidential Outsourced Data Mining
  • Most data mining tasks require a good understanding of the mining techniques, time-consuming parameter tuning, algorithm tweaking, and possibly algorithm innovation. They are often resource-intensive and may need large-scale parallel processing infrastructures for large datasets. As a result, data owners, who have no sufficient computing resources or data-mining expertise, cannot effectively mine their data.

    The recent development of cloud computing, service computing, and crowdsourcing enables several possible outsourced data mining solutions: cloud mining, data mining as a service, and crowdsourced mining. In spite of all the known benefits, these outsourcing approaches all demand the data owner to export the data, which causes several problems.

    • The outsourced data may contain sensitive information, such as business secrets or private information. For example, the privacy issue has forced Netflix to suspend the Netflix prize II competition (, which is a typical crowd-mining example.
    • Data ownership is not protected. In crowdsourced mining the dataset can be accessed and used by anybody, while in cloud mining and outsourced mining, compromised service providers or malicious employees of service provider can distribute the data without permission.
    • Confidentiality and ownership of the resultant models is not protected. Similarly, adversaries can learn the data mining models or distribute them without permission. These models may contain information against data owners' interests as well.

    As the success of data-driven research or business heavily depends on the data, owners of sensitive data cannot use the outsourcing solutions yet. Therefore, preserving the confidentiality of data and models for outsourced data mining is an urgent task.

    The objective of this project is to explore and understand new approaches to practical confidential outsourced mining. We propose to explore two different paths to obtain high confidentiality and efficiency for learning high-quality models. (1) The first path is to use more efficient cryptographic primitives to develop confidential data mining algorithms. The well-known approaches such as fully homomorphic encryption and garbled circuits are too expensive to be practical. (2) We study confidential learning methods for data protection methods that preserve limited data utility, with a focus on perturbation methods.

    When studying these approaches, a critical task is to understand the intricate trade-offs among multiple factors: confidentiality, costs, and model quality, and develop the methods to tune and balance these factors. This task will also consider the context of practical outsourced mining, which has the unique requirement on the scalability of server-side algorithms and the minimum involvement of the client-side system.

    Funding: Partially supported by NSF Grant 1245847, and an Amazon AWS Research Grant.
    People: Prof. Keke Chen (PI)
    Graduate Students: Sagar Sharma, Shumin Guo, Huiqi Xu, James Powers

  • Data Analytics with the Cloud

    Data clouds, consisting of hundreds or thousands of cheap multi-core PCs and disks, are available for rent at low cost (e.g., Amazon EC2 and S3 services). Many cloud-based applications generate a large amount of data in the cloud, which in turn needs to be processed with cloud-based data analytics tools. Powered with the distributed file system, e.g., Hadoop distributed file system, and MapReduce programming model, the cloud becomes an economical and scalable platform for performing large-scale data analytics. We study the visual cluster exploration framework (CloudVista) for analyzing the large data hosted in the cloud and the cost model for resource-aware cloud computing.

  • Clustering Large/Streaming Numerical/Categorical Data

    Large datasets are also characterized by high complexity and uncertainty. Clustering is an effective
    tool for understanding this complexity and uncertainty. In the DIAC lab, we investigate novel techniques that combine visual analytics and statistical analysis to help better understanding of the clustering patterns in large datasets. In particular, we are interested in visually exploring and validating clustering patterns in large, multi-dimensional datasets (VISTA and iVIBRATE), finding the optimal number of clusters in categorical (ACE and BestK) and transactional datasets (weighted coverage density and DMDI), and monitoring the change of clustering patterns in categorical data streams (CatStream).

  • Privacy Preserving Computing, Trustworthy Computing

    When large datasets are shared crossing boundaries, privacy and trust become the major concerns. In DIAC, we study the privacy issues in distributed, data-intensive computing; in particular, privacy preserving OLAP, mining on outsourced data, and privacy preserving multi-party, collaborative data mining.We have proposed geometric data perturbation (GDP), which can be used to fully preserve data utility in terms of classification and clustering modeling, while also providing satisfactory privacy guarantee. The GDP method can also be applied to privacy preserving, multi-party, collaborative mining (Multi-party GDP). Recent developments have been focused on the theoretical study on the family of geometric perturbation methods and its application on privacy preserving OLAP on outsourced data, and privacy and trust in social networks.

  • Web Science: Ranking and Adaptation

    For large-scale, complicated learning problems, it is very expensive to collect a sufficient amount of labeled training data. Learning to rank in web search is one such problem. There are multiple ways to extend training datasets, such as leveraging a large amount of unlabeled data (i.e., semi-supervised learning), or searching over the large amount of unlabeled data to find the most effective candidate examples for labeling (i.e., active learning). In learning to rank, we study some novel strategies to enhance the training data. Concretely, we develop new algorithms to utilize pairwise preference training data mined from implicit user feedback (GBRank), to adapt the model trained with small amountof labeled data to the pairwise preference data (ClickAdapt), and to adapt a ranking function trained from one search domain to another (Tree Adaptation or Trada). Recent developments include the understanding of the effectiveness of tree adaptation for ranking and tree adaptation methods for pairwise data.