%0 Conference Proceedings
%B 2015 IEEE 8th International Conference on Cloud Computing
%D 2015
%T Scalable Euclidean Embedding for Big Data
%A Zohreh Alavi
%A Sagar Sharma
%A Lu Zhou
%A Keke Chen
%K Algorithm design and analysis
%K Approximation algorithms
%K arbitrary metric space
%K Big Data
%K Big data scale
%K Complexity theory
%K data reduction
%K data visualisation
%K data visualization
%K Euclidean embedding algorithms
%K Euclidean space
%K FastMap-MR algorithm
%K LMDS-MR algorithm
%K massive data parallel infrastructure
%K Measurement
%K parallel algorithms
%K parallel processing
%K Scalability
%K scalable Euclidean embedding algorithm
%K visualization technique
%X Euclidean embedding algorithms transform data defined in an arbitrary metric space to the Euclidean space, which is critical to many visualization techniques. At big-data scale, these algorithms need to be scalable to massive data-parallel infrastructures. Designing such scalable algorithms and understanding the factors affecting the algorithms are important research problems for visually analyzing big data. We propose a framework that extends the existing Euclidean embedding algorithms to scalable ones. Specifically, it decomposes an existing algorithm into naturally parallel components and non-parallelizable components. Then, data parallel implementations such as MapReduce and data reduction techniques are applied to the two categories of components, respectively. We show that this can be possibly done for a collection of embedding algorithms. Extensive experiments are conducted to understand the important factors in these scalable algorithms: scalability, time cost, and the effect of data reduction to result quality. The result on sample algorithms: Fast Map-MR and LMDS-MR shows that with the proposed approach the derived algorithms can preserve result quality well, while achieving desirable scalability.
%B 2015 IEEE 8th International Conference on Cloud Computing
%I IEEE
%C New York City, NY
%P 773 - 780
%8 07/2015
%G eng
%M 15399748
%R 10.1109/CLOUD.2015.107
%0 Journal Article
%J Journal of Data Mining and Knowledge Discovery (DMKD)
%D 2010
%T SCALE: a Scalable Framework for Efficiently Clustering Large Transactional Data
%A Hua Yan
%A Keke Chen
%A Ling Liu
%A Zhang Yi
%K Framework
%K Large Data Clusters
%X This paper presents SCALE, a fully automated transactional clustering framework. The SCALE design highlights three unique features. First, we introduce the concept of Weighted Coverage Density as a categorical similarity measure for efficient clustering of transactional datasets. The concept of weighted coverage density is intuitive and it allows the weight of each item in a cluster to be changed dynamically according to the occurrences of items. Second, we develop the weighted coverage density measure based clustering algorithm, a fast, memory-efficient, and scalable clustering algorithm for analyzing transactional data. Third, we introduce two clustering validation metrics and show that these domain specific clustering evaluation metrics are critical to capture the trasactional semantics in clustering analysis. Our SCALE framework combines the weighted coverage density measure for clustering over a sample dataset with self configuring methods. These self-configuring methods can automatically tune the two important parameters of our clustering algorithms: (1) the candidates of the best number K of clusters; and (2) the application of two domain-specific cluster validity measures to find the best result from the set of clustering results.We have conducted extensive experimental evaluation using both synthetic and real datasets and our results show that the weighted coverage density approach powered by the SCALE framework can efficiently generate high quality clustering results in a fully automated manner
%B Journal of Data Mining and Knowledge Discovery (DMKD)
%G eng