Institute of Computing Technology, Chinese Academy IR
EPAS: A Sampling Based Similarity Identification Algorithm for the Cloud | |
Zhou, Yongtao1; Deng, Yuhui1,2; Xie, Junjie1; Yang, Laurence T.3 | |
2018-07-01 | |
发表期刊 | IEEE TRANSACTIONS ON CLOUD COMPUTING |
ISSN | 2168-7161 |
卷号 | 6期号:3页码:720-733 |
摘要 | The explosive growth of data brings new challenges to the data storage and management in cloud environment. These data usually have to be processed in a timely fashion in the cloud. Thus, any increased latency may cause a massive loss to the enterprises. Similarity detection plays a very important role in data management. Many typical algorithms such as Shingle, Simhash, Traits and Traditional Sampling Algorithm (TSA) are extensively used. The Shingle, Simhash and Traits algorithms read entire source file to calculate the corresponding similarity characteristic value, thus requiring lots of CPU cycles and memory space and incurring tremendous disk accesses. In addition, the overhead increases with the growth of data set volume and results in a long delay. Instead of reading entire file, TSA samples some data blocks to calculate the fingerprints as similarity characteristics value. The overhead of TSA is fixed and negligible. However, a slight modification of source files will trigger the bit positions of file content shifting. Therefore, a failure of similarity identification is inevitable due to the slight modifications. This paper proposes an Enhanced Position-Aware Sampling algorithm (EPAS) to identify file similarity for the cloud by modulo file length. EPAS concurrently samples data blocks from the head and the tail of the modulated file to avoid the position shift incurred by the modifications. Meanwhile, an improved metric is proposed to measure the similarity between different files and make the possible detection probability close to the actual probability. Furthermore, this paper describes a query algorithm to reduce the time overhead of similarity detection. Our experimental results demonstrate that the EPAS significantly outperforms the existing well known algorithms in terms of time overhead, CPU and memory occupation. Moreover, EPAS makes a more preferable tradeoff between precision and recall than that of other similarity detection algorithms. Therefore, it is an effective approach of similarity identification for the cloud. |
关键词 | Similarity detection sampling shingle position-aware cloud |
DOI | 10.1109/TCC.2016.2527646 |
收录类别 | SCI |
语种 | 英语 |
资助项目 | National Science foundation (NSF) of China[61572232] ; National Science foundation (NSF) of China[61272073] ; key program of NSF of Guangdong Province[S2013020012865] ; Fundamental Research Funds for the Central Universities ; Open Research Fund of Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences[CARCH201401] ; Science and Technology Planning Project of Guangdong Province[2013B090200021] |
WOS研究方向 | Computer Science |
WOS类目 | Computer Science, Information Systems ; Computer Science, Software Engineering ; Computer Science, Theory & Methods |
WOS记录号 | WOS:000443894000010 |
出版者 | IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC |
引用统计 | |
文献类型 | 期刊论文 |
条目标识符 | http://119.78.100.204/handle/2XEOYT63/4954 |
专题 | 中国科学院计算技术研究所期刊论文_英文 |
通讯作者 | Zhou, Yongtao |
作者单位 | 1.Jinan Univ, Dept Comp Sci, Guangzhou 510632, Guangdong, Peoples R China 2.Chinese Acad Sci, Inst Comp Technol, State Key Lab Comp Architecture, Beijing 100080, Peoples R China 3.St Francis Xavier Univ, Dept Comp Sci, Antigonish, NS B2G 2W5, Canada |
推荐引用方式 GB/T 7714 | Zhou, Yongtao,Deng, Yuhui,Xie, Junjie,et al. EPAS: A Sampling Based Similarity Identification Algorithm for the Cloud[J]. IEEE TRANSACTIONS ON CLOUD COMPUTING,2018,6(3):720-733. |
APA | Zhou, Yongtao,Deng, Yuhui,Xie, Junjie,&Yang, Laurence T..(2018).EPAS: A Sampling Based Similarity Identification Algorithm for the Cloud.IEEE TRANSACTIONS ON CLOUD COMPUTING,6(3),720-733. |
MLA | Zhou, Yongtao,et al."EPAS: A Sampling Based Similarity Identification Algorithm for the Cloud".IEEE TRANSACTIONS ON CLOUD COMPUTING 6.3(2018):720-733. |
条目包含的文件 | 条目无相关文件。 |
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。
修改评论