贵州大学计算机科学与技术学院
纸质出版:2017
移动端阅览
[1]吴云,许抗震,黄瑞章.一种基于Hadoop的文本相似度仿真检测模型[J],2017,34(03):308-315.
[1]吴云,许抗震,黄瑞章.一种基于Hadoop的文本相似度仿真检测模型[J],2017,34(03):308-315. DOI: 10.13568/j.cnki.651094.2017.03.010.
DOI:10.13568/j.cnki.651094.2017.03.010.
随着信息时代数据量成倍的增长
传统的文本相似度检测方法已经无法处理大规模的文本数据.为此
提出了一种基于Hadoop集群技术的文本相似度仿真检测模型.该检测模型分为三步:第一步
利用Hadoop工具搭建实验平台
并针对该平台进行硬件和软件的优化;第二步
把文档转化为集合
使用改进的基于Map Reduce编程模型的Shingling算法;第三步
提出一种分布式的New Minhash算法求签名矩阵
然后利用Jaccard系数计算出相似度
选出相似的文档.实验证明:对于相同操作
优化后的性能耗时减少了近5.65%.该仿真模型不仅能够更加精确的求出文本相似度
而且能够更好的适应分布式平台处理大规模的文本数据
同时拥有良好的扩展性.
With the increasing amount of data in the information age
traditional text similarity computing method has been unable to deal with large-scale text data
aiming at these problems
this text puts forward a kind of text similarity simulation detection model based on Hadoop cluster technology. The detection model is divided into three steps: the first step is to use the Hadoop tool to build the experimental platform
and the platform for the optimization of hardware and software. The second step to the document into a collection
using an improved Map Reduce based programming model based on Shingling algorithm. In the third step
a distributed New Minhash algorithm is proposed to solve the signature matrix
and then the Jaccard coefficients are used to calculate the similarity. Experiments show that for the same operation
the performance of the optimized time decreased by nearly 5.65%
the simulation model is not only more accurate for text similarity
but also can better adapt to the distributed processing platform for the large-scale text data
and has a good scalability.
程学旗,靳小龙,王元卓,等.大数据系统和分析技术综述[J].软件学报,2014,25(9):1889-1908.
Aytuˇg Onan,Serdar Korukoˇglu,Hasan Bulut.Ensemble of keyword extraction methods and classifiers in text classification[J].Expert Systems With Applications,2016:232-247.
Mitra M,Hadi A,Man L,et al.Sense sentiment similarity an analysis[C].Proceedings of the 26th AAAI Conference on Artificial Intelligence,2012:1706-1712.
Kumar N.Approximate string matching algorithm[J].International Journal on Computer Science and Engineering,2010,2(3):641-644.
王洪亚,吴西送,任建军,等.分布式平台下Min Hash算法研究与实现[J].智能计算机与应用,2014,4(6):44-46.
崔建明,刘建明,廖周宇.基于SVM算法的文本分类技术研究[J].计算机仿真,2013,30(2):299-302+368.
宋玲,马军,连莉,等.文档相似度综合计算研究[J].计算机工程与应用,2006,42(30):160-163.
黄承慧,印鉴,侯昉.一种结合词项语义信息和TF-IDF方法的文本相似度量方法[J].计算机学报,2011,34(5):856-864.
张敏.海量数据的Map Reduce相似度检测[J].实验室研究与探索,2014,33(9):132-136.
马成前,毛许光.网页查重算法Shingling和Simhash研究[J].计算机与数字工程,2009,37(1):15-17+108.
顾荣,王芳芳,袁春风,等.YARM:基于Map Reduce的高效可扩展的语义推理引擎[J].计算机学报,2015,38(1):74-85.
宋杰,郭朝鹏,王智,等,Jean-Marc PIERSON.大数据分析的分布式MOLAP技术[J].软件学报,2014,25(4):731-752.
0
浏览量
149
下载量
0
CSCD
关联资源
相关文章
相关作者
相关机构
京公网安备11010802024621
