一种基于Hadoop的文本相似度仿真检测模型

吴云; 许抗震; 黄瑞章

doi:10.13568/j.cnki.651094.2017.03.010

您当前的位置：

首页 >

文章列表页 >

一种基于Hadoop的文本相似度仿真检测模型

信息科学与技术 | 更新时间：2026-01-21

- 一种基于Hadoop的文本相似度仿真检测模型
- 一种基于Hadoop的文本相似度仿真检测模型
- 新疆大学学报（自然科学版中英文） 2017年34卷第3期页码：308-315
- 作者机构：
  
  贵州大学计算机科学与技术学院
- 作者简介：
- 基金信息：
  
  国家自然科学基金项目(61462011)
- DOI：10.13568/j.cnki.651094.2017.03.010
  中图分类号： TP391.1
- 纸质出版：2017
- 稿件说明：
移动端阅览
[1]吴云,许抗震,黄瑞章.一种基于Hadoop的文本相似度仿真检测模型[J],2017,34(03):308-315.
[1]吴云,许抗震,黄瑞章.一种基于Hadoop的文本相似度仿真检测模型[J],2017,34(03):308-315. DOI： 10.13568/j.cnki.651094.2017.03.010.

DOI：10.13568/j.cnki.651094.2017.03.010.

摘要

随着信息时代数据量成倍的增长

传统的文本相似度检测方法已经无法处理大规模的文本数据.为此

提出了一种基于Hadoop集群技术的文本相似度仿真检测模型.该检测模型分为三步:第一步

利用Hadoop工具搭建实验平台

并针对该平台进行硬件和软件的优化;第二步

把文档转化为集合

使用改进的基于Map Reduce编程模型的Shingling算法;第三步

提出一种分布式的New Minhash算法求签名矩阵

然后利用Jaccard系数计算出相似度

选出相似的文档.实验证明:对于相同操作

优化后的性能耗时减少了近5.65%.该仿真模型不仅能够更加精确的求出文本相似度

而且能够更好的适应分布式平台处理大规模的文本数据

同时拥有良好的扩展性.

Abstract

With the increasing amount of data in the information age

traditional text similarity computing method has been unable to deal with large-scale text data

aiming at these problems

this text puts forward a kind of text similarity simulation detection model based on Hadoop cluster technology. The detection model is divided into three steps: the first step is to use the Hadoop tool to build the experimental platform

and the platform for the optimization of hardware and software. The second step to the document into a collection

using an improved Map Reduce based programming model based on Shingling algorithm. In the third step

a distributed New Minhash algorithm is proposed to solve the signature matrix

and then the Jaccard coefficients are used to calculate the similarity. Experiments show that for the same operation

the performance of the optimized time decreased by nearly 5.65%

the simulation model is not only more accurate for text similarity

but also can better adapt to the distributed processing platform for the large-scale text data

and has a good scalability.

关键词

Keywords

references

程学旗,靳小龙,王元卓,等.大数据系统和分析技术综述[J].软件学报,2014,25(9):1889-1908.

Aytuˇg Onan,Serdar Korukoˇglu,Hasan Bulut.Ensemble of keyword extraction methods and classifiers in text classification[J].Expert Systems With Applications,2016:232-247.

Mitra M,Hadi A,Man L,et al.Sense sentiment similarity an analysis[C].Proceedings of the 26th AAAI Conference on Artificial Intelligence,2012:1706-1712.

Kumar N.Approximate string matching algorithm[J].International Journal on Computer Science and Engineering,2010,2(3):641-644.

王洪亚,吴西送,任建军,等.分布式平台下Min Hash算法研究与实现[J].智能计算机与应用,2014,4(6):44-46.

崔建明,刘建明,廖周宇.基于SVM算法的文本分类技术研究[J].计算机仿真,2013,30(2):299-302+368.

宋玲,马军,连莉,等.文档相似度综合计算研究[J].计算机工程与应用,2006,42(30):160-163.

黄承慧,印鉴,侯昉.一种结合词项语义信息和TF-IDF方法的文本相似度量方法[J].计算机学报,2011,34(5):856-864.

张敏.海量数据的Map Reduce相似度检测[J].实验室研究与探索,2014,33(9):132-136.

马成前,毛许光.网页查重算法Shingling和Simhash研究[J].计算机与数字工程,2009,37(1):15-17+108.

顾荣,王芳芳,袁春风,等.YARM:基于Map Reduce的高效可扩展的语义推理引擎[J].计算机学报,2015,38(1):74-85.

宋杰,郭朝鹏,王智,等,Jean-Marc PIERSON.大数据分析的分布式MOLAP技术[J].软件学报,2014,25(4):731-752.

浏览量

149

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

暂无数据