分享自用小工具：TF-IDF计算文档相似性的python实现

只看大图 · 发表于 2016-3-11 12:03:11

本帖最后由刀心于 2016-4-7 15:16 编辑

最近要失业了，过渡期上班没太多活，于是捣鼓一下这些东西。

先来说说实现思路
1、我从我的数据库中获取了一些文章的title
2、将title用jieba分词进行分词
3、使用一些第三方库计算出词频向量（其中计算方法不明）
4、根据每两篇文档的词频向量计算其余弦相似性。公式如下图

5、根据人肉观察和计算结果，设定一个阀值，作为相似性推荐的参数值

需要安装的库有：
sklearn，jieba，simplejson，还有一个翻译包，不过可以改改代码然后不安装这个包。

测试结果：
1、两个完全相同的标题，得到的最大值是1.0。
2、两个完全不相同的标题，得到的最小值是0.0。
3、150行标题，计算速度是0.0xxx，速度还算可以。
4、10万行标题，计算速度是40秒左右，算是非常慢了。如果后续有优化版本，我会再放上来，毕竟支持大量文章中筛选出相似文章才是硬需求。
5、感觉0.5以上就挺相似的了。

脚本包注意事项：
1、我的站点是个繁体的站点，从中选出的标题jieba不能进行分词，于是我翻译后再分词的，还需要一个翻译包点击我。
2、开始是想自己用，于是中间在用json转来转去，现在感觉没必要，大家可以改改。

代码如下，我觉得可以扩展一下直接使用。这里的代码高亮看起来舒服些。

#!/usr/local/bin/python
#coding=utf-8
# daoxin 2016-3
import json, simplejson, sys, re
reload(sys)
sys.setdefaultencoding('utf8')
import jieba, time
import string
from sklearn import feature_extraction
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from collections import Counter
import math
#导入翻译模块；去git下载吧
sys.path.append('/Users/movespeed/Desktop/Python/fanti_jianti') #文件路径自己修改
from zhtools.langconv import *
#翻译模块结束
"""
cutWord function , save cut word results into json file
"""
def cutWord():
f = open('/Users/movespeed/Desktop/Python/title.json') #文件路径自己修改
"""
Open source file
"""
jiebafile = open('/Users/movespeed/Desktop/Python/jiebafile.json', 'w+'
) #文件路径自己修改
"""
save cutword result into a json file
"""
while 1:
line = f.readline()
if not line:
break
else:
title = json.loads(line)['title']
t_id = json.loads(line)['id']
tf_idf = json.loads(line)['tf-idf']
#seg_list = jieba.cut(title, cut_all=True)#the jieba cut function
title = Converter('zh-hans').convert(
title.decode('utf-8')
) # Translate func for jieba can't cut cht.
#print title
seg_list = jieba.cut(title, cut_all=True) #the jieba cut function
seg_list = str(",".join(seg_list))
seg_list = seg_list.split(',')
#print seg_list
result = []
for seg in seg_list:
seg = ','.join(seg.split(',')).decode('utf-8')
if (seg != '' and seg != "\n" and seg != "\n\n" and
seg != "_" and seg != "," and seg != "|"):
result.append(seg)
jsoninfo = json.dumps({"id": t_id,
"title": title,
"cut_word": result,
"tf_idf": None})
jiebafile.write(jsoninfo + '\n')
#The testting of cut word function
#cutWord()
"""
vector-values counter function
"""
filelist = open('/Users/movespeed/Desktop/Python/jiebafile.json', 'r'
) #文件路径自己修改
# Change json infomation into list , every item contains some chinese word split by one space
def ChangeJsonIntoList(filelist):
vectorList = list()
for doc in filelist.readlines():
#jsoninfo = str(json.loads(doc)['cut_word']).replace(',', '')
jsoninfo = ' '.join(json.loads(doc)['cut_word'])
#print jsoninfo
#print type(jsoninfo)
vectorList.append(jsoninfo)
return vectorList
# tf_idf function, return tfidf array
def Tf_Idf(vectorList):
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(vectorList)
counts = X.toarray()
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(counts)
tfidf__ = tfidf.toarray()
return tfidf__
#Calculating cosine similarity
def CosValues(data1, data2):
tfidf__ = Tf_Idf(vectorList)
#print tfidf__ #输出全量文本词频向量稀疏矩阵。
#print tfidf__[16] #某行文本的向量
#print tfidf__[17] #某行文本的向量
Numb = 0
Agen = 0
Bgen = 0
for v, k in zip(tfidf__[data1], tfidf__[data2]):
Numb += v * k
Agen += v**2
Bgen += k**2
print Numb / (math.sqrt(Agen) * math.sqrt(Bgen))
start = time.clock()
#vector-values function testing
vectorList = ChangeJsonIntoList(filelist)
print type(vectorList[0]), 'type check---!==!--!=='
print vectorList[17] #可以目测一下第A段文本
#print ' '.join(eval(vectorList[0]))
print '---------'
print vectorList[16] #可以目测一下第B段文本
#cutWord() #调用切词程序，生成分词json
CosValues(17, 16) #计算两段文本的相似性。两个参数是对应文本的行号。
end = time.clock()
print end - start # Take xxx seconds
"""
测试结果：
最大值：1.0
最小值：0.0
150段文本的速度：0.0xx秒
10万段文本的速度：40秒
"""

复制代码

zip包下载。改改就可以自己测试了。

tf-idf-api.zip (16.87 KB, 下载次数: 1589)

楼主| *发表于 2016-3-11 12:03:29* · 发表于 2016-3-11 12:03:29

沙发。先去吃饭

楼主| *发表于 2016-3-14 18:53:12* · 发表于 2016-3-14 18:53:12

我去咋回事儿，3天了没人看啊。下载的人都干啥呢

*发表于 2016-3-16 15:34:11* · 发表于 2016-3-16 15:34:11

厉害向您学习

帐号		自动登录	找回密码
密码			注册