基于图模型决策的微博检索二次排序算法

杨震; 张广源; 范科峰

doi:10.11936/bjutxb2015090041

基于图模型决策的微博检索二次排序算法

doi: 10.11936/bjutxb2015090041

杨震^1,,
张广源¹,
范科峰^3,, ,

1.
北京工业大学计算机学院, 北京 100124
2.
可信计算北京市重点实验室, 北京 100124
3.
中国电子技术标准化研究院, 北京 100007
4.
桂林电子科技大学广西高校云计算与复杂系统重点实验室, 桂林 541004

基金项目: 北京市优秀人才、北京市属高校青年拔尖人才资助项目(CIT&TCD201404052)；国家科技支撑计划资助项目(2015BAK21B04)；广西高校云计算与复杂系统重点实验室资助项目(15205)

详细信息

通讯作者:
范科峰(1978—), 男, 高级工程师, 主要从事信息安全方面的研究, E-mail:fankf@cesi.cn

中图分类号: TP39
计量
- 文章访问数: 230
- HTML全文浏览量: 133
- PDF下载量: 0
- 被引次数: 0
出版历程
- 收稿日期: 2015-09-15
- 网络出版日期: 2022-09-09
- 刊出日期: 2017-01-01

Microblog Retrieval Results Re-ranking Using Graph Model Based Decision

1.
College of Computer Science, Beijing University of Technology, Beijing 100124, China
2.
Beijing Key Laboratory of Trusted Computing, Beijing 100124, China
3.
China Electronics Standardization Institute, Beijing 10007, China
4.
Guangxi Colleges and Universities Key Laboratory of Cloud Computing and Complex Systems, Guilin University of Electronic Technology, Guilin 541004, China

摘要

摘要: 为了解决微博检索面临的“用户查询”和“相关文档”都是极端短文本的情况,及由此造成的检索性能欠佳的难题,研究并实现了一种微博检索结果的二次重排算法,基于微博内容相似关系构建关系图模型,利用PageRank算法对微博检索结果进行二次排序. 比较了基于余弦相似度、戴斯系数、单向戴斯系数等文本内容相似度计算方法. 实验结果表明:二次排序算法能够有效提升微博检索性能,同时图模型迭代性能与相关主题比例存在依存关系. 有鉴于此,讨论通过决策树重排算法去除非相关主题对微博排序的影响.
- 微博检索 /
- 二次重排 /
- 图模型 /
- 决策树
Abstract: As a typical short text, microblogging retrieval suffers from the problem of the insufficient samples both in users’ query and documents that makes the probabilistic-like models unreliable. To remedy this problem, a graph model was designed and implemented based on topic clustering algorithm to re-rank microblog retrieval results. The graph model was built by the content similarity between micro-blogs. By comparing the cosine similarity, the dice coefficient, and the one-way dice coefficient with the experimental results. Results show that the performance of the search depends on the ratio of related topics, therefore decision tree algorithm was used to remedy the influence of the ranking position relevant topics.
- microblogging retrieval /
- results re-ranking /
- graph model /
- decision tree
The authors have declared that no competing interests exist.

HTML全文

图 1 微博间无向图模型

Figure 1. Undirected graph model

下载: 全尺寸图片幻灯片

图 2 微博间有向图模型

Figure 2. Directed graph model

下载: 全尺寸图片幻灯片

图 3 微博检索系统架构

Figure 3. Framework of microblog retrieval system

下载: 全尺寸图片幻灯片

表 1 Tweets的检索结果属性

Table 1. Attributes of tweets search results

tweet id	与查询语句相似度/%	是否存在关注度	是否相关
0001	60	1	Y
0002	80	0	Y
0003	65	0	N
︙	︙	︙	︙
N	15	1	Y

下载: 导出CSV

表 2 2014 TREC microblog图模型聚类算法评测结果

Table 2. Performance of microblog retrieval based on graph model in TREC 2014

Run id	R-Prec	Bpref	P@10	P@20
OSIM	0.2207	0.2673	0.4182	0.3682
NSIM	0.2169	0.2655	0.3982	0.3536
NCOS	0.2198	0.2667	0.3673	0.3255

下载: 导出CSV

表 3 TREC 2014 microblog图模型结合决策树评测结果

Table 3. Performance of microblog retrieval based on graph model and decision tree in TREC 2014

Run id	P@10	P@15	P@20
OSIM	0.4532	0.4325	0.3962
NSIM	0.4371	0.4251	0.3834
NCOS	0.4363	0.4273	0.3875

下载: 导出CSV

参考文献(15)

[1]	LI X W.Research on the key technologies in Weibo retrieval [D]. Harbin: Harbin Institue of Technology, 2013. (in Chinese)
[2]	KWAK H, LEE C, PARK H, et al.What is twitter, a social network or a news media[C]//Proceedings of the 19th International Conference on World Wide Web. NY: ACM, 2010: 591-600.
[3]	WU S, MASON W A.Who says what to whom on twitter[C]//Proceedings of the 20th International Conference on World Wide Web. NY: ACM, 2011: 705-714.
[4]	JAVA A, FININ T.Why we twitter: understanding microblogging usage and communities[C]//Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 Workshop on Web Mining and Social Network Analysis. NY: ACM, 2007: 56-65.
[5]	YANG J, COUNTS S.Comparing information diffusion structure in weblogs and microblogs[C]//Proceedings of the Fourth International AAAI Conference on Weblogs & Social Media. CA: AAAI, 2010: 351-354.
[6]	ROMERO D, MEEDER B, KLEINBERG J.Differences in the mechanics of information diffusion across topics: idioms, political hashtags, and complex contagion on twitter[C]//Proceedings of the 20th International Conference on World Wide Web. NY: ACM, 2011: 695-704.
[7]	QIN T, LIU T, XU J, et al.LETOR: a benchmark collection for research on learning to rank for information retrieval[J]. Information Retrieval, 2010, 13(4): 346-374.
[8]	LIU T.Learning to rank for information retrieval[J]. Foundations and Trends in Information Retrieval, 2009, 3(3): 225-331.
[9]	CAO Z, QIN T, LIU T, et al.Learning to rank: from pairwise approach to list-wise approach[C]//Proceedings of the 24th International Conference on Machine Learning. NY: ACM, 2007: 129-136.
[10]	HAN J, KAMBER M, PEI J.Data mining: concepts and techniques: concepts and techniques[M]. Netherlands: Elsevier, 2011.
[11]	PAGE L, BRIN S, MOTWANI R, et al. The PageRank citation ranking: bringing order to the Web [R/OL]. [2015-03-01]. http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf. http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf.
[12]	HAVELIWALA T.Topic-sensitive pagerank: a context-sensitive ranking algorithm for Web search[J]. IEEE Transactions on Knowledge and Data Engineering, 2003, 15(4): 784-796.
[13]	ERKAN G, RADER D.Lexrank: graph-based lexical centrality as salience in text summarization[J]. Journal of Artificial Intelligence Research, 2004, 22: 457-479.
[14]	SAFAVIN S, LANDERGEBE D.A survey of decision tree classifier methodology[J]. IEEE Trans on SMC, 1991, 21(3): 660-674.
[15]	QUNILAN J.C4. 5: programs for machine learning[M]. Netherlands: Elsevier, 2014.