Microblog Retrieval Results Re-ranking Using Graph Model Based Decision
-
摘要: 为了解决微博检索面临的“用户查询”和“相关文档”都是极端短文本的情况,及由此造成的检索性能欠佳的难题,研究并实现了一种微博检索结果的二次重排算法,基于微博内容相似关系构建关系图模型,利用PageRank算法对微博检索结果进行二次排序. 比较了基于余弦相似度、戴斯系数、单向戴斯系数等文本内容相似度计算方法. 实验结果表明:二次排序算法能够有效提升微博检索性能,同时图模型迭代性能与相关主题比例存在依存关系. 有鉴于此,讨论通过决策树重排算法去除非相关主题对微博排序的影响.Abstract: As a typical short text, microblogging retrieval suffers from the problem of the insufficient samples both in users’ query and documents that makes the probabilistic-like models unreliable. To remedy this problem, a graph model was designed and implemented based on topic clustering algorithm to re-rank microblog retrieval results. The graph model was built by the content similarity between micro-blogs. By comparing the cosine similarity, the dice coefficient, and the one-way dice coefficient with the experimental results. Results show that the performance of the search depends on the ratio of related topics, therefore decision tree algorithm was used to remedy the influence of the ranking position relevant topics.
-
Key words:
- microblogging retrieval /
- results re-ranking /
- graph model /
- decision tree
-
表 1 Tweets的检索结果属性
Table 1. Attributes of tweets search results
tweet id 与查询语句相似度/% 是否存在关注度 是否相关 0001 60 1 Y 0002 80 0 Y 0003 65 0 N ︙ ︙ ︙ ︙ N 15 1 Y 表 2 2014 TREC microblog图模型聚类算法评测结果
Table 2. Performance of microblog retrieval based on graph model in TREC 2014
Run id R-Prec Bpref P@10 P@20 OSIM 0.2207 0.2673 0.4182 0.3682 NSIM 0.2169 0.2655 0.3982 0.3536 NCOS 0.2198 0.2667 0.3673 0.3255 表 3 TREC 2014 microblog图模型结合决策树评测结果
Table 3. Performance of microblog retrieval based on graph model and decision tree in TREC 2014
Run id P@10 P@15 P@20 OSIM 0.4532 0.4325 0.3962 NSIM 0.4371 0.4251 0.3834 NCOS 0.4363 0.4273 0.3875 -
[1] LI X W.Research on the key technologies in Weibo retrieval [D]. Harbin: Harbin Institue of Technology, 2013. (in Chinese) [2] KWAK H, LEE C, PARK H, et al.What is twitter, a social network or a news media[C]//Proceedings of the 19th International Conference on World Wide Web. NY: ACM, 2010: 591-600. [3] WU S, MASON W A.Who says what to whom on twitter[C]//Proceedings of the 20th International Conference on World Wide Web. NY: ACM, 2011: 705-714. [4] JAVA A, FININ T.Why we twitter: understanding microblogging usage and communities[C]//Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 Workshop on Web Mining and Social Network Analysis. NY: ACM, 2007: 56-65. [5] YANG J, COUNTS S.Comparing information diffusion structure in weblogs and microblogs[C]//Proceedings of the Fourth International AAAI Conference on Weblogs & Social Media. CA: AAAI, 2010: 351-354. [6] ROMERO D, MEEDER B, KLEINBERG J.Differences in the mechanics of information diffusion across topics: idioms, political hashtags, and complex contagion on twitter[C]//Proceedings of the 20th International Conference on World Wide Web. NY: ACM, 2011: 695-704. [7] QIN T, LIU T, XU J, et al.LETOR: a benchmark collection for research on learning to rank for information retrieval[J]. Information Retrieval, 2010, 13(4): 346-374. [8] LIU T.Learning to rank for information retrieval[J]. Foundations and Trends in Information Retrieval, 2009, 3(3): 225-331. [9] CAO Z, QIN T, LIU T, et al.Learning to rank: from pairwise approach to list-wise approach[C]//Proceedings of the 24th International Conference on Machine Learning. NY: ACM, 2007: 129-136. [10] HAN J, KAMBER M, PEI J.Data mining: concepts and techniques: concepts and techniques[M]. Netherlands: Elsevier, 2011. [11] PAGE L, BRIN S, MOTWANI R, et al. The PageRank citation ranking: bringing order to the Web [R/OL]. [2015-03-01]. http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf. http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf. [12] HAVELIWALA T.Topic-sensitive pagerank: a context-sensitive ranking algorithm for Web search[J]. IEEE Transactions on Knowledge and Data Engineering, 2003, 15(4): 784-796. [13] ERKAN G, RADER D.Lexrank: graph-based lexical centrality as salience in text summarization[J]. Journal of Artificial Intelligence Research, 2004, 22: 457-479. [14] SAFAVIN S, LANDERGEBE D.A survey of decision tree classifier methodology[J]. IEEE Trans on SMC, 1991, 21(3): 660-674. [15] QUNILAN J.C4. 5: programs for machine learning[M]. Netherlands: Elsevier, 2014.