ABSTRACT
In web search, relevance ranking of popular pages is relatively easy, because of the inclusion of strong signals such as anchor text and search log data. In contrast, with less popular pages, relevance ranking becomes very challenging due to a lack of information. In this paper the former is referred to as head pages, and the latter tail pages. We address the challenge by learning a model that can extract search-focused key n-grams from web pages, and using the key n-grams for searches of the pages, particularly, the tail pages. To the best of our knowledge, this problem has not been previously studied. Our approach has four characteristics. First, key n-grams are search-focused in the sense that they are defined as those which can compose "good queries" for searching the page. Second, key n-grams are learned in a relative sense using learning to rank techniques. Third, key n-grams are learned using search log data, such that the characteristics of key n-grams in the search log data, particularly in the heads; can be applied to the other data, particularly to the tails. Fourth, the extracted key n-grams are used as features of the relevance ranking model also trained with learning to rank techniques. Experiments validate the effectiveness of the proposed approach with large-scale web search datasets. The results show that our approach can significantly improve relevance ranking performance on both heads and tails; and particularly tails, compared with baseline approaches. Characteristics of our approach have also been fully investigated through comprehensive experiments.
- E. Agichtein, E. Brill, and S. Dumais. Improving web search ranking by incorporating user behavior information. In Proc. of SIGIR'06, pages 19--26, 2006. Google ScholarDigital Library
- M. Bendersky, D. Metzler, and W. B. Croft. Parameterized concept weighting in verbose queries. In Proc. of SIGIR'11, pages 605--614, 2011. Google ScholarDigital Library
- A. Berger and J. Lafferty. Information retrieval as statistical translation. In Proc. of SIGIR'99, pages 222--229, 1999. Google ScholarDigital Library
- N. Craswell, D. Hawking, R. Wilkinson, and M. Wu. Overview of the trec 2003 web track. In Proc. of TREC'03, pages 78--92, 2003.Google Scholar
- M. Cutler, Y. Shih, and W. Meng. Using the structure of html documents to improve retrieval. In Proc. of USITS'97, pages 241--251, 1997. Google ScholarDigital Library
- A. L. da Costa Carvalho, E. S. de Moura, and P. Calado. Using statistical features to find phrasal terms in text collections. Journal of Information and Data Management, 1(3):583--597, 2010.Google Scholar
- E. Frank, G. W. Paynter, I. H. Witten, C. Gutwin, and C. G. Nevill-Manning. Domain-specific keyphrase extraction. In Proc. of IJCAI'99, pages 668--673, 1999. Google ScholarDigital Library
- J. Gao, W. Yuan, X. Li, K. Deng, and J.-Y. Nie. Smoothing clickthrough data for web search ranking. In Proc. of SIGIR'09, pages 355--362, 2009. Google ScholarDigital Library
- S. Goel, A. Broder, E. Gabrilovich, and B. Pang. Anatomy of the long tail: ordinary people with extraordinary tastes. In Proc. of WSDM'10, pages 201--210, 2010. Google ScholarDigital Library
- R. Herbrich, T. Graepel, and K. Obermayer. Large margin rank boundaries for ordinal regression. Advances in Large Margin Classifiers, 2000.Google Scholar
- Y. Hu, G. Xin, R. Song, G. Hu, S. Shi, Y. Cao, and H. Li. Title extraction from bodies of html documents and its application to web page retrieval. In Proc. of SIGIR'05, pages 250--257, 2005. Google ScholarDigital Library
- A. Hulth. Improved automatic keyword extraction given more linguistic knowledge. In Proc. of EMNLP'03, pages 216--223, 2003. Google ScholarDigital Library
- U. Irmak, V. V. Brzeski, and R. Kraft. Contextual ranking of keywords using click data. In Proc. of ICDE'09, pages 457--468, 2009. Google ScholarDigital Library
- X. Jiang, Y. Hu, and H. Li. A ranking approach to keyphrase extraction. In Proc. of SIGIR'09, pages 756--757, 2009. Google ScholarDigital Library
- T. Joachims. Optimizing search engines using clickthrough data. In Proc. of KDD'02, pages 133--142, 2002. Google ScholarDigital Library
- T. Joachims. Training linear svms in linear time. In Proc. of KDD'06, pages 217--226, 2006. Google ScholarDigital Library
- K. JÜvelin and J. KekÖÓnen. Ir evaluation methods for retrieving highly relevant documents. In Proc. of SIGIR'00, pages 41--48, 2000. Google ScholarDigital Library
- J. Lafferty and C. Zhai. Document language models, query models, and risk minimization for information retrieval. In Proc. of SIGIR'01, pages 111--119, 2001. Google ScholarDigital Library
- H. Li. Learning to rank for information retrieval and natural language processing. Synthesis Lectures on Human Language Technologies, 4(1):1--113, 2011. Google ScholarCross Ref
- T. Liu. Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 3(3):225--331, 2009. Google ScholarDigital Library
- D. Metzler and W. Croft. A markov random field model for term dependencies. In Proc. of SIGIR'05, pages 472--479, 2005. Google ScholarDigital Library
- D. Paranjpe. Learning document aboutness from implicit user feedback and document structure. In Proc. of CIKM'09, pages 365--374, 2009. Google ScholarDigital Library
- J. Ponte and W. Croft. A language modeling approach to information retrieval. In Proc. of SIGIR'98, pages 275--281, 1998. Google ScholarDigital Library
- F. Radlinski and T. Joachims. Query chains: learning to rank from implicit feedback. In Proc. of KDD'05, pages 239--248, 2005. Google ScholarDigital Library
- S. E. Robertson and S. J. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In Proc. of SIGIR'94, pages 232--241, 1994. Google ScholarDigital Library
- G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Communications of the ACM, 18:613--620, 1975. Google ScholarDigital Library
- K. Sarkar, M. Nasipuri, and S. Ghoser. A new approach to keyphrase extraction using neural networks. International Journal of Computer Science, 7:16--25, 2010.Google Scholar
- P. D. Turney. Learning algorithms for keyphrase extraction. Information Retrieval, 2(4):303--336, 2000. Google ScholarDigital Library
- P. D. Turney. Mining the web for lexical knowledge to improve keyphrase extraction: Learning from labeled and unlabeled data. Technical Report ERB-1096, National Research Council, Institute for Information Technology, 2002.Google Scholar
- E. Voorhees and D. Harman. Trec: Experiment and evaluation in information retrieval. Computational Linguistics, 32(4):563--567, 2005. Google ScholarDigital Library
- K. Wang, X. Li, and J. Gao. Multi-style language model for web scale information retrieval. In Proc. of SIGIR'10, pages 467--474, 2010. Google ScholarDigital Library
- I. H. Witten, G. W. Paynter, E. Frank, C. Gutwin, and C. G. Nevill-Manning. Kea: practical automatic keyphrase extraction. In Proc. of DL'99, pages 254--255, 1999. Google ScholarDigital Library
- J. Xu, H. Li, and C. Zhong. Relevance ranking using kernels. In Proc. of AIRS'10, pages 1--12, 2010.Google ScholarCross Ref
Index Terms
- Extracting search-focused key n-grams for relevance ranking in web search
Recommendations
Ranking Relevance in Yahoo Search
KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data MiningSearch engines play a crucial role in our daily lives. Relevance is the core problem of a commercial search engine. It has attracted thousands of researchers from both academia and industry and has been studied for decades. Relevance in a modern search ...
Re-ranking search results using query logs
CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge managementThis work addresses two common problems in search, frequently occurring with underspecified user queries: the top-ranked results for such queries may not contain documents relevant to the user's search intent, and fresh and relevant pages may not get ...
Focused ranking in a vertical search engine
SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrievalSince the debut of PageRank and HITS, hyperlink-induced Web document ranking has come a long way. The Web has become increasingly vast and topically diverse. Such vastness has led many into the area of topic-sensitive ranking and its variants. We ...
Comments