skip to main content
10.1145/2124295.2124338acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

Extracting search-focused key n-grams for relevance ranking in web search

Published:08 February 2012Publication History

ABSTRACT

In web search, relevance ranking of popular pages is relatively easy, because of the inclusion of strong signals such as anchor text and search log data. In contrast, with less popular pages, relevance ranking becomes very challenging due to a lack of information. In this paper the former is referred to as head pages, and the latter tail pages. We address the challenge by learning a model that can extract search-focused key n-grams from web pages, and using the key n-grams for searches of the pages, particularly, the tail pages. To the best of our knowledge, this problem has not been previously studied. Our approach has four characteristics. First, key n-grams are search-focused in the sense that they are defined as those which can compose "good queries" for searching the page. Second, key n-grams are learned in a relative sense using learning to rank techniques. Third, key n-grams are learned using search log data, such that the characteristics of key n-grams in the search log data, particularly in the heads; can be applied to the other data, particularly to the tails. Fourth, the extracted key n-grams are used as features of the relevance ranking model also trained with learning to rank techniques. Experiments validate the effectiveness of the proposed approach with large-scale web search datasets. The results show that our approach can significantly improve relevance ranking performance on both heads and tails; and particularly tails, compared with baseline approaches. Characteristics of our approach have also been fully investigated through comprehensive experiments.

References

  1. E. Agichtein, E. Brill, and S. Dumais. Improving web search ranking by incorporating user behavior information. In Proc. of SIGIR'06, pages 19--26, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. Bendersky, D. Metzler, and W. B. Croft. Parameterized concept weighting in verbose queries. In Proc. of SIGIR'11, pages 605--614, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Berger and J. Lafferty. Information retrieval as statistical translation. In Proc. of SIGIR'99, pages 222--229, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. N. Craswell, D. Hawking, R. Wilkinson, and M. Wu. Overview of the trec 2003 web track. In Proc. of TREC'03, pages 78--92, 2003.Google ScholarGoogle Scholar
  5. M. Cutler, Y. Shih, and W. Meng. Using the structure of html documents to improve retrieval. In Proc. of USITS'97, pages 241--251, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. L. da Costa Carvalho, E. S. de Moura, and P. Calado. Using statistical features to find phrasal terms in text collections. Journal of Information and Data Management, 1(3):583--597, 2010.Google ScholarGoogle Scholar
  7. E. Frank, G. W. Paynter, I. H. Witten, C. Gutwin, and C. G. Nevill-Manning. Domain-specific keyphrase extraction. In Proc. of IJCAI'99, pages 668--673, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Gao, W. Yuan, X. Li, K. Deng, and J.-Y. Nie. Smoothing clickthrough data for web search ranking. In Proc. of SIGIR'09, pages 355--362, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Goel, A. Broder, E. Gabrilovich, and B. Pang. Anatomy of the long tail: ordinary people with extraordinary tastes. In Proc. of WSDM'10, pages 201--210, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. R. Herbrich, T. Graepel, and K. Obermayer. Large margin rank boundaries for ordinal regression. Advances in Large Margin Classifiers, 2000.Google ScholarGoogle Scholar
  11. Y. Hu, G. Xin, R. Song, G. Hu, S. Shi, Y. Cao, and H. Li. Title extraction from bodies of html documents and its application to web page retrieval. In Proc. of SIGIR'05, pages 250--257, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. Hulth. Improved automatic keyword extraction given more linguistic knowledge. In Proc. of EMNLP'03, pages 216--223, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. U. Irmak, V. V. Brzeski, and R. Kraft. Contextual ranking of keywords using click data. In Proc. of ICDE'09, pages 457--468, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. X. Jiang, Y. Hu, and H. Li. A ranking approach to keyphrase extraction. In Proc. of SIGIR'09, pages 756--757, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. T. Joachims. Optimizing search engines using clickthrough data. In Proc. of KDD'02, pages 133--142, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. T. Joachims. Training linear svms in linear time. In Proc. of KDD'06, pages 217--226, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. K. JÜvelin and J. KekÖÓnen. Ir evaluation methods for retrieving highly relevant documents. In Proc. of SIGIR'00, pages 41--48, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. Lafferty and C. Zhai. Document language models, query models, and risk minimization for information retrieval. In Proc. of SIGIR'01, pages 111--119, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. H. Li. Learning to rank for information retrieval and natural language processing. Synthesis Lectures on Human Language Technologies, 4(1):1--113, 2011. Google ScholarGoogle ScholarCross RefCross Ref
  20. T. Liu. Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 3(3):225--331, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. D. Metzler and W. Croft. A markov random field model for term dependencies. In Proc. of SIGIR'05, pages 472--479, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. D. Paranjpe. Learning document aboutness from implicit user feedback and document structure. In Proc. of CIKM'09, pages 365--374, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. Ponte and W. Croft. A language modeling approach to information retrieval. In Proc. of SIGIR'98, pages 275--281, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. F. Radlinski and T. Joachims. Query chains: learning to rank from implicit feedback. In Proc. of KDD'05, pages 239--248, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. E. Robertson and S. J. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In Proc. of SIGIR'94, pages 232--241, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Communications of the ACM, 18:613--620, 1975. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. K. Sarkar, M. Nasipuri, and S. Ghoser. A new approach to keyphrase extraction using neural networks. International Journal of Computer Science, 7:16--25, 2010.Google ScholarGoogle Scholar
  28. P. D. Turney. Learning algorithms for keyphrase extraction. Information Retrieval, 2(4):303--336, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. P. D. Turney. Mining the web for lexical knowledge to improve keyphrase extraction: Learning from labeled and unlabeled data. Technical Report ERB-1096, National Research Council, Institute for Information Technology, 2002.Google ScholarGoogle Scholar
  30. E. Voorhees and D. Harman. Trec: Experiment and evaluation in information retrieval. Computational Linguistics, 32(4):563--567, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. K. Wang, X. Li, and J. Gao. Multi-style language model for web scale information retrieval. In Proc. of SIGIR'10, pages 467--474, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. I. H. Witten, G. W. Paynter, E. Frank, C. Gutwin, and C. G. Nevill-Manning. Kea: practical automatic keyphrase extraction. In Proc. of DL'99, pages 254--255, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. J. Xu, H. Li, and C. Zhong. Relevance ranking using kernels. In Proc. of AIRS'10, pages 1--12, 2010.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Extracting search-focused key n-grams for relevance ranking in web search

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      WSDM '12: Proceedings of the fifth ACM international conference on Web search and data mining
      February 2012
      792 pages
      ISBN:9781450307475
      DOI:10.1145/2124295

      Copyright © 2012 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 8 February 2012

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate498of2,863submissions,17%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader