research-article

Extracting search-focused key n-grams for relevance ranking in web search

Authors:
Chen Wang

Fudan University, Shanghai, China

Fudan University, Shanghai, China
View Profile

,
Keping Bi

Peking University, Beijing, China

Peking University, Beijing, China
View Profile

,
Yunhua Hu

Microsoft Research Asia, Beijing, China

Microsoft Research Asia, Beijing, China
View Profile

,
Hang Li

Microsoft Research Asia, Beijing, China

Microsoft Research Asia, Beijing, China
View Profile

,
Guihong Cao

Microsoft Corporation, Redmond, WA, USA

Microsoft Corporation, Redmond, WA, USA
View Profile

WSDM '12: Proceedings of the fifth ACM international conference on Web search and data miningFebruary 2012Pages 343–352https://doi.org/10.1145/2124295.2124338

Published:08 February 2012Publication History

WSDM '12: Proceedings of the fifth ACM international conference on Web search and data mining

Pages 343–352

ABSTRACT

In web search, relevance ranking of popular pages is relatively easy, because of the inclusion of strong signals such as anchor text and search log data. In contrast, with less popular pages, relevance ranking becomes very challenging due to a lack of information. In this paper the former is referred to as head pages, and the latter tail pages. We address the challenge by learning a model that can extract search-focused key n-grams from web pages, and using the key n-grams for searches of the pages, particularly, the tail pages. To the best of our knowledge, this problem has not been previously studied. Our approach has four characteristics. First, key n-grams are search-focused in the sense that they are defined as those which can compose "good queries" for searching the page. Second, key n-grams are learned in a relative sense using learning to rank techniques. Third, key n-grams are learned using search log data, such that the characteristics of key n-grams in the search log data, particularly in the heads; can be applied to the other data, particularly to the tails. Fourth, the extracted key n-grams are used as features of the relevance ranking model also trained with learning to rank techniques. Experiments validate the effectiveness of the proposed approach with large-scale web search datasets. The results show that our approach can significantly improve relevance ranking performance on both heads and tails; and particularly tails, compared with baseline approaches. Characteristics of our approach have also been fully investigated through comprehensive experiments.

References

E. Agichtein, E. Brill, and S. Dumais. Improving web search ranking by incorporating user behavior information. In Proc. of SIGIR'06, pages 19--26, 2006. Google ScholarDigital Library
M. Bendersky, D. Metzler, and W. B. Croft. Parameterized concept weighting in verbose queries. In Proc. of SIGIR'11, pages 605--614, 2011. Google ScholarDigital Library
A. Berger and J. Lafferty. Information retrieval as statistical translation. In Proc. of SIGIR'99, pages 222--229, 1999. Google ScholarDigital Library
N. Craswell, D. Hawking, R. Wilkinson, and M. Wu. Overview of the trec 2003 web track. In Proc. of TREC'03, pages 78--92, 2003.Google Scholar
M. Cutler, Y. Shih, and W. Meng. Using the structure of html documents to improve retrieval. In Proc. of USITS'97, pages 241--251, 1997. Google ScholarDigital Library
A. L. da Costa Carvalho, E. S. de Moura, and P. Calado. Using statistical features to find phrasal terms in text collections. Journal of Information and Data Management, 1(3):583--597, 2010.Google Scholar
E. Frank, G. W. Paynter, I. H. Witten, C. Gutwin, and C. G. Nevill-Manning. Domain-specific keyphrase extraction. In Proc. of IJCAI'99, pages 668--673, 1999. Google ScholarDigital Library
J. Gao, W. Yuan, X. Li, K. Deng, and J.-Y. Nie. Smoothing clickthrough data for web search ranking. In Proc. of SIGIR'09, pages 355--362, 2009. Google ScholarDigital Library
S. Goel, A. Broder, E. Gabrilovich, and B. Pang. Anatomy of the long tail: ordinary people with extraordinary tastes. In Proc. of WSDM'10, pages 201--210, 2010. Google ScholarDigital Library
R. Herbrich, T. Graepel, and K. Obermayer. Large margin rank boundaries for ordinal regression. Advances in Large Margin Classifiers, 2000.Google Scholar
Y. Hu, G. Xin, R. Song, G. Hu, S. Shi, Y. Cao, and H. Li. Title extraction from bodies of html documents and its application to web page retrieval. In Proc. of SIGIR'05, pages 250--257, 2005. Google ScholarDigital Library
A. Hulth. Improved automatic keyword extraction given more linguistic knowledge. In Proc. of EMNLP'03, pages 216--223, 2003. Google ScholarDigital Library
U. Irmak, V. V. Brzeski, and R. Kraft. Contextual ranking of keywords using click data. In Proc. of ICDE'09, pages 457--468, 2009. Google ScholarDigital Library
X. Jiang, Y. Hu, and H. Li. A ranking approach to keyphrase extraction. In Proc. of SIGIR'09, pages 756--757, 2009. Google ScholarDigital Library
T. Joachims. Optimizing search engines using clickthrough data. In Proc. of KDD'02, pages 133--142, 2002. Google ScholarDigital Library
T. Joachims. Training linear svms in linear time. In Proc. of KDD'06, pages 217--226, 2006. Google ScholarDigital Library
K. JÜvelin and J. KekÖÓnen. Ir evaluation methods for retrieving highly relevant documents. In Proc. of SIGIR'00, pages 41--48, 2000. Google ScholarDigital Library
J. Lafferty and C. Zhai. Document language models, query models, and risk minimization for information retrieval. In Proc. of SIGIR'01, pages 111--119, 2001. Google ScholarDigital Library
H. Li. Learning to rank for information retrieval and natural language processing. Synthesis Lectures on Human Language Technologies, 4(1):1--113, 2011. Google ScholarCross Ref
T. Liu. Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 3(3):225--331, 2009. Google ScholarDigital Library
D. Metzler and W. Croft. A markov random field model for term dependencies. In Proc. of SIGIR'05, pages 472--479, 2005. Google ScholarDigital Library
D. Paranjpe. Learning document aboutness from implicit user feedback and document structure. In Proc. of CIKM'09, pages 365--374, 2009. Google ScholarDigital Library
J. Ponte and W. Croft. A language modeling approach to information retrieval. In Proc. of SIGIR'98, pages 275--281, 1998. Google ScholarDigital Library
F. Radlinski and T. Joachims. Query chains: learning to rank from implicit feedback. In Proc. of KDD'05, pages 239--248, 2005. Google ScholarDigital Library
S. E. Robertson and S. J. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In Proc. of SIGIR'94, pages 232--241, 1994. Google ScholarDigital Library
G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Communications of the ACM, 18:613--620, 1975. Google ScholarDigital Library
K. Sarkar, M. Nasipuri, and S. Ghoser. A new approach to keyphrase extraction using neural networks. International Journal of Computer Science, 7:16--25, 2010.Google Scholar
P. D. Turney. Learning algorithms for keyphrase extraction. Information Retrieval, 2(4):303--336, 2000. Google ScholarDigital Library
P. D. Turney. Mining the web for lexical knowledge to improve keyphrase extraction: Learning from labeled and unlabeled data. Technical Report ERB-1096, National Research Council, Institute for Information Technology, 2002.Google Scholar
E. Voorhees and D. Harman. Trec: Experiment and evaluation in information retrieval. Computational Linguistics, 32(4):563--567, 2005. Google ScholarDigital Library
K. Wang, X. Li, and J. Gao. Multi-style language model for web scale information retrieval. In Proc. of SIGIR'10, pages 467--474, 2010. Google ScholarDigital Library
I. H. Witten, G. W. Paynter, E. Frank, C. Gutwin, and C. G. Nevill-Manning. Kea: practical automatic keyphrase extraction. In Proc. of DL'99, pages 254--255, 1999. Google ScholarDigital Library
J. Xu, H. Li, and C. Zhong. Relevance ranking using kernels. In Proc. of AIRS'10, pages 1--12, 2010.Google ScholarCross Ref

Index Terms

Extracting search-focused key n-grams for relevance ranking in web search
1. Information systems
  1. Information retrieval

Recommendations

Ranking Relevance in Yahoo Search
KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Search engines play a crucial role in our daily lives. Relevance is the core problem of a commercial search engine. It has attracted thousands of researchers from both academia and industry and has been studied for decades. Relevance in a modern search ...
Read More
Re-ranking search results using query logs
CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge management

This work addresses two common problems in search, frequently occurring with underspecified user queries: the top-ranked results for such queries may not contain documents relevant to the user's search intent, and fresh and relevant pages may not get ...
Read More
Focused ranking in a vertical search engine
SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval

Since the debut of PageRank and HITS, hyperlink-induced Web document ranking has come a long way. The Web has become increasingly vast and topically diverse. Such vastness has led many into the area of topic-sensitive ranking and its variants. We ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WSDM '12: Proceedings of the fifth ACM international conference on Web search and data mining
February 2012
792 pages
ISBN:9781450307475
DOI:10.1145/2124295
General Chairs:
Eytan Adar
University of Michigan, USA
,
Jaime Teevan
Microsoft Research, USA
,
Program Chairs:
Eugene Agichtein
Emory University, USA
,
Yoelle Maarek
Yahoo! Research, Israel
Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 February 2012
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
key n-gram extraction
learning to rank
ranking
search relevance
tail page
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate498of2,863submissions,17%
Upcoming Conference
WSDM '25

Sponsor:

sigir

sigir

sigir

sigir

The Eighteenth ACM International Conference on Web Search and Data Mining

April 7 - 11, 2025

Hannover , Germany
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 457
  Total Downloads
- Downloads (Last 12 months)6
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Extracting search-focused key n-grams for relevance ranking in web search

WSDM '12: Proceedings of the fifth ACM international conference on Web search and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Ranking Relevance in Yahoo Search

Re-ranking search results using query logs

Focused ranking in a vertical search engine