Towards an Accurate, Robust, and Scalable Named Entity Disambiguation System

Guo, Zhaochen

doi:doi:10.7939/R3QB9VN41

View

Download

Communities and Collections

Graduate and Postdoctoral Studies (GPS), Faculty of / Theses and Dissertations

Usage

219 views
337 downloads

Towards an Accurate, Robust, and Scalable Named Entity Disambiguation System

Author / Creator

Guo, Zhaochen
Knowledge bases (KBs), repositories consisting of entities, facts about entities, and relations between entities, are a vital component for many tasks in artificial intelligence and natural language processing such as semantic search and question answering. Named Entity Disambiguation (NED), the task of disambiguating mentions of named entities in a textual document by linking them to the actual entities in a KB, enables expanding or correcting the KB with facts extracted from documents – a task called Knowledge Base Population. This thesis focuses on the NED task with the goal of building an accurate, robust, and scalable NED system.

We first propose a graph-based approach that collectively disambiguates mentions of entities in a given document, with the assumption that entities mentioned in a document are semantically related under a single topic. Our approach uses a
carefully-curated disambiguation graph built from a KB, and applies personalized random walks on the graph to compute semantic representations of entities, which are used to measure semantic relatedness and disambiguate named entities.

We then improve the robustness of our NED approach with a supervised learning to rank algorithm using publicly available datasets. We find that the public benchmarks, mainly from news articles, are biased towards well-known entities and not representative to evaluate the robustness of an NED approach. Thus we develop a framework for deriving new benchmarks and construct two benchmarks with varying disambiguation difficulties from two large corpora (Wikipedia and ClueWeb) for the evaluation of robustness.

Finally, to address the scalability issue of our NED approach, we explore various features from entity graphs, contextual texts, and document corpora that can be efficiently pre-computed offline. Instead of performing random walks on online constructed graphs, we use a set of selected landmark nodes from entity graphs to compute the semantic representations of entities. We also explore features derived from the describing documents and associated categories of entities. By precomputing all these features offline, our approach can reduce the computing and memory resources to improve the efficiency and scale out the NED system. The evaluation shows that our approach is very competitive and efficient compared to previous NED approaches.
Subjects / Keywords
Graduation date

Fall 2018
Type of Item

Thesis
Degree

Doctor of Philosophy
DOI

https://doi.org/10.7939/R3QB9VN41
License

Permission is hereby granted to the University of Alberta Libraries to reproduce single copies of this thesis and to lend or sell such copies for private, scholarly or scientific research purposes only. Where the thesis is converted to, or otherwise made available in digital form, the University of Alberta will advise potential users of the thesis of these terms. The author reserves all other publication and other rights in association with the copyright in the thesis and, except as herein before provided, neither the thesis nor any substantial portion thereof may be printed or otherwise reproduced in any material form whatsoever without the author's prior written permission.

Language

English
Institution

University of Alberta
Degree level

Doctoral
Department
- Department of Computing Science
Supervisor / co-supervisor and their department(s)
- Barbosa, Denilson (Computing Science)