A long-standing challenge for research in computer science is the understanding of written text and
extraction of useful information from it. These distributed representations or so-called word embeddings, map words of a vocabulary to a
dense vector, such that words with closer meanings are mapped to the nearby points and the similarity between them is computed based on their distance in the embedding space. Traditional word embeddings, despite being good at capturing semantics, have some drawbacks.
They treat all words equally as terms and cannot be directly used to represent named entities. Disregarding
the named entities while generating a word embeddings creates several challenges for downstream
tasks that use them as input.
In this work, we address the problems of term-based models by generating embeddings for named entities as well as terms using an annotated corpus using two approaches:
- To naively include entities in our models, we train the well-established word embedding models on a corpus, annotated with named entities.
- To better capture the entity-entity relations, we take advantage of the graph representation of the corpus, and embed the nodes of co-occurrence graphs extracted from the annotated text.
We extracted the LOAD network from a set of 127,485 English news articles. By following the RSS feeds of the news outlets, we covered CNN, LA Times, NY Times, USA Today, CBS News, The Washington Post, IBTimes, BBC, The Independent, Reuters, SkyNews, The Telegraph, The Guardian, and Sidney Morning Herald to extract political news between June 1, 2016 and November 30, 2016.
The network data consists of two edge lists and one node list. The "node_list.txt" file contains the label for each node along with the unique identifier. "edge_list_raw.txt" contains the edge list where each node is the unique identifier from the LOAD network. Since embeddings need an index file that maps the words to their row in the embedding matrix, we created a second edge list with the indexes as nodes in "edge_list.txt". Below is the description of each file:
Index_number: consecutive index number to map each node to a row in the embedding matrix
Unique_entity_identifier: the unique identifier of each entity, consisting of (entity_type (T)erm, (O)rganistion, (A)ctor, (L)ocation, (D)ate)
Label: the label of each node
Source_Unique_entity_Identifier: the unique identifier of the source node
Target_Unique_entity_Identifier: the unique identifier of the target node
Source_index: index number of the source node
Target_index: index number of the target node
[zip] Download implicit network data (1.29 GB compressed, 4.13 GB uncompressed).
The code for this approach is provided in the GitHub repository for Entity Embeddings. To test the models without training we also provide a pre-trained version of all the models described in the paper . The models have been trained on the dataset of news articles and are compatible with the code in the repository and contains word2vec and GloVe models on both raw unannotated text and annotated text with named entities. VERSE and DeepWalk node embeddings have been trained on the LOAD network provided above.
[zip] pre-trained models (2.27 GB compressed, 2.73 GB uncompressed).
The paper that describes the above approach was presented at the 41st European Conference on Information Retrieval in Cologne, Germany:
- Satya Almasian, Andreas Spitz, and Michael Gertz.
Word Embeddings for Entity-Annotated Texts.
In: Proceedings of Advances in Information Retrieval - 41st European Conference on IR Research, ECIR 2019, Cologne, Germany, April 14-18, 2019, [pdf] [bibtex] [slides] [code]
If have any questions or requests, be sure to let us know! Please contact Satya Almasian. [web]