WordNet: The Lexical Database

April 10, 2016

Introduction

WordNet is a lexical database created at Princeton University.  As of today, it is at version 3.1 (updated in 2012).  It groups english words into synonym sets (synsets) and provides short defintions and examples.  It also holds information on the results of related words.  It is best to think of WordNet as a Dictionary/Thesaurus.  It can be used by a number of different applications and can be applied to different industries.  However, we will focus on its use with Artificial Intelligence via text analysis.

Reference: https://en.wikipedia.org/wiki/WordNet

Terminology

It's probably best if we work through some of the terminology with WordNet; as it is very important to understand lexically what the words mean and are organized.

  • Hyperonymy Relation (Noun): The relation according to two nouns where Y is a hypernym of X if every X is a (kind of) Y [Canine is a hypernym of dog]
  • Hyponymy Relation (Noun): The relation according to two nouns where Y is a hyponym of X if every Y is a (kind of) X [dog is a hyponym of Canine]
  • Meronymy Relation (Noun): The relation according to two nouns where Y is a meronym of X if Y is a part of X [window is a meronym of building]
  • Hyperonymy Relation (Verb): The relation according to two verbs where Y is a hypernym of X if X is a (kind of) Y [to perceive is an hypernym of to listen]

For a full glossary of terms see https://wordnet.princeton.edu/wordnet/man/wngloss.7WN.html

Structure

For the most part, WordNet is divided into 4 total subnets: Nouns, Verbs, Adjectives, Adverbs.  These are then divided into the corresponding files as described below:

  • Nouns
  • data.noun
  • index.noun
  • noun.exc
  • dbfiles/noun.* : nouns denoting "*"
  • Verbs
  • data.verb
  • index.verb
  • verb.exc
  • dbfiles/verb.* : verbs of "*"
  • Adjectives
  • data.adj
  • index.adj
  • adj.exc
  • dbfiles/adj.* : all adjective clusters and pertainyms
  • adj.ppl : participal adjectives
  • Adverbs
  • data.adv
  • index.adv
  • adv.exc
  • dbfiles/adv.all : all adverbs

The format of all files are ASCII and fields are generally separated by a single space (unless otherwise noted).  Records are separated by new line characters. See https://wordnet.princeton.edu/wordnet/man/lexnames.5WN.html for more details on the File Number(s) and meaning of individual files.

Each part of speech (pos) has three unique files that are of importance: data.pos, index.pos, pos.exc.  The index files are alphabetized list of all words in WordNet for the corresponding pos.  Each line is a list of byte offsets in the corresponding data file.  The words are in lower case.  The data files contain information corresponding to the synsets that were specified in the Lexicographer Files with relational points resolved to synset_offset.  The exception files are used to help with morphological irregular inflections to words [i.e. cargoes and cargo].

Index Format

lemma pos synset_cnt p_cnt [ptr_symbol...] sense_cnt tagsense_cnt synset_offset [synset_offset...]

  • lemma: Lower case ASCII word or collocation.  Collocations are joined by the _ character.
  • pos: Category of word (n = noun, v = verb, a = adjective, r = adverb)
  • synset_cnt: Number of synsets that lemma is in.  This is the number of senses of the word in WordNet.
  • p_cnt: Number of different pointers that lemma has in all synsets containing it.
  • ptr_symbol: Pointer symbols with count p_cnt separated by space.
  • sense_cnt: Same as synset_cnt above. Redundant
  • tagsense_cnt: Number of senses of lemma that are ranked according to their frequency of occurrence.
  • synset_offset: Byte offset in data.pos file of a synset containing lemma.  Each offset corresponds to a different sense of lemma in WordNet.  The offsets are always 8 digit offsets.  The number of offsets is based on synset_count and is separated by space.

Data Format

synset_offset lex_filenum ss_type w_cnt word lex_id [word lex_id...] p_cnt [ptr...] [frames...] | gloss

  • synset_offset: Current byte offset in the file represented by the 8 digit integer.
  • lex_filenum: Integer representing lexicographer file.
  • ss_type: synset_type (n = Noun, v = Verb, a = Adjective, s = Adjective Satellite, r = Adverb)
  • w_cnt: Hexadecimal Integer indicating number of words in synset
  • word: ASCII form of a word as entered inthe synset
  • lex_id: Hexadecimal Integer that uniquely identifies a sense within the file.  0 is default and is not present in the file.
  • p_cnt: 3-digit Integer indicating the number of pointers from this synset to others.
  • ptr: Pointer from this synset to another
  • frames (data.verb only): List of numbers corresponding to the generic verb sentence frames for word in the synset
  • gloss: A definition or example sentence or both.  Each synset contains a gloss.  

ptr has the following internal format:

pointer_symbol synset_offset pos source/target

  • pointer_symbol: semantic point classification
  • synset_offset: byte offset to the target synset in teh data file corresponding to pos
  • pos: part of speech
  • source/target: 4-digit hex field where first two characters represent source synset and last two represent target synset.  A value of 0000 means the point_symbol is a semantic relation between the current and the target set by synset_offset.

frames has the following internal format:

f_cnt + f_num w_num [+ f_num w_num...]

  • f_cnt: two digit integer indicating the number of generic frames listed.
  • f_num: two digit integer frame number
  • w_num: two digit hex integer indicating the word in the synset that the frame applies to.  If 00, applies to all words in the synset.

Exceptions Format

A simple list of inflected words separated by 1 or more base words. The first column is the inflected word, while the other columns are base words.

 

Lexocographer Files

Reference: https://wordnet.princeton.edu/wordnet/man/lexnames.5WN.html#toc4

These files hold the detailed relational analysis of lexical semantics.  Basically they hold the relationships of words to represent a "lexical knowledge."  This may be the most difficult of the file formats to understand so I'll try to spell out many of the symbols and meanings the best I can.

These files are located in the dbfiles subfolder of dict.  They have the format of pos.suffix where suffix is the synset group (i.e. animal, plant, etc).

Pointers are used to represent the relations between words in one synset and another.  A relation from a source to a target synset is formed by specifying a word from the target synset followed by the pointer_symbol (as described in the format section).  The full list of symbols can be found here: https://wordnet.princeton.edu/wordnet/man/wninput.5WN.html#toc3

Verb frames are used when example sentences are not normally available and represents ways a verb can be used simply in a sentence.  The full integer list can be found here: https://wordnet.princeton.edu/wordnet/man/wninput.5WN.html#toc4

FormatSynsets are entered 1 per line and each line is terminated with a newline character.  The general syntax is:

  • { words pointers ( gloss ) }

Note: Everything can be separated by either a space or tab.

The syntax is valid for all synsets except verbs.  Verb syntax is:

  • { words pointers frames ( gloss ) }

Adjectives can be clustered containing 1 or more head synsets and optional satellite synsets.  The syntax is:

  • [ head synset [satellite synsets] [-] [additional head/satellite synsets] ]

Note: These synsets can span multiple lines.  Hyphens are on a line by themselves and can be of any number.  Synsets within the square brackets follow the general syntax above.

Comments are denoted by parenthesis on a separate line ().

 A word can have a few different syntaxes.  These have various markers and lexical ids along with pointers.

Word Syntax:

word[ ( marker ) ][lex_id],

  • word: Any combination of upper/lower case unless an adjective cluster.  Collocations are spaces with an underscore _ character.  Numbers may be entered followed by a double-quote " character.
  • lex_id: unique sense id for a given word within a give file.  The range is 1 to 15.  The default is 0 and does not have to be specified.
  • marker: unknown

When 1 or more pointers correspond to the specific word, brackets are used.  This is referred to as a pointer set and only 1 word can have multiple pointers.  But there can be any number of pointer sets.  The syntax is below:

  • [ word[ ( marker ) ][lex_id], pointers ]

For verbs, the word syntax can be extended for frames that only correspond to one word rather than the whole synset:

  • [ word, [pointers] frames ]

Pointer Syntax:

Pointers are optional in synsets.  If a pointer is specified outside of a word/pointer set, the relation is applied to all words in the synset.  The syntax can take one of two forms:

  • [lex_filename:]word[lex_id],pointer_symbol

or

  • [lex_filename:]word[lex_id]^word[lex_id],pointer_symbol

In a pointer, word refers to the word in the other synset.  When the second form is used, the first word indicates a word in the head synset and the second is the word in a satellite of that cluster.  word may be followed by a lex_id for targeting the correct synset.  If it is in another file, lex_filename is used.

Verb Frame Syntax:

Associated frames are delinated as follows when references verbs:

  • frames: f_num[,f_num...]

where f_num represents the frame type referenced in the link above.

Gloss Syntax:

A gloss is included in all synsets and is denoted by parenthesis at the end of the synset.  

Special Adjective Syntax:

There is special syntax for representing antonymous adjective synsets.  The first word must be entered in upper case and is considered the head word of the head synset.  Refer to https://wordnet.princeton.edu/wordnet/man/wninput.5WN.html#toc10 for more details if needed.

 

Some examples of this format is provided by princeton here: https://wordnet.princeton.edu/wordnet/man/wninput.5WN.html#toc11

 

index.sense

This file provides an alternate method for accessing synsets and word senses in the database.  It is useful to applications that retrieve synsets, rather than all the senses of a word or collocation.

Format

sense_key synset_offset sense_number tag_cnt

  • sense_key: Encoding of the word sense
  • synset_offset: byte offset that sense can be found in the data file corresponding to the part of speech.
  • sense_number: sense number of the word
  • tag_cnt: number of times the sense is tagged in various semantic concordance texts.

The sense_key has its own encoding as shown below:

lemma%ss_type:lex_filenum:lex_id:head_word:head_id

  • lemma: ASCII text of the word or collocation found in WordNet
  • ss_type: Integer representing the synset type for the send. (1 = Noun, 2 = Verb, 3 = Adjective, 4 = Adverb, 5 = Adjective Satellite)
  • lex_filenum: Lexicographer File Number (Two Digit Number)
  • lex_id: Integer that uniquely identifies a sense within a file.  00 is the default and therefore is not present in the files.
  • head_word: Lemma for the first word of the satellite's head synset. Only present if sense is an adjective satellite synset.
  • head_id: Uniquely identifies the sense of head_word within the file.  Only present if head_word is present.

Note: This schema for sense_key applies to the other file types as well.

 

cntlist / cntlist.rev

Another important file is the cntlist / cntlist.rev files.  These tag semantic parts of words based on their frequency of use in everyday language (not really but a good way to think about it).  cntlist is ordered (descending) by the sense value (highest frequency) while cntlist.rev by the word part.

cntlist Format

tag_cnt sense_key sense_number

cntlist.rev Format

sense_key sense_number tag_cnt

  • tag_cnt: Semantic count.  
  • sense_number: an identifier of the word part.  
  • sense_key: WordNet word part encoding (see above).

 

Conclusion

It is my genuine hope that in a subsequent post, I can provide a .NET version for integrating into WordNet meaningfully.  There are some .NET classes and source code out there but I feel they are not adequate nor simple to use without a very deep knowledge of WordNet.  This post in particular is confusing (it was writing it).  However, it is also a very useful database that should be in the arsenal of any NLP developer.  Any questions?  Email me!

Back to blog

Related Posts

Check out our thoughts here.

What is important to a career

Lately I’ve been spending a lot of time thinking about my career and where it’s going. I don’t want to give the impression that I have never thought about my career before, but now the thoughts are becoming constant.

May 8, 2018
Databases: Component or Infrastructure?

There is always strong debate around databases and their role in development. Sometimes they are considered components, while others will consider them infrastructure. Is there a right answer? Let's discuss!

March 15, 2018
Software Maintenance: The Never-Ending Feud

There is one, and only one, primary focus that any software developer acknowledge: the ability for software to be maintainable. Of course, correctness, functionality, and performance are all important, these will always be easier to address with maintainable software.

January 25, 2018