Introduction

WordNet is a lexical database created at Princeton University. As of today, it is at version 3.1 (updated in 2012). It groups english words into synonym sets (synsets) and provides short defintions and examples. It also holds information on the results of related words. It is best to think of WordNet as a Dictionary/Thesaurus. It can be used by a number of different applications and can be applied to different industries. However, we will focus on its use with Artificial Intelligence via text analysis.

Reference: https://en.wikipedia.org/wiki/WordNet

Terminology

It's probably best if we work through some of the terminology with WordNet; as it is very important to understand lexically what the words mean and are organized.

Hyperonymy Relation (Noun): The relation according to two nouns where Y is a hypernym of X if every X is a (kind of) Y [Canine is a hypernym of dog]
Hyponymy Relation (Noun): The relation according to two nouns where Y is a hyponym of X if every Y is a (kind of) X [dog is a hyponym of Canine]
Meronymy Relation (Noun): The relation according to two nouns where Y is a meronym of X if Y is a part of X [window is a meronym of building]
Hyperonymy Relation (Verb): The relation according to two verbs where Y is a hypernym of X if X is a (kind of) Y [to perceive is an hypernym of to listen]

For a full glossary of terms see https://wordnet.princeton.edu/wordnet/man/wngloss.7WN.html

Structure

For the most part, WordNet is divided into 4 total subnets: Nouns, Verbs, Adjectives, Adverbs. These are then divided into the corresponding files as described below:

Nouns
data.noun
index.noun
noun.exc
dbfiles/noun.* : nouns denoting "*"
Verbs
data.verb
index.verb
verb.exc
dbfiles/verb.* : verbs of "*"
Adjectives
data.adj
index.adj
adj.exc
dbfiles/adj.* : all adjective clusters and pertainyms
adj.ppl : participal adjectives
Adverbs
data.adv
index.adv
adv.exc
dbfiles/adv.all : all adverbs

The format of all files are ASCII and fields are generally separated by a single space (unless otherwise noted). Records are separated by new line characters. See https://wordnet.princeton.edu/wordnet/man/lexnames.5WN.html for more details on the File Number(s) and meaning of individual files.

Each part of speech (pos) has three unique files that are of importance: data.pos, index.pos, pos.exc. The index files are alphabetized list of all words in WordNet for the corresponding pos. Each line is a list of byte offsets in the corresponding data file. The words are in lower case. The data files contain information corresponding to the synsets that were specified in the Lexicographer Files with relational points resolved to synset_offset. The exception files are used to help with morphological irregular inflections to words [i.e. cargoes and cargo].

Index Format

lemma pos synset_cnt p_cnt [ptr_symbol...] sense_cnt tagsense_cnt synset_offset [synset_offset...]

lemma: Lower case ASCII word or collocation. Collocations are joined by the _ character.
pos: Category of word (n = noun, v = verb, a = adjective, r = adverb)
synset_cnt: Number of synsets that lemma is in. This is the number of senses of the word in WordNet.
p_cnt: Number of different pointers that lemma has in all synsets containing it.
ptr_symbol: Pointer symbols with count p_cnt separated by space.
sense_cnt: Same as synset_cnt above. Redundant
tagsense_cnt: Number of senses of lemma that are ranked according to their frequency of occurrence.
synset_offset: Byte offset in data.pos file of a synset containing lemma. Each offset corresponds to a different sense of lemma in WordNet. The offsets are always 8 digit offsets. The number of offsets is based on synset_count and is separated by space.

Data Format

synset_offset lex_filenum ss_type w_cnt word lex_id [word lex_id...] p_cnt [ptr...] [frames...] | gloss

synset_offset: Current byte offset in the file represented by the 8 digit integer.
lex_filenum: Integer representing lexicographer file.
ss_type: synset_type (n = Noun, v = Verb, a = Adjective, s = Adjective Satellite, r = Adverb)
w_cnt: Hexadecimal Integer indicating number of words in synset
word: ASCII form of a word as entered inthe synset
lex_id: Hexadecimal Integer that uniquely identifies a sense within the file. 0 is default and is not present in the file.
p_cnt: 3-digit Integer indicating the number of pointers from this synset to others.
ptr: Pointer from this synset to another
frames (data.verb only): List of numbers corresponding to the generic verb sentence frames for word in the synset
gloss: A definition or example sentence or both. Each synset contains a gloss.

ptr has the following internal format:

pointer_symbol synset_offset pos source/target

pointer_symbol: semantic point classification
synset_offset: byte offset to the target synset in teh data file corresponding to pos
pos: part of speech
source/target: 4-digit hex field where first two characters represent source synset and last two represent target synset. A value of 0000 means the point_symbol is a semantic relation between the current and the target set by synset_offset.

frames has the following internal format:

f_cnt + f_num w_num [+ f_num w_num...]

f_cnt: two digit integer indicating the number of generic frames listed.
f_num: two digit integer frame number
w_num: two digit hex integer indicating the word in the synset that the frame applies to. If 00, applies to all words in the synset.

Exceptions Format

A simple list of inflected words separated by 1 or more base words. The first column is the inflected word, while the other columns are base words.

Lexocographer Files

Reference: https://wordnet.princeton.edu/wordnet/man/lexnames.5WN.html#toc4

These files hold the detailed relational analysis of lexical semantics. Basically they hold the relationships of words to represent a "lexical knowledge." This may be the most difficult of the file formats to understand so I'll try to spell out many of the symbols and meanings the best I can.

These files are located in the dbfiles subfolder of dict. They have the format of pos.suffix where suffix is the synset group (i.e. animal, plant, etc).

Pointers are used to represent the relations between words in one synset and another. A relation from a source to a target synset is formed by specifying a word from the target synset followed by the pointer_symbol (as described in the format section). The full list of symbols can be found here: https://wordnet.princeton.edu/wordnet/man/wninput.5WN.html#toc3

Verb frames are used when example sentences are not normally available and represents ways a verb can be used simply in a sentence. The full integer list can be found here: https://wordnet.princeton.edu/wordnet/man/wninput.5WN.html#toc4

FormatSynsets are entered 1 per line and each line is terminated with a newline character. The general syntax is:

{ words pointers ( gloss ) }

Note: Everything can be separated by either a space or tab.

The syntax is valid for all synsets except verbs. Verb syntax is:

{ words pointers frames ( gloss ) }

Adjectives can be clustered containing 1 or more head synsets and optional satellite synsets. The syntax is:

[ head synset [satellite synsets] [-] [additional head/satellite synsets] ]

Note: These synsets can span multiple lines. Hyphens are on a line by themselves and can be of any number. Synsets within the square brackets follow the general syntax above.

Comments are denoted by parenthesis on a separate line ().

A word can have a few different syntaxes. These have various markers and lexical ids along with pointers.

Word Syntax:

word[ ( marker ) ][lex_id],

word: Any combination of upper/lower case unless an adjective cluster. Collocations are spaces with an underscore _ character. Numbers may be entered followed by a double-quote " character.
lex_id: unique sense id for a given word within a give file. The range is 1 to 15. The default is 0 and does not have to be specified.
marker: unknown

When 1 or more pointers correspond to the specific word, brackets are used. This is referred to as a pointer set and only 1 word can have multiple pointers. But there can be any number of pointer sets. The syntax is below:

[ word[ ( marker ) ][lex_id], pointers ]

For verbs, the word syntax can be extended for frames that only correspond to one word rather than the whole synset:

[ word, [pointers] frames ]

Pointer Syntax:

Pointers are optional in synsets. If a pointer is specified outside of a word/pointer set, the relation is applied to all words in the synset. The syntax can take one of two forms:

[lex_filename:]word[lex_id],pointer_symbol

[lex_filename:]word[lex_id]^word[lex_id],pointer_symbol

In a pointer, word refers to the word in the other synset. When the second form is used, the first word indicates a word in the head synset and the second is the word in a satellite of that cluster. word may be followed by a lex_id for targeting the correct synset. If it is in another file, lex_filename is used.

Verb Frame Syntax:

Associated frames are delinated as follows when references verbs:

frames: f_num[,f_num...]

where f_num represents the frame type referenced in the link above.

Gloss Syntax:

A gloss is included in all synsets and is denoted by parenthesis at the end of the synset.

Special Adjective Syntax:

There is special syntax for representing antonymous adjective synsets. The first word must be entered in upper case and is considered the head word of the head synset. Refer to https://wordnet.princeton.edu/wordnet/man/wninput.5WN.html#toc10 for more details if needed.

Some examples of this format is provided by princeton here: https://wordnet.princeton.edu/wordnet/man/wninput.5WN.html#toc11

index.sense

This file provides an alternate method for accessing synsets and word senses in the database. It is useful to applications that retrieve synsets, rather than all the senses of a word or collocation.

Format

sense_key synset_offset sense_number tag_cnt

sense_key: Encoding of the word sense
synset_offset: byte offset that sense can be found in the data file corresponding to the part of speech.
sense_number: sense number of the word
tag_cnt: number of times the sense is tagged in various semantic concordance texts.

The sense_key has its own encoding as shown below:

lemma%ss_type:lex_filenum:lex_id:head_word:head_id

lemma: ASCII text of the word or collocation found in WordNet
ss_type: Integer representing the synset type for the send. (1 = Noun, 2 = Verb, 3 = Adjective, 4 = Adverb, 5 = Adjective Satellite)
lex_filenum: Lexicographer File Number (Two Digit Number)
lex_id: Integer that uniquely identifies a sense within a file. 00 is the default and therefore is not present in the files.
head_word: Lemma for the first word of the satellite's head synset. Only present if sense is an adjective satellite synset.
head_id: Uniquely identifies the sense of head_word within the file. Only present if head_word is present.

Note: This schema for sense_key applies to the other file types as well.

cntlist / cntlist.rev

Another important file is the cntlist / cntlist.rev files. These tag semantic parts of words based on their frequency of use in everyday language (not really but a good way to think about it). cntlist is ordered (descending) by the sense value (highest frequency) while cntlist.rev by the word part.

cntlist Format

tag_cnt sense_key sense_number

cntlist.rev Format

sense_key sense_number tag_cnt

tag_cnt: Semantic count.
sense_number: an identifier of the word part.
sense_key: WordNet word part encoding (see above).

Conclusion

It is my genuine hope that in a subsequent post, I can provide a .NET version for integrating into WordNet meaningfully. There are some .NET classes and source code out there but I feel they are not adequate nor simple to use without a very deep knowledge of WordNet. This post in particular is confusing (it was writing it). However, it is also a very useful database that should be in the arsenal of any NLP developer. Any questions? Email me!

Back to blog

WordNet: The Lexical Database

Introduction

Terminology

Structure

Index Format

Data Format

Exceptions Format

Lexocographer Files

index.sense

cntlist / cntlist.rev

Conclusion

Related Posts

SparcPoint, LLC.

LATEST posts

What is important to a career

Contact Us: