Lately I’ve been spending a lot of time thinking about my career and where it’s going. I don’t want to give the impression that I have never thought about my career before, but now the thoughts are becoming constant.
WordNet is a lexical database created at Princeton University. As of today, it is at version 3.1 (updated in 2012). It groups english words into synonym sets (synsets) and provides short defintions and examples. It also holds information on the results of related words. It is best to think of WordNet as a Dictionary/Thesaurus. It can be used by a number of different applications and can be applied to different industries. However, we will focus on its use with Artificial Intelligence via text analysis.
It's probably best if we work through some of the terminology with WordNet; as it is very important to understand lexically what the words mean and are organized.
For a full glossary of terms see https://wordnet.princeton.edu/wordnet/man/wngloss.7WN.html
For the most part, WordNet is divided into 4 total subnets: Nouns, Verbs, Adjectives, Adverbs. These are then divided into the corresponding files as described below:
The format of all files are ASCII and fields are generally separated by a single space (unless otherwise noted). Records are separated by new line characters. See https://wordnet.princeton.edu/wordnet/man/lexnames.5WN.html for more details on the File Number(s) and meaning of individual files.
Each part of speech (pos) has three unique files that are of importance: data.pos, index.pos, pos.exc. The index files are alphabetized list of all words in WordNet for the corresponding pos. Each line is a list of byte offsets in the corresponding data file. The words are in lower case. The data files contain information corresponding to the synsets that were specified in the Lexicographer Files with relational points resolved to synset_offset. The exception files are used to help with morphological irregular inflections to words [i.e. cargoes and cargo].
lemma pos synset_cnt p_cnt [ptr_symbol...] sense_cnt tagsense_cnt synset_offset [synset_offset...]
synset_offset lex_filenum ss_type w_cnt word lex_id [word lex_id...] p_cnt [ptr...] [frames...] | gloss
ptr has the following internal format:
pointer_symbol synset_offset pos source/target
frames has the following internal format:
f_cnt + f_num w_num [+ f_num w_num...]
A simple list of inflected words separated by 1 or more base words. The first column is the inflected word, while the other columns are base words.
These files hold the detailed relational analysis of lexical semantics. Basically they hold the relationships of words to represent a "lexical knowledge." This may be the most difficult of the file formats to understand so I'll try to spell out many of the symbols and meanings the best I can.
These files are located in the dbfiles subfolder of dict. They have the format of pos.suffix where suffix is the synset group (i.e. animal, plant, etc).
Pointers are used to represent the relations between words in one synset and another. A relation from a source to a target synset is formed by specifying a word from the target synset followed by the pointer_symbol (as described in the format section). The full list of symbols can be found here: https://wordnet.princeton.edu/wordnet/man/wninput.5WN.html#toc3
Verb frames are used when example sentences are not normally available and represents ways a verb can be used simply in a sentence. The full integer list can be found here: https://wordnet.princeton.edu/wordnet/man/wninput.5WN.html#toc4
FormatSynsets are entered 1 per line and each line is terminated with a newline character. The general syntax is:
Note: Everything can be separated by either a space or tab.
The syntax is valid for all synsets except verbs. Verb syntax is:
Adjectives can be clustered containing 1 or more head synsets and optional satellite synsets. The syntax is:
Note: These synsets can span multiple lines. Hyphens are on a line by themselves and can be of any number. Synsets within the square brackets follow the general syntax above.
Comments are denoted by parenthesis on a separate line ().
A word can have a few different syntaxes. These have various markers and lexical ids along with pointers.
word[ ( marker ) ][lex_id],
When 1 or more pointers correspond to the specific word, brackets are used. This is referred to as a pointer set and only 1 word can have multiple pointers. But there can be any number of pointer sets. The syntax is below:
For verbs, the word syntax can be extended for frames that only correspond to one word rather than the whole synset:
Pointers are optional in synsets. If a pointer is specified outside of a word/pointer set, the relation is applied to all words in the synset. The syntax can take one of two forms:
In a pointer, word refers to the word in the other synset. When the second form is used, the first word indicates a word in the head synset and the second is the word in a satellite of that cluster. word may be followed by a lex_id for targeting the correct synset. If it is in another file, lex_filename is used.
Verb Frame Syntax:
Associated frames are delinated as follows when references verbs:
where f_num represents the frame type referenced in the link above.
A gloss is included in all synsets and is denoted by parenthesis at the end of the synset.
Special Adjective Syntax:
There is special syntax for representing antonymous adjective synsets. The first word must be entered in upper case and is considered the head word of the head synset. Refer to https://wordnet.princeton.edu/wordnet/man/wninput.5WN.html#toc10 for more details if needed.
Some examples of this format is provided by princeton here: https://wordnet.princeton.edu/wordnet/man/wninput.5WN.html#toc11
This file provides an alternate method for accessing synsets and word senses in the database. It is useful to applications that retrieve synsets, rather than all the senses of a word or collocation.
sense_key synset_offset sense_number tag_cnt
The sense_key has its own encoding as shown below:
Note: This schema for sense_key applies to the other file types as well.
Another important file is the cntlist / cntlist.rev files. These tag semantic parts of words based on their frequency of use in everyday language (not really but a good way to think about it). cntlist is ordered (descending) by the sense value (highest frequency) while cntlist.rev by the word part.
tag_cnt sense_key sense_number
sense_key sense_number tag_cnt
It is my genuine hope that in a subsequent post, I can provide a .NET version for integrating into WordNet meaningfully. There are some .NET classes and source code out there but I feel they are not adequate nor simple to use without a very deep knowledge of WordNet. This post in particular is confusing (it was writing it). However, it is also a very useful database that should be in the arsenal of any NLP developer. Any questions? Email me!
Check out our thoughts here.
There is one, and only one, primary focus that any software developer acknowledge: the ability for software to be maintainable. Of course, correctness, functionality, and performance are all important, these will always be easier to address with maintainable software.