How do I create a dictionary of synonyms that is efficient in terms of retrieving synonyms of a word?

Basically, I want to use some sorts of data storage to store groups of words, and enable an end-user to request any word and be prompted with all the other words in its group (its synonyms). Afterwards, I want to be able to add spelling corrections, suggestions, and relevancy ranking (using edit distance for example).

Also, some groups may contain the same word, so I would like to return both groups separately.

Any ideas how to get there? Any particular database, data structures, concepts, etc. that could help?

Your idea of groups look good at first glance, but it’s actually a complexity overkill. Imagine we are in a relational DB, you have groups and words in a N/N relationship. That means in order to fetch a synonym, you would have to fetch his groups links, then extract linked groups, then extracs all the groups links from all groups, then get the word list – for maybe a dozen of records final. Unless having separate groups is a needed feature, I would rather keep things simple : words and links.

Some databases are better than others at representing this. Since you have a large data set with a lot of relationships but not much structure, it’s probably advisable to go for a NoSQL database that supports lists of some kind – that would enable you to fetch synonyms faster because you would avoid a painful join on a huge link table (choosing grouped implementation or not).

For the rest, I don’t see any advice that wouldnt be opinion-based so it’s up to you since I see no particular reason to stick with a particular scheme or technology.

First ideas

The basic structure is a simple table lookup {word, synonym}. Up to you to decide if it’s a bidirectional relation (less data in your dictionary) or not (more accurate).

However, this is often not sufficient, as the suitable synonyms depend on the context: a synonym of “particle” could be “dust” in a common language, but “electron” or “proton” in a scientific context. So the basic structure could be enhanced to {word, synonym, context-key-word}

The problem here is that you would have to enter synonyms for every possible forms of a word (e.g. {"particle", "dust"},{"particles","dust"},{"particle","electron"},{"particles","electrons"} )

Linguistic properties of a word

Take a verb like { “buy”, “purchase”}. A verb can have several forms (buy, bought, buying): so will you manage the synonym for each form separately, or will you group them ? Same for nouns and adjectives (singular , plural). Only preopositions and adverbs have a single form.

So you could opt for a double structure, with on one side the different words, grouped according to a primary form and its grammatic variations, and the other side the synonyms for the primary form.

Dictionary of all the language tokens:

{ infinitive_verb, past_simple, past participle }
{ singular_noun, plural_noun} 
{ singular_adjective, plural_adjective }
{ preposition } 
{ adverb } 

Synonyms:

{ type_of_token, primary_form, primary_form_of_synonym, context } 

So in the context “science”, the synonyms for “electrons” would first go through the dictionary of language tokens and find the noun {"electron","electrons"} matching for plural. It would then search the synonym and find {"electron","particle","science"} and {"electron","dust","-"} and choose the first one. But as it is a plural it would lookcup for {"particle", "particles"} and come up with “particles”.

Up to you to see if the contextual analysis could be a benefit, or if you’d like to show all the synonyms and just mention the different speciall context if there are some.

And you have to manage ambiguity: you could find nouns and verbs with the same form for example.

The database

You could very well use a traditional relational DB. The dictionary could be stored in several tables (each having the appropriate structure), and the synonyms in a single table. Advantage: you could search on any column very efficiently.

Another approach could use a NoSQL base. If a document DB would be used, you could store the data almost as described above. If you’d opt for a key value store, you would have to break the dictionary part into pairs. Ultimately storing more data. The easiest form to manage would certainly be a graph DB, in which you could navigate accross the relationships between the words

More thoughts

If you look at verbs, most verbs have a radical, to which you could add “ing” or “ed” to find the other forms. Only a few hundreds vers need the full alternate forms as they are not predictable (from a database point of view).

You could of course take advantage of such rules to considerably decrease the size of your database. But it would also make the search process more complex as you would have to preprocess the words (e.g.”parking”) to see if they match some of the rules (e.g. “-ing”) and in which part of the dictionary you’d have to look (e.g. regular verb “park” ? and noun “parking”).