How to reuse parts of a Trie data structure (like a Trie DAG)?

  softwareengineering

I am playing around with a cross-language spell-check sort of thing, and am still in the prototyping/ideation phases. Basically I have thought of something like a Trie data structure, but I keep wanting to “reuse” parts of the trie, but can’t figure out if it’s possible. Wondering if you can just tell me if it’s possible to do a Trie DAG sort of thing or not, and if not, why not.

Basically we can have these sorts of words/fragments, which together form larger words:

u
  n
    *
e
  n
    *
r
  e
    a
      s
        o
          n
a
  w
    a
      r
        e
f
  a
    i
      r
*
  a
    b
      l
        e
*
  n
    e
      s
        s

If this is a trie, we can generate these words:

enable
reason
reasonable
unreasonable
unreasonableness
aware
awareness
fair
fairness

Now imagine there are a million fragments in there. In Sanskrit we have a million fragments, plus a thousand suffixes per fragment, so there’s like a billion possibilities at least, eek….

So I thought “what if we could reuse the parts of the trie that encodes the words/fragments? So reason or able were only in the trie once.

un-----
      |
      reason
            |
            able
           |    |
         en     ness
               |
            fair
               |
           aware

However, it seems to breakdown because, in this case at least, en-able-ness is not a word, so how can we tell the trie to not include that? And if you allow compound words (like in Sanskrit), where you might have words like “realestate” or “icecream” or breakfast”, you reuse the base words in there:

estate---

un-----
      |
      reason
            |
            able
           |    |
         en     ness
               |  |
            fair  |
               |  |
           aware  |
                  |
          real----
                  |
                  ----estate

And finally, the trie doesn’t support reuse of branches, you muddy the waters so to speak with extra words that aren’t in your desired set.

So there are at least 3 glaring problems:

  1. Non-words: Words like “enableness” would be a problem.
  2. Word reuse: Compound words and such are a problem.
  3. Not supported by the trie: The trie can’t handle circularity.

Is it possible to do this in any way, to get sort of a Trie DAG?

I tried to create a data structure that was basically a double trie, where the main trie was the glyphs in the main base words/fragments, and the secondary trie was for linking together fragments.

type TrieNode = {
  childGlyphs: Record<string, TrieNode>
  childFragments: Record<string, TrieNode>
}

And when you get to a w a r e it then links to a “child fragment” of the n e s s node. But that would link to probably just before the n in n e s s, and what if we had this?

n
  e
    s
      s
  o
    t
      e
    r
      m
        a
          l
  ....

The whole subtree would get included in our word for “awareness”, so we would also have words for “awarenormal” and “awarenote” and stuff.

So I thought about trying to keep track of the serialized words, and think about adding an array of arrays, containing all the valid paths for each trie node:

n
  e
    s
      s [[aware, ness, id], [reason, able, ness, id], [un, reason, able, ness, id]]
  o
    t
      e [[note, id]]
    r
      m
        a
          l [[normal, id]]
  ....

Or even having those IDs stored in a tertiary trie! Stored at the end of the glyph trie nodes. But what am I gaining? Is that going to be any less memory intensive than having a trie for all glyphs, duplicated for every usage of a word or fragment?

So now I keep going back to, why not just duplicate every word in every usage?

The problem I foresee with that is, there could be 1000 prefixes, 1 million words, and 1000 suffixes. That would be a lot of duplication, repeating all million words and suffixes 1000 times. Hence my desire to make things more compact, and reuse parts of the trie, like a Trie DAG.

Is it possible? Or no? If not, can you explain why it’s not possible? If so, can you explain briefly how to accomplish reusage of the trie branches? It doesn’t necessarily have to be a a trie data structure, if it’s somehow possible to do with a state machine or something else, please let me know. I have thought about FSMs, but don’t see how that would be any different at a high level than a trie DAG, and run into the same problems.

LEAVE A COMMENT