How do traditional search engines like Lucene handle tokenization and indexing, and why don’t they use integer mappings for tokens?
I’ve been learning about how traditional search engines like Lucene work, and I understand that they typically build an inverted index by tokenizing the text in the corpus. These tokens are then used directly in the index.