Need raw text segregated into just title and content , from wikipedia dump (English)
I am working on a Full Text Search Implementation (sort of a matching algorithm) in a tool called Tantivy_py , I tried with a small text source and it worked smoothly , Now i want to test it on a very large text source , so I went ahead and downloaded the Wikipedia English Dump (xml) file . Uncompressed it’s around 92 GB .