I have a python program which collects links from wikipedia, and stores the article names in one file, and the links between them in another.
For the first file, every article name is stored, and then padded to be a length of 32 bytes. For example, if the article name was Cat, it would be stored as:
43 61 74 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
All the article names are stored like this all in one file. The second file is similar, but each entry is 8 bytes long and represents a link from a page to another. For example:
00 00 00 00 00 00 00 01
Would mean that the the 0th article in the index would link to the 1st article in the index.
My question is, how can I work out the shortest route from article A to B, if I have all the links stored in the format specified above?
(And yes, the code is in python)
0
I would suggest using a Graph data model. You could load the data into Neo4j (a graph database) and use Cypher or py2neo to compute the shortest path, or TitanDB (another graph database) and use Goblin (still in development) or Cayley (a framework that provides graph-like tools on top of another database technology).
If your data set is small enough to fit in memory, and you don’t want to persist it in a database, you could use the NetworkX python library or the graph-tools python library or the iGraph python library.