I’ve been reading the 3rd edition of [Algorithms][1] by Cormen, Leiserson, Rivest and Stein. For DFS and BFS their algorithm loops through all the vertices first and colors them white.
1) If the color attribute were part of the node/vertex you would have to traverse the graph before you traverse the graph. And if you color every vertex white and your graph is cyclical then you would have an infinite loop.
clarification: CLRS’ algorithm for DFS is
for each vertex u 'in' G.V
u.color = WHITE
u.Parent = NIL
time = 0
for each vertex u 'in' G.V
if u.color == WHITE
DFSVISIT(G.u)
If the u.color is part of the node in the graph how do you actually do that loop? Do you have information on all the nodes stored elsewhere? And if you have an adjacencylist as a linked list with over a billion nodes wouldn’t that take too long to iterate through the linked list every time you look for an adjacent vertex?
2) When I thought about storing attributes in an adjacencymatrix or adjacencylist I have a hard time contemplating what that data structure looks like. At least in C I can only make arrays so big. And if I use a linked list then I couldn’t index into it. I would have to traverse a potentially huge list to get a vertex attribute. I’m having a hard time thinking of how Facebook would store 1 billion users or Google storing all the addresses for Maps.
3) As a follow up to the Facebook and Google thoughts, assuming that information is stored in a graph, I imagine they don’t traverse the entire graph every time somebody searches for an address or person. For example if I enter my address and a destination for directions somehow I would think they find the address and get a pointer to the vertex in the graph that represents my address. From there they would do a BFS or something to calculate my address. They wouldn’t start by traversing the whole world map to find my address.
*) I apologize if this isn’t the correct way to ask my questions. Really these are just concepts I am having a hard time understanding. Hopefully I have communicated what I am confused by and can get some additional clarification. Thanks!
edit: In reading coding examples of DFS I notice people don’t even implement the graph. They implement the adjacencymatrix or list and traverse that. In a tree I may have a struct with a “left” and “right” pointer to its children. While I always assumed the graph nodes would contain links to other nodes (edges) I suppose the nodes could be more simple (just contain the data) and the edges are ONLY stored in the adjacencymatrix or list. Is that a common way to implement graphs?
3
TL;DR
A graph is the concept you need to have in mind. How it is actually implemented depends on many factors and might not look like a graph at all.
[..] For DFS and BFS their algorithm loops through all the vertices first and colors them white. [..]
You need to differentiate between the – as I’d call it – procedural description of the algorithm and an actual implementation. The description here is actually emphasizing that a node that has not been visited yet is to be considered as ‘white’.
How this is implemented depends on the situation. For a graph that you know cannot have cycles, isn’t accessed concurrently, etc. it might be a good idea to actually store the color in the graph node.
Another option is to have a buffer storing the nodes that you consider being ‘white’. When you start your search, you add only the start node to that buffer. Then you visit the “first” node of that buffer, removing it from the buffer and adding all its direct neighbors to the ‘white’ buffer.
Depending on what kind of buffer you use you can implement different search strategies: Using a stack (adding == pushing an element on top of the stack, the first == the top most element on the stack) gives you a DFS, using a queue (adding == appending to the end, the first == the first on the front) you get BFS.
Add a buffer to store already visited nodes to get cycle detection.
If you can get hold of it the book Artificial Intelligence: A Modern Approach has a really detailed and IMO good explanation of search strategies (and of course much more).
[..] Do you have information on all the nodes stored elsewhere? [..]
I think you’re too focused on this kind of node:
struct node {
struct node * neighbors;
int data1;
char * data2;
// ...
};
But your node can also be …
typedef int node_t;
.. and information about neighbors, the data the nodes contain etc is stored in tables (as simple as int **neighbors;
, or in a SQL database, or whatever’s appropriate).
[..] As a follow up to the Facebook and Google thoughts, assuming that information is stored in a graph, I imagine they don’t traverse the entire graph every time somebody searches for an address or person. [..]
I don’t know how it is actually implemented, but wouldn’t it be kind of weird to even consider (= have in memory) information about streets in Canada when you’re looking for a route from Paris to Rome?
A possible solution is to split a huge graph into multiple levels of detail. For example: A top most level, where nodes are countries (France is a neighbor of Italy, so consider only those two, or perhaps their neighbors as well to also consider routes through Switzerland), a mid level where nodes are “regions”/cities and a lowest level where nodes are actually street crossings.
[..] At least in C I can only make arrays so big. [..]
int * buffer = malloc(10 * sizeof(int)); // space for 10 int, add error handling
// ... some code ...
// need more space
void * new_buffer = realloc(buffer, 12 * sizeof(int)); // space for 12 int
if (new_buffer) {
buffer = new_buffer;
buffer[11] = 42; // perfectly fine
} else { /* error handling */ }
Per my understanding, the color is not part of the graph, but part of the node.
Consider you have a graph of nodes.
A>B
B>C
A = {color:white, name:A}
B = {color:nil, name:B}
and etc.
You traverse the graph at the node level.
There are graph storage systems that treats everything as a node
, like semantic web / RDF. But in general, when talking about graphs, the unit is nodes, and nodes will have attributes.
Consider ‘G,V’ a abstract collection or iterator. For u in G,V is just iterating over the nodes. Don’t think too deep about the implementation yet. So, you don’t have to traverse the graph to color everything.
Over a billion of nodes? Depends, if you have an array list of 1 billion nodes, with modern computer, it take less than 1 minutes to go through all of them (assume the list fits in the memory). From my experience, this is stored as simple object, often time just node ids.

Large scale graph store and search is quite challenging. People don’t use adjacency list or matrix. Though you can still fit a simple graph of 5 million nodes under 1 GB plain text. There are compressed formats, such as sparse matrix storage format. This is a wellknown problem within CS research and scientists are developing more efficient way to store it every day.

Often time it is not stored in a graph. Graph storage is not a good choice or certain data models. And though you may interface with a graphtheory based query interface, the underlying storage may still be traditional sql relational storage. My assumption is they just query an address service database with your userID and that’s it.
To your final thoughts, depends on the exact type of node data, node relationship and graph complexity. You may have complex weighted graph and storing in edges in adjacency matrix will be soon too complicated.