What are the differences between algorithms using data structures and algorithms using databases?

The General Question

Some Context

This is a question that has been bugging me for some time, and I have not been able to come up with a convincing answer for it.

Currently, I am working on strengthening my understanding of algorithms that, of course, heavily involve data structures. These are basic structures such as Bag, Queue, Stack, Priority Queue, and Heap.

I also use databases on a daily basis to store the data that has been processed and submitted by the end-user or processed by the program. I retrieve and submit the data through a DAL, which has data structures of its own that are generated based on the tables in the database.

My questions come when I have the option to sort the data using the database to send it back to me ordered in either an ascending/descending fashion or retrieve and load the data into my logic, process this data in a priority queue, and heap sort all of it. Or another one would be to search for records using the database rather than loading a subset of the records and using something like binary search to find the record or records I am interested in.

In my mind, I would try to have as many operations take place on the database-end before sending it over because communication is expensive. This also makes me wonder when do you use algorithms and data structures strictly defined within your own logic rather to process data than that of the database’s?

So here are the questions…

Questions

What are the differences between data structures and databases?
When do we use algorithms that use data structures defined solely within your own logic and not that of the database’s?
@Harvey post: When do the methods in the database become less efficient to use than methods in your own logic?
- @mirculixx post: What makes a method efficient?
@Harvey post: How is processing data with data structures faster than doing it in the database?

Clarifications

@Grant post: The databases I normally work with are relational, and these questions are coming out of working with them. However, I do think these questions are applicable to any persistence framework (when I say framework, I mean it in the most general sense).

I know answers without a specific context are difficult. Food-for-thought, advice, or discussion points are mainly what I’m looking for and would be most appreciated!

Data Structures are, for the most part:

Memory resident,
Transient,
Limited in size,
Not re-entrant without adding concurrency mechanisms like locks or immutability,
Not ACID compliant,
Fast, if chosen carefully.

Databases are, for the most part:

Disk-bound,
Persistent,
Large,
Safely concurrent,
ACID compliant, with transactional capabilities,
Slower than data structures

Data structures are meant to be passed from one place to another, and used internally within a program. When was the last time you sent data from a web page to a web server using a database, or performed a calculation on a database that was entirely resident in memory?

Database systems use data structures as part of their internal implementation. It’s a question of size and
scope; you use data structures within your program, but a database system is a program in its own right.

What are the differences between data structures and databases?

On an abstract level, there is none – a database is a data structure.

On a specific level, databases typically have the purpose of persisting data, usually in a format that is optimized for either insertions, updates, retrieval, joining or some other purpose (or a combination).

E.g. if you compare a table in an RDBMS to say an array of data, the difference may be in the run-time of the algorithm, the amount of code you have to write, the amount of memory you need to run the algorithm, or the flexibility to work/access the data from outside of your program/algorithm.

When do we use algorithms that use data structures defined solely within your own logic and not that of the database’s?

In tendency I would argue

a) to use a database if you need to persist data in way that is accessible beyond the run-time or purpose of the specific algorithm.

b) to use your own (in-memory) data structure if run-time speed matters, or persistence is not required

E.g. if your algorithm processes customer records, you may want to store those customer records (say to find all customer’s in a particular area) for later use by some other program/algorithm and for an entirely different purpose (say to find the most valuable customers). In that case using a database to persist the data is probably a good idea.

Note, however, that there is the concept of in-memory databases that do not necessarily persist data, for performance reasons. E.g. Redis or HANA.

When do the methods in the database become less efficient to use than methods in your own logic?

The answer very much depends on the circumstances and the (type of) database in use. I would rephrase the question to “what makes a method efficient?” It then becomes an exercise of assessing the methods (=algorithm) you would use for you own data structure v.s. the methods used by the database. Also see next point.

How is processing data with data structures faster than doing it in the database?

Again, this depends on the specifics. In general, the processing of data that is in-memory, directly accessible to the process that runs your algorithm, is faster than sending a request to another process (in the same computer or across a network) and asking it to send back the results. However if the data already resides within the database, sending it a command – say an SQL statement to join two tables and calculate some aggregate function – and retrieving only a small summary or subset of the data may be much more efficient than first transferring all the data and calculating the results locally (using your own data structures).

Disk access is primarily what is most expensive in this operation, more often so than network access (http://serverfault.com/questions/238417/are-networks-now-faster-than-disks). Unless your database is not located on at least a 1 Gbps network and the same network as your webapplication server, network performance will not matter as much as disk performance for larger datasets. Or if your data happens to reside on very fast solid state disks which will be faster then typical network access. Also, databases usually provide an IPC mechanism like named pipes instead of using TCP/IP if the database resides on the same server as your application server.

If you can keep most ofthe enire data structure in memory between requests then this will generally be your fastest bet. If you can’t, then it is hard to beat a good database structure with normalized tables and proper indices for search and update performance on anything other than small sets of records, especially in a system with millions of records.

Relational Databases typically use a B+ tree or a variant thereof under the hood and have many optimizations like data alignment on disk and buffer pools for frequently accessed records. This makes them excel at processing large datasets quickly, especially if aggregation or filtering is involved.

What do you mean by a database? Do you mean a relational database like MySQL, or SQL Server? A relational database is a meta-data structure that supports some subset of the operations defined by the relational model. The theory of the relational model which was mostly worked out by Edgar Codd in the 60s.

The relational model is very general purpose and flexible, but that means it can’t take any advantage of structure in the data or patterns of access. Data structures are useful when you know something about the data and how it will be accessed. For example, if you know the last data you put into a data structure will be the first data you want out you can use a stack.

I called the relational database a meta-data structure because it is generally pretty big wad of software that uses lots of data structures like stacks, queues, trees, and lists to create the abstract data structure of a relational table.

Filed under: softwareengineering - @ 01:10

Thẻ: algorithms, data-structures, database, programming-practices

Thiết kế website giá rẻ

Danh mục

What are the differences between algorithms using data structures and algorithms using databases?

The General Question

Some Context

Questions

Clarifications