Where to store many files?

I have 500.000 files with the combined size of 350GB. So an average file has 0.7MB size. Each file has metadata: 1 to 100 keywords and optionally a short description. I need to filter the files and find the keywords of the search expression in the keywords of the files and in the descriptions. Note that some of the files are text files, so by those, I need to find the keywords in the file as well, which means full-text search.

  1. Should I store the metadata and the files in the same database, or should I store the metadata and the text files in the same database and the binary files somewhere else?
  2. What type of database can store so many files?

Note that the databases are RAID protected, but I can have a not RAID protected filesystem cache, for the case they would make streaming slow.

I am concerned only about the performance of search and file access, not consistency, convenience, security or resource utilization. I can use even the file system if that makes things faster.


Where to store the files ?

The question of whether or not to store the files in the database has to be considered under several angles:

  • Consistency: storing metadata and the files (as BLOB) together in the database ensures that what belongs together remain together. No fear of inconsistency if insertion is interupted, no separate storage location to manage with absolute or relative url in the database.

  • Convenience: you may move/backup/replicate/monitor your database if you need just using the database tools. With separate files, you have to organize all theser operations. It is not necessarily difficult, but you have to take care of it.

  • Security: most DBMS offer you some authorization mechanismsfor user access, and even encryption if needed. So having the file in the DMBMS ensures that nobody tampers with the files, and only those having the necessary DB priviledges can access it. With separate files outside the database, its much more difficult to organize this (unless you’re on a server and the clients can’t access directly to the folders).

  • Performance: this is something that you have to check carefully with the DBMS that you’ll choose: the API for accessing BLOBs might require some overhead to transfer from/to the database in smaller chunks. So you need to be careful to request this object only if necessary. Here with files in the file system, it’s faster to access the raw data when it is needed. However with so many files, you might have to distribute them across several folders, in order not to to suffer from the search performance of every filename in a huge directory.

  • Resources: If you would consider using some in-memory database for accelerating your “semantic” work on metadata, then it would be very costly to store as well all the plain data in memory. There, separate files could really be of advantage.

Not knowing what exactly your application is doing, it wouldn’t be wise to advise you firmly on one way or the other.

Real life examples

  • At my company we use a huge ERP. The records of the financial transactions in the database refer to scanned financial documents that are stored outside the DB on a distinct content server. The content server is a kind of web server, that locally stores the image files (JPG, PDF, …) in its local file system. The security of access is organized via a complex URL validation scheme.
  • Another system stores scanned documents for an activity not covered by the ERP. The images are stored directly in the database.

So in practice, both approaches will work. The first is based on standard software products. The second was developed in-house. From the point of view of performance, they both are very similar because images are accessed from the client (i.e. the potential overhead in BLOB management on DB side, are compensated by the overhead of an additional transfer with the additional webserver).

Relational or not ?

If you go relational, you may want to manage:

  • the file-records (e.g.identification, some unique metadata, and the BLOB).
  • the keywords (metadata + filtered list of plain text words)
  • the association of keywords to file-records (many to many).

There is no doubt that performance and the flexibility will be there, because searching for keys, merging several searches, etc.. is the core business of an RDBMS. But you’ll have to work out how to best structure the metadata.

You could also opt for a NoSQL database. They are more flexible on the data structure. Intuitively I’d suggest to start to have a look at the document databases. If however you prefer to keep the files out of the database, you could be more interested in a key-value store, or even a large column store if you’d manage different kind of keywords for different kind of metadata.


I am concerned only about the performance of search and file access, not consistency, convenience, security or resource utilization.

The performance of reading and writing files of that size (average 0.7Mb), a file system is probably faster. (For smaller files, e.g. 0.7Kb, a database would probably be faster. Typical file systems are bad at handling lots of tiny files.)

Searching is a different issue. The way to get fast search over a large amount of data is to construct indexes.

  • For structured data (e.g. the file metadata) you typically create database tables to hold the searchable data and then add database indexes to make the common (simple) searches go faster. And avoid searches that require the use of LIKE, regexes and so on because that requires linear scans.

  • For non-structured data (especially text), the standard solution is to use a free-text search engine. This works by constructing a reverse index for the entire corpus. This gives you orders of magnitude better performance than doing a linear scan of the files on each search. It can efficiently do searches with multiple terms or phrase search.

It sounds to me like the free-text search engine approach is what you need. A good free-text search engine will also provide a way to hold and search the file metadata.


Trả lời

Email của bạn sẽ không được hiển thị công khai. Các trường bắt buộc được đánh dấu *