I would like to write a program that essentially handles text data and metadata of files locally on a machine’s filesystem. There is no need for any network activity. I am working with large quantities of sound files in complex directory structures.

Information that the program should be able to quickly access include:

  • File names (minimal viable product)
  • Embedded metadata from these files (further in the future)

An example usage of this program would be that a user loads the program, provides a root directory, and then that directory and all subdirectories are searched for any audio files such as wav, mp3, flac, aiff, etc…

Say the program finds 25,000 such files. The program needs to track information about these files which can be obtained by parsing characters from the file names, and as aforementioned in a later version, metadata from MP3s such as “artist”, “album”, etc…

I already know from the programming experience that I do have that not storing this data would mean that every time the program was opened, I would need to re-scan the entire directory structure all over again, which can be time consuming.

I also know that some lighter re-scan would be needed because if the user moves files around or changes names, that could corrupt whatever data store I am using. I’ve seen that a common solution for this particular issue is to have a interface that lazily checks that the files exist as the user interacts with the program, and displays a “file no longer there” error for any discrepancies.

My question is, what are some architectures and designs I should consider using for such an application? A few ideas that popped in my head:

  1. On initial load, the program does the deep scan, and stores the full file paths in a “flat text file”, one path per line, and then on subsequent runs, this file is queried for the program to reason about (e.g. from there, it can parse different pieces of the file name to work with)
  2. Rather than a flat text file, something such as a SQLite database could be used and queried

There is a program called “Everything” for Windows which comes to mind because it does a particularly good job at finding and pulling up files FAST, though I do not know how this program is architected or designed beyond what is explained here.

The main concern here is performance – the program interface needs to be snappy and not bogged down by continuous file IO operations because the entire purpose of it is to work with and manage very large sets of files. Not sure if it matters for this particular question, but it will be a cross-platform program capable of running on Windows and MacOS.

You want it fast.
And you want it portable.

The principal thing working against you is random I/O.
The more you can read “small” records sequentially, the better.
There’s a fixed cost to sequentially read each disk block,
so the smaller each record is, the more records will fit per block.

I will assume we can {rename, remove, ignore} filenames that
contain NEWLINE or other crazy characters — if that’s
a poor assumption feel free to use find . -print0 instead.
Also, when I execute find you can assume there’s an -xdev
at the end so we don’t wander off into other filesystems.

handles text data …

I initially interpreted that as a filesystem full of text files,
but later you mentioned the files contain MP3 sound data.
So I’ll interpret “text data” as “text metadata” gleaned
from sound file tags.


MVP

Here’s the simplest thing you could possibly do.
It is reasonably portable.

$ find .  > catalog.txt

Now you have filenames, and you can scan them very rapidly,
for example with grep.
May as well run that through xz compression while you’re at it,
and maybe order the entries for later diffing.

You’re probably better off additionally recording FS metadata:

$ find . -ls | sort -k11 | xz  > catalog.txt.xz

Now you have file lengths and timestamps,
which sets you up very nicely for noticing when a file changes.
You also have inode numbers, which can help with that,
but the behavior is not super portable and you may find
it more trouble to deal with than it’s worth.
Additionally you can look for "d" to pick out directories,
for example to scan for folders whose timestamp recently changed
and therefore needs to be revisited.


artists, titles

Armed with such a text file, you are in a good position
to quickly import to a DB or update a DB.
As you examine each folder pathname and song filename,
you may already be able to pick out artist and title.
If not, fork a command or use a tag reading library
to pick them out from binary header of the audio file.
Record such details in an RDBMS, perhaps sqlite,
so you don’t have to (slowly) repeat that process on
subsequent scans.

You might also query
CD DB
or similar online API for relevant details,
recording them in a table.


updates

Time goes on, and the filesystem changes.
You want to learn about additions and deletions.

Simply re-run that find command,
and use diff -u to see how the pair of text files differs.
Or have a program read through both of them, in a
two-way merge,
noting any mismatches between them.
Notice that storing lines in sorted order is important,
and also that a sql ORDER BY can easily match that sort order.

If your program is able to run continuously (for hours)
while the user adds and deletes media files, there is
another very efficient technique available to you.
Just use
“$ fswatch my/music/”
to notice when file updates happen,
even deeply nested ones.
To consume such event notifications either
listen to its stdout pipe, or use the library.