technique to process a folder of files from several machines

There is a single network-shared folder of files that is to be processed by threads running on several servers. Once a thread starts processing a certain file, I don’t want any other thread to process it. What are some techniques to ensure that only one thread processes a certain file? Keep in mind that the threads are running on different servers.

2

Well I could think of two options, right now:

  1. As already mentioned in the comments, create lock-files indicating that this file is already in process. If you don’t want to synchronize it on the file system a table in database might be an option. Only downside here:

    • If a thread is crashing before it could delete the lock-file, it has created, the related file is marked forever as in process. Perhaps, it’s an issue, depending on your requirements.
  2. Select the files in a way, so that the threads will never try to access the same files. E.g. you could say, there is a thread responsible for file starting with [a-m] and another thread is responsible for files starting with [n-z]. A third one is handling all files from [0-9].
    Downsides here:

    • The load might be distributed in a inefficient way, since one of the threads is busy all the time and the other two are idle, since there are no files to be handled. So this is only an option if you can estimate the diversification of the files.
    • Plus, if one of the threads is down, its files are not processed, since the others are not responsible for them.

[EDIT]

Create a monitoring instance. (E.g. a web service) This instance has simply two methods

  • GetNext(requestId : int) : string
  • Finish(requestId : int, path : string) : bool.

The first method is providing a path to the file, which should be processed by the requester. Each path is only returned once, as long it’s not timed out. The requester’s id is stored with the provided path and timestamp somewhere. This is needed, if a file processing is timed out or the monitoring instance was down for a reason.
The second method is the finish statement of the requester, so that the monitoring instance knows, that the file is processed.

Downside here, definitely the configuration and accessibility issues you might run into. If the central monitoring instance is down, the consumer threads have to wait till it’s available again, since a file needs to be finished before proceeding with the next.

2

Some filesystems guarantee that some operations are atomic.
Among the more common ones are that renames/moves either happen or don’t happen for a caller.

If your filesystem has atomic rename, you can attempt to rename a file to a worker-specific name and process it if the rename succeeded. If it fails, either something went wrong or another worker beat you to it.

When you’re done with your work, you can rename it for further tasks to consume.

Note that copies are usually not atomic, nor is creating + writing to ad-hoc “lock” files. Moving between directories depends greatly on the filesystem in use.

We use the rename+process technique for large-scale import/export of files to tape storage with multiple workers. It’s very resilient against crashing and is essentially immune to corruption.

Trả lời

Email của bạn sẽ không được hiển thị công khai. Các trường bắt buộc được đánh dấu *