Distribute jobs evenly among a set of servers

  softwareengineering

I have the need to distribute a set of long running jobs across a cluster of containers in ECS. These jobs essentially would need to open a socket connection with a remote server and begin streaming data for import into a database.

For each customer, there may be any number of socket connections required to consume different data feeds. Creating a new ECS service for each customer is not practical as this would also require new ECS task definitions with slightly different configurations, which ultimately would result in having 1000s of services/task definitions. This approach would quickly become a maintenance and monitoring nightmare, so I am looking for a simpler solution.

The list of “feeds” is relatively static and is stored in a document database. Feeds are only added as new customers sign up. My initial thought is to have a fixed number of containers responsible for fetching the feed configurations from the database. Each container would attempt to acquire a lease for the feed, and if acquired, start the feed and let it run until the container is killed or the connection is interrupted. Each container would have to periodically check for new feeds that are available in the pool. They would also have to extend the lease while the feed is running so the same feed isn’t pulled by another container. There could be a race condition here where the lease expires before it is extended, so I’d have to be careful to always extend the lease.

This solution would work, but there are some obvious sticking points. If each container starts at relatively the same time, there needs to be a way to control how many jobs each container is allowed to start at one time so that you don’t have 1 or 2 containers starting all of the jobs at once. One approach would be to pull one job every couple seconds until the pool is empty and all feeds are leased. This would lead to potentially uneven job distribution, and the ramp up time might take a while until all jobs are pulled from the pool and leased. I could also have the containers fetch a feed and start it, then go grab another one. Some containers may start their feeds faster and go fetch another job before another container could finish starting its feed, thereby leading to container hot spots.

Another approach could be using something like consistent hashing. If each container can know the ID of itself and the other containers, it can hash the feed configuration ID and figure out which container it belongs on. This would distribute the jobs more evenly, but I would also have to deal with checking periodically for new feeds, for example, if a container were killed and the feeds leases expired.

I have also thought that actor programming like with Akka might be exactly for this problem, but that does not come without significant complexity in implementation.

I am open to any and all suggestions or any other holes you can poke into my existing proposals!

Consensus

You have some N machines and they need to agree on how to split up K jobs. Preferable if any of those machines went down, or a new job was added then those jobs should also be divided up.

Master and Slaves

You can achieve this by having a master, and N slaves. The master has the list of jobs and distributes them out to the K slaves.

  • Simple to understand/implement
  • Easy to grow, just add more slaves/jobs
  • When the master dies the slaves cannot reschedule jobs.

Elected Master and Slaves

You can avoid having a single point of failure by electing a master. Each machine runs as a candidate, and then they vote as to their preferred candidate.

There are numerous election strategies out there, some use random numbers, others use metrics that might indicate a good master. I would start off simple with a random/semi-random election strategy and swap only if a metrics based version is better for your use case.

You will need to implement tie-breakers for when a vote stalemates, to avoid oscillation (each side swapping to the otherside). A good technique is a random delay between when the machine realises that there might be/is a stale mate, and when it tries to break that stalemate by changing its vote. If it is one of the candidates being elected it should wait a good deal longer, as essentially its throwing the race.

Now when there is a master, there should just be comms between slave and master. But when the master disappears all of the slaves must be able to talk to each other, which means they all need to know each others addresses. You will need to manage that address book somehow.

  • No Single point of failure, and self-healing
  • Has election cycles, no new/failed jobs can be started during an election (as there is no master)
  • Elected Slave has to shift jobs elsewhere so that it has resources for co-ordination.
  • Any machine can be easily added/removed
  • Adding jobs requires finding the master
  • Address book of all machines has to be maintained.
  • Jobs may be duplicated if the network is split into two.

You can avoid the split brain issue by enforcing a minimum vote that is at least (N+1)/2. This way only the split with the majority of machines can elect, and the other network portion is stuck in endless election cycles. You will also need to ensure that jobs are stopped on those slaves.

Distributed

Taking the election idea down to a finer grain, each machine makes motions to change the system state, the other machines vote to accept the change (after having confirmed that it is okay by them), and should enough vote to accept it the motion is ratified, and the system enters the new state.

The key to this system of votes and motions is that each machine considers its next vote based on what it has already voted for. This forces it to counter-vote for motions that would have worked but do not make sense given this other vote it gave. Eventually one of those two competing motions will be ratified.

Like with electing a master, every machine will need to be communicating with all of the other machines. At the very least the communication is a heartbeat – I’m still active and running these jobs.

Each machine is responsible for monitoring the network, if it doesn’t hear from machine X in a while, it must make a motion that the machine has been disconnected. The motion if passed will allow the jobs on that machine to be rescheduled elsewhere. If might also fail because other machines in the network can talk to it.

  • No single point of failure, and self-healing
  • All decisions are effectively re-decided 1+N/2 times
  • All machines must be able to communicate with each other (directly/indirectly)
  • All machines must actively maintain a perspective of the network state

Thoughts

I would start by implementing a Master/Slave system, largely because it is simple, and because it lays the groundwork.

I would slowly build that master/slave system up to an Elected Master/Slave system. Organising a way to distribute contact details, providing a means to determine if a machine is offline, implementing even a basic vote are not trivial problems.

Only after getting to this point would I consider a distributed state change system. Hopefully you would already have a basic model of the network and system availability from electing masters. You could then extend this to support extra motions for introducing nodes, removing nodes, adding/removing jobs, jobs being run, node being overloaded, etc…

You may want to take a look at RAFT. It is both an idea for consensus, and an implementation of it. It or similar designs would help simplify the distributed address book, and election/voting problems.

LEAVE A COMMENT