We are building a REST API which creates items in a CDN. Each item takes between 5 and 7 seconds to create, and we need to create them as quickly as possible until we have about a million of them.
One school of thought in our office is that we should make lots of small transactions. Another is that we should have fewer, larger transactions to minimise network “chat”, perhaps running as described here.
Small transactions would be simpler to implement. They would involve a bit more network traffic (perhaps insignificant given the time per item). They could also be advantageous in the event of a failure to create an item.
Performance is our top priority, development cost a close second. This post suggests that either is OK in principle. Which/how do I choose?
Our CDN will serve up images of properties in New Zealand. We will pre-populate the CDN with six images of each property we think is likely to be requested. The different images are used by various pages in our website.
The image generation process looks like this:
for each property {
call remote 3rd-party API to find property details;
for each type of image required {
call remote 3rd-party API to get image;
}
}
The remote API calls each take between 1 and 2 seconds. The remote API is on the opposite side of the planet to our users and our servers. The stuff that happens locally is in the millisecond range. I’m using C# async
where possible.
If somebody requests an image that isn’t already cached in the CDN, it generally comes back in two or three seconds, which is acceptable. We load it with AJAX too, so the user isn’t kept waiting.
5
Based on your update what you could do is the following:
Work with queues
Initial population:
for each property {
insert job in queue 1 to handle property
}
queue 1
call remote 3rd-party API to find property details;
for each type of image required {
insert job in queue 2 to get image
}
queue 2
call remote 3rd-party API to get image;
Implementation options
You should be able to do this quite simple with some standard queue software. It will be parallel and handles errors etcetera. That will take down your process time considerable and keeps your code quite simple.
And example could be: https://aws.amazon.com/sqs/ but there are many more. Some example could be this: https://startupnextdoor.com/adding-to-sqs-queue-using-aws-lambda-and-a-serverless-api-endpoint/ then you would use Lambda which is quite nice for this kind of jobs because it scales nicely.
The nice thing about an existing queue package is also that it will help you with retries on failure etc. So that reduces the amount of logic to be written.
Performance
You can manage the amount of jobs within the queue en scale the workers to find an optimal fit. You might also not want to overrun the supplier of the data by too many requests. Talk with them before you start hitting them with lots of requests.
Performance here is practically unlimited, AWS will scale incredible likely stronger than your supplier if their requests take so long.
Alternatives
Clearly you can do this with any kind of queue software. You just should ensure that you feed the properties list once. If you want an object to update you could feed it again and it will process it again and get the new images for example with the same code.
Delay
If you have setup as above your only further improvement would be to find out why it takes so damn long to handle the requests. But I suppose that the above will make it less relevant.
3
A good test is worth a thousand expert opinions.
Each item takes 7 seconds to create when doing it which way?
Different ways of doing it slow you down differently. Doing batches may speed things up but until you try you don’t really know.
Test doing 10. Then 100. Then 1000. Time it and find out if this is even linear.
Now you should have enough data to make the call.
If that’s still to painful find out why it’s taking 7 seconds. And what’s taking the most time. There is almost always a way to make it faster. Focus on what’s taking the bulk of the time. Don’t guess at this.
1