Handle sudden spike in CPU usage on AWS EC2

  Kiến thức lập trình

We have a setup where we’re running 2 EC2 instances (of an instance type with a lot of CPU, c7i or M5zn) each have a single ECS task running. This task runs an optimisation algorithm for vehicles and jobs – this means that it will periodically and suddenly need to process single large requests, resulting in very high CPU usage.

Due to the nature of the program running it will use all the CPU that you throw at it, jumping to 100% CPU for as long as it needs to compute the solution. If we give it more CPU, it will still go to 100% – it will just complete the task faster.

The problem occurs when we are also using this program to run lots, and lots, of much smaller requests. We are currently processing ~50 requests/second of these smaller ones. When we run the large requests, EC2 CPU hits 100% and we then get a cascade of timeouts on the systems that are attempting to process lots of smaller requests. Even if we have multiple EC2 instances, each with multiple ECS tasks on it – some requests will inevitably get routed to the ECS/EC2 task that is sat at 100% CPU, meaning the response times take a huge hit. The response times will go from ~50ms to multiple seconds.

We can’t afford to just increase timeouts on the affected systems, as their response times need to be maintained.

I haven’t found a good solution to this, does anyone know of a clever way to somehow limit the CPU usage of these big requests – allowing other requests to still be served, while slowing down the processing time of the larger requests (this is something that is acceptable). Or possibly forcing AWS to route traffic away from the ECS/EC2 instance(s) that are currently at 100% – this would have to react very quickly to the sudden spike.

Note that handling these very large requests is within our control as we determine when to trigger them – currently scheduled to happen during the night so will hit the system as a sudden surge of CPU requirement.

The Task placement constraints and group of tasks can be used to isolate workloads based on their resource needs this ensure that CPU-intensive tasks don’t impact smaller ones.

In your ECS instances ASG (I suppose you are using one with a mixed group of instances) you tag the instances based on the type of workload, let’s say Type=tinyTask and others with Type=bigTask

In your ECS definitions you can specify the placementConstraints and you should use a constraint like this :

"placementConstraints": [
    {
      "type": "memberOf",
      "expression": "attribute:Type == bigTask"
    }
  ],

for the smaller one :

  "placementConstraints": [
    {
      "type": "memberOf",
      "expression": "attribute:Type == tinyTask"
    }
  ],

Note that that placement constraints are binding and they can prevent task placement, docs :

Task placement strategies are a best effort. Amazon ECS still attempts
to place tasks even when the most optimal placement option is
unavailable. However, task placement constraints are binding, and they
can prevent task placement.

One question still is why not using ECS Fargate in this case ?

2

Theme wordpress giá rẻ Theme wordpress giá rẻ Thiết kế website

LEAVE A COMMENT