In my company, we use a separate cron server to run mission-critical background jobs that run on a single ec2 instance: the whole platform is then vulnerable to anything going wrong on this instance.
We want to change the design of the system.
Based on that, I need to modify the system in order to make it more reliable. Here are some issues and needs we have:
- no way to see if some cron runs stall on something
- setup a cron job status tracking system
- jobs cannot be run concurrently
- deploys cannot be canceled or reverted
My first thought here is to move the cron from EC2 to EventBridge Scheduler because EBS provides Retries, and then we can minimize the stalls. To properly monitor the system I want to suggest Datadog.
Then use SQS for not concurrently jobs.
Elastic Beanstalk for the deployment.
I usually work as a developer, and in my previous experience we had a DevOps team in charge of the infrastructure, but in my current experience, the company asks everyone to be involved in everything… I am confused because from what I learned on AWS, there are a few ways to implement this scenario. For example, I can use Lambda, or maybe create another EC2 on another AZ, but this is the first time I am in charge of designing and implementing the architecture.
May I have another point of view, please?
Without knowing more about the details of the cron jobs functionalities and their interdependencies, it seems that you’d benefit from implementing a data pipeline, a saga pattern or combination of both.