Scrapy – How to save logs to an AWS S3 bucket?

  Kiến thức lập trình

The logs the Scrapy spider is producing are getting bigger over time, which is causing an issue in terms of the performance of my server. I don’t want to limit the level of information I am putting to logs, because I find it very useful in general, and even more when it comes to debugging.

So right now, my options are:

  1. Increase the disk space/configuration of the server – this is a short-term fix and absolutely not scalable. I cannot do it forever.
  2. Decrease the level of information I store in log files – as described above, I am not so eager to do that.

There’s a way to store scraped data (images and other files) to third-party storage services, such as AWS S3, and that works well.

Is there a way to do the same with log files? I was not able to find a solution to this. Right now what I am considering is writing a script that would be triggered once the spider is finished. This script would take the log file, copy it to the AWS S3 bucket, update the database with the path to the log file in the S3 bucket, and delete the log file generated by Scrapy on the server.

It’s an extra logic that I would need to add, so I am wondering if there’s a better way to deal with this problem.

LEAVE A COMMENT