When to ignore errors that are common and the program recovers from [closed]

I have a program that makes hundreds of daily CURL requests, SMTP request, and other requests. Less than 1 percent of the time, a CURL or SMTP request will fail. Best I can tell, the cause of the problem is external and can not be fixed to be 100% reliable. My program is always able to recover from it and no human interaction is ever needed from it. I have a system in place to send an email alert when something fails. The vast majority of what I receive are these harmless CURL and SMTP failures.

Should I not send an email alert for common failures that the program recovers from?


Depends on your application.

The E-Mails might be useful for a statistic but if not, I would avoid this spam.
What I do in similar cases: Send a summary once a day to be informed how well your program performs (and that it is still running).

I would only send an email, if the error-rate exceeds a preset limit which indicates that human intervention is needed.


In this situation I would immediately stop sending the emails.

The error-emails should act as a signal that something is wrong and action needs to be taken.
Because you get so much of them, they act as static noise and you will easily miss a really important error-email that came in for another reason.

However, if you get like 5 of these emails each hour and getting an email like every minute would be something abnormal, you need to build a mechanism that sends out something when the errors/hour passes a certain threshold. Because the single email may not mean much anymore, the amount of them in a certain period (minute/hour/day) may mean something bigger.

Email is not a good tool for keeping track of errors. Look into products such as New Relic or App Insights to record all your errors (and other information) so you can then report on it or send email alerts when certain conditions are met (e.g. when it changes from 1% failing to >10% failing).

With separate emails for each error you just end up ignoring the emails, and may not even notice the jump from 1% to 10%. Worse, your email provider may see the large volume of near-identical emails from one address and mark them all as spam.

In this type of situations try to make a algorithm to make a log of error events and send once in a day. As pieter said, also put a alert for exceeding number of errors. That’s will be a systematic way of app management and trouble shooting.

Trả lời

Email của bạn sẽ không được hiển thị công khai. Các trường bắt buộc được đánh dấu *