Monitor Splunk Alerts for Errors
Zubair Rauf | Senior Splunk Consultant – Team Lead
In the past few years, Splunk has become a very powerful tool to help teams in organizations proactively analyze their log data for reactive and proactive actions in a plethora of use cases. I have observed almost every Splunker monitor Splunk Alerts for errors. Splunk Alerts use a saved search to look for events, this can be in real-time (if enabled) or on a schedule. Scheduled alerts are more commonplace and are frequently used. Alerts trigger when the search meets specific conditions specified by the alert owner.
Triggered alerts call alert actions which can help owners respond to alerts. Some standard alert actions are to send an email, add to triggered alerts, etc. Other Splunk TAs also help users integrate external alerting tools like PagerDuty, creating JIRA tickets, and many other things using these tools. Users can also create their own custom alert actions which can help them respond to alerts or integrate with external alerting or MoM tools. There are times that they may fail due to different reasons and a user may not get the intended alert they set up. This can be inconvenient for users and if the alerts are used to monitor critical services, this can have a financial impact as well and can prove to be costly if alerts are not received on time.
The following two searches can help users understand if any triggered alerts are not sending emails or the alert action is failing. Alert actions can fail because of multiple reasons, and Splunk internal logs will be able to capture most of those reasons as long as proper logging is set in the alert action script.
Please note that the user running the searches need to have access to the “_internal” index in Splunk.
The first search looks at email alerts and will tell you by subject which alert did not go through. You can use the information in the results of
index=_internal host=<search_head> sourcetype=splunk_python ERROR
| transaction startswith=“Sending email.” endswith=“while sending mail to”
| rex field=_raw “subject\”\=(?P<subject>[^\”]+)\””
| rex field=_raw “\-\s(?<error_message>.*)\swhile\ssending\smail\sto\:\s(?P<rec_mail>.*)”
| stats count values(host) as host by subject, rec_mail, error_message
Note: Please replace <search_head> with the name of your search head(s), wildcards will also work.
Legend;
host - The host the alert is saved/run on
subject - Subject of the email - by default it is Splunk Alert: <name_of_alert>
rec_mail - Recipients of the email alert
error_message - Message describing why the alert failed to send email
The second (below) search looks through the internal logs to find errors while sending alerts using alert actions to external alerting tools/integrations
| transaction action date_hour date_minute startswith=“Invoking” endswith=“exit code”
| eval alert_status = if(code=0, “success”, “failed”)
| table _time search action alert_status app owner code duration event_message
| eval event_message = mvjoin(event_message, “ -> “)
| bin _time span=2h
| stats values(action) as alert_action count(eval(alert_status=“failed”)) as failed_count count(eval(alert_status=“success”) as success_count latest(event_message) as failure_reason by search, _time
| search failed_count>0
Note: Please replace <search_head> with the name of your search head(s), wildcards will also work.
These two searches can be setup as their own alert, but I would recommend setting these up on an Alert Monitoring dashboard. Splunk Administrators can monitor Splunk Alerts periodically to see whether any alerts are failing to send emails or any external alerting tools integrations are not working. Splunk puts a variety of tools in your hand but without proper knowledge, every tool becomes a hammer.
To learn more and have our consultants help you with your Splunk needs, please feel free to reach out to us.