Getting Rid of Unwanted Data with SEDCMD’s in Splunk
By Aaron Dobrzeniecki, Senior Splunk Consultant
Do you need a safe way to lower your Splunk license ingestion as well as get rid of any unwanted characters or text? Obviously, you could go back to the application team and see if they can remove the text that you no longer wish to index. The easiest and quickest way to achieve this goal is to use SEDCMD’s in Splunk!
The SEDCMD can be used to remove unwanted characters or text from your events and to replace certain data with other values; to possibly mask very important and sensitive data. With the increasing number of SVC based licenses, a lot of teams will just ingest everything in their logs. When I say everything, I mean all parts of any events (extremely large events, small events, junk events, etc.). If you currently have an SVC based license, remember, removing unwanted text or junk data, as well as creating accurate sourcetype settings, will help Splunk lower the number of SVCs used to parse your data. Which in turn lowers your SVC usage for ingestion, allowing more usage for searching. We will dive more into SEDCMD’s below!
Splunk is an effective data analytics and visualization platform that allows organizations to gain valuable insights from their machine-generated data. One of the key features of Splunk is its ability to process and transform data using numerous commands. In this blog post, we’ll dive into the world of SEDCMDs (SED-like stream editing commands) in Splunk and explore how they can be used to manipulate and transform data.
Understanding SEDCMD’s
SED (stream editor) is a Unix utility that allows you to perform basic text transformations on an input stream or a file. Splunk has borrowed the concept of SED and introduced SEDCMDs to modify events before indexing. SEDCMDs are applied to each event before it is indexed, enabling you to modify specific fields or manipulate the entire event payload if necessary.
Syntax and Usage
The syntax for SEDCMDs in Splunk follows a simple pattern:
SEDCMD-<name> = s/<pattern>/<replacement>/<flags>
<name>: A unique name for the SEDCMD, allowing you to reference it in your configuration.
<pattern>: A regular expression pattern that matches the text you want to replace.
<replacement>: The text you want to substitute for the matched pattern.
<flags>: Optional flags that control the behavior of the substitution (e.g., case-insensitive matching). Flags can wither be “g” to replace all matches (global), or a number to replace a specified match.
SED versus SEDCMDs
While SEDCMDs in Splunk are inspired by the traditional SED utility, there are some differences to note. Unlike SED, which operates on a file or a stream of data, SEDCMDs in Splunk are applied to individual events during indexing. This means that each event is processed independently, allowing you to apply different SEDCMDs to different events or fields. Let us dive into an example of getting rid of different segments of an event, and just for fun, an entire event leaving just one field and value. The data we are using for this post can be found here: Download Splunk Tutorial Files. We will be using the www2 access.log. Please see a sample of the log below:
[unwanted_test]
LINE_BREAKER=([\r\n]+)[\d\.]+
MAX_TIMESTAMP_LOOKAHEAD=20
SHOULD_LINEMERGE=false
TIME_PREFIX=^[^\[]+\[
TRUNCATE=9999
TIME_FORMAT=%d/%b/%Y:%H:%M:%S
SEDCMD-mask=s/^\d+\.\d+\.\d+\.\d+/xxx.xxx.xxx.xxx/g
SEDCMD-reqmethod=s/\"[A-Z]+\s[^\"]+\"\s//g
SEDCMD-useragent=s/\"\w+\/[^\"]+\"\s//g
As you can see we have masked the source IP address and removed the user agent and request type segments. Since this is an all-in-one instance, we do not have the UF props settings for EVENT BREAKER. For more information on creating efficient sourcetypes to successfully onboard your data, check out this TekStream Blog Post.
For the second test I will be removing every single segment, but will be keeping the JSESSIONID from the resource requested segment. It will look something like this JSESSIONID=SD10SL4FF4ADFF4976. Below is the sourcetype I created with the two SEDCMD’s that will allow us to achieve our goal.
[unwanted_test]
LINE_BREAKER=([\r\n]+)[\d\.]+
MAX_TIMESTAMP_LOOKAHEAD=20
SHOULD_LINEMERGE=false
TIME_PREFIX=^[^\[]+\[
TRUNCATE=9999
TIME_FORMAT=%d/%b/%Y:%H:%M:%S
SEDCMD-sid=s/^.*JSESSIONID\=/JSESSIONID=/g
SEDCMD-remove=s/\s[A-Z]+\s.*//g
The results from the second test are a success. The timestamp is being correctly extracted due to the sourcetype having the proper settings. My first SEDCMD removes the entire first part of the event including JSESSIONID= and we are replacing all that text with JSESSIONID=. My second SEDCMD removes every single character after the value of the JSESSIONID. With large amounts of data ingesting into Splunk, how can you tell exactly what unwanted data that should be removed? Please see below, some ways to point out unwanted logs in your data stream.
Identifying Unwanted Data
Unwanted data in Splunk can come in various forms, including irrelevant events, noisy logs, or redundant information. Here are some common scenarios where unwanted data can accumulate. Redundant Data: Duplicate events or log entries that offer no additional value and only consume storage space. Irrelevant Sources: Data sources or log files that are no longer needed for analysis or monitoring purposes. Noise and Low-Value Logs: Log messages that are excessive, verbose, or provide limited insights into system performance or security. Test and Development Data: Data generated during testing or development phases that may not be relevant for production analysis.
In conclusion, cleansing unwanted data from your Splunk environment is essential for optimizing performance, storage utilization, and data analysis. By implementing effective strategies and following best practices, you can streamline your data ingestion process, improve search efficiency, and ensure that your Splunk environment remains focused on delivering valuable insights. Regularly reviewing and refining your data cleansing approach will help you maintain a clean and efficient data ecosystem, enabling you to derive maximum value from your Splunk investment.