Replace Metadata Field Values Using Props & Transforms

By Zubair Rauf, Team Lead, Cybersecurity Engineering

Splunk is a very powerful analytics tool that has helped users uncover valuable insights about their day-to-day operations in many fields of business, including but not limited to Security monitoring, IT operations and other custom use cases. It helps customers make informed decisions for better outcomes.

The data onboarding phase for a new data source is very critical and if done correctly can mean good data which will produce valuable insights for the users. Many times I have seen users lose the potential benefit they can get from Splunk because their data sources are not properly onboarded. This needs to be a well-thought-out and deliberate process where the data owner defines how fields are extracted and how well the metadata is extracted from a data source.

In a previous blog post Custom Splunk Metadata Fields with inputs.conf I explained how to create custom metadata fields using inputs.conf

In this blog post, I will walk you through rewriting Splunk metadata fields at index time based on a matching regex. This is very valuable if you need to rewrite sourcetypes at index time or even update the value of a source to something different based on the regex.

What Are We Changing

Splunk has many metadata fields, the popular ones are

  • index
  • host
  • source
  • sourcetype
  • _time

There are other metadata fields as well which Splunk uses to process data, but these are the fields that are visible to users when we search and are used to classify data at index time. As indexed data is immutable, these fields need to be rewritten before indexing if they need to be changed while ingesting data.

There are many use-cases to change index, sourcetype and source fields in Splunk. One of the most common ones that I have seen is for syslog sources. Technical Add-ons will assign a single sourcetype to a syslog data stream from a device and then use Splunk’s props.conf and transforms.conf to update the sourcetype at ingest time based on the data being ingested.

Splunk’s transforms.conf file allows Splunk admins to configure rules for transforming data. It allows users to mask data, create fields at index time, rewrite fields and configure lookups.

Splunk‘s props.conf file allows Splunk admins to configure how the incoming data will be parsed — how to break events in a log stream, parse timestamps, transform data and also create fields. The rules defined in transforms.conf are applied in the props configuration.

The props.conf and transforms.confconfigurations reside on search heads, indexers and heavy forwarders and in some cases, universal forwarders. More details about Splunk’s props can be found in my colleague’s blog post – Data Onboarding in Splunk

In this blog post, I will show you how to change source and sourcetype while data is being indexed.

Please note that, as we make changes to the props and transforms files, a restart is required for the settings to take effect. You can also use the debug/refresh endpoint to reload the configs (not recommended in production). If you are using an indexer cluster, then the change needs to be pushed to the indexers through the Cluster Manager. More information on the following link – https://docs.splunk.com/Documentation/Splunk/9.1.2/Admin/Configurationfilechangesthatrequirerestart

The Data

In this case, we have generated sample logs using a custom script. It generates two kind of sample logs,

20231211_093438 File number 3 I will change the sourcetype for this event  
20231211_093438 This is file 3 created in folder 5
20231211_093435 File number 2 I will change the sourcetype for this event  
20231211_093435 This is file 2 created in folder 5
20231211_093432 File number 1 I will change the sourcetype for this event  
20231211_093432 This is file 1 created in folder 5
20231211_093429 File number 3 I will change the sourcetype for this event  
20231211_093429 This is file 3 created in folder 4
20231211_093426 File number 2 I will change the sourcetype for this event  
20231211_093426 This is file 2 created in folder 4
  • The original source is /Users/zubairrauf/var/log/splunk-test/appliance_4/host_4/file_20231004_144903_1.txt
  • The original sourcetype is syslog-test

Below is a screenshot of how the data looks when it’s ingested into Splunk without any transformations.

Transforming The “source” Field

I recently worked with this customer, who wanted to remove the date and timestamp from the source field in their syslog feed. They wanted the data to be ingested in the following format.

  • Original source – /var/log/splunk-test/appliance_5/host_5/file_20231211_093438_3.txt
  • New source – /var/log/splunk-test/appliance_5/host_5/file.txt

To achieve this, we will create a transforms stanza in transforms.conf to define a regex-replace on the source and apply the transform to the source stanza in props.conf

#transforms.conf

[orig_source]
INGEST_EVAL = orig_source = source

# orig source example - /Users/username/var/log/splunk-test/appliance_1/host_1/file_20230907_112754_3.txt
[rewrite_syslog_source]
SOURCE_KEY = MetaData:Source
DEST_KEY = MetaData:Source
REGEX = ^(.*)(\\_\\d{8}_\\d{6}_\\d{1,})(\\.txt)$
FORMAT = source::$1$3
WRITE_META = true

In the above transforms.conf, we have created two transforms to create a new field from the original source (only for demo purposes of this blog) and also rewrite the source. You can refer to transforms.conf.spec for more information on the above parameters.

The REGEX parameter in rewrite_syslog_source creates three capture groups and with the FORMAT parameter we define which of those 3 groups we need to use in the new source. See the below screenshot from http://regex101.com where I test my regex.

Another thing to note in transforms.conf is that, when the SOURCE_KEY and DEST_KEY settings refer to “MetaData” values, the field names (e.g. Source) are case sensitive.

#props.conf
#/var/log/splunk-test/....
[source::.../var/log/splunk-test/...]
TRANSFORMS-1 = orig_source
TRANSFORMS-2 = rewrite_syslog_source

In props.conf, I apply the transforms to the source stanza which has the path to my files. This will make sure the source is rewritten and the orig_source is also preserved as shown below in the screenshots

Transforming The “sourcetype” Field

As we transformed the source field above, we can use transforms.conf and props.conf to rewrite the sourcetype as well if the event matches a certain regex. In this example, we will change the sourcetype for the event which includes the text “I will change the sourcetype for this event”.

In this case, we will create a new transform to change the sourcetype based on a regex match and apply it to the sourcetype stanza in props.conf

#transforms.conf

[orig_sourcetype]
INGEST_EVAL = orig_sourcetype = sourcetype

# sample event for sourcetype change - 
# 20230913_192022 File number  I will change the sourcetype for this event
[rewrite_syslog_sourcetype]
REGEX = (\\d{8}\\_\\d{6}\\sFile\\snumber\\s\\d{1,2}\\sI)
FORMAT = sourcetype::updated_sourcetype
DEST_KEY = MetaData:Sourcetype
WRITE_META = true

In the above transforms, the first one is preserving the original sourcetype (for demo purposes again) and the second transform, rewrite_syslog_sourcetype is rewriting the sourcetype based on a regex match to the value updated_sourcetype

The highlighted events will have a sourcetype value of “updated_sourcetype” and the rest of the events will have the original sourcetype value of “syslog-test”

#props.conf

[syslog-test]
LINE_BREAKER = ([\\r\\n]+)
SHOULD_LINEMERGE = false
TIME_FORMAT = %Y%m%d_%H%M%S
TIME_PREFIX = ^
MAX_TIMESTAMP_LOOKAHEAD = 15
TRANSFORMS-3 = orig_sourcetype 
TRANSFORMS-4 = rewrite_syslog_sourcetype

In props.conf,we apply the transforms to the sourcetype syslog-test.

Conclusion

As I have shown in the blog post, props and transforms are potent tools when parsing data ingested into Splunk. Different customers may have different use cases for masking and transforming data at ingest time. In a future blog post, I will demonstrate how to send a subset of data to different indexes based on the content of the events and also how to mask data before ingestion.