Routing PII Data to Multiple Indexes
By Eric Levy, Splunk Consultant & Zubair Rauf, Team Lead, Cybersecurity Engineering
As Splunk Consultants we often come across customers who log personally identifiable information (PII) data on their systems that need to be indexed into Splunk for various use-cases. This data can be part of but not limited to sales logs, health records, and many other types of log files. This data can be used by various teams which may or may not require access to the PII to build analytics on this data for their daily business needs.
Oftentimes, multiple teams interact with the same data, but some teams would need to have sensitive portions masked/removed. In this blog post we will talk about a use-case which deals with logs from a sales system that records customer’s credit card information when a sale is processed and generates error logs if something doesn’t go as planned.
We used a custom-built script to generate some sample logs which are shown below. The script uses random words from the dictionary to record error messages:
2024-01-11 11:22:38: [TransactionID=1259] [CreditCardNumber=1516-3643-1843-5285] [TransactionAmount=265.81] [TransactionStatus=0]
2024-01-11 11:22:41: [TransactionID=1606] [CreditCardNumber=3301-5491-1216-2042] [TransactionAmount=49.29] [TransactionStatus=0]
2024-01-11 11:22:44: [TransactionID=1814] [CreditCardNumber=8894-10012-6514-9377] [TransactionAmount=878.50] [TransactionStatus=1]
2024-01-11 11:22:48: [ErrorMessage=piscian iodine tussive intenability alody trypanosomatous hypabyssal supervast splashing japishness ] [ErrorStackTrace=piscian iodine tussive intenability alody trypanosomatous ]
2024-01-11 11:22:51: [TransactionID=1735] [CreditCardNumber=2525-2750-4478-8785] [TransactionAmount=275.86] [TransactionStatus=1]
2024-01-11 11:22:55: [ErrorMessage=monadelphous triakistetrahedral slippery belage pungapung pustule leucophanite mantispid spiritualty misdetermine ] [ErrorStackTrace=monadelphous triakistetrahedral slippery belage pungapung pustule leucophanite mantispid spiritualty ]
2024-01-11 11:22:58: [TransactionID=1470] [CreditCardNumber=9384-8198-3585-2396] [TransactionAmount=370.37] [TransactionStatus=0]
2024-01-11 11:23:02: [ErrorMessage=Sunil Gaditan piaster pudicity tarkeean Juan ] [ErrorStackTrace=Sunil Gaditan piaster pudicity tarkeean Juan Sulafat trowelful ]
We will explore two methods to route and filter the same data and index it in two different indexes. We will use the Splunk props.conf
and transforms.conf
configuration files to transform our data and achieve the desired results.
Splunk‘s props.conf file allows Splunk admins to configure how the incoming data will be parsed — how to break events in a log stream, parse timestamps, call transforms to data, and create fields. The rules defined in transforms.conf are applied in the props configuration.
Splunk’s transforms.conf file allows Splunk admins to configure rules for transforming data. It allows users to mask data, create fields at index time, rewrite fields, and configure lookups.
The props.conf and transforms.conf configurations reside on search heads, indexers, and heavy forwarders, as well as universal forwarders in some cases. More details about Splunk’s props can be found in my colleague’s blog post – Data Onboarding in Splunk
Method 1 – Mask PII and send all data to a separate index
The first method demonstrates how to send all sales log data (transactions and error logs) to a separate index where teams who need access to masked data for their job can use the data.
The unmasked sales data is ingested as the sourcetype named private_sourcetype
(for ease of understanding) into the private_index
. The configs can be deployed to a new TA that can be used for parsing or $SPLUNK_HOME/etc/system/local/props.com|transforms.conf
files.
With this method, we will first clone the private_sourcetype
into a new sourcetype called public_sourcetype
, where PII will be masked, and then route it to a public_index.
Parsing for both private_sourcetype
and public_sourcetype
should be done by using the props.conf settings shown further down this post.
To start, in our transforms.conf
, we define the transform stanza that will clone our sourcetype. The cloned sourcetype will perform all remaining transformations.
###transforms.conf
[clone_sourcetype]
CLONE_SOURCETYPE = public_sourcetype
REGEX = .
DEST_KEY = _METADATA:Index
FORMAT = public_index
WRITE_META = true
This stanza calls the CLONE_SOURCETYPE
command, which is the “secret sauce” of this trick. CLONE_SOURCETYPE
allows you to duplicate data into another sourcetype. We then set the DEST_KEY
to modify the index, indicating through REGEX value of .
indicating that we want to replace the entire value and setting FORMAT
to the name of the new index – in this case, “public_index” (since it will house our masked data).
Below props.conf
shows how the transform is applied to the sourcetype. These props for the private_sourcetype
is using the Super Six recommended settings to parse the logs while ingesting into Splunk.
###props.conf
[private_sourcetype]
LINE_BREAKER = ([\r\n]+)
SHOULD_LINEMERGE = false
TIME_PREFIX = ^
TIME_FORMAT = %Y-%m-%d %H:%M:%S
MAX_TIMESTAMP_LOOKAHEAD = 20
TRUNCATE = 200
TRANSFORMS-public_sourcetype = public_sourcetype
The second set of transforms that will be included in the same transforms.conf
file will mask the data the before it is indexed into the new public_index so that it can be consumed by the right people without exposing PII data.
###transforms.conf
[mask_credit_card_info]
REGEX = ^(.*?)CreditCardNumber=\d{4,5}-\d{4,5}-\d{4,5}-\d{4,5}(.*)$
FORMAT = $1CreditCardNumber=XXXX-XXXX-XXXX-XXXX$2
DEST_KEY = _raw
The syntax is very similar to how we replaced the index name for the cloned data – we specify our REGEX
for the events we want to target (which, in this case, is anything with a credit card number), then apply the FORMAT
to the specific portion of the event we want to replace (hence the event being surrounded by $1 and $2), and finally set our DEST_KEY
to mask our data within the _raw
event.
To apply these transforms we use the props.conf
settings shown below to the public_sourcetype
stanza:
###props.conf
[public_sourcetype]
LINE_BREAKER = ([\r\n]+)
SHOULD_LINEMERGE = false
TIME_PREFIX = ^
TIME_FORMAT = %Y-%m-%d %H:%M:%S
MAX_TIMESTAMP_LOOKAHEAD = 20
TRUNCATE = 200
TRANSFORMS-mask_credit_card_info = mask_credit_card_info
When all these props and transforms are applied the private_index
should show the unmasked data and the public_index
should show the masked data:
Method 2 – Filter and route error logs to a separate index
With this method, we will explore how to filter and route error logs to the public_index
. With this method we do not need to mask the PII as it is not being routed to the public_index
.
To start, like we did in method 1, we will clone the sourcetype to sales_error_log
(to make it easy to identify the data) using the transforms.conf
config file.
###transforms.conf
[clone_sourcetype]
CLONE_SOURCETYPE = sales_error_log
REGEX = ^.*ErrorMessage=
DEST_KEY = _MetaData:Index
FORMAT = public_index
WRITE_META = true
In the above clone_sourcetype
transforms stanza, we use the CLONE_SOURCETYPE
parameter to specify a name for the new source. As we are only cloning the error message events, we specify a REGEX
that matches the relevant event format. We also add DEST_KEY
to identify that the destination for these new duplicated events is the public_index
.
We apply these transforms in the props.conf
to the private_sourcetype
and also create a new stanza for the sales_error_log
so that Splunk can easily parse the duplicated logs as well without issue.
[private_sourcetype]
LINE_BREAKER = ([\r\n]+)
SHOULD_LINEMERGE = false
TIME_PREFIX = ^
TIME_FORMAT = %Y-%m-%d %H:%M:%S
MAX_TIMESTAMP_LOOKAHEAD = 20
TRUNCATE = 200
TRANSFORMS-clone_private_sourcetype = clone_sourcetype
[sales_error_log]
LINE_BREAKER = ([\r\n]+)
SHOULD_LINEMERGE = false
TIME_PREFIX = ^
TIME_FORMAT = %Y-%m-%d %H:%M:%S
MAX_TIMESTAMP_LOOKAHEAD = 20
TRUNCATE = 200
With these props and transforms settings, we should be able to duplicate the error logs and send the duplicated copy to the public_index
for consumption.
Here we’ve outlined two ways you can route sensitive PII data to multiple indexes using props and transforms without duplicating at the source. The first puts the same data in both indexes while redacting sensitive information (in this case, a credit card number) while the second separates regular events from error messages. Both approaches show Splunk’s flexibility in data management configurations and how it can provide tailored solutions for a wide array of use cases.
Find other Splunk blogs here, Splunk offerings here, and always feel free to reach out if you have any questions. Happy Splunking!