Summary Indexing with Collect
By: Karl Cepull | Senior Director, Operational Intelligence
Splunk can be a valuable tool in cybersecurity. Attacks from outside forces, along with questionable activity within a network, put sensitive data and corporate assets at tremendous risk. Using Splunk to find bad actors or malicious events can help an organization protect itself and discover breaches in time to act. However, larger corporate environments may see millions of events from their firewalls in a single hour. When looking at traffic trends over the past seven days, the number of events may make the search inefficient and expensive to run. This is when a summary index can be used to reduce the volume of data being searched, and return meaningful results quickly.
Summary indexes can be tremendously helpful to find a needle in the haystack of data. If the need is to determine the most frequent source IP for inbound traffic, fields such as vendor, product, and product version may not be helpful. Including these fields adds more work to the search process, multiplied by millions (and sometimes billions) of events. Summarizing the data into source IP, destination IP, ports used, and action (just to name a few) helps to ease the strain of the search. When done correctly, summary indexes won’t have an impact on your license usage.
A summary index starts off as a normal index. The specifications of the index need to be defined in indexes.conf. Data cannot be summarized to an index that does not exist. Adding data to the index can be done by adding a “collect” statement at the end of the search. The structure of the command is:
collect index=<index name> <additional arguments>
The collect command only requires an index to be specified. Other arguments can be added to the collect command:
Argument | Description | Default |
addtime | True/False – This determines whether to add a time field to each event. | True |
file | String – when specified, this is the name of the file where the events will be written. A timestamp (epoch) or a random number can be used by specifying file=$timestamp$ or file=$random$. | <random-number>_events.stash |
host | String – The name of the host you want to specify for the events. | n/a |
marker | String – different key-value pairs to append to each event, separated by a comma or a space. Spaces or commas in the value need to be escape quoted: field=value A will be changed to field=\”value A\”. | n/a |
output_format | raw or hec – specifies the output format. | raw |
run_in_preview | True/False – Controls whether the collect command is enabled during preview generation. Change to True to make sure the correct summary previews are generated. | False |
spool | True/False – Default of True sets the data in the spool directory, where it’s indexed automatically. If set to False, the data is written to ../var/run/splunk. The file will remain there unless moved by other automation or administrative action. This can be helpful when troubleshooting so summary data doesn’t get ingested. | True |
source | String – Name or value for the source. | n/a |
sourcetype | String – Name or value for the sourcetype. The default summary sourcetype is “stash.” | Stash |
testmode | True/False – If set to True, the results are not written to the summary index, but the search results are made to appear as they would be sent to the index. | False |
When using the collect command, there are two important details to remember:
- Changing the sourcetype to something other than “stash” will result in the summary data ingestion hitting your license usage.
- Unless specific fields are added to a table command, the collect command will grab all returned fields, including _raw. Having the raw data added to the summary search reduces the effectiveness of the summary index.
With the collect command set and data ready to be sent to a summary index, the next step is to create a scheduled search. The search should run frequently enough to find threats while impactful, but spaced out enough to reduce resource utilization. The summary index is more helpful for historical insights and trends than for real or near-time searches.
Going back to our original example of summarizing firewall logs, here is an example of a scheduled search:
index=firewall_index sourcetype=network_traffic
| fields action, application, bytes, dest_ip, dest_port, src_ip, dest_ip, packets
| table action, application, bytes, dest_ip, dest_port, src_ip, dest_ip, packets
| collect index=summary_firewall source=summary_search
Set this search to run every 10 minutes, looking back 10 minutes. The summary index will get a list of events with these fields specified, and continue to receive new events every 10 minutes. Historical searches for trends can now run without having to dig through unnecessary data and provide faster analysis on traffic patterns and trends.
Contact us for more help on using the Collect command for summary indexing your Splunk environment!