Accelerating Data Modeling: Expert Tips
By: Jon Walthour | Senior Splunk Consultant, Team Lead
- – Common Information Model (CIM)
- – DBX Health Dashboards
- – Palo Alto app
- – Splunk Global Monitoring Console
- – Infosec
- – CIM Validator
- – CIM Usage Dashboards
- – ArcSight CEF data models add-on
- – SA-Investigator
- – Threat hunting
All these Splunk apps and add-ons and many others use data models to power their searches. In order for a data model-powered search to function at peak performance, they are often accelerated. This means that at regular, frequent intervals, the searches that define these data models are run by Splunk and the results are summarized and stored on the indexers. And, because of the design of data models and data model accelerations, this summarized data stored on the indexers is tied to the search head or search head cluster that created it.
So, imagine it: You’re employing many different apps and add-ons in your Splunk deployment that all require these data models. Many times you need the same data models accelerated on several different search heads for different purposes. All these data models on all these search heads running search jobs to maintain and keep their summarized data current. All this summarized data is stored again and again on the indexers, each copy of a bucket’s summary data identical, but tied to a different search head.
In a large distributed deployment with separate search heads or search head clusters for Enterprise Security, IT Service Intelligence, adhoc searching, etc., you end up accelerating these data models everywhere you want to use them—on each search head or search head cluster, on your Monitoring Console instance, on one or more of your heavy forwarders running DB Connect, and more. That’s a lot of duplicate searches consuming CPU and memory on both your search heads and your indexers and duplicate accelerated data-consuming storage on those indexers.
There is a better way, though. Beginning with version 8.0, you can now share data models across instances—run once, use everywhere in your deployment that uses the same indexers. You accelerate the data models as usual on Search Head 1. Then, on Search Head 2, you direct Splunk to use the accelerated data created by the searches run on Search Head 1. You do this in datamodel.conf on Search Head 2 under the stanzas for each of the data models you want to share by adding the setting “acceleration.source_guid” like this:
[<data model name.]
acceleration.source_guid = guid of Search Head 1
You get the GUID from one of two places. If a standalone search head created the accelerated data, the GUID is in $SPLUNK_HOME/etc/instance.cfg. If the accelerated data was created by data model searches run on a search head cluster, you will find the GUID for the cluster in server.conf on any cluster member in the [shclustering] stanza.
That’s it, but there are a few “gotchas” to keep in mind.
First, keep in mind that everything in Splunk exists in the context of an app, also known as a namespace. So, the data models you’re accelerating are defined in the context of an app. Thus, the datamodel.conf you’re going to have on the other search heads with the “acceleration.source_guid” setting must be defined in the same namespace (the same app) as the one in which the data model accelerations are generated on the originating search head.
Second, once you set up this sharing, you cannot edit the data models on the search heads sharing the accelerated data (Search Head 2, in our example above) via Splunk web. You have to set up this sharing via the command line, and you can only edit it via the command line. You will also not be able to rebuild the accelerated data on the sharing search heads for obvious reasons, as they did not build the accelerated data in the first place.
Third, as with all other things in multisite indexer clusters, sharing data model accelerations in multisite indexer clusters gets more complicated. Basically, since the summary data hitches a ride with the primary buckets in a multisite deployment, which end up being spread across the sites, while search heads get “assigned” to particular sites, you want to set “summary_replication” to “true” in the [clustering] stanza in server.conf. This ensures that every searchable copy of a bucket, not just the primary bucket, has a copy of the accelerated data and that searches of summary data are complete. There are other ways to deal with this issue, but I’ve found simply replicating the accelerated data to all searchable copies ensures no missing data and no duplicate data the best.
Finally, when you’re running a tstats search against a shared data model, always use summariesonly=true. Again, this ensures a consistent view of the data as unsummarized data could introduce differing sources and thus incorrect results. One way to address this is to ensure the definition of the indexes that comprise the sources for the data models in the CIM (Common Information Model) add-on are consistent across all the search heads and search head clusters.
And this leads us to the pièce de résistance, the way to take this feature to a whole new level: Install a separate data model acceleration search head built entirely for the purpose of running the data model accelerations. It does nothing else, as in a large deployment, accelerating all the data models will keep it quite busy. Now, this means this search head will need plenty of memory and plenty of CPU cores to ensure the acceleration search jobs run smoothly, quickly, and do not queue up waiting for CPU resources or, worse yet, get skipped altogether. The data models for the entire deployment are managed on this job server. They are all accelerated by this instance and every other search head and search head cluster has a datamodel.conf where all data model stanzas have an “acceleration.source_guid” setting pointing to this data model job search head.
This gives you two big advantages. First, all the other search heads and clusters are freed up to use the accelerated data models without having to expend the resources to maintain them. It separates the maintenance of the data model accelerations from the use of them. Even in an environment where only one search head or search head cluster is utilizing these accelerated data models, this advantage alone can be significant.
So often in busy Enterprise Security implementations, you can encounter significant skipped search ratios because regularly run correlation searches collide with regularly run acceleration jobs and there just aren’t enough host resources to go around. By offloading the acceleration jobs to a separate search head, this risk of data model content loss because of skipped accelerations or missed notable events because of skipped correlation searches is greatly diminished.
Second, since only one instance creates all the data models, there is only one copy of the summary data on the indexers, not multiple duplicate copies for various search heads, saving potentially gigabytes of disk space. And, since the accelerations are only run once on those indexers, indexer resources are freed up to handle more search load.
In the world of medium and large distributed Splunk deployments, Splunk instances get specialized—indexers do indexing, search heads do searching. We also often have specialized instances for the Monitoring Console, the Cluster Manager, the Search Head Cluster Deployer, and complex modular inputs like DBConnect, Splunk Connect for Syslog, and the AWS add-ons. The introduction of Splunk Cloud has brought us the “Inputs Data Manager,” or IDM, instance for these modular inputs. I offer to you that we should add another instance type to this repertoire—the DMA instance to handle all the data model accelerations. No decently-sized Splunk deployment should be without one.
Want to learn more about data model accelerations? Contact us today!