In my last few posts I talked about hunting for anomalies in network data. I wanted to expand on that a bit and specifically talk about a way we can create metadata around detectable events and use those additional data points for hunting or anomaly detection. The hope being that the metadata will help point us to areas of investigation that we may not normally take.
For this post I'm again using the BOTS data from Splunk and I've created several saved searches based on behaviors we may see during an intrusion. Once the saved searches run, the output results are logged to a summary index. More on that topic can be found here: http://findingbad.blogspot.com/2017/02/hunting-for-chains.html. The goal is to get all of our detect data into a queryable location as well as a way that we count.
For our saved searches we want to ensure the following.
Create detections based on behaviors:
- Focus on accuracy regardless of fidelity.
- A field that will signify an intrusion phase where this detection would normally be seen.
- A field where a weight can be assigned based on criticality.
- A common field that can be found in each detection output that will identify the asset or user (src_ip, hostname, username...).
Once the output of our saved searches begins to populate the summary index we would like to have results similar to the screenshot below:
The following is the definition of the fields:
(Note: the events in the screenshot have been deduped. All calculations have taken place, but am limiting the number of rows. Much of what is identified in the output is data from the last detection before the dedup occurred.)
- hostname: Self explanatory, but am also using the src_ip where the hostname can't be determined.
- source: The name of the saved search.
- weight: Number assigned that represents criticality of event.
- phase: Identifier assigned for phase of intrusion.
- tweight: The sum weight of all detected events.
- dscount: The distinct county of unique detection names (source field).
- pcount: The number of unique phases identified.
- scount: Total number of detection identified.
- phasemult: An additional value given for number of unique phases identified where that number is > 1.
- sourcemult: An additional value given for number of unique sources identified where that number is > 1.
- weighted: The sum score of all values from above.
There are a few points that I want to discuss around the additional fields that I've assigned and the reasons behind them.
- Phases (phase,pcount,phasemult): Actors or insiders will need to step through multiple phases of activity before data theft occurrs. Identifying multiple phases in a given period of time may be an indicator of malicious activity.
- Sources (source,scount,dscount,sourcemult): A large number of detections may be less concerning if all detections are finding the same activity over and over. Actors or insiders need to perform multiple steps before data theft occurs and therefor fewer numbers of detections, where those detections surround different actions, would be more concerning.
- Weight: Weight is based on criticality. If I see a large weight with few detections, I can assume the behavior may have a higher likely hood of being malicious.
- Weighted: High scores tend to have more behaviors identified where those behaviors reach multiple behaviors.
Now that we've performed all of these calculations and have a good understanding of what they are, we can run k-means and cluster the results. I downloaded a csv from the splunk output and named it cluster.csv. Using the below code you can see I chose 3 clusters using the tweight, phasemult and scount fields. I believe that the combinations of these fields can be a good representation of anomalous behavior (I could also plug in other combinations and have the potential to surface other behaviors.).
The following is the contents of those clusters.
Based on the output, the machine in cluster 1 definitely should be investigated. I would also investigate those machines in cluster 2 as well.
Granted, this is a fairly small data set, but is a great representation of what can be done in much larger environments. The scheduling of this method could also be automated where the results are actioned, correlated, alerted on ...).
Again I would like to thank the Splunk team for producing and releasing BOTS. It's a great set of data to test with and learn from.