A few years ago I wrote a post about ways that you can correlate different characteristics of backdoor beaconing. By identifying and combining these different characteristics you may be able to identify unknown backdoors and possibly generate higher fidelity alerting. The blog can be found here: http://findingbad.blogspot.com/2018/03/c2-hunting.html
What I didn't talk about was utilizing flow data to identify C2. With the use of ssl or encrypted traffic you may lack the required data to correlate different characteristics and need to rely on other sources of information. So how do we go hunting for C2 in network flows? First we need to define what that may look like.
- Beacons generally create uniform byte patterns
- Active C2 generates non uniform byte patterns
- There are far more flows that are uniform than non uniform
- Active C2 happens in spurts
- These patterns will be anomalous when compared to normal traffic
I've said for a long time that one way to find malicious beaconing in network flow data is to look for patterns of beacons (uniform byte patterns) and alert when the patterns drastically change (non uniform byte patterns). The problem I had was figuring out how to do just that with the tools I had. I think we (or maybe just me) often get stuck on a single idea . When we hit a roadblock we lose momentum and can eventually let the idea go, though it may remain in the back of your head.
Last week I downloaded the latest Splunk BOTS data source and loaded it into a Splunk instance I have running on a local VM. I wanted to use this to explore some ideas I had using Jupyter Notebook. That's when the light went off. Below is what I came up with.
This Splunk search performs the following:
- Collects all flows that are greater than 0 bytes
- Counts the number of flows by each unique byte count by src_ip, dest_ip, and dest_port (i_bytecount)
- Counts the total number of flows between src_ip, dest_ip (t_bytecount)
- Counts the unique number of byte counts by src_ip, dest_ip (distinct_byte_count)
- Generates a percentage of traffic by unique byte count between src_ip, dest_ip (avgcount)
The thought being that a beacon will have a high percentage of the overall traffic between 2 endpoints. Active C2 will be variable in byte counts, which is represented by distinct_byte_count.
I then wanted to identify anomalous patterns (if any) within this data. For this I used K-Means clustering as I wanted to see if there were patterns that were outside of the norm. Using the following python code:
import matplotlib.dates as md
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import host_subplot
import mpl_toolkits.axisartist as AA
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.covariance import EllipticEnvelope
from sklearn.ensemble import IsolationForest
from sklearn.svm import OneClassSVM
from mpl_toolkits.mplot3d import Axes3D
df = pd.read_csv("ByteAvgs1.csv")
df['t_bytecount'] = pd.to_numeric(df['t_bytecount'], errors='coerce')
df['i_bytecount'] = pd.to_numeric(df['i_bytecount'], errors='coerce')
df['avgcount'] = pd.to_numeric(df['avgcount'], errors='coerce')
df['distinct_byte_count'] = pd.to_numeric(df['distinct_byte_count'], errors='coerce')
df['bytes_out'] = pd.to_numeric(df['bytes_out'], errors='coerce')
X = df[['avgcount', 't_bytecount', 'distinct_byte_count']]
X = X.reset_index(drop=True)
km = KMeans(n_clusters=2)
labels = km.labels_
fig = plt.figure(1, figsize=(7,7))
ax = Axes3D(fig, rect=[0, 0, 0.95, 1], elev=48, azim=134)
plt.title("K Means", fontsize=14);
I was able to visualize the following clusters:
While the majority of the traffic looks normal there are definitely few outliers. The biggest outlier based on the Beacon Percentage and Total Count is:
There were 3865 flows with 97% all being the same byte count. There were also 19 unique byte counts between these 2 ip's.
Taking a quick look into the ip we can assume that this machine was compromised based off the command for the netcat relay (will take more analysis to confirm):
Obviously this is a quick look into a limited data set and needs more runtime to prove it out. Though it does speak to exploring new ideas and new methods (or in this case, old ideas and new methods). You never know what you may surface.
I'd also like to thank the Splunk team for making the data available to everyone. If you would like to download it, you can find it here: https://www.splunk.com/en_us/blog/security/botsv3-dataset-released.html.