Anomalous Network Traffic


About The Data

This puzzle comes from Dr. Chiara Sabatti at Stanford University. The dataset has 21K rows and covers 10 local workstation IPs over a three month period. Half of these local IPs were compromised at some point during this period and became members of various botnets. Each row consists of four columns:

  • date: yyyy-mm-dd (from 2006-07-01 through 2006-09-30)
  • l_ipn: local IP (coded as an integer from 0-9)
  • r_asn: remote ASN (an integer which identifies the remote ISP)
  • f: flows (count of connnections for that day)
  • How Does It Work?

    Each circle represents the local machine, where the IP is indicated by the number above. The circles themselves change in size depending on the relative number of connections made by that machine on that particular day. The outer circle represents the 95th percentile of the number of connections made overall for each machine. The slider can be used to cycle through the dates in order to see how the network traffic changes throughout time for each machine. The red lines above the slider indicate the number of machines whose activity was greater than the 95th percentile for that particular day.

    Data Preparation

    To compute the number of connections made by each IP, I first grouped by both date and IP, then summed over the number of flows. I then grouped the aggregated sum column and computed the 95th percentile to get the bounding values for the outer circles. Many machines had varying activity, thus it was very important to standardize the visual encodings. Each circle size is relative to that machine’s activity. With the assumption that anomalous behavior occurs when the machine reaches greater than the 95th percentile of its regular traffic, this visualization shows an abundance of anomalous events. Depending on one’s definition of an anomaly, the visual display could change drastically.