I propose to build a tool for the visualization of computer network intrusion data.
In security-sensitive computer networks, it is not uncommon to log information about each packet traversing the network. In the event that a computer on the network is compromised ("cracked"), is it crucial that the security team quickly determine what weakness was exploited, how, and by whom. The logged packet data often contains enough information to determine the nature of the attack, but it is in a form that is difficult to make sense of, even for knowledgeable network administrators. There is a large volume of data, much of which is irrelevant.
Once the security team has determined the nature of the attack, the new "attack signature" can be quite easily incorporated into the existing Network Intrusion Detection System (NIDS). The NIDS will then be able to protect the system from this type of attack in the future.
This is a useful domain for infovis since:
- Human analysis is required
- automated intrusion detection systems can easily prevent known types of attacks, but new attack types must be analyzed to determine their "signatures".
- Visualization may be useful
- large volumes of data are available, but most is benign. The raw data is generally in the form of cryptic text log files. Detecting patterns in the data when presented in this form will be very challenging, but when presented visually patterns will be more easily discernible.
- Rapid analysis is important
- since the network remains vulnerable until the attack signature has been determined, it is crucial that the analysis be conducted quickly. Therefore, tools that speed the analysis process could be very useful.
I intend to use the data sets available at http://ivpr.cs.uml.edu/shootout/about.html. These data sets are typical of what might reasonably be logged in a security-conscious network. Only a few pieces of information about each packet are available, since the sheer volume of data sent across the network precludes logging everything.
The tool I intend to build could feasibly be used for real-time monitoring of a network (using a tool such as tcpdump), but it is unclear that this is necessary or even useful. A reasonable NIDS will be able to prevent any known types of attacks, and given the rate of packets traversing the network, it seems unlikely that a person monitoring the network will be able to detect a novel form of attack in real time. For this reason, I will focus on logged data. The logged data sets above are useful since they contain "baseline" and four different "attack" scenarios.
For concreteness, below is an example of the contents of the data sets. Each data set contains several hundred thousand packets, one packet per line. (The following was copy-pasted verbatim from the web page describing the data sets, copyright The Institute for Visualization and Perception Research.)
The data items are discrete events in time, with many properties. Groups of events are closely related, and often whole sequences of events should be considered as a single unit. While each packet has a "source" and "destination", there is no intrinsic spatial arrangement.time,src_addr,src_port,dest_addr,dest_port,flag,seq1,seq2,ack,win,buf,ulen,op 38141.504694,1,7000,2,7001,U,,,,,,148,"" 38141.510076,3,20,2,3421,.,1811081902,1811082414,366784001,9216,512,,"" 38141.515159,3,20,2,3421,.,512,1024,1,9216,512,,"" 38141.516172,4,80,2,2609,.,,,438528422,9112,,," (DF)" 38141.516647,5,25,2,1362,F,266688477,266688477,580609140,4096,0,,""
The security analyst must determine which packets caused the system to be "cracked". Once this has been determined, the identifying properties of these packets must be determined so that the "attack signature" can be described. In cases where the intruder gained access to the system, it would be useful to determine what was done. It would also be useful to determine the identity of the intruder. Further, if the phases of the attack can be identified, other areas of weakness of the system may be identified. For example, if the intruder carried out a careful information-gathering phase before launching a pinpoint attack, this could prompt an audit of the areas of the system that were used to gather information, even though these were not directly compromised by the attack.
I propose to make several iterations through the design-test-evaluate loop, since it is quite unclear (to me, at least) how exactly the final tool should operate. I have a fair amount of networking experience (have written hardware drivers and a UDP/IP/Ethernet protocol stack, have administered small home networks) but little experience in network security. This will be a learning experience for me. I will therefore only outline the general requirements of the tool, as I currently see them.
There is a large quantity of data and the identifying traits of the "attack" packets are not known a priori. Therefore, the tool must support filtering of packets based on possibly complex criteria. Furthermore, aggregation of similar packets will likely be required, and derived values (eg, statistics) will be useful.
Since the volume of data is fairly large, and since complex filtering criteria may be necessary, it may be useful to use a backend database system to store the data sets. This should allow for rapid filtering and retrieval of data, and a database query language (such as SQL) could be used. The users of this tool are expected to be very competant computer users, so they will likely be comfortable with, and may even demand, the sophisticated search abilities of a "real" query language, rather than a GUI-moderated, limited search engine.
The "baseline" network activity data will be useful in determining how the "attack" scenario differs from standard operating conditions. Therefore, the ability to show several data sets simultaneously would be beneficial.
Since there are many properties of packets (and groups of packets) that the analyst may be interested in, it will likely be necessary to provide several types of visualisations. The data fields are primarily nominal and ordinal, and each is timestamped. Scatterplots and time series plots may be the most useful visuals, but a spatial layout of network nodes and the connections between them may be useful. Since packets are in essence discrete events in time, animations may be useful.
Given the variety of visualisations that the analyst may wish to use, the tool must be flexible and provide a simple way to choose among the various visualisation types.
For simplicity and for the ability to leverage existing toolkits, I will probably create my tool in Java. The Jazz/Piccolo zoomable interface toolkits may be useful. Good database access libraries exist, and open-source databases such as MySQL are available. I see little reason to use 3-D graphics rather than 2-D, so the builtin Java graphics libraries (and other 2-D graphics libraries) will likely suffice for my needs.
I can take some inspiration from MovieFinder and HomeFinder; both of these systems allow the user to selectively view different attributes of discrete objects (movies and houses, respectively).
I may be able to leverage work by others in the visualisation of networks. This domain is different than the often-attempted task of visualising the Internet; the topology of the network is far less important than the nature of the packets on it. Since a network packet sniffer can only "see" packets on the local network segment, the raw packet data has no hierarchical or graph arrangement. Most of the work that treats the internet as a large graph to be visualised will likely be of little use for this project.
Several visual tools are available for performing somewhat similar tasks, however. VisuSniff: A Tool For The Visualization of Network Traffic may be useful, though I find their visualisation rather uninspiring. The jpcap project is more oriented toward real-time network sniffing, though their graphic toolkits may be useful and their visualisations are more appealing. Finally, the NIUNet: "Now I Understand Networking!" project, intended to aid in "Understanding Networking Principles through Visualization, Simulation, Emulation, and Application", has toolkits (the Java Network Simulator (JNS) and Java Visual Animator (Javis)) which may be helpful.
This tool could potentially be very sophisticated. In order to make it tractable in the time available, I intend to take a "layered" approach. By beginning with very basic database filtering and one or two visualisation styles, I will be able to refine the visualisation (perhaps even using informal user studies), and add more tools and more sophisticated interaction styles as time permits, using user feedback as a guide.