The following thesis assignment scopes represent the research areas of our team. These areas specify the background of the final thesis assignment (bachelor’s, master’s, dissertation). We always choose the scope and particular focus of the final thesis individually for the mutual benefit of students and us, to let the students enjoy their work. Therefore, please contact us if you are interested in any of the areas, and we will create a “tailor-made” assignment.

If you are interested in cooperation on any of the topics, contact: “Ing. Tomáš Čejka, Ph.D.” cejkato2@fit.cvut.cz

1. Encrypted network traffic analysis and detection

Encrypted traffic nowadays represents the majority of network traffic, causing a significant improvement in security and privacy for users. On the other hand, the benefits of private communications can also be leveraged by attackers and malware authors to avoid detection on purpose. Thus, the analysis of encrypted communication poses a research challenge for the field of security systems based on network traffic monitoring because it is no longer possible to extract and analyze the content of network packets directly. This framework focuses on analyzing the characteristics of packets and communication flows to estimate the content of the communication - the type of activity, or at least infer the used communication protocol. Both are useful for detecting security threats and maintaining situational awareness for infrastructure administrators/operators or security teams.

We already have a lot of experience in the area of encrypted network analysis, especially with the application of machine learning methods. Our research can serve as a possible foundation for further thesis work and research around various protocols (e.g., TLS, SSH, HTTPS, OpenVPN, WireGuard, Tor).

2. Exploration of the characteristics of network traffic

Data representing network traffic is the basis of any security network analysis and network threat detection. The network traffic data contain characteristics/statistics (features), which serve as input to classification and detection algorithms. Currently, IP flow data (specifically in IPFIX or NetFlow format) is successfully used for network traffic monitoring, representing aggregated information about the observed communication, i.e., “when who communicated with whom and how much”. While applying machine learning methods to classification and detection tasks, we learned that enriching these flow data with additional packet-level information is advantageous. These include, for example, various aggregated statistics from packet lengths and inter-packet time gaps. This thesis assignment area specifically targets research and experimentation around appropriate augmentation information for IP flow data that will improve the capabilities of classifying the likely content of the communication (application) and detecting security threats. However, an important aspect should be the ease of obtaining this new information so that it can be computed even on high-speed networks (>= 200Gb/s).

Recently, we have been able to find and use a set of unique statistics to improve the accuracy of machine learning models over network data. These experiences can be built upon for follow-up thesis topics.

3. Automatic analysis of security events

Monitoring and detection systems produce a lot of events every minute. Each of these events can present a more or less significant risk of a security threat. Due to the amount of received data, it is usually necessary to process the detected events automatically. Therefore, so-called SIEM (Security Information and Events Management) systems are being deployed in both academic and commercial environments. SIEM specializes in examining the correlation between events to detect suspicious clusters of activity that may reveal, for example, advanced persistent threats (APTs). In addition to this type of processing, this data has also been successfully used to predict events and/or the behavior of network entities. For example, for sources of network attacks, it is possible to estimate the probability of future malicious activity and model their reputation. However, information about the long-term behavior of network entities is also worth monitoring for legitimate devices. This normal behavior can be automatically learned and used for anomaly detection in the future.

This thesis assignment area focuses on the processing of security reports/alerts, deriving “labels” of various types to network entities. By classifying and inferring new information about network entities, it is then possible to better target defense mechanisms in the network infrastructure.

4. Automatic capture of network traffic to create high-quality datasets

High-quality, up-to-date, and public datasets are needed to improve classification algorithms to compare results between models reliably regardless of model type (decision trees, neural networks, or other statistical methods). Creating such datasets from network communication is, therefore, an essential task. This framework focuses on automating the creation of datasets by capturing and annotating real network traffic. For example, DPI (deep packet inspection) or perhaps information from other sources (end devices and their logs, information databases - OSINT, scanners, etc.) can be used to annotate the traffic.

A dataset had to be created for every research/development activity focused on network traffic analysis and security threat detection. Therefore we already have a set of tools and an emerging infrastructure for traffic capture. Thus, the final work based on this framework topic can improve and extend the capability to produce high-quality, large-scale annotated datasets.

5. Network communication datasets and their evaluation

The question of dataset suitability for use in a particular classification or detection problem always arises regardless of data origin. These questions are asked before using publicly available data published by foreign researchers or datasets created locally. Due to a large amount of real network traffic, it is relatively easy to create a large dataset. However, at the moment, there are no sufficient methods to evaluate the quality, the sufficiency of the amount of data, or, conversely, the degree of unnecessary redundancy that causes excessive storage requirements.

This thesis topic area aims to find ways to evaluate datasets and their quality by analyzing the structure and different properties of the dataset or by searching for specific deficiencies. Addressing these problems can significantly help the data analysis automatization, which is currently made mainly by hand. An automated system could predict the appropriate time to retrain models, detect dataset obsolescence, compare datasets from different sources, and merge datasets in some “smart” way for better training data for machine learning models.

6. Hardware-accelerated network traffic monitoring

Processing network traffic at high speeds (400Gb/s) is one of the exciting challenges of computer engineering. The systems work with high frequency to process all packets losslessly. Sometimes, it is even necessary to process more than one packet in a single clock period. On the other hand, this area is crucial because the so-called monitoring probes that have to process packets on such fast links are essential data sources for security (detection) systems. Their malfunction can compromise the whole security infrastructure since it cannot detect all threats.

However, monitoring probes are not just hardware and firmware. Currently, the most appropriate approach is proving to be the so-called hardware-software co-design. It means a carefully designed and developed harmony between components that work together from the hardware card with software threads running in parallel to achieve the optimal use of system resources. This assignment area is established on our team’s long experience and collaboration with the CESNET research department. The partnership allows taking advantage of unique technical resources and rich experience in developing high-speed network traffic processing applications. This assignment aims to participate in the research/development of monitoring probes for 400Gb/s, for example, packet parsing, flow caching, computation of characteristics, and statistics enriching traditional IP flow data.

7. Network traffic classification using time series analysis

The basis of security analysis or detection of network threats is data representing network traffic. These data together with time can be represented using a time series whose behavior can be described by a set of characteristics/features, which subsequently serve as input to classification and detection algorithms. Currently, IP flow data (specifically in IPFIX or NetFlow format) are successfully used to monitor network traffic, which represent information about the communication that took place, i.e. “who communicated when with whom and how much”. The analysis of time series from IP flow data for detection and classification represents a research challenge for the field of security systems based on network traffic monitoring, as it is no longer possible to directly extract and analyze the content of network packets to the original extent. This topic focuses on research and experiments in the field of analysis of time series created from network traffic with the aim of estimating what kind of communication the time series represents - activity, protocol, device type. Which is beneficial for detecting security threats and maintaining situational awareness of infrastructure managers/operators or security teams.

An example of significant time series behavior is periodicity, which has already proven to be suitable for classifying network traffic (applications, services, and operating systems). It is thus possible to use periodicity detection in further experiments and research related to final theses topics. An example of an important research area from this framework topic is the detection of anomalous behavior, which is currently in great demand and widely used by monitoring systems. Anomalous behavior is especially likely when detecting DDoS attacks or when detecting suspicious behavior of a device in the network that has been attacked by an attacker.