To support the investigation, the project calls for the application of a number of state-of-art methodological solutions, in order to address the issues related to the activities planned in the project — data collection, labeling, and augmentation, data representation, Enhanced traffic classification and prediction, Explainable AI, as well as Incremental Learning.
In the following, the main activities of the XInternet project are listed, reporting the state-of-art methodological solutions the RUs will cooperate to investigate, implement, and evaluate.
Data Collection, labeling (and augmentation)
Data availability is a main requirement for data-hungry ML/DL methods. More in general, provisioning accurately-labeled datasets as well as carefully documenting workflows for obtaining them, are critical to foster replicability and reproducibility, respectively, with both fueling research dissemination. Such data collection and augmentation techniques will support the activities. Since these collection and augmentation activities may be time and resource-consuming, we aim to drive them based on the knowledge derived from other phases of the project: we plan to leverage results on model effectiveness obsolescence to conveniently instrument the collection and labeling process. In particular, the XInternet project intends to address this need considering two scenarios: (1) the collection of benign traffic generated by mobile apps leveraging the MIRAGE architecture [Aceto2019] and (2) the collection of malicious traffic via a state-of-art honeypot infrastructure [dpipot, Rescio2021].
MIRAGE is a reproducible architecture for capturing mobile-app traffic and building the related accurate ground-truth. It allows the capture and labeling of traffic generated by human experimenters running the required apps in a controlled environment. DPIPot is a flexible infrastructure to spot the origin of anomalous traffic by shedding lights on new exploits being abused in the wil. It consists of 4 /24 networks that can be configured as darknets or vertical and smart honeypots. As opposed to darknets, honeypots are active monitoring probes that record traffic and generate rich sets of unstructured logs while also interacting with the possible attackers. In addition to the mentioned data sources, data augmentation strategies will be possibly investigated, such as GAN-based augmentation, adversarial training, and meta-learning.
Enhanced traffic characterization/classification/prediction
Advanced methods for traffic analysis (i.e. characterization, classification, and prediction) are paramount for gaining the full visibility required for both informed network management and security. Unfortunately, the evolution of the nature of traffic (e.g., use of common application protocols as transport sublayers, encryption, dynamic ports, network address translation) has progressively reduced the effectiveness of the techniques for traffic analysis proposed in the past. Also, for providing the tools to respond to such real-time needs dictated by rapidly-changing traffic and network conditions, the strategies and information elements on which these techniques are based should be rapidly adapted by reducing their dependence on human intervention.
XInternet plans to design and explore innovative traffic analysis tools based on DL capable of bringing out the distinctive “fingerprint” of network traffic starting from the observed “raw” traffic (i.e. reducing the presence of human experts in the loop). In particular, through the joint use of hybrid DL architectures based on “multi-modal” and “multi-task” learning, the project intends to improve previous techniques, limited by a partial capitalization of the information deriving from the traffic analyzedand limited to a single inference (viz. visibility) task, and to be also orthogonal to different granularities in packets’ aggregation [Aceto2020]. XInternet intends to leverage parameters that allow traffic analysis techniques to be both robust against encryption mechanisms and “privacy-preserving” (i.e. without tracing sensitive information of users).
Data Representation
Feature engineering and data representation constitute a critical process, also impacting the characteristics and the effectiveness of the resulting model. Moreover, these activities may result in a painstaking process, requiring specialized domain knowledge. However, even with expert domain knowledge, feature exploration and engineering remains largely an imperfect process, since the choice of features and how to represent them can greatly affect model accuracy, with manual extraction potentially omitting features involving complex relationships (e.g., non-linear relationships between features). This situation is further exacerbated by the nature of Internet data, being the traffic subjected to constant change (“concept drift”, in machine learning terms), obsoleting validity of both models and handcrafted features [Holland2021].
In the XInternet process, we plan to explore Internet-data representation, aiming at identifying a suitable representation of network data tackling the most representative issues, such the identification of unbiased inputs for DL (e.g., by exploring bit-wise, byte-wise, and word-wise representation of the payload), the identification of expert-driven inputs guided by XAI techniques, the exploration of privacy-preserving representation. The effectiveness of spatio-temporal embeddings will also be investigated.
Incremental Learning
Traffic traversing today’s network is a constantly moving target and the relation between inputs and outcomes of traffic analysis tools drifts with the rapid evolution of network traffic. Effectively planning the updates of ML/DL models is of utmost importance to have accurate traffic analysis tools able to cope with the dynamicity of mobile and anomalous traffic by acquiring new knowledge and to avoid incurring a performance drop on previously learned information (i.e. catastrophic forgetting). Also, to complete the update of traffic analysis models on time, their total retraining is unfeasible, and techniques based on a constant increment of knowledge should be applied.
XInternet aims to attain automated updates of models by exploiting incremental learning techniques that aim to develop artificially intelligent systems that can continuously learn to address new tasks from new data while preserving knowledge learned from previously learned tasks. After each training session, the learner should be capable of performing all previously seen tasks on unseen data by integrating the new knowledge without forgetting the old one [Masana2020]. To cope with the shortage of network-traffic data related to new applications or network attacks, in XInternet we plan to integrate class-incremental learning with few-shot learning approaches starting from state-of-the-art proposals applied in well-established fields (e.g., image classification and natural language processing).
Explainable AI
Artificial intelligence (AI) techniques are proven to deliver high-quality solutions and thus have increasingly been adopted to tackle a number of problems in the networking domain. However, these solutions suffer from being highly intricate and erratic for human cognition. In fact, this lack of interpretability highly hinders the commercial success of AI-based solutions in practice. Accordingly, networking researchers are starting to explore Explainable AI (XAI) techniques to make AI models interpretable, manageable, and trustworthy [Zhang2022]. More specifically, the concept of explainability sits at the intersection of several areas of active research in AI, including (1) Transparency (AI decision must be explained in terms, formats and languages humans can understand); (2) Causality (the model learned from data should provide humans with correct inferences but also some explanation for the underlying phenomena); (3) Bias (the model must not learn a biased view of the world due to shortcomings of data or objective functions; (4) Fairness (decisions based on AI system must be fair); (5) Safety (humans must gain confidence in the reliability of AI systems also without an explanation of how it reaches conclusions [Hagras2018].
XAI techniques will be investigated to cope with the limitations of AI solutions. Post-hoc explanation techniques (e.g., Deep SHAP and Integrated Gradients) will be used to interpret the behavior of state-of-the-art DL models for classification, detection, and prediction. Additionally, we plan to devise models explainable by design to directly provide explanations to the stakeholders involved. Complementarily, XInternet aims to ensure the trustworthiness of traffic analysis for its “safe” usage to end-users, namely by evaluating to which extent the confidence associated to a given decision by an opaque solution can be deemed reliable.
References:
- [Aceto2019] Aceto et al. MIRAGE: Mobile-app traffic capture and ground-truth creation. IEEE ICCCS, 2019.
- [dpipot] dpipot, https://github.com/SmartData-Polito/dpipot, 2022.
- [Rescio2021] Rescio et al. DPI Solutions in Practice: Benchmark and Comparison. IEEE SPW, 2021.
- [Aceto2020] Aceto et al., Toward effective mobile encrypted traffic classification through deep learning, Neurocomputing, 409, 2020.
- [Holland2021] Holland et al. New directions in automated traffic analysis. ACM SIGSAC CCS, 2021.
- [Masana2020] Masana et al. Class-incremental learning: survey and performance evaluation on image classification, Arxiv, 2020.
- [Zhang2022] Zhang et al. Interpreting AI for Networking: Where We Are and Where We Are Going. IEEE Communications Magazine 60 (2), 2022.
- [Hagras2018] Hagras. Toward human-understandable, explainable AI. Computer, 2018.