Deflect Labs Monitoring

The monitoring component of Deflect Labs is an ongoing effort to make our identification and sharing of malicious web traffic more effective by using Artificial Intelligence (AI) to distinguish different types of browsing behaviour. The premise is that we can extract features from the set of web logs, and use this to compare different IPs’ behaviour and detect anomalies.

AI Explanation

AI Explanation

Past: BotnetDBP

The idea of grouping web logs by IP-host pair into request sets, in order to extract browsing features and predict whether the IPs are malicious or benign, originally came from the Deflect project BotnetDBP. Its documentation can be found here.

Present: Baskerville

Baskerville is a complete pipeline to implement the theory behind BotnetDBP. It receives as input incoming web logs, either from a Kafka topic, from a locally saved raw log file, or from log files saved to an Elasticsearch instance. It processes these logs in batches, forming request sets by grouping them by requested host and requesting IP. It subsequently extracts features for these request sets and it predicts whether they are malicious or benign using a model that was trained offline on previously observed and labelled data. Baskerville additionally cross-references with MISP to determine if each IP is already known to be malicious. Finally, it saves all the data and results to a Postgres database, and it publishes metrics on its processing (e.g. number of logs processed, percentage predicted malicious, percentage predicted benign, processing speed etc) that can be consumed by Prometheus and visualised using a Grafana dashboard.

As well as an engine that consumes and process web logs, a set of offline analysis tools have been developed for use in conjunction with Baskerville. These tools may be accessed directly, or via two Jupyter notebooks, which walk the user through the machine learning tools and the investigations tools, respectively. The machine learning notebook comprises tools for training, evaluating, and updating the model used in the Baskerville engine. The investigations notebook comprises tools for processing, analysing, and visualising past attacks, for reporting purposes.

A brief overview of the current state of the Baskerville project is here, and the full in-depth documentation is available here.

Baskerville Schematic

Baskerville Schematic

Future: Deflect Labs ISAC

The proposed future of Deflect monitoring involves splitting the Baskerville engine into separate User Module and Clearinghouse components. The User Module will be run by users to extract feature vectors of browsing behaviour from batches of their incoming web logs. These feature vectors will be sent to the DL Clearinghouse, where they will be processed and stored by the DL Prediction Engine, and a prediction (with a degree of certainty) will be returned. The user can then take the mitigation action they see fit (banning, restricting access, imposing a captcha challenge…) based on the prediction. In addition, the Clearinghouse will contain a DL Analysis Center, where Deflect data scientists and technicians will work to improve the trained classifier used in the DL Prediction Engine. We will need to develop a framework for providing feedback for iteratively improving and assessing this model. We plan to provide a web interface through which users can easily log attacks they have seen, and initially implement the DL-ISAC with a selection of partners who are willing to provide this feedback.

By dividing Baskerville into the log-processing User Module and the Prediction Engine described above, we enable a complete separation of personal data from the central Clearinghouse. Users process their own web logs locally, and send off feature vectors (devoid of IP/host site) to receive a prediction. This allows threat-sharing without compromising data privacy. In addition, this separation will enable the adoption of the DL-ISAC by a much broader range of clients than the Deflect-hosted websites currently served. Increasing the user base of this software will also increase the amount of browsing data we are able to collect, and thus the strength of the models we are able to train.

The Analysis Center component of the DL Clearinghouse is intended as an extension of what is currently the Baskerville offline analysis toolkit. As there is no sensitive user IP data contained in the feature vectors used by the DL Analysis Center, this can be completely opened up to external partners, collaborating with model development. Similarly, all the results of the analysis can be kept open source.

DL-ISAC Schematic

DL-ISAC Schematic