Deflect Labs Monitoring

The monitoring component of Deflect Labs is an ongoing effort to make our identification and sharing of malicious web traffic more effective by using Artificial Intelligence (AI) to distinguish different types of browsing behaviour. The premise is that we can extract features from the set of web logs, and use this to compare different IPs” behaviour and detect anomalies.

Today: Monitoring & Mitigation

At its core, the Deflect network is capable of logging information about any and all aspects of web traffic destined for our clients” websites (this includes traffic over SSL). This means that for each visitor accessing the Deflect network it is possible to record or otherwise ascertain:

  • Site accessed
  • Browser user agent
  • Deflect server queried
  • Time of request
  • Response code to the request
  • Cache status of the request

The key elements of monitoring and mitigation technology are:

  • Deflect itself, by its nature a rich target for DDoS attacks.
  • Banjax is responsible for early stage filtering, challenging and banning of bots, identified via regular expression (regex) matching, in conjunction with the Swabber module.
  • Swabber is a daemon for banning and unbanning IP addresses.
  • Edgemanage selects and rotates Deflect edges.
  • Opsdash is an ElasticSearch cluster where the majority of collected data is stored and queried.

We use these tools to gather, store and analyze information for attack diagnostics and user-facing statistics, as well as to study series of attacks and historical behaviours. What we can observe when analyzing bots through BotnetDBP’s components and open third-party resources includes:

  • The geographic location of the bot (GeoIP databases lookups): This information can be used to inform decisions as to whether the bot is part of a malware-based botnet or a voluntary botnet using tools such as LOIC or other packaged denial of service tools that are used in participatory DDoS attacks. Bots proximity will be noted and can be used to indicate whether a high number of attackers are clustered geographically.
  • Whether the visitor has used the site regularly: Hits on the aggregation system will be used to indicate whether the IP address has been seen before
  • How much traffic a particular user has incurred
  • Whether the host has been seen as part of a botnet in the past

Tomorrow: Baskerville

Baskerville is a complete pipeline to implement the theory behind BotnetDBP. It receives as input incoming web logs, either from a Kafka topic, from a locally saved raw log file, or from log files saved to an Elasticsearch instance. It processes these logs in batches, forming request sets by grouping them by requested host and requesting IP. It subsequently extracts features for these request sets and it predicts whether they are malicious or benign using a model that was trained offline on previously observed and labelled data. Baskerville additionally cross-references with MISP to determine if each IP is already known to be malicious. Finally, it saves all the data and results to a Postgres database, and it publishes metrics on its processing (e.g. number of logs processed, percentage predicted malicious, percentage predicted benign, processing speed etc) that can be consumed by Prometheus and visualised using a Grafana dashboard.

As well as an engine that consumes and process web logs, a set of offline analysis tools have been developed for use in conjunction with Baskerville. These tools may be accessed directly, or via two Jupyter notebooks, which walk the user through the machine learning tools and the investigations tools, respectively. The machine learning notebook comprises tools for training, evaluating, and updating the model used in the Baskerville engine. The investigations notebook comprises tools for processing, analysing, and visualising past attacks, for reporting purposes.

A brief overview of the current state of the Baskerville project is here, and the full in-depth documentation is available here.

Baskerville Schematic

Baskerville Schematic

Soon: Deflect Labs ISAC

The proposed future of Deflect monitoring involves splitting the Baskerville engine into separate User Module and Clearinghouse components. The User Module will be run by users to extract feature vectors of browsing behaviour from batches of their incoming web logs. These feature vectors will be sent to the DL Clearinghouse, where they will be processed and stored by the DL Prediction Engine, and a prediction (with a degree of certainty) will be returned. The user can then take the mitigation action they see fit (banning, restricting access, imposing a captcha challenge…) based on the prediction. In addition, the Clearinghouse will contain a DL Analysis Center, where Deflect data scientists and technicians will work to improve the trained classifier used in the DL Prediction Engine. We will need to develop a framework for providing feedback for iteratively improving and assessing this model. We plan to provide a web interface through which users can easily log attacks they have seen, and initially implement the DL-ISAC with a selection of partners who are willing to provide this feedback.

By dividing Baskerville into the log-processing User Module and the Prediction Engine described above, we enable a complete separation of personal data from the central Clearinghouse. Users process their own web logs locally, and send off feature vectors (devoid of IP/host site) to receive a prediction. This allows threat-sharing without compromising data privacy. In addition, this separation will enable the adoption of the DL-ISAC by a much broader range of clients than the Deflect-hosted websites currently served. Increasing the user base of this software will also increase the amount of browsing data we are able to collect, and thus the strength of the models we are able to train.

The Analysis Center component of the DL Clearinghouse is intended as an extension of what is currently the Baskerville offline analysis toolkit. As there is no sensitive user IP data contained in the feature vectors used by the DL Analysis Center, this can be completely opened up to external partners, collaborating with model development. Similarly, all the results of the analysis can be kept open source.

DL-ISAC Schematic

DL-ISAC Schematic