Baskerville

Baskerville is a network traffic anomaly detector, for use in identifying and characterising malicious IP behaviour. It additionally comprises a selection of offline tools to investigate and learn from past web server logs.
The in-depth Baskerville documentation can be found here.

Overview

Baskerville is the component of the Deflect analysis engine that is used to decide whether IPs connecting to Deflect hosts are authentic normal connections, or malicious bots. In order to make this assessment, Baskerville groups incoming requests into request sets by requested host and requesting IP.

For each request set, a selection of features are computed. These are properties of the requests within the request set (e.g. average path depth, number of unique queries, HTML to image ratio…) that are intended to help differentiate normal request sets from bot request sets. A supervised novelty detector, trained offline on the feature vectors of a set of normal request sets, is used to predict whether new request sets are normal or suspicious. Additionally, a set of offline analysis tools exist to cluster, compare, and visualise groups of request sets based on the feature values. The request sets, their features, trained models, and details of suspected attacks and attributes, are all saved to a Baskerville database.

Put simply, the Baskerville engine is the workhorse for consuming, processing, and saving the output from input web logs. This engine can be run as Baskerville live, which enables the real-time identification and banning of suspicious IPs, or as Baskerville manual, which conducts this same analysis for log files saved locally or in an elasticsearch database. There is additionally an offline analysis library for use with Baskerville, intended for a) developing the supervised model for use in the Baskerville engine, and b) reporting on attacks and botnets. Both utilise the Baskerville storage, which is the database referenced above.

Baskerville Engine

In depth documentation here.

The main Baskerville engine consumes web logs and uses these to compute request sets (i.e. the groups of requests made by each IP-host pair) and extract the request set features. It applies a trained novelty detection algorithm to predict whether each request set is normal or anomalous. It saves the request set features and predictions to the Baskerville storage database. It additionally cross-references incoming IP addresses with attacks logged in the database, to determine if a label (known malicious or known benign) can be applied to the request set.

As we would like to be able to make predictions about whether the request set is benign or malicious while the runtime (and thus potentially each request set) is ongoing, we divide each request set up into subsets. Subsets have a fixed two-minute length, and the request set features (and prediction) are updated at the end of each subset using a feature-specific update method (discussed here). For nonlinear features, the feature value will be dependent on the subset length, so for this reason, logs are processed in two-minute subsets even when not being consumed live. This is also discussed in depth in the feature document above.

The Baskerville engine utilises Apache Spark; an analytics engine designed for the purpose of large-scale data processing. The decision to use Spark in Baskerville was made to ensure that the engine can achieve a high enough level of efficiency to consume and process web logs in real time, and thus run continuously as part of the Deflect ecosystem.

Summary of Components - Initialize Set up necessary components for engine to run. - Create Runtime: Create a record in the Runtimes table of the Baskerville storage, to indicate that Baskerville has been run. - Get Data: Receive web logs, and load them into a Spark dataframe. - Get Window: Select data from current time bucket window. - Preprocessing: Handle missing values, filter columns, add calculation columns etc. - Group-by: Formulate request sets from log data via grouping based on host-IP pairs. - Feature Calculation: Add additional calculation columns, and extract the features of each request set. - Label or Predict: Apply a trained model to classify the request set as suspicious or benign. Cross reference the IP with known malicious IPs to see if a label can be applied. - Save: Save the analysis results to Baskerville storage.

Live

In depth documentation here.

The live version of Baskerville is designed to consume (ATS) logs from a Kafka topic in predefined intervals (time bucket set to 120 seconds by default), while a runtime is ongoing. This will be integrated into the online Deflect analysis engine, and receive logs directly from ATS.

As logs are supplied to the Baskerville engine and processed, various metrics are produced, e.g. the number of incoming request sets, the average feature values for these request sets, and the predictions and/or labels (normal/anomalous) associated with these request sets. These metrics are exported to Prometheus, which publishes them for consumption by Grafana and other subscribers.

Grafana is a metrics visualization web application that can be configured to display several dashboards with charts, raise alerts when metric crosses a user defined threshold and notify through mail or other means. Within Baskerville, under data/metrics, there is an importable to Grafana dashboard which presents the statistics of the Baskerville engine in a customisable manner. It is intended to be the principal visualisation and alerting tool of incoming Deflect traffic, displaying metrics in graphical form. Prometheus is the metric storage and aggregator that provides Grafana with the charts data.

Manual

In depth documentation here.

The manual version of Baskerville consumes logs from locally saved raw log files or from an elasticsearch database, and conducts the processing steps enumerated in the Baskerville Engine.

The processing of old logs can be carried out either by supplying raw log files, or by providing a time period, batch length, and (optionally) host, in which case the logs will be pulled from elasticsearch in batches of the specified length. These details are provided in the Baskerville engine configuration file, and the type of run is determined by the commandline argument ‘rawlog’ or ‘es’ when calling the Baskerville main function. To label the request sets as benign or malicious, the Attacks and Attributes tables in Baskerville storage must be filled, either via syncing with a MISP database of attacks, or by directly inputting records of past attacks and the attributes associated with those attacks.

Baskerville Storage

In depth documentation here.

The Baskerville storage is a database containing all the data output from the Baskerville engine, as well as the trained models and records of attacks utilized by the Baskerville engine for prediction and labelling, respectively.

Summary of Components - Runtimes: Details of the occasions on which Baskerville has been run. - Request sets: Data on requests made by an IP-host pair (features, prediction, label etc). - Subsets: The host-IP pair request data for each subset time bucket. - Models: Different versions of the trained novelty detectors (when they were created, which features they use, their accuracy etc). - Attacks: Details of known attacks, optionally synced with a MISP database using the offline tools. - Attributes: IPs implicated in the incidents listed in the Attacks table. - Model Training Sets Link: Table linking models with the request sets they were trained on. - Requestset Attack Link: Table linking request sets with the attacks they were involved in. - Attribute Attack Link: Table linking attributes with the attacks they were involved in.

Offline Analysis

In depth documentation here.

The offline component of Baskerville comprises a selection of tools to conduct model development (for use in the Baskerville engine) and analysis (for use in investigations). A supervised binary classifier may be trained based on the labelled request sets, newly trained models may be used to make predictions for existing request sets, clustering based on request set feature values may be conducted to identify similar requesting IP behaviour and characterise botnets, and the results of the all of the above may be visualized.

These offline tools can be accessed by running the offline main script. Alternatively, two notebooks (“investigations” and “machine learning”) guide the user through the process of using these tools to investigate network traffic, or train a new classifier for use in Baskerville, respectively.

Summary of Components - Investigations
- Misp Sync: Copy the attack data stored in a MISP database to the Baskerville storage Attacks and Attributes tables. - Labelling: Label already processed request sets as malicious or benign, based on e.g. cross-referencing with the Attacks table. - Data Exploration: Visualise mean feature statistics over time and across different groups. - Clustering: Group request sets based on their features to investigate botnets. - Visualisation: Produce figures to aid model development and investigations. - Machine Learning - Investigations tools above, and… - Training: Train novelty detection classifier on labelled request sets. - Model Evaluation: Assess the model accuracy with training size, and across different attacks / reference periods. - Prediction: Classify request sets as malicious/benign using newly trained models.

Requirements

  • Python >= 3.6,
  • Postgres 10,
  • Java 8 needs to be in place (and in PATH) for Spark (Pyspark version 2.3+) to work,
  • The required packages in requirements.txt,
  • Tests need pytest, mock and spark-testing-base,
  • Access to the esretriever repository (to get logs from elasticsearch),
  • Access to the Deflect analytics ecosystem repository (to run Baskerville online services).

Installation

In the root Baskerville directory, run:

pip install -e . --process-dependency-links

Note that Baskerville uses Python3.6. The above command should be modified to pip3.6 install -e . --process-dependency-links if this is not the default version of python in your environment.

To use Baskerville live, you will need to have Postgres, Kafka, and Zookeeper running. There is a docker file for these services here. The prometheus.yml should have the following job listed for the Baskerville exporter to run:

- job_name: 'Baskerville_exporter'
  static_configs:
  - targets:
    - 'my-local-ip:8998'

To use Baskerville manually or the Baskerville offline analysis tools, only a Postgres database is required.

For both the live and manual use of Baskerville, it is possible to export metrics to Prometheus, and visualise these in a Grafana dashboard. Both Prometheus and Grafana are also included in the docker file here.

Configuration

The run settings for the Baskerville engine are contained in the configuration file baskerville/conf/baskerville.yaml. The example configuration file should be renamed to baskerville.yaml, and edited as detailed here.. The main compoents of the configuration file are: - DatabaseConfig: mysql or postgres config - ElasticConfig: elasticsearch config - MispConfig : misp database config - EngineConfig: baskerville-specific config, including the following: - ManualEsConfig: details for pulling logs from elasticsearch - ManualRawlogConfig: details for getting logs from a local file - SimulationConfig: the config for running in online simulation mode - MetricsConfig: the config for metrics exporting - DataConfig: details of the format of the incoming web logs - KafkaConfig : kafka and zookeeper configs - SparkConfig: common spark configurations, such as log level, master etc.

The run settings for the offline tools are contained in the configuration file offline_analysis/conf/baskerville.yaml. In addition to the database, elasticsearch, and misp configurations as above, this includes a offline_features section (for use in specifying the desired features when training a model), and a process_reference section (for specifying the setting when running the process_reference offline script). Details of how to modify the offline config are given in the relevant offline tools sections (here).

Running

Baskerville Live

The first step is to launch the online services - Kafka, Zookeeper, Prometheus, and Grafana. To do this, replace ${DOCKER_KAFKA_HOST} (in the docker file) and my-local-ip (in the prometheus.yml) with your local IP address. Then run

[sudo] docker-compose up

in the directory with the docker file.

Next, the configuration file should be filled out as explained above, to mirror the appropriate ports/hosts/users/passwords for the services being used. Once the configuration file has been set, change directory to baskerville/src/baskerville and run

python3 main.py kafka -e

to launch the version of the Baskerville engine that works by receiving logs from Kafka.

The -e flag is specified so that while the run is ongoing you will be able to view metrics at the specified hosts/ports for Prometheus, and you will be able to import Prometheus as a metrics source for Grafana to visualise the Baskerville metrics. A sample Baskerville dashboard is saved in the data directory, and can be imported to visualise metrics in Grafana.

Baskerville Manual

To run the Baskerville engine in offline mode, first edit the main baskerville configuration file as detailed above. In the engine section of the configuration, either the manual_es or the manual_rawlog details should be set, depending on whether you will be processing logs from elasticsearch, or locally saved log files. Note that if processing logs from elasticsearch, the host field is optional (if omitted, logs from all hosts will be processed). The batch_length specifies the number of minutes of logs to pull from elasticsearch at a time, before conducting the processing.

Next, change directory to baskerville/src/baskerville and run

python3 main.py es

or

python3 main.py rawlog

again, depending on whether you wish to pull logs from elasticsearch on process locally saved log files. The engine will then consume the specified logs, process them, and save the results to the Baskerville storage database detailed in the configuration file.

Offline Analysis

To run the offline analysis tools, the offline configuration file (detailed above) should be edited. The analytical tools can then launched by changing directory to baskerville/src/offline_analysis/src/, running

python3 main.py

and following the commandline instructions. Further details are given here

Alternatively, to work through the offline analysis notebooks, navigate to baskerville/src/offline_analysis/src/notebooks/ and execute

jupyter notebook

You can then access the notebooks and follow the instructions therein at localhost:8888.

If instead you wish to launch the notebook on a remote machine, call the notebook-launch.sh script in baskerville/src/offline_analysis/src/notebooks/, and in a local terminal run
ssh -N -f -L localhost:8888:localhost:8889 user@remote-ip. The notebook will then be accessible at localhost:8888.

Testing

Unit Tests Basic unittests for the features and the pipelines have been implemented. You can run in a Python3 virtual environment with:

python -m pytest tests/

Functional Tests No functional tests exist yet. They will be added as the structure stabilizes.

To Do

  • Implement full suite of unit and functional tests.
  • Allow log queuing for when the engine does not keep pace with incoming logs.
  • Create request set IDs locally, so the database does not need to be queried for these.
  • Export full suite of statistics to Prometheus.
  • Conduct model tuning and feature selection / engineering.
  • Consolidate offline tools into an investigations pipeline.