Running¶

The Baskerville engine is triggered by changing directory to baskerville/src/baskerville and running

python3 main.py [triggering-type]


where the positional argument triggering-type is es, rawlog, or kafka. Additional options that may be specified are -s for simulation, -c for the config file, -v for verbose, and -e for export metrics. All other arguments are specified in the config file, which is located in baskerville/conf/baskerville.yaml.

Configuration¶

The configuration file is located in baskerville/conf/baskerville.yaml, and has the following structure:

database:
user: user                              -- username for database
type: 'postgres'                        -- type of database (postgres or mysql)
host: 127.0.0.1                         -- IP hosting database
port: 5432
data_partition_trigger_file: '/path/to/data/data_partitioning_by_{month || host}.sql'  -- creates data partitioned (by month or by host) request_sets and subsets tables.

elastic_db:
user: 'elastic'                         -- username for elasticsearch
host: 'https://opsdash.deflect.ca'      -- URL for elasticsearch
base_index: 'deflect.log'
index_type: 'deflect_access'

misp_db:
misp_url: 'https://misp.ie/'            -- URL of MISP attacks database to sync with
misp_key: 'private-key'                 -- key to connect to MISP database
misp_verifycert: True

engine:
manual_es:
host: somehost                        -- criteria when pulling logs from es
start: 2018-01-01 00:00:00               (optional)
stop: 2018-01-02 00:00:00
batch_length: 15                      -- batch in mins to pull logs from es in
save_logs_dir: path/to/save/logs/dir  -- directory to save logs pulled from es (optional)
manual_rawlog:
raw_logs_paths:
- 'path1/log1.json'                 -- paths to local logs to process (optional)
- 'path2/log2.json'
simulation:
sleep: True
verbose: False
datetime_format: '%Y-%m-%d %H:%M:%S'    -- format of timestamps in logs
cache_expire_time: 604800               -- seconds after which request sets in cache expire
cross_reference: False                  -- whether to label request sets using attacks table
model_path: path/to/saved/model         -- path to locally saved trained model (optional)
model_version_id: n                     -- id of model saved to Baskerville storage (optional)
extra_features:
- 'example_feature_average'           -- model features will automatically be calculated;
verbose: False                             specify additional features here
metrics:
port: 8998                            -- options for metrics exported to Prometheus
performance:                          -- times the methods listed under pipeline and request_set_cache
pipeline:
- '_preprocessing'
- '_group_by'
- '_feature_calculation'
- '_label_or_predict'
- '_save'
request_set_cache:
- 'instantiate_cache'
- '__getitem__'
- '__contains__'
- 'clean'
features: True                      -- times the feature computation
progress: True                        -- tracks the progress of the engine
data_config:                            -- options for parsing web logs
parser: JSONLogSparkParser            -- the name of the parser class
schema: '/path/to/data/samples/sample_log_schema.json'  -- the path to the json schema for the logs
group_by_cols:
- 'client_request_host'               -- fields to group by when forming
- 'client_ip'                            request sets
timestamp_column: '@timestamp'        -- name of column that contains log timestamp
logpath: path/to/where/to/save/logs/    -- baskerville.log output here
log_level: 'ERROR'

kafka:                                    -- kafka config
url: '0.0.0.0:9092'
zookeeper: 'localhost:2181'
consume_topic: 'ats.logs'
publish_logs: 'baskerville.logs'        -- currently not used
publish_stats: 'baskerville.stats'              -- currently not used
publish_predictions: 'baskerville.predictions'  -- currently not used

spark:                                    -- spark config
master: 'local'
parallelism: -1  # control the number of tasks, -1 means use everything the machine has
log_conf: 'true'
log_level: 'INFO'
jars: '/path/to/jars'                   -- including postgres jar, es jar etc
session_timezone: 'UTC'
shuffle_partitions: 14
executor_instances: 4
executor_cores: 4
spark_driver_memory: '6G'
db_driver: 'org.postgresql.Driver'
metrics_conf: /path/to/data/spark.metrics
jar_packages: 'com.banzaicloud:spark-metrics:2.3-1.1.0,io.prometheus:simpleclient:0.3.0,io.prometheus:simpleclient_dropwizard:0.3.0,io.prometheus:simpleclient_pushgateway:0.3.0,io.dropwizard.metrics:metrics-core:3.1.2'
jar_repositories: 'https://raw.github.com/banzaicloud/spark-metrics/master/maven-repo/releases'
event_log: True
serializer: 'org.apache.spark.serializer.KryoSerializer'
kryoserializer_buffer_max: '2024m'
kryoserializer_buffer: '1024k'
executor_extra_java_options: '-verbose:gc'


Note: The configuration can be parsed using environment variables like this:

database:
user: !ENV ${DB_USER} password: !ENV${DB_PASSWORD}
host: !ENV ${DB_HOST} port : !ENV${DB_PORT}

...


The configuration file is parsed in main.py, and read and verified in engine.py. The configuration parser is defined in helpers.py, and the configuration loader is defined in config.py. BaskervilleConfig is called hierarchically and verifies the whole configuration.

The hierarchy in BaskervilleConfig is as follows: - DatabaseConfig: mysql or postgres config - ElasticConfig: elasticsearch config - MispConfig : misp database config - EngineConfig: baskerville-specific config, including the following: - ManualEsConfig: details for pulling logs from elasticsearch - ManualRawlogConfig: details for getting logs from a local file - SimulationConfig: the config for running in online simulation mode - MetricsConfig: the config for metrics exporting - DataConfig: details of the format of the incoming web logs - KafkaConfig : kafka and zookeeper configs - SparkConfig: common spark configurations, such as log level, master etc. Each configuration object inherits from the base class Config.

Features¶

Features are the properties of a request set (i.e. the set of requests made over time by one IP-host pair), which are used to classify that request set as malicious or benign. A feature may be dependent on other features, which they rely on in order to be calculated. These are specified in the feature.dependencies property. Every feature has two methods; compute and update, which specify how the feature initially be calculated for a subset, and then how its value show be updated if the request set already exists. A full discussion of feature updating is given here. A feature must specify the necessary columns from the logs. A feature may specify: - pre_group_by_calcs: any calculations that should be performed on the raw logs instead of the grouped logs - group_by_aggs: what to keep after the grouping of the logs. - post_group_by_calcs: any calculations that should be performed after the grouping of the logs, when certain columns are available after aggregations. - columns_renamed: renamings for the columns in form python   self.columns_renamed = {'column name to be renamed': 'new name'} These are in form:

# new column name and the respective calculation e.g. divide a column by 10
self.pre_group_by_calcs = {'new column name': F.col('existing column')/ 10 }
# new column name and the respective aggregation function
self.group_by_aggs = {'new agg column name': F.min(F.col('new column name'))}
# new column name and the respective calculation e.g. divide a column by 10
self.post_group_by_calcs = {'new post group by column name': F.col('new agg column name') / 1000}


The currently implemented features are as follows: - feature_js_to_html_ratio - feature_image_total - feature_response4xx_rate - feature_payload_size_average - feature_unique_path_rate - feature_html_total - feature_unique_ua_to_request_ratio - feature_response5xx_rate - feature_unique_query_to_unique_path_ratio - feature_image_to_html_ratio - feature_geo_time_average - feature_payload_size_log_average - feature_response5xx_to_request_ratio - feature_path_depth_average - feature_css_to_html_ratio - feature_unique_ua_total - feature_response5xx_total - feature_response4xx_total - feature_unique_path_to_request_ratio - feature_js_total - feature_top_page_total - feature_unique_query_total - feature_request_interval_variance - feature_unique_ua_rate - feature_unique_path_total - feature_unique_query_rate - feature_minutes_total - feature_request_interval_average - feature_request_total - feature_response4xx_to_request_ratio - feature_request_rate - feature_css_total - feature_top_page_to_request_ratio - feature_path_depth_variance

Pipelines¶

The class Step in util/enum contains all the possible pipeline steps. They are:

initialize = 'Start sessions, initialize cache/features/model/dfs.'
create_runtime = 'Create a Runtime in the Baskerville database.'
get_data = 'Get dataframe of log data.'
get_window = 'Select data from current time bucket window.'
preprocessing = 'Fill missing values, add calculation cols, and filter.'
group_by = 'Group logs by IP/host.'
feature_calculation = 'Add calculation cols, extract features, and update.'
label_or_predict = 'Apply label from MISP or predict label.'
save = 'Update dataframe, save to database, and update cache.'
finish_up = 'Disconnect from db, unpersist dataframes, and empty cache.'


The Step class is used in the pipelines in the step_to_action dictionary. It allows us to know where the execution is at any time and if there is a need to clean up after we stop the engine, e.g. to finish up saving everything in the database.

Each pipeline implements the run method where it iterates over the steps and calls the repspective action functions.

def run(self):
self._initialize()
self._create_runtime()
self._get_data()

if self.logs_df.count() == 0:
self.logger.info('No data in to process.')
else:
for window_df in self.get_window():
self.logs_df = window_df
remaining_steps = list(self.step_to_action.keys())
for step, action in self.step_to_action.items():
self.logger.info('Starting step {}'.format(step))
action()
self.logger.info('Completed step {}'.format(step))
remaining_steps.remove(step)


There are three different pipelines, depending on the use case; - ElasticsearchPipeline - RawLogPipeline - KafkaPipeline Which pipeline to use is determined by by the triggering flag specified when main is run (es, rawlog, or kafka).

The following sections will detail each of the pipeline steps in turn.

Initialize¶

In this step, the following initial processes take place, depending on the configuration: - Start necessary sessions (Spark, elasticsearch etc), - Connect to the Baskerville storage database, - Define the active features, active columns, cache columns, - Load the model, - Register the metrics.

Create Runtime¶

A runtime is created in the Baskerville storage database to indicate that Baskerville has been run, and the details of the run.

Get Data¶

The log data to be processed is ingested - from elasticsearch, a raw log file, or from kafka.

Get Window¶

Filter out just those logs whose timestamps correspond to the current time bucket chunk being processed.

Preprocessing¶

Preprocessing comprises the actions; - handle_missing_columns, - rename_columns, - filter_columns, - handle_missing_values, - add_calc_columns.

Handle Missing Columns The current necessary log columns are the following:

+-------------------+--------------+--------------------+--------------------+--------------------+--------------------+------------------+-----------+------------------+
|client_request_host|     client_ip|          @timestamp|           client_ua|          client_url|        content_type|http_response_code|querystring|reply_length_bytes|
+-------------------+--------------+--------------------+--------------------+--------------------+--------------------+------------------+-----------+------------------+
|       testhost.net|25.204.184.124|2018-04-17T08:12:...|Mozilla/5.0 (Wind...|/wp-content/theme...|application/javas...|               200| ?ver=2.2.3|             25204|
|       testhost.net|  8.157.89.174|2018-04-17T08:12:...|Mozilla/5.0 (Wind...|/wp-content/plugi...|application/javas...|               200|     ?ver=1|              2825|
|       testhost.net|  37.151.22.36|2018-04-17T08:12:...|Mozilla/5.0 (Wind...|/wp-content/theme...|application/javas...|               200|     ?ver=1|               267|
|       testhost.net|202.165.110.43|2018-04-17T08:12:...|Mozilla/5.0 (Wind...|/wp-content/theme...|application/javas...|               200|     ?ver=1|               341|
|       testhost.net| 174.201.44.32|2018-04-17T08:12:...|Mozilla/5.0 (Wind...|/wp-content/plugi...|application/javas...|               200|     ?ver=1|               302|
+-------------------+--------------+--------------------+--------------------+--------------------+--------------------+------------------+-----------+------------------+


These are defined in a JSON schema the user must provide. An example of the JSON schema can be found here and the required attribute dictates which columns must be present in the dataframe. Any missing columns are filled at during this stage.

Rename Columns Spark cannot process column names that include . because . indicate column hierarchy, e.g a.b.c is translated in spark roughly to:

|      a     | <-- top column a
-------------|
| .. | b   | | -
|  --------- | | ---> a single cell that contains sub columns
||...| c |...| |
-------------- -


but it is actually in this form:

|   a.b.c    | <-- top column a.b.c
-------------|
| some data  |
--------------


so, in case the incoming logs have such a column because of the naming and not because of the structure, this must be replaced with an acceptable from spark name such as: with _ before proceeding, e.g. a.b.c to a_b_c.

Filter Columns Any columns that are not listed in the active_cols (provided by the features) and groupby_cols are discarded. The timestamp_column. if not defined in active_cols, is included too, since it is a necessary element for many calculations and actions - or just to be saved in the database.

Handle Missing Values Any missing values are filled with what is defined in the user-provided JSON schema. If no default value defined, then None is used as a default value.

Add Pre-group by Calculation Columns Some columns are added before the grouping of the data, to facilitate the feature computation further on. These calculations are defined in the feature’s pre_group_by_calcs dictionary. For example, the PathDepthAverage feature is calculated as the client_url_html_type_slash_count / number of requests. It is easier to calculate the client_url slash count in the not grouped logs with spark functions than it is to calculate it in the grouped dataframe, where we would have to use a udf (user defined function - slower than using spark functions). Each feature provides the aggregation functions to be applied after the grouping and gather the necessary columns for the feature value computation.

Group-by¶

During this step, the Spark dataframe is grouped by client_request_host and client_ip to form request sets, comprising all the request from one IP-host pair. During the grouping of the data, the feature defined aggregations are also applied so that the necessary information for the features to compute their values wil be available in the grouped dataframe later on. All these aggregations are defined in the group_by_aggs attribute of the features.

Feature Calculation¶

Feature calculation includes the following actions: - add_post_groupby_columns, - feature_extraction, - feature_update.

Add Post-Groupby Columns Some additional calculation columns must be added after the group-by step. These are defined in the post_group_by_calcs attribute of the features. This step is performed before the actual computation of the features so that we can take advantage of any common column computations.

Feature Extraction Each of the features is calculated, and added as an additional column to the Spark dataframe.

Feature Update If the request set exists already in the request set cache, its value is updated. This is achieved via the update method defined in the feature class.

Label or Predict¶

Label If the cross_reference flag is set to True in the configuration file, the request set IP is checked against the list of known malicious IPs in the Attributes table. If it is present there, the request set is marked as malicious. If it is not present, and the Attacks table has been synced more recently than the request set timestamp, it is marked as benign. The labelling is done using the cross_reference udf.

Predict The prediction is done using a udf that takes as input the feature values and returns 1 for a normal row and -1 for an anomalous row. The machine learning module additionally outputs a metric r that indicates our degree of certainty in the prediction, with a larger value indicating higher certainty. (r represents the distance to the separating hyperplane in the One Class SVM.) The model and scaler are loaded from the Baskerville storage database.

Save¶

The request set cache is checked to determine if request sets already exists. Those request sets that are present have their values updated in the Baskerville storage database, and those that are not present are saved as new request sets in the Request Sets table.

Request-Set Cache¶

The request set cache was created to avoid querying the database for every request set after every time bucket. Necessary information for updating is loaded into the cache when the pipeline is initialized (if the load_past flag is set to True in the config file), and again at the end of the save step of the pipeline.

The live version of Baskerville is a version where we continuously consume logs, in a predefined interval (the time_bucket, which currently is set to 120 seconds), from a (kafka) queue. The main steps after every time bucket is the preprocessing of the logs, the grouping, the feature extraction, the prediction and the save to the database.

|     |                                      | Online      |
|     |   consume every time_bucket seconds| - process,  |
|     |  <---------------------------------- | - predict   |
|     |                                      | - save      |
---------------


Input¶

The current input of the system is the ATS logs (the web logs from all of the ATS’s edges). These reach Baskerville through kafka, where a Logstash gathers them and publishes them in a specified format (This is currently WIP). #### Output The outputs of the online Baskeville are: - The processed request sets saved in the database with their respective predictions - Metrics about the efficiency and the progress of the process: - performance: metrics about how long it takes to complete a specific task. - progress: how many lines of logs have been processed so far, how many request sets etc. The metrics are exported to Prometheus and visualized through a Grafana Dashboard that can be found under the metrics folder.

Simulation¶

For testing purposes, there is a script that simulates the log publishing behaviour. This can be found under baskerville/src/baskerville/simulation. It can be ran autonomously python3 real_timeish_simulation.py or through python main.py kafka -s In short the behaviour of the script can be summed up as follows: Given an input file of raw logs, a kafka connection with all the details such as where to publish, a time bucket - by default 120 seconds- go through the logs, group them in 120 second batches and start publishing those batches (one line at a time) in the specified topic, e.g. baskerville.logs. After all the lines of a batch have been published, if there is still time (the time bucket duration has not passed), sleep for the remaining seconds.

Monitoring¶

For monitoring Prometheus in combination with Grafana are used. There are various components that need monitoring: - Kafka using a kafka-exporter. Metrics and dashboards: - https://grafana.com/dashboards/721 - https://grafana.com/dashboards/5484 -Baskervilleusing the embedded in the project exporter (python main.py {functionality} -e, the -e flag will register the metrics and run the exporter in the specified port - default:8998) -Sparkusing thespark-metrics projectthat exports spark related metrics (info about the cluster and the workers) to Prometheus Push Gateway. -Postgres: there are two ways to monitor Postgres: - Directly connect to a Postgres database and set up a dashboard with queries to a specific table - Using a postgres exporter to monitor requests, ram etc-Prometheus: Prometheus itself is monitored by the default exporter, the node exporter, which provides insights about the Prometheus instance itself. - Grafana: Grafana exports metrics to Prometheus - ATS: One way to monitor ATS is to consume from the default stats_over_http plugin that exposes metrics like these. Ideally the load plugin should be enabled too so that we get information about the edges load.

Response¶

There are several ideas for responding to the alerts (malicious traffic predictions) that Baskerville outputs. For example, having a simple ATS plugin that accepts the predicted output (the ip considered as malicious for a set of requests) and act by bypassing the specific requests and responding with a simple message and a custom code, e.g. “You shall not pass” 418 I’m a teapot. Another example is to update MISP with the events Baskerville considers malicious and have someone provide feedback about them.

This is something to be investigated.

The output of Baskerville is saved to a Postgres database specified in the configuration file. Interactions between the Baskerville engine and Postgres are managed using SQLAlchemy. The database models are contained in the db folder, and are as follows:

• Runtimes:
• id (Integer): primary key
• id_encryption (Integer): foreign key - encryption id
• start (DateTime, UTC): start time of logs being processed
• stop (DateTime, UTC): stop time of logs being processed
• target (Text): host website that logs were requesting (optional)
• dt_bucket (Float): length in seconds of the time bucket logs were processed in
• file_name (Text): path to raw log file that was processed (optional)
• processed (Boolean): flag indicating whether runtime analysis completed or not
• n_request_sets (Integer): count of request sets associated with the runtime
• comment (Text): field for optional additional information
• created_at (DateTime, UTC): timestamp when runtime was created
• config (JSON): a record of the engine config associated with the runtime
• Request sets:
• id (Integer): primary key
• id_runtime (Integer): foreign key - runtime id
• target (Text): host website that IP was requesting
• ip (String45): requesting IP address
• ip_encrypted (Text): (not implemented)
• ip_iv (Text): (not implemented)
• ip_tag (Text): (not implemented)
• start (DateTime, UTC): timestamp of first request in request set
• stop (DateTime, UTC): timestamp of last request in request set
• total_seconds (Float): length of request set, in seconds
• subset_count (Integer): number of subsets requests have been made in
• num_requests (Integer): total requests made in the request set
• time_bucket (Integer): time bucket length used for runtime (seconds)
• label (Integer): known benign (+1) or known malicious (-1) or null
• id_attribute (Integer): foreign key - attribute associated with IP
• process_flag (Boolean): whether request set needs to be processed
• prediction (Integer): predicted benign (+1) or malicious (-1) or null
• r (Float): degree of certainty in the prediction outcome
• row_num (Integer):
• features (JSON): feature values associated with the request set
• created_at (DateTime, UTC): time request set was created
• updated_at (DateTime, UTC): time request set was last updated
• model_version (Integer): foreign key - model used for prediction
• Subsets:
• id (Integer): primary key
• target (Text): host website that IP was requesting
• ip (String45): requesting IP address
• start (DateTime, UTC): timestamp of start of subset
• stop (DateTime, UTC): timestamp of end of subset
• num_requests (Integer): total requests made in the subset
• features (JSON): feature values associated with the subset
• prediction (Integer): predicted benign (+1) or malicious (-1) or null
• r (Float): degree of certainty in the prediction outcome
• row_num (Integer):
• time_bucket (Integer): length of subset (seconds)
• created_at (DateTime, UTC): time subset was created
• Models:
• id (Integer): primary key
• created_at (DateTime, UTC): time model was trained
• features(JSON): features used in model
• algorithm (Text): algorithm model employs (e.g. OneClassSVM)
• parameters (Text): parameter values used in model
• recall (Float): model recall score
• precision (Float): model precision score
• f1_score (Float): mode f1 score
• classifier (LargeBinary): pickled trained model
• scaler (LargeBinary): pickled scaler, fitted to normal training data
• notes (Text): optional additional notes on model
• Attacks:
• id (Integer): primary key
• id_misp (Integer): event id from the misp database
• date (DateTime, UTC): day on which the attack occured
• target (Text): target the attack focussed on
• attack_type (Text): type of attack (DDoS, web, etc)
• attack_tool (Text): tool used for attack
• attack_source (Text): e.g. Deflect
• ip_count (Integer): number of IPs involved in attack
• sync_start (DateTime, UTC): the oldest date synced from misp
• sync_stop (DateTime, UTC): the newest date synced from misp
• processed (Integer): whether the attack period has been processed
• notes (Text): optional additional notes on attack
• Attributes:
• id (Integer): primary key
• value (Text): IP address known to be malicious
• id_request_set (Integer): foreign key - request_sets
• id_model (Integer): foreign key - models
• id_request_set (Integer): foreign key - request_sets
• id_attack (Integer): foreign key - attacks
• id_attack (Integer): foreign key - attacks
• id_attribute (Integer): foreign key - attributes
• Encryption: not currently implemented

Data partitioning is defined and used for Request Sets and Subsets because these tables will get quite big.

Offline Analysis¶

Components¶

Investigations¶

The investigations tools can be run by working through the investigations jupyter notebook, baskerville/src/offline_analysis/src/notebooks/baskerville_investigations.ipynb, or alternatively, they can be run individually as described below.

Misp Sync¶

Purpose

Sync the attacks and attributes listed in a MISP database to the corresponding tables in the Baskerville storage, so that incoming IPs can be cross-referenced with these incidents, and their request sets can be appropriately labelled for model training and validation.

Running

In the offline configuration file baskerville/src/offline_analysis/conf/baskerville.yaml, edit the misp_db section to reflect the MISP database you wish to sync with, and edit the database section to point to the Baskerville storage database.

Change directory to baskerville/src/offline_analysis/src and run

python3 main.py


At the commandline prompt, choose option 0 to run the misp_sync function, and specify how many days you wish the sync to look back over.

Labelling¶

Purpose

To train and test a supervised model, request sets must be labelled as benign or malicious. If attacks have been synced from a MISP database, and cross_reference is set to True when the Baskerville engine is run, this labelling will automatically occur (provided the request set is not newer than the last sync date). If request sets need to be labelled after the Baskerville engine has been run, this can be achieved using the offline labelling tool.

Running

In the offline configuration file baskerville/src/offline_analysis/conf/baskerville.yaml, edit the database section to point to the Baskerville storage database.

Change directory to baskerville/src/offline_analysis/src and run

python3 main.py


At the commandline prompt, choose the labelling method from: - cross-referencing with attributes table, - via runtime ID, - via time period and host, - from IP list, and follow the instructions to complete the labelling process.

Data Exploration¶

Purpose Aggregate and visualise summary statistics for groups of request sets. Used to compare attacks with one another, and with reference periods. Options to calculate mean feature values for different groups of request sets, or to calculate the model accuracy in predicting request sets involved in different attacks.

Running

In the offline configuration file baskerville/src/offline_analysis/conf/baskerville.yaml, edit the database section to point to the Baskerville storage database with the data. The request sets you wish to analyse must already be saved to this database.

Change directory to baskerville/src/offline_analysis/src and run

python3 main.py


Follow the commandline prompt to decide how to specify the request sets you wish to analyse, which model version to use, and how to name the output visualisations. These will be saved in /baskerville/offline_analysis/data/visualisations.

Clustering¶

Purpose

The script clustering.py contains tools for clustering request sets by their feature vectors, and comparing the overlap between runtimes and clusters, for use in identifying and characterising botnets. It is intended as an exploratory investigative tool, rather than for use with online Baskerville.

Details

The available clustering algorithms are KMeans, DBSCAN, and HDBSCAN. In KMeans, the number of clusters is chosen by finding the ‘elbow’ of the BIC curve. However, this is subject to noise, and biases towards a small number of clusters in the current implementation. In DBSCAN, the number of clusters is automatically determined by the algorithm. However, there the parameter epsilon (representing the maximum distance between two samples for them to be considered as in the same neighborhood) still needs to be tuned. HDBSCAN is a hierarchical implementation of DBSCAN that allows the automatic selection of epsilon. This is consequently the recommended clustering algorithm to use here.

Running

In the offline configuration file baskerville/src/offline_analysis/conf/baskerville.yaml, edit the database section to point to the Baskerville storage database with the data. The request sets you wish to cluster must already be saved to this database.

Change directory to baskerville/src/offline_analysis/src and run

python3 main.py


Follow the commandline prompt to decide how to specify the request sets you wish to cluster, which clustering algorithm you would like to use, and whether to produce and save visualisations. The cluster IDs json will be saved locally in the /baskerville/offline_analysis/data/model_output directory. Any visualisations will be saved in /baskerville/offline_analysis/data/visualisations.

Visualisation¶

Purpose

The script visualisation.py contains tools for producing analysis visualisations, both for novelty detection and for botnet clustering.

Details

The supported figures are: - Feature importances: Feature importances shows the relative importance of different features in determining request set classifications, as estimated using an ExtraTreesClassifier. - Feature correlations: Correlation matrix of the correlation between different features. This is for use in model development, as highly correlated features should be combined. - Feature pairplot full, colour by runtime / prediction / cluster: Pairplot (n_features x n_features) showing the pairwise correlations between all the request set feature values, with the feature distributions along the diagnol. Labels to colour by can be either runtime id, prediction value, or cluster id. - Feature pairplot subset, colour by runtime / prediction / cluster: Pairplot, as above, but for a restricted set of features, specified by user input. - Parallel coordinates: Parallel coordinates plots of the feature values across each of the active features, for every request set. Strands can be coloured by runtime OR prediction OR cluster. - Feature boxplot, split by runtime / prediction / cluster: Box plots of the scaled feature distributions for each of the input runtimes OR prediction values OR cluster ids. - Runtime / prediction / cluster visualisation using PCA: Principal Component Analysis visualization to plot the data in 3D feature space. Data points can be coloured by runtime OR prediction OR cluster. - Runtime / prediction / cluster visualisation along 3 feature dims: 3D scatter plot of request set data in feature space along three specified feature axes, coloured by runtime OR prediction OR cluster. - Cluster-runtime overlap (stacked barchart): Stacked bar chart visualising overlap between runtimes and clusters.

Running

In the offline configuration file baskerville/src/offline_analysis/conf/baskerville.yaml, edit the database section to point to the Baskerville storage database with the data. The request sets you wish to visualise must already be saved to this database.

Change directory to baskerville/src/offline_analysis/src and run

python3 main.py


Follow the commandline prompt to decide which figure to plot, from the categories;

FEATURE SELECTION:
- Feature importances
- Feature correlations
INSPECTION OF INCIDENT OR MODEL IN FEATURE SPACE:
(Color by attack id, runtime, prediction, label, or cluster id)
- Feature pairplot full
- Feature pairplot subset
- Parallel coordinates
- Feature boxplot
- Visualisation using PCA
- Visualisation along 3 feature dimensions
INCIDENT-MODEL COMPARISON:
- Stacked barchart showing overlaps with predictions
- Stacked barchart showing overlaps with clusters


Additionally, specify which request sets to visualise, whether to scale the data, and where to save the visualisations.

Machine Learning¶

The machine learning tools can be run by working through the machine learning jupyter notebook, baskerville/src/offline_analysis/src/notebooks/baskerville_machine_learning.ipynb, or alternatively, they can be run individually as described below.

Training¶

Purpose

The script training.py contains tools for training a novelty detection classifier for use in the online version of Baskerville. Trained on labelled request sets, this model can then predict whether new request sets are benign or suspicious, based on their feature values.

Details

Novelty detection is distinct from outlier detection in that it assumes the training data does not contain outliers, and we are interested in detecting anomalies in new observations. A OneClassSVM is appropriate for this use case. For outlier detection, where the training set is contaminated, tools include IsolationForest, LOF, and EllipticEnvelopes (see sklearn documentation).

A novelty detector is an example of a binary classifier, as it classifiers new sample as either “normal” (+1) or “novel” (-1). The steps taken in training the classifier are as follows: 1. Load feature vectors for normal request sets. This will comprise the training set. 2. Scale the features to have mean = 0 and variance = 1. Save this scaler to scale future new samples with. 3. Load feature vectors for known malicious request sets to test the classifier with. Partition out a portion of the normal training data into this test set, so that the two categories (benign / malicious) are evenly represented in the test set. 4. Tune the model hyperparaters by iterating through a parameter grid, training the classifier on the training set of normal request sets, and testing its performance on the labelled test data (comprising known malicious request sets and the held out normal request sets). The performance can be assessed by calculating recall, precision, and their harmonic average f1 score. 5. Fit the model to the training data with the parameters equal to their optimal values. Save the trained classifier, and write the model details (version, created_at, trained_on, tested_on, classifier, parameters, recall, precision, f1_score) to the model database.

Running

In order to train a novelty detector, request set data must already have been produced by running the Baskerville engine. Additionally, some of these request sets must have been labelled as known benign (these are the training data), and some must have been labelled as known malicious (these are the testing data).

In the offline configuration file baskerville/src/offline_analysis/conf/baskerville.yaml, edit the database section to point to the Baskerville storage database with the data. Also edit the offline_features to include those features you wish to use in your model.

Change directory to baskerville/src/offline_analysis/src and run

python3 main.py


At the commandline prompt, specify the training data and the testing data using the options provided, and whether or not to save produce and save visualisations. The trained model will be saved to the Baskerville storage database, and also locally in the /baskerville/offline_analysis/data/model_output directory. Any visualisations will be saved in /baskerville/offline_analysis/data/visualisations.

Model Evaluation¶

Purpose

Assess the success of a model with training size, and across different attacks.

Details

Train a classifier increasing length chunks of normal data, test it against a known attack, and plot the model accuracy (precision, recall, f1 score) as a function of these training set sizes. Increase the length of the training data size until the model accuracy begins to converge. If the model accuracy converges to a value that is not sufficiently accurate, feature engineering etc will be needed to improve this.

Running

The online configuration file baskerville/conf/baskerville.yaml should be edited to reflect the engine settings you wish to employ. The offline configuration file baskerville/src/offline_analysis/conf/baskerville.yaml should be edited to list the features you wish to focus on in the model training.

Change directory to baskerville/src/offline_analysis/src and run

python3 main.py

Prediction¶

Purpose

If a scaler and model have previously been trained, predictions can be made for whether new request sets are malicious or benign using the tools in prediction.py. This is of use when comparing different models, or in order to classify existing request sets using a new model version. The degree of certainty in the prediction is quantified by the decision function, with a larger absolute value indicating more certainty). These predictions are automatically saved as a json file, and can optionally be written to the Baskerville database.

Running

In the offline configuration file baskerville/src/offline_analysis/conf/baskerville.yaml, edit the database section to point to the Baskerville storage database with the data. The request sets you wish to classify and the model you wish to use must already be saved to this database.

Change directory to baskerville/src/offline_analysis/src and run

python3 main.py


Follow the commandline prompt to decide which model version to use, whether to write the predictions to the database, and whether to produce and save visualisations. The predictions json will be saved locally in the /baskerville/offline_analysis/data/model_output directory. Any visualisations will be saved in /baskerville/offline_analysis/data/visualisations.