NPBAD Basics

Analyzing Patterns and Anomalies with NPBAD

This tutorial utilizes the main takeaways from the research paper: Pattern-Based Anomaly Detection in a Network of Multivariate Time Series

To explore the basic concepts, we’ll use the ipbad function to find interesting patterns and anomalies and demonstrate these concepts with two different time series datasets:

  1. The NYC taxi passengers dataset
  2. The server machine dataset

ipbad is described in detail in the paper and contains a variety of optimalisations to minimize runtime and improve comprehension. Steps include: (i) transforming a time series to a set of discrete sequences; (2) find frequent and cohesive patterns that best compress the data; (3) create an anomaly scores based on pattern occurrences.

Getting Started

Let’s import the packages that we’ll need to load, analyze, and plot the data:

In [1]:
!git clone https://len_feremans@bitbucket.org/len_feremans/pbad_network_private.git
Cloning into 'pbad_network_private'...
remote: Enumerating objects: 573, done.
remote: Counting objects: 100% (573/573), done.
remote: Compressing objects: 100% (517/517), done.
remote: Total 573 (delta 262), reused 146 (delta 38), pack-reused 0
Receiving objects: 100% (573/573), 314.46 MiB | 6.44 MiB/s, done.
Resolving deltas: 100% (262/262), done.
In [2]:
!cd pbad_network_private && pip install . -q
  DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
   pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555.
     |████████████████████████████████| 2.5 MB 5.3 MB/s 
  Building wheel for npbad (setup.py) ... done
In [3]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import json
from tqdm import tqdm
import os

from npbad.utils import load_numenta_data, cart_product

from npbad.visualisation import plot_data, plot_discrete, plot_anomalies, plot_patterns
from npbad.preprocess_timeseries import min_max_norm
from npbad.symbolic_representation import create_segments, create_windows
from npbad.symbolic_representation import discretise_segments_equal_distance_bins_global

from npbad.ipbad.main import save_transactions
from npbad.ipbad.pattern_mining_petsc import mine_patterns_and_create_embedding
from npbad.ipbad.minimum_description_length import post_process_mdl

from npbad.ipbad.main import run_ipbad
from npbad.eval_methods import eval_best_f1, eval_recall_at_k

What is a pattern?

Time series patterns are repeated subsequences found within a longer time series. In contrast to motifs we compute exact occurrences after discretisation, have varying length patterns and allow for a number of gaps relative to the length of pattern.

Loading the NYC Taxi Passengers Dataset

First, we’ll download historical data that represents the half-hourly average of the number of NYC taxi passengers over 75 days in the Fall of 2014 available in the The Numenta Anomaly Benchmark. We also download the labeled timestamps for the Taxi Dataset saved in json.

In [4]:
!wget -nv https://raw.githubusercontent.com/numenta/NAB/master/data/realKnownCause/nyc_taxi.csv
!wget -nv https://raw.githubusercontent.com/numenta/NAB/master/labels/combined_labels.json
2022-05-23 12:06:26 URL:https://raw.githubusercontent.com/numenta/NAB/master/data/realKnownCause/nyc_taxi.csv [265771/265771] -> "nyc_taxi.csv" [1]
2022-05-23 12:06:26 URL:https://raw.githubusercontent.com/numenta/NAB/master/labels/combined_labels.json [7015/7015] -> "combined_labels.json" [1]

We extract that data and insert it into a pandas dataframe, making sure the timestamps are stored as datetime objects and the values are of type float64. The labels or anomalies are just a list of numpy datetime objects. In this we use the utility function npbad.utils.load_numenta_data to do this.

In [5]:
df, anomalies = load_numenta_data('nyc_taxi.csv', 'combined_labels.json', key='realKnownCause/nyc_taxi.csv')
display(df)
fill_missing: adding (relative) +0.0% more values (from 10320 to 10320)
loading data. start: 2014-07-01 00:00:00. end: 2015-01-31 23:30:00. interval: 30.0m
anomalies: #5. timestamps: [numpy.datetime64('2014-11-01T19:00:00'), numpy.datetime64('2014-11-27T15:30:00'), numpy.datetime64('2014-12-25T15:00:00'), numpy.datetime64('2015-01-01T01:00:00'), numpy.datetime64('2015-01-27T00:00:00')]
average_value time
0 10844 2014-07-01 00:00:00
1 8127 2014-07-01 00:30:00
2 6210 2014-07-01 01:00:00
3 4656 2014-07-01 01:30:00
4 3820 2014-07-01 02:00:00
... ... ...
10315 24670 2015-01-31 21:30:00
10316 25721 2015-01-31 22:00:00
10317 27309 2015-01-31 22:30:00
10318 26591 2015-01-31 23:00:00
10319 26288 2015-01-31 23:30:00

10320 rows × 2 columns

Visualizing the Taxi Dataset

We can visualise the time series and anomalies using matplotlib or using the utility function npbad.visualisation.plot_data to show both the time series and labelled anomalies.

In [6]:
fig, axs = plt.subplots(1, figsize=(20,3))
df_sample = df[df.time > np.datetime64('2014-10-15')]
plot_data(axs, df_sample, anomalies)

Symbolic representation

Before we find frequent patterns we require a symbolic representation or sequence database. Therefore, we first normalise the time series and apply a sliding window

In [7]:
ts_i = min_max_norm(df)
ts_i_sample = ts_i[ts_i.time > np.datetime64('2014-12-20')]
#symbolic representation
windows = create_windows(ts_i_sample, interval=24, stride=24) 
segments = create_segments(ts_i_sample, windows)
print(f"Total windows: {len(windows)}. First: {windows[0]}. Last: {windows[-1]}. Length segment: {len(segments[0])}")
Total windows: 42. First: (Timestamp('2014-12-20 00:30:00'), Timestamp('2014-12-21 00:30:00')). Last: (Timestamp('2015-01-30 00:30:00'), Timestamp('2015-01-31 00:30:00')). Length segment: 48

Next we discretise each window by taking average values every x steps, i.e. apply the Piecewise Aggregate Approxmiation, and discretise the average values into k equal-length bins.

In [8]:
segments_d = discretise_segments_equal_distance_bins_global(segments, no_symbols=10, no_bins=5)
print(segments_d[0])
[2 1 0 1 2 2 2 3 2 3]

We can visualize the discrete representation directly since binning is global

In [9]:
fig, axs = plt.subplots(2, figsize=(30,6), sharex=True)
anomalies = [an for an in anomalies if an > np.datetime64('2014-12-01')]
plot_data(axs[0], ts_i_sample, anomalies)
plot_discrete(axs[1], segments_d, windows, interval_width=24, stride=24, no_symbols=10, no_bins=5)
plot_anomalies(axs[1], ts_i_sample, anomalies)
discrete plot len w 420, len s420

Manualy finding a pattern-based outlier

Given a set of discrete sequence we can report an anomaly as a sequence that occurs infrequently or vice versa if is not covered by frequent patterns. First we identify a pattern manually.

In [10]:
def matches(pattern, segment):
  ''' if all symbols pattern occur in the same order as in window'''
  for x in pattern:
    if x in segment:
      segment = segment[segment.index(x):]
    else:
      return False
  return True

frequent_pattern = [0,1,2,2,2,2,2]
occurrences1 = [1 if matches(frequent_pattern,segment.tolist()) else 0 for segment in segments_d]
infrequent_pattern = [3,2,0,0,1]
occurrences2 = [1 if matches(infrequent_pattern,segment.tolist()) else 0 for segment in segments_d]

fig, axs = plt.subplots(3, figsize=(30,6), sharex=True)
plot_discrete(axs[0], segments_d, windows, interval_width=24, stride=24, no_symbols=10, no_bins=5)
plot_anomalies(axs[0], ts_i_sample, anomalies)
axs[1].bar([window[0] + np.timedelta64(6,'h') for window in windows], occurrences1)
axs[1].set_ylabel(f'{frequent_pattern}')
axs[2].bar([window[0] + np.timedelta64(6,'h') for window in windows], occurrences2, color='green')
axs[2].set_ylabel(f'{infrequent_pattern}')
discrete plot len w 420, len s420
Out[10]:
Text(0, 0.5, '[3, 2, 0, 0, 1]')

From the above plot we see that the first frequent patterns matches 40/42 windows and that the two remaining windows are in fact anomalous. The second infrequent pattern matches only a single window which is also anomalous. If we would create an embedding for each window and store if either pattern occurs would not be hard to decide if an anomaly occurs. In the following section we will automatically discover patterns and report anomalies.

Finding patterns using NPBAD

First we find the top-1000 frequent sequential patterns with a constraint on relative duration. For instance, with a relative duration we allow no gaps and with a relative duration of 1.2 an occurrence of pattern of length 5 can have at most 2 gaps (or 5*1.2 -5=2). We also filter sequential pattern with a length smaller than 4.

In [11]:
save_transactions('./temp/sequences.txt', segments_d)
patterns_df, embedding = mine_patterns_and_create_embedding('./temp/sequences.txt', no_transactions = len(segments_d), 
                                                            top_k=10000, min_size=4, max_size=10, 
                                                            duration=1.2, 
                                                            patterns_fname='./temp/patterns.txt', embedding_fname='./temp/embedding.txt', verbose=True)
print(f"Total patterns before MDL: {patterns_df.shape[0]}")
display(patterns_df)
['java', '-Xmx16G', '-cp', '/usr/local/lib/python3.7/dist-packages/npbad/ipbad/../../lib/petsc-miner-pbad-0.1.0-with-dependencies.jar', 'be.uantwerpen.mining.MineTopKSequentialPatternsNoDiscretisation', '-input_transactions', './temp/sequences.txt', '-min_size', '4', '-max_size', '10', '-duration', '1.2', '-top_k', '10000', '-output_patterns', './temp/patterns.txt', '-output_occurrences', './temp/embedding.txt']
saved embedding to ./temp/embedding.txt elapsed 1.2886779308319092s
loading #749 patterns. Min/Max rsupport :0.024/0.833
loading embedding #42. Min/Max patterns in segment :25/79. First: [(0, 0.8333333333333334), (1, 0.8095238095238095), (2, 0.7142857142857143), (3, 0.7142857142857143), (4, 0.6904761904761905), (18, 0.30952380952380953), (21, 0.2857142857142857), (22, 0.2857142857142857), (23, 0.2857142857142857), (24, 0.2619047619047619), (25, 0.2619047619047619), (26, 0.2619047619047619), (28, 0.23809523809523808), (29, 0.23809523809523808), (30, 0.23809523809523808), (32, 0.21428571428571427), (33, 0.21428571428571427), (36, 0.21428571428571427), (38, 0.21428571428571427), (39, 0.21428571428571427), (40, 0.21428571428571427), (46, 0.19047619047619047), (48, 0.19047619047619047), (56, 0.16666666666666666), (61, 0.16666666666666666), (70, 0.14285714285714285), (74, 0.14285714285714285), (75, 0.14285714285714285), (76, 0.14285714285714285), (77, 0.14285714285714285), (79, 0.14285714285714285), (82, 0.14285714285714285), (83, 0.14285714285714285), (89, 0.11904761904761904), (90, 0.11904761904761904), (115, 0.09523809523809523), (116, 0.09523809523809523), (147, 0.09523809523809523), (148, 0.09523809523809523), (170, 0.07142857142857142), (171, 0.07142857142857142), (172, 0.07142857142857142), (173, 0.07142857142857142), (190, 0.07142857142857142), (206, 0.07142857142857142), (208, 0.07142857142857142), (210, 0.07142857142857142), (211, 0.07142857142857142), (216, 0.07142857142857142), (218, 0.07142857142857142), (231, 0.07142857142857142), (233, 0.07142857142857142), (234, 0.07142857142857142), (236, 0.07142857142857142), (258, 0.047619047619047616), (262, 0.047619047619047616), (270, 0.047619047619047616), (276, 0.047619047619047616), (290, 0.047619047619047616), (292, 0.047619047619047616), (293, 0.047619047619047616), (314, 0.047619047619047616), (316, 0.047619047619047616), (325, 0.047619047619047616), (326, 0.047619047619047616), (351, 0.047619047619047616), (353, 0.047619047619047616), (354, 0.047619047619047616), (355, 0.047619047619047616), (358, 0.047619047619047616), (359, 0.047619047619047616), (377, 0.047619047619047616), (381, 0.047619047619047616), (383, 0.047619047619047616), (388, 0.047619047619047616), (389, 0.047619047619047616), (390, 0.047619047619047616), (391, 0.047619047619047616), (400, 0.047619047619047616)]. Last: [(0, 0.8333333333333334), (1, 0.8095238095238095), (2, 0.7142857142857143), (3, 0.7142857142857143), (4, 0.6904761904761905), (5, 0.6666666666666666), (6, 0.6428571428571429), (7, 0.6428571428571429), (8, 0.5952380952380952), (9, 0.5476190476190477), (10, 0.5238095238095238), (11, 0.5238095238095238), (12, 0.40476190476190477), (21, 0.2857142857142857), (22, 0.2857142857142857), (23, 0.2857142857142857), (24, 0.2619047619047619), (25, 0.2619047619047619), (26, 0.2619047619047619), (28, 0.23809523809523808), (29, 0.23809523809523808), (30, 0.23809523809523808), (35, 0.21428571428571427), (51, 0.19047619047619047), (55, 0.16666666666666666), (57, 0.16666666666666666), (64, 0.16666666666666666), (69, 0.14285714285714285), (94, 0.11904761904761904), (95, 0.11904761904761904), (99, 0.11904761904761904), (100, 0.11904761904761904), (101, 0.11904761904761904), (102, 0.11904761904761904), (103, 0.11904761904761904), (108, 0.11904761904761904), (112, 0.11904761904761904), (149, 0.09523809523809523), (150, 0.09523809523809523), (154, 0.09523809523809523), (184, 0.07142857142857142), (225, 0.07142857142857142), (229, 0.07142857142857142), (235, 0.07142857142857142), (298, 0.047619047619047616), (401, 0.047619047619047616)]
Total patterns before MDL: 749
pattern support rsupport id
0 [1, 2, 2, 2] 35 0.833333 0
1 [0, 1, 2, 2, 2] 34 0.809524 1
2 [1, 2, 2, 2, 2] 30 0.714286 2
3 [0, 1, 2, 2, 2, 2] 30 0.714286 3
4 [0, 1, 2, 2] 29 0.690476 4
... ... ... ... ...
516 [0, 0, 1, 2, 1, 1, 1, 2, 2, 1] 1 0.023810 744
515 [0, 0, 1, 2, 1, 1, 1, 2, 1] 1 0.023810 745
514 [2, 0, 0, 1, 2, 2, 3, 2] 1 0.023810 746
513 [2, 0, 0, 1, 2, 2, 3, 2, 3] 1 0.023810 747
748 [0, 0, 2, 2, 2, 2, 2, 2, 1] 1 0.023810 748

749 rows × 4 columns

To improve compresion, we post-process patterns and only keep patterns that compress the data defined using minimum description length which dramatically decreased the number of patterns to improve comprehesion

In [12]:
patterns_df, embedding = post_process_mdl(patterns_df, embedding, segments_d, no_symbols=10, duration=1.2, verbose=False)
print(f"Total patterns after MDL: {patterns_df.shape[0]}")
display(patterns_df)
print(embedding[0])
Total patterns after MDL: 50
pattern support rsupport id instances bits_saved
3 [0, 1, 2, 2, 2, 2] 30 0.714286 3 [0, 1, 2, 3, 4, 7, 9, 10, 11, 13, 14, 18, 19, ... 205
7 [0, 0, 1, 2, 2, 2] 27 0.642857 7 [2, 3, 4, 7, 9, 10, 11, 12, 13, 14, 15, 18, 19... 192
2 [1, 2, 2, 2, 2] 30 0.714286 2 [0, 1, 2, 3, 4, 7, 9, 10, 11, 13, 14, 18, 19, ... 187
9 [0, 0, 1, 2, 2, 2, 2] 23 0.547619 9 [2, 3, 4, 7, 9, 10, 11, 13, 14, 18, 19, 23, 24... 182
1 [0, 1, 2, 2, 2] 34 0.809524 1 [0, 1, 2, 3, 4, 7, 9, 10, 11, 12, 13, 14, 15, ... 178
6 [0, 0, 1, 2, 2] 27 0.642857 6 [2, 3, 4, 7, 9, 10, 11, 12, 13, 14, 15, 18, 19... 169
5 [0, 0, 1, 2] 28 0.666667 5 [2, 3, 4, 7, 10, 11, 12, 14, 16, 17, 18, 19, 2... 152
0 [1, 2, 2, 2] 35 0.833333 0 [0, 1, 2, 3, 4, 7, 9, 10, 11, 12, 13, 14, 15, ... 147
8 [2, 2, 2, 2] 25 0.595238 8 [1, 4, 7, 9, 10, 11, 13, 14, 18, 19, 22, 23, 2... 141
4 [0, 1, 2, 2] 29 0.690476 4 [0, 1, 2, 3, 4, 7, 10, 11, 12, 14, 18, 19, 21,... 126
16 [0, 0, 1, 2, 2, 2, 2, 2] 15 0.357143 16 [2, 3, 7, 10, 11, 13, 14, 18, 23, 24, 25, 26, ... 124
14 [0, 1, 2, 2, 2, 2, 2] 16 0.380952 14 [1, 2, 3, 7, 10, 11, 13, 14, 18, 23, 24, 25, 2... 122
15 [1, 2, 2, 2, 2, 2] 16 0.380952 15 [1, 2, 3, 7, 10, 11, 13, 14, 18, 23, 24, 25, 2... 111
13 [2, 2, 2, 2, 2] 16 0.380952 13 [1, 2, 3, 7, 10, 11, 13, 14, 18, 23, 24, 25, 2... 106
12 [0, 0, 2, 2, 2, 2] 17 0.404762 12 [4, 7, 10, 11, 14, 18, 19, 23, 24, 25, 26, 27,... 92
24 [0, 1, 2, 2, 2, 2, 3] 11 0.261905 24 [0, 11, 18, 19, 26, 27, 28, 33, 34, 35, 41] 91
28 [0, 1, 2, 2, 2, 3, 3] 10 0.238095 28 [0, 19, 20, 21, 27, 28, 33, 34, 35, 41] 87
25 [1, 2, 2, 2, 2, 3] 11 0.261905 25 [0, 11, 18, 19, 26, 27, 28, 33, 34, 35, 41] 84
11 [0, 0, 2, 2, 2] 22 0.523810 10 [1, 4, 7, 10, 11, 14, 18, 19, 22, 23, 24, 25, ... 82
23 [0, 1, 2, 2, 2, 3] 12 0.285714 21 [0, 11, 18, 19, 20, 21, 27, 28, 33, 34, 35, 41] 81
29 [1, 2, 2, 2, 3, 3] 10 0.238095 29 [0, 19, 20, 21, 27, 28, 33, 34, 35, 41] 81
19 [1, 2, 2, 2, 1] 13 0.309524 19 [2, 3, 4, 9, 12, 15, 16, 17, 22, 29, 30, 31, 36] 78
30 [2, 2, 2, 3, 3] 10 0.238095 30 [0, 19, 20, 21, 27, 28, 33, 34, 35, 41] 65
18 [1, 1, 2, 2, 2] 13 0.309524 18 [0, 1, 9, 13, 15, 16, 17, 22, 28, 29, 31, 34, 36] 63
26 [2, 2, 2, 2, 3] 11 0.261905 26 [0, 11, 18, 19, 26, 27, 28, 33, 34, 35, 41] 62
21 [1, 2, 2, 2, 3] 12 0.285714 22 [0, 11, 18, 19, 21, 26, 27, 28, 33, 34, 35, 41] 61
10 [0, 2, 2, 2, 2] 22 0.523810 11 [2, 3, 4, 7, 10, 11, 12, 14, 18, 19, 21, 23, 2... 56
17 [2, 2, 2, 1] 15 0.357143 17 [1, 2, 3, 4, 9, 12, 15, 16, 17, 22, 23, 29, 30... 54
108 [2, 2, 2, 3, 3, 3] 5 0.119048 94 [20, 21, 27, 34, 41] 40
101 [0, 1, 2, 2, 2, 2, 3, 3] 5 0.119048 99 [19, 27, 33, 34, 41] 37
27 [0, 2, 2, 2, 2, 2] 10 0.238095 27 [1, 7, 10, 14, 23, 24, 25, 26, 32, 40] 36
75 [2, 1, 0, 1, 2, 2, 2, 2] 6 0.142857 77 [0, 1, 22, 28, 29, 36] 33
110 [1, 2, 2, 2, 2, 3, 2] 5 0.119048 96 [11, 18, 19, 26, 33] 32
111 [2, 2, 2, 2, 3, 2] 5 0.119048 97 [11, 18, 19, 26, 33] 30
99 [0, 2, 2, 2, 2, 3, 3] 5 0.119048 100 [19, 27, 33, 34, 41] 29
32 [1, 0, 1, 2, 2, 2] 9 0.214286 39 [0, 1, 12, 21, 22, 28, 29, 35, 36] 18
77 [2, 1, 0, 1, 2, 2] 6 0.142857 75 [0, 1, 22, 28, 29, 36] 18
105 [2, 2, 2, 2, 1, 1] 5 0.119048 91 [4, 22, 29, 30, 36] 17
33 [1, 0, 1, 2, 2] 9 0.214286 38 [0, 1, 12, 21, 22, 28, 29, 35, 36] 16
39 [2, 0, 1, 2, 2, 2] 9 0.214286 33 [0, 1, 11, 14, 22, 28, 29, 34, 36] 12
153 [2, 1, 0, 1, 2, 2, 2, 2, 1] 4 0.095238 124 [1, 22, 29, 36] 11
22 [2, 2, 2, 3] 12 0.285714 20 [1, 2, 3, 4, 9, 18, 22, 23, 29, 30, 31, 36] 11
130 [2, 1, 0, 2, 2, 2, 2] 4 0.095238 153 [1, 22, 29, 36] 10
76 [2, 1, 0, 2, 2] 6 0.142857 74 [0, 1, 22, 28, 29, 36] 10
115 [0, 1, 2, 2, 3] 4 0.095238 148 [0, 21, 28, 35] 9
40 [2, 1, 1, 2, 2] 9 0.214286 32 [0, 1, 11, 14, 22, 28, 29, 34, 36] 8
134 [1, 0, 0, 1] 4 0.095238 136 [7, 11, 14, 15] 6
133 [1, 1, 1, 2, 2] 4 0.095238 140 [16, 17, 37, 39] 3
136 [1, 0, 0, 1, 2, 2] 4 0.095238 134 [8, 11, 14, 15] 3
186 [0, 0, 1, 2, 1, 2, 2, 2, 1] 3 0.071429 230 [16, 17, 31] 2
[(0, 0.8333333333333334), (1, 0.8095238095238095), (2, 0.7142857142857143), (3, 0.7142857142857143), (4, 0.6904761904761905), (18, 0.30952380952380953), (21, 0.2857142857142857), (22, 0.2857142857142857), (24, 0.2619047619047619), (25, 0.2619047619047619), (26, 0.2619047619047619), (28, 0.23809523809523808), (29, 0.23809523809523808), (30, 0.23809523809523808), (32, 0.21428571428571427), (33, 0.21428571428571427), (38, 0.21428571428571427), (39, 0.21428571428571427), (74, 0.14285714285714285), (75, 0.14285714285714285), (77, 0.14285714285714285), (148, 0.09523809523809523)]

We can visualise each pattern using npbad.visualisation.plot_patterns, which also show the number of patterns in the pattern-based embedding for each window.

In [13]:
plot_patterns(ts_i_sample, anomalies, segments_d, windows, patterns_df, embedding, interval_width=24,stride=24,no_symbols=10)
discrete plot len w 420, len s420

Finding anomalies using NPBAD

After we discovered patterns and created an embedding of patterns for each window, we compute an anomaly score using an isolation forest. Alternatively, we use the frequent pattern outlier factor (FPOF) which is the sum of relative support of all matching patterns versus the total number of pattern.

In [14]:
from npbad.anomaly_detection import iso_forest, fpof
#fpof results:
fpof_scores = fpof(embedding, patterns_df.shape[0])
fpof_scores = list(zip(windows, fpof_scores))
(f1,prec,rec,TP,FP,FN,TN, t) = eval_best_f1(fpof_scores, anomalies)
print(f"IPBAD-FPOF: F1@best: {f1:.4f}, precision: {prec:.4f}, recall: {rec:.4f}, threshold: {t:.4f}, TP: {TP}, FP: {FP}, FN:{FN}, TN:{TN}")

#plot
def plot_predicted_anomalies(df, anomalies, an_scores1, t):
    fig, axs = plt.subplots(2, figsize=(30,3), sharex=True)
    plot_data(axs[0], df, anomalies)
    #plot fpof
    timestamps = [window[0] + np.timedelta64(12,'h') for window, score in an_scores1]
    scores1 = [score for window, score in an_scores1]
    axs[1].plot(timestamps, scores1, color='red',alpha=0.5)
    axs[1].axhline(y=t, color='b', linestyle='--')
    axs[1].set_ylabel('FPOF')

plot_predicted_anomalies(ts_i_sample, anomalies, fpof_scores, t)
eval_best_f1: #an_scores: 42, min score:  0.760-max score:  1.000 #anomalies: 3
IPBAD-FPOF: F1@best: 0.4444, precision: 0.3333, recall: 0.6667, threshold: 0.9519, TP: 2, FP: 4, FN:1, TN:35

The previous steps (discretisation, pattern mining and embedding, anomaly detection) can be combined on the full taxi dataset in one call to npbad.ipbad.main.run_ipbad. In this case we set the stride to 1.

In [15]:
df, anomalies = load_numenta_data('nyc_taxi.csv', 'combined_labels.json', key='realKnownCause/nyc_taxi.csv')
prep_par = {'interval': 96, 'stride':1, 'no_symbols': 5, 'no_bins': 5, 'remove_outliers':False} 
parameter = {'use_iso': True, 'use_MDL': True}

#run anomaly detection
an_scores, ts_i, segments_discretised, windows = run_ipbad(
                df=df, interval=prep_par["interval"], stride=prep_par["stride"], 
                no_symbols=prep_par["no_symbols"], no_bins=prep_par["no_bins"], 
                preprocess_outlier=prep_par["remove_outliers"], 
                binning_method='normal',
                use_MDL=parameter['use_MDL'], use_iso=parameter['use_iso'],
                verbose=False)   

#evaluate                                                                          
(f1,prec,rec,TP,FP,FN,TN, t) = eval_best_f1(an_scores, anomalies)
print(f"F1@best: {f1:.4f}, precision: {prec:.4f}, recall: {rec:.4f}, threshold: {t:.4f}, TP: {TP}, FP: {FP}, FN:{FN}, TN:{TN}")

#plot
plot_predicted_anomalies(df, anomalies, an_scores, t)
fill_missing: adding (relative) +0.0% more values (from 10320 to 10320)
loading data. start: 2014-07-01 00:00:00. end: 2015-01-31 23:30:00. interval: 30.0m
anomalies: #5. timestamps: [numpy.datetime64('2014-11-01T19:00:00'), numpy.datetime64('2014-11-27T15:30:00'), numpy.datetime64('2014-12-25T15:00:00'), numpy.datetime64('2015-01-01T01:00:00'), numpy.datetime64('2015-01-27T00:00:00')]
create anomaly scores (isolation_forest). Number of scores: 5064. First 5: [((Timestamp('2014-07-01 00:00:00'), Timestamp('2014-07-05 00:00:00')), 0.4463487956354264), ((Timestamp('2014-07-01 01:00:00'), Timestamp('2014-07-05 01:00:00')), 0.4463487956354264), ((Timestamp('2014-07-01 02:00:00'), Timestamp('2014-07-05 02:00:00')), 0.4339457410193874), ((Timestamp('2014-07-01 03:00:00'), Timestamp('2014-07-05 03:00:00')), 0.4421260164701807), ((Timestamp('2014-07-01 04:00:00'), Timestamp('2014-07-05 04:00:00')), 0.4421260164701807)]
eval_best_f1: #an_scores: 5064, min score:  0.434-max score:  0.551 #anomalies: 5
F1@best: 0.8389, precision: 0.8798, recall: 0.8017, threshold: 0.4807, TP: 388, FP: 53, FN:96, TN:4527

Optimising parameters

The performance of anomaly detection is dependent on a good symbolic representation. Therefore we should optimise hyper-parameters. An example using grid search is shown below

In [16]:
results_ipbad = []
grid = cart_product(interval=[110,96,72], no_symbols=[5,6,7], no_bins=[5,6,7])   
for preprocess_parameter in tqdm(grid):
  try:
    an_scores, ts_i, segments_discretised, windows = run_ipbad(df=df, interval=preprocess_parameter["interval"], stride=1, 
                                                               no_symbols=preprocess_parameter["no_symbols"], no_bins=preprocess_parameter["no_bins"], 
                                                               preprocess_outlier=False, binning_method='normal',verbose=False,
                                                               use_MDL=True, use_iso=True)
    precAt10, recAt10 = eval_recall_at_k(an_scores, anomalies, threshold_top_k=50, K=10)
    results_ipbad.append((recAt10, precAt10, preprocess_parameter))
  except Exception as e:
    print(e)
df_results = pd.DataFrame.from_records(results_ipbad, columns=["precision@10", "recall@10", "parameters"])
df_results = df_results.sort_values(by=["recall@10","precision@10"], ascending=False)
display(df_results)
  4%|▎         | 1/27 [00:12<05:21, 12.38s/it]
create anomaly scores (isolation_forest). Number of scores: 5050. First 5: [((Timestamp('2014-07-01 00:00:00'), Timestamp('2014-07-05 14:00:00')), 0.42616524372456305), ((Timestamp('2014-07-01 01:00:00'), Timestamp('2014-07-05 15:00:00')), 0.4434734481791104), ((Timestamp('2014-07-01 02:00:00'), Timestamp('2014-07-05 16:00:00')), 0.4434734481791104), ((Timestamp('2014-07-01 03:00:00'), Timestamp('2014-07-05 17:00:00')), 0.4434734481791104), ((Timestamp('2014-07-01 04:00:00'), Timestamp('2014-07-05 18:00:00')), 0.4446216854351137)]
  7%|▋         | 2/27 [00:23<04:57, 11.91s/it]
create anomaly scores (isolation_forest). Number of scores: 5050. First 5: [((Timestamp('2014-07-01 00:00:00'), Timestamp('2014-07-05 14:00:00')), 0.4525372950073978), ((Timestamp('2014-07-01 01:00:00'), Timestamp('2014-07-05 15:00:00')), 0.4525372950073978), ((Timestamp('2014-07-01 02:00:00'), Timestamp('2014-07-05 16:00:00')), 0.4525372950073978), ((Timestamp('2014-07-01 03:00:00'), Timestamp('2014-07-05 17:00:00')), 0.4525372950073978), ((Timestamp('2014-07-01 04:00:00'), Timestamp('2014-07-05 18:00:00')), 0.4525372950073978)]
 11%|█         | 3/27 [00:35<04:45, 11.88s/it]
create anomaly scores (isolation_forest). Number of scores: 5050. First 5: [((Timestamp('2014-07-01 00:00:00'), Timestamp('2014-07-05 14:00:00')), 0.44800609398808533), ((Timestamp('2014-07-01 01:00:00'), Timestamp('2014-07-05 15:00:00')), 0.44800609398808533), ((Timestamp('2014-07-01 02:00:00'), Timestamp('2014-07-05 16:00:00')), 0.44800609398808533), ((Timestamp('2014-07-01 03:00:00'), Timestamp('2014-07-05 17:00:00')), 0.44800609398808533), ((Timestamp('2014-07-01 04:00:00'), Timestamp('2014-07-05 18:00:00')), 0.45394088377986963)]
 15%|█▍        | 4/27 [00:49<04:45, 12.42s/it]
create anomaly scores (isolation_forest). Number of scores: 5050. First 5: [((Timestamp('2014-07-01 00:00:00'), Timestamp('2014-07-05 14:00:00')), 0.4356586872495356), ((Timestamp('2014-07-01 01:00:00'), Timestamp('2014-07-05 15:00:00')), 0.4356586872495356), ((Timestamp('2014-07-01 02:00:00'), Timestamp('2014-07-05 16:00:00')), 0.4356586872495356), ((Timestamp('2014-07-01 03:00:00'), Timestamp('2014-07-05 17:00:00')), 0.4512666884361035), ((Timestamp('2014-07-01 04:00:00'), Timestamp('2014-07-05 18:00:00')), 0.4512666884361035)]
 19%|█▊        | 5/27 [01:03<04:50, 13.19s/it]
create anomaly scores (isolation_forest). Number of scores: 5050. First 5: [((Timestamp('2014-07-01 00:00:00'), Timestamp('2014-07-05 14:00:00')), 0.4384550040007865), ((Timestamp('2014-07-01 01:00:00'), Timestamp('2014-07-05 15:00:00')), 0.4384550040007865), ((Timestamp('2014-07-01 02:00:00'), Timestamp('2014-07-05 16:00:00')), 0.4384550040007865), ((Timestamp('2014-07-01 03:00:00'), Timestamp('2014-07-05 17:00:00')), 0.4384550040007865), ((Timestamp('2014-07-01 04:00:00'), Timestamp('2014-07-05 18:00:00')), 0.4384550040007865)]
 22%|██▏       | 6/27 [01:17<04:43, 13.49s/it]
create anomaly scores (isolation_forest). Number of scores: 5050. First 5: [((Timestamp('2014-07-01 00:00:00'), Timestamp('2014-07-05 14:00:00')), 0.4488260144691304), ((Timestamp('2014-07-01 01:00:00'), Timestamp('2014-07-05 15:00:00')), 0.4488260144691304), ((Timestamp('2014-07-01 02:00:00'), Timestamp('2014-07-05 16:00:00')), 0.4488260144691304), ((Timestamp('2014-07-01 03:00:00'), Timestamp('2014-07-05 17:00:00')), 0.43472247198116454), ((Timestamp('2014-07-01 04:00:00'), Timestamp('2014-07-05 18:00:00')), 0.43366573647480133)]
 26%|██▌       | 7/27 [01:34<04:49, 14.47s/it]
create anomaly scores (isolation_forest). Number of scores: 5050. First 5: [((Timestamp('2014-07-01 00:00:00'), Timestamp('2014-07-05 14:00:00')), 0.43656184426414035), ((Timestamp('2014-07-01 01:00:00'), Timestamp('2014-07-05 15:00:00')), 0.43656184426414035), ((Timestamp('2014-07-01 02:00:00'), Timestamp('2014-07-05 16:00:00')), 0.43656184426414035), ((Timestamp('2014-07-01 03:00:00'), Timestamp('2014-07-05 17:00:00')), 0.43656184426414035), ((Timestamp('2014-07-01 04:00:00'), Timestamp('2014-07-05 18:00:00')), 0.4468493083525255)]
 30%|██▉       | 8/27 [01:52<05:00, 15.83s/it]
create anomaly scores (isolation_forest). Number of scores: 5050. First 5: [((Timestamp('2014-07-01 00:00:00'), Timestamp('2014-07-05 14:00:00')), 0.4460606977854477), ((Timestamp('2014-07-01 01:00:00'), Timestamp('2014-07-05 15:00:00')), 0.4361290626640682), ((Timestamp('2014-07-01 02:00:00'), Timestamp('2014-07-05 16:00:00')), 0.4301848572228986), ((Timestamp('2014-07-01 03:00:00'), Timestamp('2014-07-05 17:00:00')), 0.4272361604328966), ((Timestamp('2014-07-01 04:00:00'), Timestamp('2014-07-05 18:00:00')), 0.4272361604328966)]
 33%|███▎      | 9/27 [02:14<05:18, 17.69s/it]
create anomaly scores (isolation_forest). Number of scores: 5050. First 5: [((Timestamp('2014-07-01 00:00:00'), Timestamp('2014-07-05 14:00:00')), 0.4392489243124804), ((Timestamp('2014-07-01 01:00:00'), Timestamp('2014-07-05 15:00:00')), 0.4392489243124804), ((Timestamp('2014-07-01 02:00:00'), Timestamp('2014-07-05 16:00:00')), 0.4392489243124804), ((Timestamp('2014-07-01 03:00:00'), Timestamp('2014-07-05 17:00:00')), 0.4392489243124804), ((Timestamp('2014-07-01 04:00:00'), Timestamp('2014-07-05 18:00:00')), 0.4392489243124804)]
 37%|███▋      | 10/27 [02:26<04:30, 15.92s/it]
create anomaly scores (isolation_forest). Number of scores: 5064. First 5: [((Timestamp('2014-07-01 00:00:00'), Timestamp('2014-07-05 00:00:00')), 0.4466264968419477), ((Timestamp('2014-07-01 01:00:00'), Timestamp('2014-07-05 01:00:00')), 0.4466264968419477), ((Timestamp('2014-07-01 02:00:00'), Timestamp('2014-07-05 02:00:00')), 0.43425694752672583), ((Timestamp('2014-07-01 03:00:00'), Timestamp('2014-07-05 03:00:00')), 0.44059435855515694), ((Timestamp('2014-07-01 04:00:00'), Timestamp('2014-07-05 04:00:00')), 0.44059435855515694)]
 41%|████      | 11/27 [02:38<03:57, 14.82s/it]
create anomaly scores (isolation_forest). Number of scores: 5064. First 5: [((Timestamp('2014-07-01 00:00:00'), Timestamp('2014-07-05 00:00:00')), 0.43144829955473174), ((Timestamp('2014-07-01 01:00:00'), Timestamp('2014-07-05 01:00:00')), 0.43144829955473174), ((Timestamp('2014-07-01 02:00:00'), Timestamp('2014-07-05 02:00:00')), 0.43144829955473174), ((Timestamp('2014-07-01 03:00:00'), Timestamp('2014-07-05 03:00:00')), 0.41400053546368687), ((Timestamp('2014-07-01 04:00:00'), Timestamp('2014-07-05 04:00:00')), 0.41400053546368687)]
 44%|████▍     | 12/27 [02:51<03:33, 14.21s/it]
create anomaly scores (isolation_forest). Number of scores: 5064. First 5: [((Timestamp('2014-07-01 00:00:00'), Timestamp('2014-07-05 00:00:00')), 0.41818142374178846), ((Timestamp('2014-07-01 01:00:00'), Timestamp('2014-07-05 01:00:00')), 0.41818142374178846), ((Timestamp('2014-07-01 02:00:00'), Timestamp('2014-07-05 02:00:00')), 0.41818142374178846), ((Timestamp('2014-07-01 03:00:00'), Timestamp('2014-07-05 03:00:00')), 0.41818142374178846), ((Timestamp('2014-07-01 04:00:00'), Timestamp('2014-07-05 04:00:00')), 0.4290821957478095)]
 48%|████▊     | 13/27 [03:05<03:17, 14.09s/it]
create anomaly scores (isolation_forest). Number of scores: 5064. First 5: [((Timestamp('2014-07-01 00:00:00'), Timestamp('2014-07-05 00:00:00')), 0.4420975494329572), ((Timestamp('2014-07-01 01:00:00'), Timestamp('2014-07-05 01:00:00')), 0.4420975494329572), ((Timestamp('2014-07-01 02:00:00'), Timestamp('2014-07-05 02:00:00')), 0.4420975494329572), ((Timestamp('2014-07-01 03:00:00'), Timestamp('2014-07-05 03:00:00')), 0.4420975494329572), ((Timestamp('2014-07-01 04:00:00'), Timestamp('2014-07-05 04:00:00')), 0.45072146313974154)]
 52%|█████▏    | 14/27 [03:20<03:05, 14.23s/it]
create anomaly scores (isolation_forest). Number of scores: 5064. First 5: [((Timestamp('2014-07-01 00:00:00'), Timestamp('2014-07-05 00:00:00')), 0.4338708111020696), ((Timestamp('2014-07-01 01:00:00'), Timestamp('2014-07-05 01:00:00')), 0.4398760066262446), ((Timestamp('2014-07-01 02:00:00'), Timestamp('2014-07-05 02:00:00')), 0.4281394696661027), ((Timestamp('2014-07-01 03:00:00'), Timestamp('2014-07-05 03:00:00')), 0.4281394696661027), ((Timestamp('2014-07-01 04:00:00'), Timestamp('2014-07-05 04:00:00')), 0.4281394696661027)]
 56%|█████▌    | 15/27 [03:35<02:56, 14.67s/it]
create anomaly scores (isolation_forest). Number of scores: 5064. First 5: [((Timestamp('2014-07-01 00:00:00'), Timestamp('2014-07-05 00:00:00')), 0.41893429290118167), ((Timestamp('2014-07-01 01:00:00'), Timestamp('2014-07-05 01:00:00')), 0.41893429290118167), ((Timestamp('2014-07-01 02:00:00'), Timestamp('2014-07-05 02:00:00')), 0.41893429290118167), ((Timestamp('2014-07-01 03:00:00'), Timestamp('2014-07-05 03:00:00')), 0.41893429290118167), ((Timestamp('2014-07-01 04:00:00'), Timestamp('2014-07-05 04:00:00')), 0.4393609890483229)]
 59%|█████▉    | 16/27 [03:53<02:49, 15.42s/it]
create anomaly scores (isolation_forest). Number of scores: 5064. First 5: [((Timestamp('2014-07-01 00:00:00'), Timestamp('2014-07-05 00:00:00')), 0.46073427365802466), ((Timestamp('2014-07-01 01:00:00'), Timestamp('2014-07-05 01:00:00')), 0.4460096692920554), ((Timestamp('2014-07-01 02:00:00'), Timestamp('2014-07-05 02:00:00')), 0.4460096692920554), ((Timestamp('2014-07-01 03:00:00'), Timestamp('2014-07-05 03:00:00')), 0.4460096692920554), ((Timestamp('2014-07-01 04:00:00'), Timestamp('2014-07-05 04:00:00')), 0.4460096692920554)]
 63%|██████▎   | 17/27 [04:12<02:46, 16.63s/it]
create anomaly scores (isolation_forest). Number of scores: 5064. First 5: [((Timestamp('2014-07-01 00:00:00'), Timestamp('2014-07-05 00:00:00')), 0.4258565446359143), ((Timestamp('2014-07-01 01:00:00'), Timestamp('2014-07-05 01:00:00')), 0.4258565446359143), ((Timestamp('2014-07-01 02:00:00'), Timestamp('2014-07-05 02:00:00')), 0.4279429057058123), ((Timestamp('2014-07-01 03:00:00'), Timestamp('2014-07-05 03:00:00')), 0.4279429057058123), ((Timestamp('2014-07-01 04:00:00'), Timestamp('2014-07-05 04:00:00')), 0.4279429057058123)]
 67%|██████▋   | 18/27 [04:34<02:45, 18.34s/it]
create anomaly scores (isolation_forest). Number of scores: 5064. First 5: [((Timestamp('2014-07-01 00:00:00'), Timestamp('2014-07-05 00:00:00')), 0.43940117699650816), ((Timestamp('2014-07-01 01:00:00'), Timestamp('2014-07-05 01:00:00')), 0.43752465700684195), ((Timestamp('2014-07-01 02:00:00'), Timestamp('2014-07-05 02:00:00')), 0.4466255955411036), ((Timestamp('2014-07-01 03:00:00'), Timestamp('2014-07-05 03:00:00')), 0.4466255955411036), ((Timestamp('2014-07-01 04:00:00'), Timestamp('2014-07-05 04:00:00')), 0.4466255955411036)]
 70%|███████   | 19/27 [04:47<02:13, 16.73s/it]
create anomaly scores (isolation_forest). Number of scores: 5088. First 5: [((Timestamp('2014-07-01 00:00:00'), Timestamp('2014-07-04 00:00:00')), 0.45334777293701833), ((Timestamp('2014-07-01 01:00:00'), Timestamp('2014-07-04 01:00:00')), 0.4250638592139693), ((Timestamp('2014-07-01 02:00:00'), Timestamp('2014-07-04 02:00:00')), 0.4250638592139693), ((Timestamp('2014-07-01 03:00:00'), Timestamp('2014-07-04 03:00:00')), 0.4250638592139693), ((Timestamp('2014-07-01 04:00:00'), Timestamp('2014-07-04 04:00:00')), 0.4250638592139693)]
 74%|███████▍  | 20/27 [05:00<01:48, 15.46s/it]
create anomaly scores (isolation_forest). Number of scores: 5088. First 5: [((Timestamp('2014-07-01 00:00:00'), Timestamp('2014-07-04 00:00:00')), 0.4344744103558306), ((Timestamp('2014-07-01 01:00:00'), Timestamp('2014-07-04 01:00:00')), 0.4344744103558306), ((Timestamp('2014-07-01 02:00:00'), Timestamp('2014-07-04 02:00:00')), 0.4232932222614637), ((Timestamp('2014-07-01 03:00:00'), Timestamp('2014-07-04 03:00:00')), 0.44765298866788117), ((Timestamp('2014-07-01 04:00:00'), Timestamp('2014-07-04 04:00:00')), 0.4232932222614637)]
 78%|███████▊  | 21/27 [05:12<01:27, 14.64s/it]
create anomaly scores (isolation_forest). Number of scores: 5088. First 5: [((Timestamp('2014-07-01 00:00:00'), Timestamp('2014-07-04 00:00:00')), 0.42057512451717505), ((Timestamp('2014-07-01 01:00:00'), Timestamp('2014-07-04 01:00:00')), 0.42057512451717505), ((Timestamp('2014-07-01 02:00:00'), Timestamp('2014-07-04 02:00:00')), 0.42057512451717505), ((Timestamp('2014-07-01 03:00:00'), Timestamp('2014-07-04 03:00:00')), 0.42057512451717505), ((Timestamp('2014-07-01 04:00:00'), Timestamp('2014-07-04 04:00:00')), 0.42057512451717505)]
 81%|████████▏ | 22/27 [05:26<01:11, 14.36s/it]
create anomaly scores (isolation_forest). Number of scores: 5088. First 5: [((Timestamp('2014-07-01 00:00:00'), Timestamp('2014-07-04 00:00:00')), 0.4219630897960936), ((Timestamp('2014-07-01 01:00:00'), Timestamp('2014-07-04 01:00:00')), 0.4219630897960936), ((Timestamp('2014-07-01 02:00:00'), Timestamp('2014-07-04 02:00:00')), 0.4219630897960936), ((Timestamp('2014-07-01 03:00:00'), Timestamp('2014-07-04 03:00:00')), 0.4219630897960936), ((Timestamp('2014-07-01 04:00:00'), Timestamp('2014-07-04 04:00:00')), 0.4219630897960936)]
 85%|████████▌ | 23/27 [05:41<00:57, 14.50s/it]
create anomaly scores (isolation_forest). Number of scores: 5088. First 5: [((Timestamp('2014-07-01 00:00:00'), Timestamp('2014-07-04 00:00:00')), 0.4529256257097837), ((Timestamp('2014-07-01 01:00:00'), Timestamp('2014-07-04 01:00:00')), 0.4409042125674256), ((Timestamp('2014-07-01 02:00:00'), Timestamp('2014-07-04 02:00:00')), 0.4196662422665574), ((Timestamp('2014-07-01 03:00:00'), Timestamp('2014-07-04 03:00:00')), 0.4276203953315386), ((Timestamp('2014-07-01 04:00:00'), Timestamp('2014-07-04 04:00:00')), 0.4101258571933938)]
 89%|████████▉ | 24/27 [05:57<00:45, 15.03s/it]
create anomaly scores (isolation_forest). Number of scores: 5088. First 5: [((Timestamp('2014-07-01 00:00:00'), Timestamp('2014-07-04 00:00:00')), 0.43070524561655615), ((Timestamp('2014-07-01 01:00:00'), Timestamp('2014-07-04 01:00:00')), 0.4359067015035589), ((Timestamp('2014-07-01 02:00:00'), Timestamp('2014-07-04 02:00:00')), 0.4194502969248387), ((Timestamp('2014-07-01 03:00:00'), Timestamp('2014-07-04 03:00:00')), 0.4194502969248387), ((Timestamp('2014-07-01 04:00:00'), Timestamp('2014-07-04 04:00:00')), 0.41136464454434124)]
 93%|█████████▎| 25/27 [06:18<00:33, 16.62s/it]
create anomaly scores (isolation_forest). Number of scores: 5088. First 5: [((Timestamp('2014-07-01 00:00:00'), Timestamp('2014-07-04 00:00:00')), 0.43180270745884475), ((Timestamp('2014-07-01 01:00:00'), Timestamp('2014-07-04 01:00:00')), 0.44375367807657096), ((Timestamp('2014-07-01 02:00:00'), Timestamp('2014-07-04 02:00:00')), 0.43172674581968584), ((Timestamp('2014-07-01 03:00:00'), Timestamp('2014-07-04 03:00:00')), 0.43172674581968584), ((Timestamp('2014-07-01 04:00:00'), Timestamp('2014-07-04 04:00:00')), 0.43172674581968584)]
 96%|█████████▋| 26/27 [06:42<00:18, 18.96s/it]
create anomaly scores (isolation_forest). Number of scores: 5088. First 5: [((Timestamp('2014-07-01 00:00:00'), Timestamp('2014-07-04 00:00:00')), 0.448582958014484), ((Timestamp('2014-07-01 01:00:00'), Timestamp('2014-07-04 01:00:00')), 0.43080564349464634), ((Timestamp('2014-07-01 02:00:00'), Timestamp('2014-07-04 02:00:00')), 0.4470246789370688), ((Timestamp('2014-07-01 03:00:00'), Timestamp('2014-07-04 03:00:00')), 0.4470246789370688), ((Timestamp('2014-07-01 04:00:00'), Timestamp('2014-07-04 04:00:00')), 0.4383292027388669)]
100%|██████████| 27/27 [07:09<00:00, 15.90s/it]
create anomaly scores (isolation_forest). Number of scores: 5088. First 5: [((Timestamp('2014-07-01 00:00:00'), Timestamp('2014-07-04 00:00:00')), 0.43916756063615414), ((Timestamp('2014-07-01 01:00:00'), Timestamp('2014-07-04 01:00:00')), 0.43916756063615414), ((Timestamp('2014-07-01 02:00:00'), Timestamp('2014-07-04 02:00:00')), 0.4369888347898041), ((Timestamp('2014-07-01 03:00:00'), Timestamp('2014-07-04 03:00:00')), 0.4442076248399467), ((Timestamp('2014-07-01 04:00:00'), Timestamp('2014-07-04 04:00:00')), 0.4332083500723546)]

precision@10 recall@10 parameters
9 1.2 1.000000 {'interval': 96, 'no_symbols': 5, 'no_bins': 5}
3 1.0 1.000000 {'interval': 110, 'no_symbols': 6, 'no_bins': 5}
11 0.8 1.000000 {'interval': 96, 'no_symbols': 5, 'no_bins': 7}
12 0.6 1.000000 {'interval': 96, 'no_symbols': 6, 'no_bins': 5}
0 0.2 1.000000 {'interval': 110, 'no_symbols': 5, 'no_bins': 5}
2 0.2 1.000000 {'interval': 110, 'no_symbols': 5, 'no_bins': 7}
18 1.0 0.454545 {'interval': 72, 'no_symbols': 5, 'no_bins': 5}
1 0.6 0.375000 {'interval': 110, 'no_symbols': 5, 'no_bins': 6}
19 0.8 0.363636 {'interval': 72, 'no_symbols': 5, 'no_bins': 6}
4 0.4 0.200000 {'interval': 110, 'no_symbols': 6, 'no_bins': 6}
10 0.4 0.200000 {'interval': 96, 'no_symbols': 5, 'no_bins': 6}
5 0.4 0.181818 {'interval': 110, 'no_symbols': 6, 'no_bins': 7}
15 0.4 0.181818 {'interval': 96, 'no_symbols': 7, 'no_bins': 5}
21 0.4 0.181818 {'interval': 72, 'no_symbols': 6, 'no_bins': 5}
6 0.2 0.090909 {'interval': 110, 'no_symbols': 7, 'no_bins': 5}
8 0.2 0.090909 {'interval': 110, 'no_symbols': 7, 'no_bins': 7}
13 0.2 0.090909 {'interval': 96, 'no_symbols': 6, 'no_bins': 6}
14 0.2 0.090909 {'interval': 96, 'no_symbols': 6, 'no_bins': 7}
16 0.2 0.090909 {'interval': 96, 'no_symbols': 7, 'no_bins': 6}
23 0.2 0.090909 {'interval': 72, 'no_symbols': 6, 'no_bins': 7}
26 0.2 0.090909 {'interval': 72, 'no_symbols': 7, 'no_bins': 7}
7 0.0 0.000000 {'interval': 110, 'no_symbols': 7, 'no_bins': 6}
17 0.0 0.000000 {'interval': 96, 'no_symbols': 7, 'no_bins': 7}
20 0.0 0.000000 {'interval': 72, 'no_symbols': 5, 'no_bins': 7}
22 0.0 0.000000 {'interval': 72, 'no_symbols': 6, 'no_bins': 6}
24 0.0 0.000000 {'interval': 72, 'no_symbols': 7, 'no_bins': 5}
25 0.0 0.000000 {'interval': 72, 'no_symbols': 7, 'no_bins': 6}

Multivariate or entity-based anomaly detection

As a second use-case we detect anomalies in a multivariate dataset containing multiple time series, or sensors, related to a server, such as CPU and RAM usage.

Loading the SMD dataset

The Server Machine Dataset (SMD) is split into a train and test part, where the test part consist of the last 50% of sensor values. For model training we concatenate both datasets and use all data for frequent pattern mining thereby ignoring labels until evaluation. We load the first machine from SMD, however, the complete dataset is available here. In SMD each machine consist of 38 different sensors.

In [17]:
data_folder = './pbad_network_private/data/SMD_sample/'
machine = 'machine-1-1'
machine_1_test = f'{machine}_test.csv'
machine_1_train = f'{machine}_train.csv'
machine_1_test_labels = f'{machine}_test_label.csv'
machine_1_interpretation_label = f'{machine}_interpretation_label.csv'

#Load training and test data. Original data has only 'time_id' so adding actual timestamps
#We concanate both datasets, since test data is after training. 
#We only have labels for the test part and thus only predict anomalies during the test period. 
df_train = pd.read_csv(data_folder + machine_1_train) #i.e. contains metric-1,...,metric-38,time_id,time
df_train['time'] = pd.to_datetime(df_train['time'])
df_train = df_train.drop(columns=['time_id'])
df_test = pd.read_csv(data_folder + machine_1_test) #i.e. contains metric-1,...,metric-38,time_id,time
df_test['time'] = pd.to_datetime(df_test['time'])
df_test = df_test.drop(columns=['time_id'])
#Test is right after train, so add date
diff_seconds = int((df_train['time'].max() - df_test['time'].min()) / np.timedelta64(1, 's'))
df_test['time'] = df_test['time'].apply(lambda x: x + np.timedelta64(diff_seconds,'s'))
df = pd.concat([df_train, df_test])    
labels_df = pd.read_csv(data_folder + machine_1_test_labels) #i.e. contains label,time_id,time where label==1
labels_df['time'] = pd.to_datetime(labels_df['time'])
#also for labels
labels_df['time'] = labels_df['time'].apply(lambda x: x + np.timedelta64(diff_seconds,'s'))
anomalies = [row['time'] for idx, row in labels_df.iterrows()]
print(f'No anomalies: {len(anomalies)}. First: {anomalies[0]}')
pd.options.display.float_format = '{:,.3f}'.format
display(df)
No anomalies: 2694. First: 2022-01-31 18:47:00
metric-1 metric-2 metric-3 metric-4 metric-5 metric-6 metric-7 metric-8 metric-9 metric-10 ... metric-30 metric-31 metric-32 metric-33 metric-34 metric-35 metric-36 metric-37 metric-38 time
0 0.032 0.039 0.028 0.024 0.000 0.915 0.344 0.000 0.020 0.000 ... 0.004 0.030 0.022 0.000 0.000 0.035 0.035 0.000 0.000 2022-01-01 00:00:00
1 0.043 0.049 0.033 0.026 0.000 0.915 0.345 0.000 0.019 0.002 ... 0.004 0.030 0.029 0.000 0.000 0.036 0.036 0.000 0.000 2022-01-01 00:01:00
2 0.043 0.035 0.032 0.026 0.000 0.915 0.345 0.000 0.020 0.000 ... 0.004 0.026 0.021 0.000 0.000 0.033 0.033 0.000 0.000 2022-01-01 00:02:00
3 0.032 0.029 0.030 0.024 0.000 0.913 0.343 0.000 0.021 0.000 ... 0.004 0.030 0.026 0.000 0.000 0.035 0.035 0.000 0.000 2022-01-01 00:03:00
4 0.032 0.019 0.027 0.023 0.000 0.913 0.343 0.000 0.019 0.000 ... 0.004 0.027 0.023 0.000 0.000 0.033 0.034 0.000 0.000 2022-01-01 00:04:00
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
28474 0.075 0.052 0.048 0.048 0.000 0.908 0.257 0.000 0.044 0.000 ... 0.032 0.047 0.040 0.000 0.000 0.043 0.043 0.000 0.000 2022-02-09 13:12:00
28475 0.065 0.025 0.039 0.044 0.000 0.905 0.257 0.000 0.033 0.000 ... 0.032 0.047 0.049 0.000 0.000 0.047 0.047 0.000 0.000 2022-02-09 13:13:00
28476 0.065 0.081 0.050 0.048 0.000 0.908 0.258 0.000 0.026 0.001 ... 0.032 0.047 0.040 0.000 0.000 0.043 0.043 0.000 0.000 2022-02-09 13:14:00
28477 0.065 0.056 0.048 0.046 0.000 0.903 0.257 0.000 0.033 0.000 ... 0.032 0.042 0.043 0.000 0.000 0.040 0.040 0.000 0.000 2022-02-09 13:15:00
28478 0.011 0.030 0.041 0.044 0.000 0.892 0.256 0.000 0.031 0.002 ... 0.034 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 2022-02-09 13:16:00

56958 rows × 39 columns

Visualising the SMD dataset

In [18]:
from npbad.visualisation import plot_mv_data
plot_mv_data('machine-1-1', df, anomalies)

Pre-processing the multivariate dataset

In the raw data we find many correlated time series. For anomaly detection we pre-process the time series and remove correlated time series. The function pre_process_mv_df does:

  1. Remove time series correlated where pearson correlation coefficient > 0.95
  2. Remove straight lines
  3. Cluster the remaining time series correlated by a pearson correlation coefficient > 0.85

After pre-processing we keep only 16 of the total 38 time series for the first device.

In [19]:
from npbad.preprocess_timeseries import pre_process_mv_df

df2, components = pre_process_mv_df(df)
plot_mv_data('machine-1-1', df2, anomalies)
100%|██████████| 703/703 [00:01<00:00, 649.06it/s]
find_similarities_pearson_simple: elapsed: 1.0972445011138916s
found 4 components: [{1, 2, 3, 34, 35, 18, 19, 20, 21, 24, 27, 30, 31}, {10, 15}, {25, 23}, {32, 33, 28}]
Removing: Before: 38->16 straight: ['metric-5', 'metric-8', 'metric-17', 'metric-18', 'metric-37', 'metric-38'] and perfect correlated: ['metric-3', 'metric-4', 'metric-19', 'metric-20', 'metric-21', 'metric-22', 'metric-25', 'metric-28', 'metric-31', 'metric-32', 'metric-35', 'metric-36', 'metric-16', 'metric-26', 'metric-33', 'metric-34']
100%|██████████| 120/120 [00:00<00:00, 702.37it/s]
find_similarities_pearson_simple: elapsed: 0.1827538013458252s
found 3 components: [{9, 4}, {8, 7}, {12, 13, 14}]
[['metric-14', 'metric-9'], ['metric-12', 'metric-13'], ['metric-24', 'metric-27', 'metric-29']]

Finding anomalies using an isolation forest

In [20]:
from npbad.baselines.iso_forest import run_iso_forest_mv
preprocess_parameter = {'interval': 72, 'stride':1, 'no_symbols': 4, 'no_bins': 5,  'preprocess_outlier': False}

#run iso forest
an_scores = run_iso_forest_mv(df2, interval=preprocess_parameter['interval'], 
                                   stride=preprocess_parameter['stride'], 
                              preprocess_outlier=preprocess_parameter['preprocess_outlier'])

#evaluate (using only test data which has labelled anomalies)
an_score_second_half = []
for window, score in an_scores: 
  if window[0] > df_test['time'].min():
        an_score_second_half.append((window,score))

(f1,prec,rec,TP,FP,FN,TN, t) = eval_best_f1(an_score_second_half, anomalies)
print(f"ISO F1@best: {f1:.4f}, precision: {prec:.4f}, recall: {rec:.4f}, threshold: {t:.4f}, TP: {TP}, FP: {FP}, FN:{FN}, TN:{TN}")

#without point-adjust:
(f1,prec,rec,TP,FP,FN,TN, t) = eval_best_f1(an_score_second_half, anomalies, point_adjust=False)
print(f"ISO W/O F1@best: {f1:.4f}, precision: {prec:.4f}, recall: {rec:.4f}, threshold: {t:.4f}, TP: {TP}, FP: {FP}, FN:{FN}, TN:{TN}")
run_iso_forest_mv
/usr/local/lib/python3.7/dist-packages/sklearn/ensemble/_iforest.py:293: UserWarning: max_samples (1500) is greater than the total number of samples (878). max_samples will be set to n_samples for estimation.
  % (self.max_samples, n_samples)
eval_best_f1: #an_scores: 403, min score:  0.423-max score:  0.475 #anomalies: 2694
ISO F1@best: 1.0000, precision: 1.0000, recall: 1.0000, threshold: 0.4439, TP: 211, FP: 0, FN:0, TN:192
eval_best_f1: #an_scores: 403, min score:  0.423-max score:  0.475 #anomalies: 2694
ISO W/O F1@best: 0.8138, precision: 0.9273, recall: 0.7251, threshold: 0.4387, TP: 153, FP: 12, FN:58, TN:180

Find anonalies using NPBAD

In [22]:
from npbad.ipbad.main import run_ipbad_mv
an_scores = run_ipbad_mv(df2,  interval=preprocess_parameter['interval'], stride=preprocess_parameter['stride'], 
                              no_symbols=preprocess_parameter['no_symbols'], no_bins=preprocess_parameter['no_bins'], 
                              preprocess_outlier=preprocess_parameter['preprocess_outlier'], 
                              use_MDL=True, binning_method='normal', use_iso=True)
#eval after time
an_score_second_half = []
for window, score in an_scores:
  if window[0] > df_test['time'].min():
      an_score_second_half.append((window,score))
(f1,prec,rec,TP,FP,FN,TN, t) = eval_best_f1(an_score_second_half, anomalies)
print(f"IPBAD F1@best: {f1:.4f}, precision: {prec:.4f}, recall: {rec:.4f}, threshold: {t:.4f}, TP: {TP}, FP: {FP}, FN:{FN}, TN:{TN}")

#without point-adjust:
(f1,prec,rec,TP,FP,FN,TN, t) = eval_best_f1(an_score_second_half, anomalies, point_adjust=False)
print(f"IPBAD W/O F1@best: {f1:.4f}, precision: {prec:.4f}, recall: {rec:.4f}, threshold: {t:.4f}, TP: {TP}, FP: {FP}, FN:{FN}, TN:{TN}")
Warning: ipbad: fewer than 5 patterns
Warning: ipbad: fewer than 5 patterns
Warning: ipbad: fewer than 5 patterns
Warning: ipbad: fewer than 5 patterns
Warning: ipbad: fewer than 5 patterns
Warning: ipbad: fewer than 5 patterns
Warning: ipbad: fewer than 5 patterns
Warning: ipbad: fewer than 5 patterns
Warning: ipbad: fewer than 5 patterns
Warning: ipbad: fewer than 5 patterns
Warning: ipbad: fewer than 5 patterns
Warning: ipbad: fewer than 5 patterns
/usr/local/lib/python3.7/dist-packages/sklearn/ensemble/_iforest.py:293: UserWarning: max_samples (1500) is greater than the total number of samples (878). max_samples will be set to n_samples for estimation.
  % (self.max_samples, n_samples)
run_mv_hetero elapsed 31.217990159988403s
eval_best_f1: #an_scores: 403, min score:  0.421-max score:  0.549 #anomalies: 2694
IPBAD F1@best: 1.0000, precision: 1.0000, recall: 1.0000, threshold: 0.4980, TP: 211, FP: 0, FN:0, TN:192
eval_best_f1: #an_scores: 403, min score:  0.421-max score:  0.549 #anomalies: 2694
IPBAD W/O F1@best: 0.7121, precision: 0.6040, recall: 0.8673, threshold: 0.4342, TP: 183, FP: 120, FN:28, TN:72
In [23]:
plot_mv_data('machine-1-1', df2, anomalies, an_scores_computed=an_score_second_half, threshold=t)

Optimising parameters

We can use random search to find optimal parameters.

We remark that for parameter optimalisation we prefer to set point-adjust to False. We determine that an anomaly is detected if a window contains the anomaly and the anomaly score is higher than a certain threshold. Using point-adjust we also consider other (overlapping) windows and check if one of those windows has a high anomaly score. Next, we report a True Possitive if a neighbouring window with a higher anomaly score reports that anomaly, even if the current window has an anomaly score below the threshold. However, this is an (overly) optimistic evaluation protocol.

In [24]:
import random
grid={"interval": [24,48,72,96],
      "stride": [6],
      "no_symbols": list(range(4,20,2)), 
      "no_bins" : list(range(3,20,2)) + [50,100],
      "preprocess_outlier": [True, False], 
      "use_MDL": [True],
      "binning_method": ['normal'],
      "use_iso": [True]
}

no_iter = 5
best_parameters = None
best_metrics = None
best_an_scores = None
while no_iter > 0:
    random_parameter = {}
    for key, options in grid.items():
        idx = random.randint(0,len(options)-1)
        random_parameter[key] = options[idx]
    try:
        an_scores = run_ipbad_mv(df2,  **random_parameter)
        an_score_second_half = []
        for window, score in an_scores:
            if window[0] > df_test['time'].min():
                an_score_second_half.append((window,score))
        (f1,prec,rec,TP,FP,FN,TN, t) = eval_best_f1(an_score_second_half, anomalies, point_adjust=False)
        print(f"Parameters: {random_parameter},\n F1@best: {f1:.4f}, precision: {prec:.4f}, recall: {rec:.4f}, threshold: {t:.4f}, TP: {TP}, FP: {FP}, FN:{FN}")
        no_iter = no_iter - 1
        if best_metrics is None or f1 > best_metrics[0]:
            best_metrics =   (f1,prec,rec,TP,FP,FN,TN, t)
            best_parameters = random_parameter
            best_an_scores = an_scores
    except Exception as e:
        print(e)
print('#### Best result ####')
(f1,prec,rec,TP,FP,FN,TN, t) = best_metrics
print(f"Parameters: {best_parameters}")
print(f"F1@best: {f1:.4f}, precision: {prec:.4f}, recall: {rec:.4f}, threshold: {t:.4f}, TP: {TP}, FP: {FP}, FN:{FN}, TN: {TN}")
#best parameters
#Parameters: {'interval': 96, 'stride': 6, 'no_symbols': 12, 'no_bins': 50, 'preprocess_outlier': False, 'use_MDL': True, 'binning_method': 'normal', 'use_iso': True},
#F1@best: 0.8451, precision: 0.8333, recall: 0.8571, threshold: 0.4618, TP: 30, FP: 6, FN:5
/usr/local/lib/python3.7/dist-packages/sklearn/ensemble/_iforest.py:293: UserWarning: max_samples (1500) is greater than the total number of samples (147). max_samples will be set to n_samples for estimation.
  % (self.max_samples, n_samples)
run_mv_hetero elapsed 39.26453971862793s
eval_best_f1: #an_scores: 67, min score:  0.439-max score:  0.508 #anomalies: 2694
Parameters: {'interval': 72, 'stride': 6, 'no_symbols': 10, 'no_bins': 11, 'preprocess_outlier': True, 'use_MDL': True, 'binning_method': 'normal', 'use_iso': True},
 F1@best: 0.6863, precision: 0.5224, recall: 1.0000, threshold: 0.4393, TP: 35, FP: 32, FN:0
/usr/local/lib/python3.7/dist-packages/sklearn/ensemble/_iforest.py:293: UserWarning: max_samples (1500) is greater than the total number of samples (155). max_samples will be set to n_samples for estimation.
  % (self.max_samples, n_samples)
run_mv_hetero elapsed 16.383702993392944s
eval_best_f1: #an_scores: 75, min score:  0.434-max score:  0.484 #anomalies: 2694
Parameters: {'interval': 24, 'stride': 6, 'no_symbols': 6, 'no_bins': 13, 'preprocess_outlier': True, 'use_MDL': True, 'binning_method': 'normal', 'use_iso': True},
 F1@best: 0.5577, precision: 0.3867, recall: 1.0000, threshold: 0.4341, TP: 29, FP: 46, FN:0
/usr/local/lib/python3.7/dist-packages/sklearn/ensemble/_iforest.py:293: UserWarning: max_samples (1500) is greater than the total number of samples (147). max_samples will be set to n_samples for estimation.
  % (self.max_samples, n_samples)
run_mv_hetero elapsed 36.10798454284668s
eval_best_f1: #an_scores: 67, min score:  0.461-max score:  0.519 #anomalies: 2694
Parameters: {'interval': 72, 'stride': 6, 'no_symbols': 14, 'no_bins': 7, 'preprocess_outlier': False, 'use_MDL': True, 'binning_method': 'normal', 'use_iso': True},
 F1@best: 0.7292, precision: 0.5738, recall: 1.0000, threshold: 0.4729, TP: 35, FP: 26, FN:0
/usr/local/lib/python3.7/dist-packages/sklearn/ensemble/_iforest.py:293: UserWarning: max_samples (1500) is greater than the total number of samples (147). max_samples will be set to n_samples for estimation.
  % (self.max_samples, n_samples)
run_mv_hetero elapsed 48.92760396003723s
eval_best_f1: #an_scores: 67, min score:  0.431-max score:  0.558 #anomalies: 2694
Parameters: {'interval': 72, 'stride': 6, 'no_symbols': 16, 'no_bins': 5, 'preprocess_outlier': False, 'use_MDL': True, 'binning_method': 'normal', 'use_iso': True},
 F1@best: 0.8571, precision: 0.7857, recall: 0.9429, threshold: 0.4945, TP: 33, FP: 9, FN:2
/usr/local/lib/python3.7/dist-packages/sklearn/ensemble/_iforest.py:293: UserWarning: max_samples (1500) is greater than the total number of samples (147). max_samples will be set to n_samples for estimation.
  % (self.max_samples, n_samples)
run_mv_hetero elapsed 45.18245053291321s
eval_best_f1: #an_scores: 67, min score:  0.463-max score:  0.502 #anomalies: 2694
Parameters: {'interval': 72, 'stride': 6, 'no_symbols': 12, 'no_bins': 5, 'preprocess_outlier': True, 'use_MDL': True, 'binning_method': 'normal', 'use_iso': True},
 F1@best: 0.6863, precision: 0.5224, recall: 1.0000, threshold: 0.4632, TP: 35, FP: 32, FN:0
#### Best result ####
Parameters: {'interval': 72, 'stride': 6, 'no_symbols': 16, 'no_bins': 5, 'preprocess_outlier': False, 'use_MDL': True, 'binning_method': 'normal', 'use_iso': True}
F1@best: 0.8571, precision: 0.7857, recall: 0.9429, threshold: 0.4945, TP: 33, FP: 9, FN:2, TN: 23

Interpretable multivariate anomaly detection based on the pattern-based outlier

In [39]:
par_fpof = {'interval': 6, 'stride': 6, 'no_symbols': 10, 'no_bins': 10, 'preprocess_outlier': False, 'use_MDL': True, 'binning_method': 'normal', 
        'use_iso': False}
df2_sample = df2#df2[df2['time'] > np.datetime64('2022-02-01')]
an_scores = run_ipbad_mv(df2_sample,  **par_fpof)
an_score_second_half = []
for window, score in an_scores:
  if window[0] > df_test['time'].min():
    an_score_second_half.append((window,score))
(f1,prec,rec,TP,FP,FN,TN, t) = eval_best_f1(an_score_second_half, anomalies, point_adjust=False)
print(f"Parameters: {par_fpof},\n F1@best: {f1:.4f}, precision: {prec:.4f}, recall: {rec:.4f}, threshold: {t:.4f}, TP: {TP}, FP: {FP}, FN:{FN}")  
#was: F1@best: 0.4528, precision: 0.3077, recall: 0.8571, threshold: 0.9876, TP: 12, FP: 27, FN:2

#WASF1@best: 0.5000, precision: 0.6000, recall: 0.4286, threshold: 0.9964, TP: 6, FP: 4, FN:8
run_mv_hetero elapsed 20.057478666305542s
eval_best_f1: #an_scores: 78, min score:  0.978-max score:  1.000 #anomalies: 2694
Parameters: {'interval': 6, 'stride': 6, 'no_symbols': 10, 'no_bins': 10, 'preprocess_outlier': False, 'use_MDL': True, 'binning_method': 'normal', 'use_iso': False},
 F1@best: 0.5217, precision: 0.6667, recall: 0.4286, threshold: 0.9933, TP: 6, FP: 3, FN:8

Plot of discrete representation and known anomalies per sensor

In SMD we have additional data on which sensors caused which anomalies. First we load this data.

In [40]:
from matplotlib.font_manager import ttfFontProperty
#SMD contains interpretation label, which shows which sensors caused the 
df_int_labels = pd.read_csv(data_folder + machine_1_interpretation_label) 
df_int_labels = pd.merge(df_int_labels, labels_df, how='left', left_on='from_time_id', right_on='time_id')
df_int_labels = df_int_labels.rename(columns={"time":"from_time"})
df_int_labels['to_time'] = df_int_labels['to_time_id'].apply(lambda i: np.datetime64('2022-01-31 18:47:00') + np.timedelta64(i - 15849,'m'))

display(df_int_labels)
def get_anomalies_col(anomalies, col):
   maching_anomalies = []
   for anomaly in anomalies:
      found = False
      filter_df = df_int_labels[(df_int_labels['from_time'] <= anomaly) & (df_int_labels['to_time'] >= anomaly)]
      for idx, row in filter_df.iterrows():
         if '-C' in col:
           col=col[0:col.index('-C')]
         if col in row['dimensions']:
           found = True
           break
      if found:
        maching_anomalies.append(anomaly)
   return maching_anomalies
from_time_id to_time_id dimensions label time_id from_time to_time
0 15849 16368 ['metric-1', 'metric-9', 'metric-10', 'metric-... 1.000 15849 2022-01-31 18:47:00 2022-02-01 03:26:00
1 16963 17517 ['metric-1', 'metric-2', 'metric-3', 'metric-4... 1.000 16963 2022-02-01 13:21:00 2022-02-01 22:35:00
2 18071 18528 ['metric-1', 'metric-2', 'metric-9', 'metric-1... 1.000 18071 2022-02-02 07:49:00 2022-02-02 15:26:00
3 19367 20088 ['metric-1', 'metric-2', 'metric-3', 'metric-4... 1.000 19367 2022-02-03 05:25:00 2022-02-03 17:26:00
4 20786 21195 ['metric-1', 'metric-9', 'metric-10', 'metric-... 1.000 20786 2022-02-04 05:04:00 2022-02-04 11:53:00
5 24679 24682 ['metric-9', 'metric-13', 'metric-14', 'metric... 1.000 24679 2022-02-06 21:57:00 2022-02-06 22:00:00
6 26114 26116 ['metric-9', 'metric-13', 'metric-14', 'metric... 1.000 26114 2022-02-07 21:52:00 2022-02-07 21:54:00
7 27554 27556 ['metric-9', 'metric-13', 'metric-14', 'metric... 1.000 27554 2022-02-08 21:52:00 2022-02-08 21:54:00

Next, we show each time series, the anomalies associated with that time series and the corresponding discretised time series.

In [41]:
no_dim = len(df2.columns)
fig, axs = plt.subplots(no_dim, figsize=(30,30), sharex=True)
for i, col in enumerate(df2.columns):
  if col == 'time':
    continue
  axs[i].plot(df2_sample['time'], df2_sample[col], color='blue',alpha=0.5)
  #show anomalies for known sensors
  anomalies_dim = get_anomalies_col(anomalies, col)
  axs[i].scatter(anomalies_dim, [1.0 for anomaly in anomalies_dim], color='red', marker='X', alpha=0.5, s=5)
  #discretise
  ts_i = df2_sample[['time',col]].copy()
  ts_i = ts_i.rename(columns={col:'average_value'})
  windows = create_windows(ts_i, interval=par_fpof['interval'], stride=par_fpof['stride']) 
  segments = create_segments(ts_i, windows)
  segments_d = discretise_segments_equal_distance_bins_global(segments, par_fpof['no_bins'], par_fpof['no_symbols'])
  print(windows[0],segments_d[0],len(windows))
  #plot discretise
  axs[i].text(df2_sample['time'].min() - np.timedelta64(1,'D'), 0, f'{col}', fontsize=12)
  plot_discrete(axs[i], segments_d, windows, interval_width=par_fpof['interval'], stride=par_fpof['stride'], no_symbols=par_fpof['no_symbols'], no_bins=par_fpof['no_bins'])
(Timestamp('2022-01-01 00:00:00'), Timestamp('2022-01-01 06:00:00')) [0 0 0 0 0 0 0 0 0 0] 158
discrete plot len w 1580, len s1580
(Timestamp('2022-01-01 00:00:00'), Timestamp('2022-01-01 06:00:00')) [0 0 0 0 0 0 0 0 0 0] 158
discrete plot len w 1580, len s1580
(Timestamp('2022-01-01 00:00:00'), Timestamp('2022-01-01 06:00:00')) [9 9 9 9 9 9 9 9 9 9] 158
discrete plot len w 1580, len s1580
(Timestamp('2022-01-01 00:00:00'), Timestamp('2022-01-01 06:00:00')) [3 3 3 3 3 3 3 3 3 3] 158
discrete plot len w 1580, len s1580
(Timestamp('2022-01-01 00:00:00'), Timestamp('2022-01-01 06:00:00')) [0 0 0 0 0 0 0 0 0 0] 158
discrete plot len w 1580, len s1580
(Timestamp('2022-01-01 00:00:00'), Timestamp('2022-01-01 06:00:00')) [1 1 1 1 1 1 1 1 1 1] 158
discrete plot len w 1580, len s1580
(Timestamp('2022-01-01 00:00:00'), Timestamp('2022-01-01 06:00:00')) [0 0 0 0 0 0 0 0 0 0] 158
discrete plot len w 1580, len s1580
(Timestamp('2022-01-01 00:00:00'), Timestamp('2022-01-01 06:00:00')) [4 4 3 2 1 2 1 1 1 1] 158
discrete plot len w 1580, len s1580
(Timestamp('2022-01-01 00:00:00'), Timestamp('2022-01-01 06:00:00')) [0 0 0 0 0 0 0 0 0 0] 158
discrete plot len w 1580, len s1580
(Timestamp('2022-01-01 00:00:00'), Timestamp('2022-01-01 06:00:00')) [0 0 0 0 0 0 0 0 0 0] 158
discrete plot len w 1580, len s1580
(Timestamp('2022-01-01 00:00:00'), Timestamp('2022-01-01 06:00:00')) [0 0 0 0 0 0 0 0 0 0] 158
discrete plot len w 1580, len s1580
(Timestamp('2022-01-01 00:00:00'), Timestamp('2022-01-01 06:00:00')) [0 0 0 0 0 0 0 0 0 0] 158
discrete plot len w 1580, len s1580
(Timestamp('2022-01-01 00:00:00'), Timestamp('2022-01-01 06:00:00')) [0 0 0 0 0 0 0 0 0 0] 158
discrete plot len w 1580, len s1580
(Timestamp('2022-01-01 00:00:00'), Timestamp('2022-01-01 06:00:00')) [0 0 0 0 0 0 0 0 0 0] 158
discrete plot len w 1580, len s1580
(Timestamp('2022-01-01 00:00:00'), Timestamp('2022-01-01 06:00:00')) [0 0 0 0 0 0 0 0 0 0] 158
discrete plot len w 1580, len s1580
(Timestamp('2022-01-01 00:00:00'), Timestamp('2022-01-01 06:00:00')) [0 0 0 0 0 0 0 0 0 0] 158
discrete plot len w 1580, len s1580

FPOF score for each sensor

Next, we compute the Frequent-pattern Outlier factor for each time series individually. The FPOF anomaly score often has a lower accuracy than using an isolation forest. However, the anomaly score is easy to interpret, as it is just the ratio of the number of matching patterns in each window versus all patterns. To be fully technically correct, the frequent-pattern outlier factor is the ratio of the the weighted sum of relative support of each matching pattern versus the total number of patterns.

In the first plot, we render each time series of the first device in de SMD dataset, the discretised time series and the computed anomaly score using FPOF. We skip time series that have no patterns, i.e. time series that are mostly flat after discretisation.

In [42]:
from npbad.visualisation import *

#show FPOF score each dimension
data = []
columns = [col for col in df2_sample.columns if col != 'time']

fig, axs = plt.subplots(len(columns), figsize=(30,30), sharex=True)

records = []
for i, col in enumerate(columns):    
    ts_i = df2_sample[['time', col]].copy()
    ts_i = ts_i.rename(columns={col: 'average_value'})
    print(f'run-single(col={col})')
    result = run_ipbad(ts_i, interval=par_fpof['interval'], stride=par_fpof['stride'], 
                                                               no_symbols=par_fpof['no_symbols'], no_bins=par_fpof['no_bins'], 
                                                               preprocess_outlier=par_fpof['preprocess_outlier'], 
                                                               use_MDL=par_fpof['use_MDL'], binning_method=par_fpof['binning_method'], 
                                                               use_iso=False, verbose=False, return_patterns=True) #RETURN_PATTERN=TRUE
    if len(result) == 4:
        an_scores, ts_i, segments_discretised, windows = result
        records.append([an_scores, ts_i, segments_discretised, windows, None, None])
    else:
        an_scores, ts_i, segments_discretised, windows, patterns_df, embedding = result
        records.append([ an_scores, ts_i, segments_discretised, windows, patterns_df, embedding]) 
    #plot_ts 
    axs[i].plot(df2_sample['time'], df2_sample[col], color='blue',alpha=0.5)
    #show anomalies for known sensors
    anomalies_dim = get_anomalies_col(anomalies, col)
    axs[i].scatter(anomalies_dim, [1.0 for anomaly in anomalies_dim], color='red', marker='X', alpha=0.5, s=5)
    #plot discretise
    plot_discrete(axs[i], segments_discretised, windows, 
                  interval_width=par_fpof['interval'], stride=par_fpof['stride'], no_symbols=par_fpof['no_symbols'], no_bins=par_fpof['no_bins'])
    #plot an_scores
    if len(result) != 6:
       continue
    timestamps = [average_timestamps(window[0],window[1]) for window, score in an_scores]
    scores = [score for window, score in an_scores]
    if (max(scores) - min(scores)) !=0:
      scores = [(score - min(scores))/(max(scores) - min(scores)) for score in scores]
    axs[i].plot(timestamps, scores, color='orange', marker='X', alpha=0.5)
    #start = pd.to_datetime(df2_sample['time'].min()) - np.timedelta64(1.5,'D')
    axs[i].text(timestamps[0] - np.timedelta64(2,'D'), 0, f'{col}', fontsize=12)
run-single(col=metric-1)
create anomaly scores (fpof). Number of scores: 158. First 5: [((Timestamp('2022-01-01 00:00:00'), Timestamp('2022-01-01 06:00:00')), 0.9228256431196407), ((Timestamp('2022-01-01 06:00:00'), Timestamp('2022-01-01 12:00:00')), 0.9228256431196407), ((Timestamp('2022-01-01 12:00:00'), Timestamp('2022-01-01 18:00:00')), 0.9228256431196407), ((Timestamp('2022-01-01 18:00:00'), Timestamp('2022-01-02 00:00:00')), 0.9228256431196407), ((Timestamp('2022-01-02 00:00:00'), Timestamp('2022-01-02 06:00:00')), 0.9228256431196407)]
discrete plot len w 1580, len s1580
run-single(col=metric-2)
create anomaly scores (fpof). Number of scores: 158. First 5: [((Timestamp('2022-01-01 00:00:00'), Timestamp('2022-01-01 06:00:00')), 0.9145569620253164), ((Timestamp('2022-01-01 06:00:00'), Timestamp('2022-01-01 12:00:00')), 0.9145569620253164), ((Timestamp('2022-01-01 12:00:00'), Timestamp('2022-01-01 18:00:00')), 0.9145569620253164), ((Timestamp('2022-01-01 18:00:00'), Timestamp('2022-01-02 00:00:00')), 0.9145569620253164), ((Timestamp('2022-01-02 00:00:00'), Timestamp('2022-01-02 06:00:00')), 0.9145569620253164)]
discrete plot len w 1580, len s1580
run-single(col=metric-6)
create anomaly scores (fpof). Number of scores: 158. First 5: [((Timestamp('2022-01-01 00:00:00'), Timestamp('2022-01-01 06:00:00')), 0.8349433710859427), ((Timestamp('2022-01-01 06:00:00'), Timestamp('2022-01-01 12:00:00')), 0.8349433710859427), ((Timestamp('2022-01-01 12:00:00'), Timestamp('2022-01-01 18:00:00')), 0.8349433710859427), ((Timestamp('2022-01-01 18:00:00'), Timestamp('2022-01-02 00:00:00')), 0.8349433710859427), ((Timestamp('2022-01-02 00:00:00'), Timestamp('2022-01-02 06:00:00')), 0.8349433710859427)]
discrete plot len w 1580, len s1580
run-single(col=metric-7)
create anomaly scores (fpof). Number of scores: 158. First 5: [((Timestamp('2022-01-01 00:00:00'), Timestamp('2022-01-01 06:00:00')), 0.9525316455696202), ((Timestamp('2022-01-01 06:00:00'), Timestamp('2022-01-01 12:00:00')), 0.9525316455696202), ((Timestamp('2022-01-01 12:00:00'), Timestamp('2022-01-01 18:00:00')), 0.9525316455696202), ((Timestamp('2022-01-01 18:00:00'), Timestamp('2022-01-02 00:00:00')), 0.9525316455696202), ((Timestamp('2022-01-02 00:00:00'), Timestamp('2022-01-02 06:00:00')), 0.9525316455696202)]
discrete plot len w 1580, len s1580
run-single(col=metric-10)
create anomaly scores (fpof). Number of scores: 158. First 5: [((Timestamp('2022-01-01 00:00:00'), Timestamp('2022-01-01 06:00:00')), 0.0), ((Timestamp('2022-01-01 06:00:00'), Timestamp('2022-01-01 12:00:00')), 0.0), ((Timestamp('2022-01-01 12:00:00'), Timestamp('2022-01-01 18:00:00')), 0.0), ((Timestamp('2022-01-01 18:00:00'), Timestamp('2022-01-02 00:00:00')), 0.0), ((Timestamp('2022-01-02 00:00:00'), Timestamp('2022-01-02 06:00:00')), 0.0)]
discrete plot len w 1580, len s1580
run-single(col=metric-11)
create anomaly scores (fpof). Number of scores: 158. First 5: [((Timestamp('2022-01-01 00:00:00'), Timestamp('2022-01-01 06:00:00')), 0.36867088607594933), ((Timestamp('2022-01-01 06:00:00'), Timestamp('2022-01-01 12:00:00')), 0.36867088607594933), ((Timestamp('2022-01-01 12:00:00'), Timestamp('2022-01-01 18:00:00')), 0.36867088607594933), ((Timestamp('2022-01-01 18:00:00'), Timestamp('2022-01-02 00:00:00')), 0.36867088607594933), ((Timestamp('2022-01-02 00:00:00'), Timestamp('2022-01-02 06:00:00')), 0.36867088607594933)]
discrete plot len w 1580, len s1580
run-single(col=metric-15)
create anomaly scores (fpof). Number of scores: 158. First 5: [((Timestamp('2022-01-01 00:00:00'), Timestamp('2022-01-01 06:00:00')), 0.9617754830113258), ((Timestamp('2022-01-01 06:00:00'), Timestamp('2022-01-01 12:00:00')), 0.9191372418387742), ((Timestamp('2022-01-01 12:00:00'), Timestamp('2022-01-01 18:00:00')), 0.95236508994004), ((Timestamp('2022-01-01 18:00:00'), Timestamp('2022-01-02 00:00:00')), 0.9372085276482345), ((Timestamp('2022-01-02 00:00:00'), Timestamp('2022-01-02 06:00:00')), 0.9617754830113258)]
discrete plot len w 1580, len s1580
run-single(col=metric-23)
create anomaly scores (fpof). Number of scores: 158. First 5: [((Timestamp('2022-01-01 00:00:00'), Timestamp('2022-01-01 06:00:00')), 0.9843328491388255), ((Timestamp('2022-01-01 06:00:00'), Timestamp('2022-01-01 12:00:00')), 1.0), ((Timestamp('2022-01-01 12:00:00'), Timestamp('2022-01-01 18:00:00')), 1.0), ((Timestamp('2022-01-01 18:00:00'), Timestamp('2022-01-02 00:00:00')), 1.0), ((Timestamp('2022-01-02 00:00:00'), Timestamp('2022-01-02 06:00:00')), 0.9860967005602822)]
discrete plot len w 1580, len s1580
run-single(col=metric-30)
create anomaly scores (fpof). Number of scores: 158. First 5: [((Timestamp('2022-01-01 00:00:00'), Timestamp('2022-01-01 06:00:00')), 0.0007911392405063333), ((Timestamp('2022-01-01 06:00:00'), Timestamp('2022-01-01 12:00:00')), 0.0007911392405063333), ((Timestamp('2022-01-01 12:00:00'), Timestamp('2022-01-01 18:00:00')), 0.0007911392405063333), ((Timestamp('2022-01-01 18:00:00'), Timestamp('2022-01-02 00:00:00')), 0.0007911392405063333), ((Timestamp('2022-01-02 00:00:00'), Timestamp('2022-01-02 06:00:00')), 0.0007911392405063333)]
discrete plot len w 1580, len s1580
run-single(col=metric-14-C1)
create anomaly scores (fpof). Number of scores: 158. First 5: [((Timestamp('2022-01-01 00:00:00'), Timestamp('2022-01-01 06:00:00')), 0.0015822784810126667), ((Timestamp('2022-01-01 06:00:00'), Timestamp('2022-01-01 12:00:00')), 0.0015822784810126667), ((Timestamp('2022-01-01 12:00:00'), Timestamp('2022-01-01 18:00:00')), 0.0015822784810126667), ((Timestamp('2022-01-01 18:00:00'), Timestamp('2022-01-02 00:00:00')), 0.0015822784810126667), ((Timestamp('2022-01-02 00:00:00'), Timestamp('2022-01-02 06:00:00')), 0.0015822784810126667)]
discrete plot len w 1580, len s1580
run-single(col=metric-9-C1)
create anomaly scores (fpof). Number of scores: 158. First 5: [((Timestamp('2022-01-01 00:00:00'), Timestamp('2022-01-01 06:00:00')), 0.0), ((Timestamp('2022-01-01 06:00:00'), Timestamp('2022-01-01 12:00:00')), 0.0), ((Timestamp('2022-01-01 12:00:00'), Timestamp('2022-01-01 18:00:00')), 0.0), ((Timestamp('2022-01-01 18:00:00'), Timestamp('2022-01-02 00:00:00')), 0.0), ((Timestamp('2022-01-02 00:00:00'), Timestamp('2022-01-02 06:00:00')), 0.0)]
discrete plot len w 1580, len s1580
run-single(col=metric-12-C2)
create anomaly scores (fpof). Number of scores: 158. First 5: [((Timestamp('2022-01-01 00:00:00'), Timestamp('2022-01-01 06:00:00')), 0.9539922103213243), ((Timestamp('2022-01-01 06:00:00'), Timestamp('2022-01-01 12:00:00')), 0.9028724440116845), ((Timestamp('2022-01-01 12:00:00'), Timestamp('2022-01-01 18:00:00')), 0.9253489126906849), ((Timestamp('2022-01-01 18:00:00'), Timestamp('2022-01-02 00:00:00')), 0.8952450503083415), ((Timestamp('2022-01-02 00:00:00'), Timestamp('2022-01-02 06:00:00')), 0.9539922103213243)]
discrete plot len w 1580, len s1580
run-single(col=metric-13-C2)
create anomaly scores (fpof). Number of scores: 158. First 5: [((Timestamp('2022-01-01 00:00:00'), Timestamp('2022-01-01 06:00:00')), 0.0015822784810126667), ((Timestamp('2022-01-01 06:00:00'), Timestamp('2022-01-01 12:00:00')), 0.0015822784810126667), ((Timestamp('2022-01-01 12:00:00'), Timestamp('2022-01-01 18:00:00')), 0.0015822784810126667), ((Timestamp('2022-01-01 18:00:00'), Timestamp('2022-01-02 00:00:00')), 0.0015822784810126667), ((Timestamp('2022-01-02 00:00:00'), Timestamp('2022-01-02 06:00:00')), 0.0015822784810126667)]
discrete plot len w 1580, len s1580
run-single(col=metric-24-C3)
create anomaly scores (fpof). Number of scores: 158. First 5: [((Timestamp('2022-01-01 00:00:00'), Timestamp('2022-01-01 06:00:00')), 0.0007911392405063333), ((Timestamp('2022-01-01 06:00:00'), Timestamp('2022-01-01 12:00:00')), 0.0007911392405063333), ((Timestamp('2022-01-01 12:00:00'), Timestamp('2022-01-01 18:00:00')), 0.0007911392405063333), ((Timestamp('2022-01-01 18:00:00'), Timestamp('2022-01-02 00:00:00')), 0.0007911392405063333), ((Timestamp('2022-01-02 00:00:00'), Timestamp('2022-01-02 06:00:00')), 0.0007911392405063333)]
discrete plot len w 1580, len s1580
run-single(col=metric-27-C3)
create anomaly scores (fpof). Number of scores: 158. First 5: [((Timestamp('2022-01-01 00:00:00'), Timestamp('2022-01-01 06:00:00')), 0.0007911392405063333), ((Timestamp('2022-01-01 06:00:00'), Timestamp('2022-01-01 12:00:00')), 0.0007911392405063333), ((Timestamp('2022-01-01 12:00:00'), Timestamp('2022-01-01 18:00:00')), 0.0007911392405063333), ((Timestamp('2022-01-01 18:00:00'), Timestamp('2022-01-02 00:00:00')), 0.0007911392405063333), ((Timestamp('2022-01-02 00:00:00'), Timestamp('2022-01-02 06:00:00')), 0.0007911392405063333)]
discrete plot len w 1580, len s1580
run-single(col=metric-29-C3)
create anomaly scores (fpof). Number of scores: 158. First 5: [((Timestamp('2022-01-01 00:00:00'), Timestamp('2022-01-01 06:00:00')), 0.0007911392405063333), ((Timestamp('2022-01-01 06:00:00'), Timestamp('2022-01-01 12:00:00')), 0.0007911392405063333), ((Timestamp('2022-01-01 12:00:00'), Timestamp('2022-01-01 18:00:00')), 0.0007911392405063333), ((Timestamp('2022-01-01 18:00:00'), Timestamp('2022-01-02 00:00:00')), 0.0007911392405063333), ((Timestamp('2022-01-02 00:00:00'), Timestamp('2022-01-02 06:00:00')), 0.0007911392405063333)]
discrete plot len w 1580, len s1580

Showing pattern occurrences for each sensor

In this plot we show the pattern-based embedding, that is all patterns and each pattern occurrence for each sensor corresponding with the last case.

In [43]:
def remove_redundant_patterns(patterns_df):
    filter = set()
    for idx, pattern_row in patterns_df.iterrows():
       for idx2, pattern_row2 in patterns_df.iterrows():
          if pattern_row['instances'] == pattern_row2['instances'] and pattern_row2['bits_saved'] > pattern_row['bits_saved']:
              filter.add(pattern_row['id'])
    return patterns_df[~patterns_df['id'].isin(filter)]
    
for i, col in enumerate(columns):   
    an_scores, ts_i, segments_discretised, windows, patterns_df, embedding = records[i]
    if patterns_df is None or patterns_df.shape[0] == 0:
      continue
    #display(patterns_df)
    #display(remove_redundant_patterns(patterns_df))
    anomalies_dim = get_anomalies_col(anomalies, col)
    patterns_df = remove_redundant_patterns(patterns_df)
    plot_patterns(ts_i, anomalies_dim, segments_discretised, windows, patterns_df, embedding, 
                  interval_width=par_fpof['interval'],stride=par_fpof['stride'],no_symbols=par_fpof['no_symbols'])
discrete plot len w 1580, len s1580
discrete plot len w 1580, len s1580
discrete plot len w 1580, len s1580
discrete plot len w 1580, len s1580
discrete plot len w 1580, len s1580
discrete plot len w 1580, len s1580
discrete plot len w 1580, len s1580
discrete plot len w 1580, len s1580
discrete plot len w 1580, len s1580
discrete plot len w 1580, len s1580
discrete plot len w 1580, len s1580
discrete plot len w 1580, len s1580
discrete plot len w 1580, len s1580
discrete plot len w 1580, len s1580
discrete plot len w 1580, len s1580
discrete plot len w 1580, len s1580
In [46]:
!jupyter nbconvert --to html tutorial_npbad.ipynb
[NbConvertApp] Converting notebook tutorial_npbad.ipynb to html
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/nbformat/reader.py", line 18, in parse_json
    nb_dict = json.loads(s, **kwargs)
  File "/usr/lib/python3.7/json/__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.7/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.7/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/bin/jupyter-nbconvert", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/dist-packages/jupyter_core/application.py", line 269, in launch_instance
    return super().launch_instance(argv=argv, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/traitlets/config/application.py", line 846, in launch_instance
    app.start()
  File "/usr/local/lib/python3.7/dist-packages/nbconvert/nbconvertapp.py", line 340, in start
    self.convert_notebooks()
  File "/usr/local/lib/python3.7/dist-packages/nbconvert/nbconvertapp.py", line 510, in convert_notebooks
    self.convert_single_notebook(notebook_filename)
  File "/usr/local/lib/python3.7/dist-packages/nbconvert/nbconvertapp.py", line 481, in convert_single_notebook
    output, resources = self.export_single_notebook(notebook_filename, resources, input_buffer=input_buffer)
  File "/usr/local/lib/python3.7/dist-packages/nbconvert/nbconvertapp.py", line 410, in export_single_notebook
    output, resources = self.exporter.from_filename(notebook_filename, resources=resources)
  File "/usr/local/lib/python3.7/dist-packages/nbconvert/exporters/exporter.py", line 179, in from_filename
    return self.from_file(f, resources=resources, **kw)
  File "/usr/local/lib/python3.7/dist-packages/nbconvert/exporters/exporter.py", line 197, in from_file
    return self.from_notebook_node(nbformat.read(file_stream, as_version=4), resources=resources, **kw)
  File "/usr/local/lib/python3.7/dist-packages/nbformat/__init__.py", line 170, in read
    return reads(buf, as_version, capture_validation_error, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/nbformat/__init__.py", line 88, in reads
    nb = reader.reads(s, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/nbformat/reader.py", line 72, in reads
    nb_dict = parse_json(s, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/nbformat/reader.py", line 21, in parse_json
    raise NotJSONError(("Notebook does not appear to be JSON: %r" % s)[:77] + "...") from e
nbformat.reader.NotJSONError: Notebook does not appear to be JSON: 'xP5Z7QV7DXaIkPfWRGhr+3qqzNtiKFBCEgkmGAC...