This tutorial utilizes the main takeaways from the research paper: Pattern-Based Anomaly Detection in a Network of Multivariate Time Series
To explore the basic concepts, we’ll use the ipbad
function to find interesting patterns and anomalies and demonstrate these concepts with two different time series datasets:
ipbad
is described in detail in the paper and contains a variety of optimalisations to minimize runtime and improve comprehension. Steps include: (i) transforming a time series to a set of discrete sequences; (2) find frequent and cohesive patterns that best compress the data; (3) create an anomaly scores based on pattern occurrences.
Let’s import the packages that we’ll need to load, analyze, and plot the data:
!git clone https://len_feremans@bitbucket.org/len_feremans/pbad_network_private.git
!cd pbad_network_private && pip install . -q
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import json
from tqdm import tqdm
import os
from npbad.utils import load_numenta_data, cart_product
from npbad.visualisation import plot_data, plot_discrete, plot_anomalies, plot_patterns
from npbad.preprocess_timeseries import min_max_norm
from npbad.symbolic_representation import create_segments, create_windows
from npbad.symbolic_representation import discretise_segments_equal_distance_bins_global
from npbad.ipbad.main import save_transactions
from npbad.ipbad.pattern_mining_petsc import mine_patterns_and_create_embedding
from npbad.ipbad.minimum_description_length import post_process_mdl
from npbad.ipbad.main import run_ipbad
from npbad.eval_methods import eval_best_f1, eval_recall_at_k
Time series patterns are repeated subsequences found within a longer time series. In contrast to motifs we compute exact occurrences after discretisation, have varying length patterns and allow for a number of gaps relative to the length of pattern.
First, we’ll download historical data that represents the half-hourly average of the number of NYC taxi passengers over 75 days in the Fall of 2014 available in the The Numenta Anomaly Benchmark. We also download the labeled timestamps for the Taxi Dataset saved in json.
!wget -nv https://raw.githubusercontent.com/numenta/NAB/master/data/realKnownCause/nyc_taxi.csv
!wget -nv https://raw.githubusercontent.com/numenta/NAB/master/labels/combined_labels.json
We extract that data and insert it into a pandas
dataframe
, making sure the timestamps are stored as datetime objects and the values are of type float64. The labels or anomalies are just a list of numpy datetime objects. In this we use the utility function npbad.utils.load_numenta_data
to do this.
df, anomalies = load_numenta_data('nyc_taxi.csv', 'combined_labels.json', key='realKnownCause/nyc_taxi.csv')
display(df)
We can visualise the time series and anomalies using matplotlib
or using the utility function npbad.visualisation.plot_data
to show both the time series and labelled anomalies.
fig, axs = plt.subplots(1, figsize=(20,3))
df_sample = df[df.time > np.datetime64('2014-10-15')]
plot_data(axs, df_sample, anomalies)
Before we find frequent patterns we require a symbolic representation or sequence database. Therefore, we first normalise the time series and apply a sliding window
ts_i = min_max_norm(df)
ts_i_sample = ts_i[ts_i.time > np.datetime64('2014-12-20')]
#symbolic representation
windows = create_windows(ts_i_sample, interval=24, stride=24)
segments = create_segments(ts_i_sample, windows)
print(f"Total windows: {len(windows)}. First: {windows[0]}. Last: {windows[-1]}. Length segment: {len(segments[0])}")
Next we discretise each window by taking average values every x steps, i.e. apply the Piecewise Aggregate Approxmiation, and discretise the average values into k equal-length bins.
segments_d = discretise_segments_equal_distance_bins_global(segments, no_symbols=10, no_bins=5)
print(segments_d[0])
We can visualize the discrete representation directly since binning is global
fig, axs = plt.subplots(2, figsize=(30,6), sharex=True)
anomalies = [an for an in anomalies if an > np.datetime64('2014-12-01')]
plot_data(axs[0], ts_i_sample, anomalies)
plot_discrete(axs[1], segments_d, windows, interval_width=24, stride=24, no_symbols=10, no_bins=5)
plot_anomalies(axs[1], ts_i_sample, anomalies)
Given a set of discrete sequence we can report an anomaly as a sequence that occurs infrequently or vice versa if is not covered by frequent patterns. First we identify a pattern manually.
def matches(pattern, segment):
''' if all symbols pattern occur in the same order as in window'''
for x in pattern:
if x in segment:
segment = segment[segment.index(x):]
else:
return False
return True
frequent_pattern = [0,1,2,2,2,2,2]
occurrences1 = [1 if matches(frequent_pattern,segment.tolist()) else 0 for segment in segments_d]
infrequent_pattern = [3,2,0,0,1]
occurrences2 = [1 if matches(infrequent_pattern,segment.tolist()) else 0 for segment in segments_d]
fig, axs = plt.subplots(3, figsize=(30,6), sharex=True)
plot_discrete(axs[0], segments_d, windows, interval_width=24, stride=24, no_symbols=10, no_bins=5)
plot_anomalies(axs[0], ts_i_sample, anomalies)
axs[1].bar([window[0] + np.timedelta64(6,'h') for window in windows], occurrences1)
axs[1].set_ylabel(f'{frequent_pattern}')
axs[2].bar([window[0] + np.timedelta64(6,'h') for window in windows], occurrences2, color='green')
axs[2].set_ylabel(f'{infrequent_pattern}')
From the above plot we see that the first frequent patterns matches 40/42 windows and that the two remaining windows are in fact anomalous. The second infrequent pattern matches only a single window which is also anomalous. If we would create an embedding for each window and store if either pattern occurs would not be hard to decide if an anomaly occurs. In the following section we will automatically discover patterns and report anomalies.
First we find the top-1000 frequent sequential patterns with a constraint on relative duration. For instance, with a relative duration we allow no gaps and with a relative duration of 1.2 an occurrence of pattern of length 5 can have at most 2 gaps (or 5*1.2 -5=2). We also filter sequential pattern with a length smaller than 4.
save_transactions('./temp/sequences.txt', segments_d)
patterns_df, embedding = mine_patterns_and_create_embedding('./temp/sequences.txt', no_transactions = len(segments_d),
top_k=10000, min_size=4, max_size=10,
duration=1.2,
patterns_fname='./temp/patterns.txt', embedding_fname='./temp/embedding.txt', verbose=True)
print(f"Total patterns before MDL: {patterns_df.shape[0]}")
display(patterns_df)
To improve compresion, we post-process patterns and only keep patterns that compress the data defined using minimum description length which dramatically decreased the number of patterns to improve comprehesion
patterns_df, embedding = post_process_mdl(patterns_df, embedding, segments_d, no_symbols=10, duration=1.2, verbose=False)
print(f"Total patterns after MDL: {patterns_df.shape[0]}")
display(patterns_df)
print(embedding[0])
We can visualise each pattern using npbad.visualisation.plot_patterns
, which also show the number of patterns in the pattern-based embedding for each window.
plot_patterns(ts_i_sample, anomalies, segments_d, windows, patterns_df, embedding, interval_width=24,stride=24,no_symbols=10)
After we discovered patterns and created an embedding of patterns for each window, we compute an anomaly score using an isolation forest. Alternatively, we use the frequent pattern outlier factor (FPOF) which is the sum of relative support of all matching patterns versus the total number of pattern.
from npbad.anomaly_detection import iso_forest, fpof
#fpof results:
fpof_scores = fpof(embedding, patterns_df.shape[0])
fpof_scores = list(zip(windows, fpof_scores))
(f1,prec,rec,TP,FP,FN,TN, t) = eval_best_f1(fpof_scores, anomalies)
print(f"IPBAD-FPOF: F1@best: {f1:.4f}, precision: {prec:.4f}, recall: {rec:.4f}, threshold: {t:.4f}, TP: {TP}, FP: {FP}, FN:{FN}, TN:{TN}")
#plot
def plot_predicted_anomalies(df, anomalies, an_scores1, t):
fig, axs = plt.subplots(2, figsize=(30,3), sharex=True)
plot_data(axs[0], df, anomalies)
#plot fpof
timestamps = [window[0] + np.timedelta64(12,'h') for window, score in an_scores1]
scores1 = [score for window, score in an_scores1]
axs[1].plot(timestamps, scores1, color='red',alpha=0.5)
axs[1].axhline(y=t, color='b', linestyle='--')
axs[1].set_ylabel('FPOF')
plot_predicted_anomalies(ts_i_sample, anomalies, fpof_scores, t)
The previous steps (discretisation, pattern mining and embedding, anomaly detection) can be combined on the full taxi dataset in one call to npbad.ipbad.main.run_ipbad
. In this case we set the stride
to 1.
df, anomalies = load_numenta_data('nyc_taxi.csv', 'combined_labels.json', key='realKnownCause/nyc_taxi.csv')
prep_par = {'interval': 96, 'stride':1, 'no_symbols': 5, 'no_bins': 5, 'remove_outliers':False}
parameter = {'use_iso': True, 'use_MDL': True}
#run anomaly detection
an_scores, ts_i, segments_discretised, windows = run_ipbad(
df=df, interval=prep_par["interval"], stride=prep_par["stride"],
no_symbols=prep_par["no_symbols"], no_bins=prep_par["no_bins"],
preprocess_outlier=prep_par["remove_outliers"],
binning_method='normal',
use_MDL=parameter['use_MDL'], use_iso=parameter['use_iso'],
verbose=False)
#evaluate
(f1,prec,rec,TP,FP,FN,TN, t) = eval_best_f1(an_scores, anomalies)
print(f"F1@best: {f1:.4f}, precision: {prec:.4f}, recall: {rec:.4f}, threshold: {t:.4f}, TP: {TP}, FP: {FP}, FN:{FN}, TN:{TN}")
#plot
plot_predicted_anomalies(df, anomalies, an_scores, t)
The performance of anomaly detection is dependent on a good symbolic representation. Therefore we should optimise hyper-parameters. An example using grid search is shown below
results_ipbad = []
grid = cart_product(interval=[110,96,72], no_symbols=[5,6,7], no_bins=[5,6,7])
for preprocess_parameter in tqdm(grid):
try:
an_scores, ts_i, segments_discretised, windows = run_ipbad(df=df, interval=preprocess_parameter["interval"], stride=1,
no_symbols=preprocess_parameter["no_symbols"], no_bins=preprocess_parameter["no_bins"],
preprocess_outlier=False, binning_method='normal',verbose=False,
use_MDL=True, use_iso=True)
precAt10, recAt10 = eval_recall_at_k(an_scores, anomalies, threshold_top_k=50, K=10)
results_ipbad.append((recAt10, precAt10, preprocess_parameter))
except Exception as e:
print(e)
df_results = pd.DataFrame.from_records(results_ipbad, columns=["precision@10", "recall@10", "parameters"])
df_results = df_results.sort_values(by=["recall@10","precision@10"], ascending=False)
display(df_results)
As a second use-case we detect anomalies in a multivariate dataset containing multiple time series, or sensors, related to a server, such as CPU and RAM usage.
The Server Machine Dataset (SMD) is split into a train and test part, where the test part consist of the last 50% of sensor values. For model training we concatenate both datasets and use all data for frequent pattern mining thereby ignoring labels until evaluation. We load the first machine from SMD, however, the complete dataset is available here. In SMD each machine consist of 38 different sensors.
data_folder = './pbad_network_private/data/SMD_sample/'
machine = 'machine-1-1'
machine_1_test = f'{machine}_test.csv'
machine_1_train = f'{machine}_train.csv'
machine_1_test_labels = f'{machine}_test_label.csv'
machine_1_interpretation_label = f'{machine}_interpretation_label.csv'
#Load training and test data. Original data has only 'time_id' so adding actual timestamps
#We concanate both datasets, since test data is after training.
#We only have labels for the test part and thus only predict anomalies during the test period.
df_train = pd.read_csv(data_folder + machine_1_train) #i.e. contains metric-1,...,metric-38,time_id,time
df_train['time'] = pd.to_datetime(df_train['time'])
df_train = df_train.drop(columns=['time_id'])
df_test = pd.read_csv(data_folder + machine_1_test) #i.e. contains metric-1,...,metric-38,time_id,time
df_test['time'] = pd.to_datetime(df_test['time'])
df_test = df_test.drop(columns=['time_id'])
#Test is right after train, so add date
diff_seconds = int((df_train['time'].max() - df_test['time'].min()) / np.timedelta64(1, 's'))
df_test['time'] = df_test['time'].apply(lambda x: x + np.timedelta64(diff_seconds,'s'))
df = pd.concat([df_train, df_test])
labels_df = pd.read_csv(data_folder + machine_1_test_labels) #i.e. contains label,time_id,time where label==1
labels_df['time'] = pd.to_datetime(labels_df['time'])
#also for labels
labels_df['time'] = labels_df['time'].apply(lambda x: x + np.timedelta64(diff_seconds,'s'))
anomalies = [row['time'] for idx, row in labels_df.iterrows()]
print(f'No anomalies: {len(anomalies)}. First: {anomalies[0]}')
pd.options.display.float_format = '{:,.3f}'.format
display(df)
from npbad.visualisation import plot_mv_data
plot_mv_data('machine-1-1', df, anomalies)
In the raw data we find many correlated time series. For anomaly detection we pre-process the time series and remove correlated time series. The function pre_process_mv_df
does:
After pre-processing we keep only 16 of the total 38 time series for the first device.
from npbad.preprocess_timeseries import pre_process_mv_df
df2, components = pre_process_mv_df(df)
plot_mv_data('machine-1-1', df2, anomalies)
from npbad.baselines.iso_forest import run_iso_forest_mv
preprocess_parameter = {'interval': 72, 'stride':1, 'no_symbols': 4, 'no_bins': 5, 'preprocess_outlier': False}
#run iso forest
an_scores = run_iso_forest_mv(df2, interval=preprocess_parameter['interval'],
stride=preprocess_parameter['stride'],
preprocess_outlier=preprocess_parameter['preprocess_outlier'])
#evaluate (using only test data which has labelled anomalies)
an_score_second_half = []
for window, score in an_scores:
if window[0] > df_test['time'].min():
an_score_second_half.append((window,score))
(f1,prec,rec,TP,FP,FN,TN, t) = eval_best_f1(an_score_second_half, anomalies)
print(f"ISO F1@best: {f1:.4f}, precision: {prec:.4f}, recall: {rec:.4f}, threshold: {t:.4f}, TP: {TP}, FP: {FP}, FN:{FN}, TN:{TN}")
#without point-adjust:
(f1,prec,rec,TP,FP,FN,TN, t) = eval_best_f1(an_score_second_half, anomalies, point_adjust=False)
print(f"ISO W/O F1@best: {f1:.4f}, precision: {prec:.4f}, recall: {rec:.4f}, threshold: {t:.4f}, TP: {TP}, FP: {FP}, FN:{FN}, TN:{TN}")
from npbad.ipbad.main import run_ipbad_mv
an_scores = run_ipbad_mv(df2, interval=preprocess_parameter['interval'], stride=preprocess_parameter['stride'],
no_symbols=preprocess_parameter['no_symbols'], no_bins=preprocess_parameter['no_bins'],
preprocess_outlier=preprocess_parameter['preprocess_outlier'],
use_MDL=True, binning_method='normal', use_iso=True)
#eval after time
an_score_second_half = []
for window, score in an_scores:
if window[0] > df_test['time'].min():
an_score_second_half.append((window,score))
(f1,prec,rec,TP,FP,FN,TN, t) = eval_best_f1(an_score_second_half, anomalies)
print(f"IPBAD F1@best: {f1:.4f}, precision: {prec:.4f}, recall: {rec:.4f}, threshold: {t:.4f}, TP: {TP}, FP: {FP}, FN:{FN}, TN:{TN}")
#without point-adjust:
(f1,prec,rec,TP,FP,FN,TN, t) = eval_best_f1(an_score_second_half, anomalies, point_adjust=False)
print(f"IPBAD W/O F1@best: {f1:.4f}, precision: {prec:.4f}, recall: {rec:.4f}, threshold: {t:.4f}, TP: {TP}, FP: {FP}, FN:{FN}, TN:{TN}")
plot_mv_data('machine-1-1', df2, anomalies, an_scores_computed=an_score_second_half, threshold=t)
We can use random search
to find optimal parameters.
We remark that for parameter optimalisation we prefer to set point-adjust
to False. We determine that an anomaly is detected if a window contains the anomaly and the anomaly score is higher than a certain threshold. Using point-adjust we also consider other (overlapping) windows and check if one of those windows has a high anomaly score. Next, we report a True Possitive if a neighbouring window with a higher anomaly score reports that anomaly, even if the current window has an anomaly score below the threshold. However, this is an (overly) optimistic evaluation protocol.
import random
grid={"interval": [24,48,72,96],
"stride": [6],
"no_symbols": list(range(4,20,2)),
"no_bins" : list(range(3,20,2)) + [50,100],
"preprocess_outlier": [True, False],
"use_MDL": [True],
"binning_method": ['normal'],
"use_iso": [True]
}
no_iter = 5
best_parameters = None
best_metrics = None
best_an_scores = None
while no_iter > 0:
random_parameter = {}
for key, options in grid.items():
idx = random.randint(0,len(options)-1)
random_parameter[key] = options[idx]
try:
an_scores = run_ipbad_mv(df2, **random_parameter)
an_score_second_half = []
for window, score in an_scores:
if window[0] > df_test['time'].min():
an_score_second_half.append((window,score))
(f1,prec,rec,TP,FP,FN,TN, t) = eval_best_f1(an_score_second_half, anomalies, point_adjust=False)
print(f"Parameters: {random_parameter},\n F1@best: {f1:.4f}, precision: {prec:.4f}, recall: {rec:.4f}, threshold: {t:.4f}, TP: {TP}, FP: {FP}, FN:{FN}")
no_iter = no_iter - 1
if best_metrics is None or f1 > best_metrics[0]:
best_metrics = (f1,prec,rec,TP,FP,FN,TN, t)
best_parameters = random_parameter
best_an_scores = an_scores
except Exception as e:
print(e)
print('#### Best result ####')
(f1,prec,rec,TP,FP,FN,TN, t) = best_metrics
print(f"Parameters: {best_parameters}")
print(f"F1@best: {f1:.4f}, precision: {prec:.4f}, recall: {rec:.4f}, threshold: {t:.4f}, TP: {TP}, FP: {FP}, FN:{FN}, TN: {TN}")
#best parameters
#Parameters: {'interval': 96, 'stride': 6, 'no_symbols': 12, 'no_bins': 50, 'preprocess_outlier': False, 'use_MDL': True, 'binning_method': 'normal', 'use_iso': True},
#F1@best: 0.8451, precision: 0.8333, recall: 0.8571, threshold: 0.4618, TP: 30, FP: 6, FN:5
par_fpof = {'interval': 6, 'stride': 6, 'no_symbols': 10, 'no_bins': 10, 'preprocess_outlier': False, 'use_MDL': True, 'binning_method': 'normal',
'use_iso': False}
df2_sample = df2#df2[df2['time'] > np.datetime64('2022-02-01')]
an_scores = run_ipbad_mv(df2_sample, **par_fpof)
an_score_second_half = []
for window, score in an_scores:
if window[0] > df_test['time'].min():
an_score_second_half.append((window,score))
(f1,prec,rec,TP,FP,FN,TN, t) = eval_best_f1(an_score_second_half, anomalies, point_adjust=False)
print(f"Parameters: {par_fpof},\n F1@best: {f1:.4f}, precision: {prec:.4f}, recall: {rec:.4f}, threshold: {t:.4f}, TP: {TP}, FP: {FP}, FN:{FN}")
#was: F1@best: 0.4528, precision: 0.3077, recall: 0.8571, threshold: 0.9876, TP: 12, FP: 27, FN:2
#WASF1@best: 0.5000, precision: 0.6000, recall: 0.4286, threshold: 0.9964, TP: 6, FP: 4, FN:8
In SMD we have additional data on which sensors caused which anomalies. First we load this data.
from matplotlib.font_manager import ttfFontProperty
#SMD contains interpretation label, which shows which sensors caused the
df_int_labels = pd.read_csv(data_folder + machine_1_interpretation_label)
df_int_labels = pd.merge(df_int_labels, labels_df, how='left', left_on='from_time_id', right_on='time_id')
df_int_labels = df_int_labels.rename(columns={"time":"from_time"})
df_int_labels['to_time'] = df_int_labels['to_time_id'].apply(lambda i: np.datetime64('2022-01-31 18:47:00') + np.timedelta64(i - 15849,'m'))
display(df_int_labels)
def get_anomalies_col(anomalies, col):
maching_anomalies = []
for anomaly in anomalies:
found = False
filter_df = df_int_labels[(df_int_labels['from_time'] <= anomaly) & (df_int_labels['to_time'] >= anomaly)]
for idx, row in filter_df.iterrows():
if '-C' in col:
col=col[0:col.index('-C')]
if col in row['dimensions']:
found = True
break
if found:
maching_anomalies.append(anomaly)
return maching_anomalies
Next, we show each time series, the anomalies associated with that time series and the corresponding discretised time series.
no_dim = len(df2.columns)
fig, axs = plt.subplots(no_dim, figsize=(30,30), sharex=True)
for i, col in enumerate(df2.columns):
if col == 'time':
continue
axs[i].plot(df2_sample['time'], df2_sample[col], color='blue',alpha=0.5)
#show anomalies for known sensors
anomalies_dim = get_anomalies_col(anomalies, col)
axs[i].scatter(anomalies_dim, [1.0 for anomaly in anomalies_dim], color='red', marker='X', alpha=0.5, s=5)
#discretise
ts_i = df2_sample[['time',col]].copy()
ts_i = ts_i.rename(columns={col:'average_value'})
windows = create_windows(ts_i, interval=par_fpof['interval'], stride=par_fpof['stride'])
segments = create_segments(ts_i, windows)
segments_d = discretise_segments_equal_distance_bins_global(segments, par_fpof['no_bins'], par_fpof['no_symbols'])
print(windows[0],segments_d[0],len(windows))
#plot discretise
axs[i].text(df2_sample['time'].min() - np.timedelta64(1,'D'), 0, f'{col}', fontsize=12)
plot_discrete(axs[i], segments_d, windows, interval_width=par_fpof['interval'], stride=par_fpof['stride'], no_symbols=par_fpof['no_symbols'], no_bins=par_fpof['no_bins'])
Next, we compute the Frequent-pattern Outlier factor for each time series individually. The FPOF anomaly score often has a lower accuracy than using an isolation forest. However, the anomaly score is easy to interpret, as it is just the ratio of the number of matching patterns in each window versus all patterns. To be fully technically correct, the frequent-pattern outlier factor is the ratio of the the weighted sum of relative support of each matching pattern versus the total number of patterns.
In the first plot, we render each time series of the first device in de SMD dataset, the discretised time series and the computed anomaly score using FPOF. We skip time series that have no patterns, i.e. time series that are mostly flat after discretisation.
from npbad.visualisation import *
#show FPOF score each dimension
data = []
columns = [col for col in df2_sample.columns if col != 'time']
fig, axs = plt.subplots(len(columns), figsize=(30,30), sharex=True)
records = []
for i, col in enumerate(columns):
ts_i = df2_sample[['time', col]].copy()
ts_i = ts_i.rename(columns={col: 'average_value'})
print(f'run-single(col={col})')
result = run_ipbad(ts_i, interval=par_fpof['interval'], stride=par_fpof['stride'],
no_symbols=par_fpof['no_symbols'], no_bins=par_fpof['no_bins'],
preprocess_outlier=par_fpof['preprocess_outlier'],
use_MDL=par_fpof['use_MDL'], binning_method=par_fpof['binning_method'],
use_iso=False, verbose=False, return_patterns=True) #RETURN_PATTERN=TRUE
if len(result) == 4:
an_scores, ts_i, segments_discretised, windows = result
records.append([an_scores, ts_i, segments_discretised, windows, None, None])
else:
an_scores, ts_i, segments_discretised, windows, patterns_df, embedding = result
records.append([ an_scores, ts_i, segments_discretised, windows, patterns_df, embedding])
#plot_ts
axs[i].plot(df2_sample['time'], df2_sample[col], color='blue',alpha=0.5)
#show anomalies for known sensors
anomalies_dim = get_anomalies_col(anomalies, col)
axs[i].scatter(anomalies_dim, [1.0 for anomaly in anomalies_dim], color='red', marker='X', alpha=0.5, s=5)
#plot discretise
plot_discrete(axs[i], segments_discretised, windows,
interval_width=par_fpof['interval'], stride=par_fpof['stride'], no_symbols=par_fpof['no_symbols'], no_bins=par_fpof['no_bins'])
#plot an_scores
if len(result) != 6:
continue
timestamps = [average_timestamps(window[0],window[1]) for window, score in an_scores]
scores = [score for window, score in an_scores]
if (max(scores) - min(scores)) !=0:
scores = [(score - min(scores))/(max(scores) - min(scores)) for score in scores]
axs[i].plot(timestamps, scores, color='orange', marker='X', alpha=0.5)
#start = pd.to_datetime(df2_sample['time'].min()) - np.timedelta64(1.5,'D')
axs[i].text(timestamps[0] - np.timedelta64(2,'D'), 0, f'{col}', fontsize=12)
In this plot we show the pattern-based embedding, that is all patterns and each pattern occurrence for each sensor corresponding with the last case.
def remove_redundant_patterns(patterns_df):
filter = set()
for idx, pattern_row in patterns_df.iterrows():
for idx2, pattern_row2 in patterns_df.iterrows():
if pattern_row['instances'] == pattern_row2['instances'] and pattern_row2['bits_saved'] > pattern_row['bits_saved']:
filter.add(pattern_row['id'])
return patterns_df[~patterns_df['id'].isin(filter)]
for i, col in enumerate(columns):
an_scores, ts_i, segments_discretised, windows, patterns_df, embedding = records[i]
if patterns_df is None or patterns_df.shape[0] == 0:
continue
#display(patterns_df)
#display(remove_redundant_patterns(patterns_df))
anomalies_dim = get_anomalies_col(anomalies, col)
patterns_df = remove_redundant_patterns(patterns_df)
plot_patterns(ts_i, anomalies_dim, segments_discretised, windows, patterns_df, embedding,
interval_width=par_fpof['interval'],stride=par_fpof['stride'],no_symbols=par_fpof['no_symbols'])
!jupyter nbconvert --to html tutorial_npbad.ipynb