[15]:

import warnings
warnings.filterwarnings("ignore", category=UserWarning)

Log Encodings¶

Declare4Py provides several among the main encoding techniques for vectorizing a traces log. These are useful for applying Machine Learning techniques. The encoding classes provided by Declare4Py (see the Declare4Py.Encodings package) take as an input a log in a Pandas dataframe format and return a Pandas dataframe whose rows represent a single trace and the columns the extracted features. The Declare4Py encodings are implemented as scikit-learn transformers so it is straightfoward to use them in a Machine Learning pipeline.

The tutorial will cover the following points:

Encodings families:
1. The boolean encoding;
2. The frequency-based encoding;
3. Aggregated encodings;
4. Indexed encodings:
  1. The simple-index encoding;
  2. The complex-Index encoding;
5. Static Encodings:
  1. The first-state encoding;
  2. The second-to-last-state encoding;
  3. The last-state encoding;
6. The Ngram encoding;
7. The Declare encoding;
Encoding combinations:
1. The index-latest-payload encoding;
A Machine Learning pipeline.

Before starting how to use the encodings the necessary packages need to be imported.

[1] [2] [3] [4]

[16]:

import os
import pandas as pd


from Declare4Py.Encodings.Aggregate import Aggregate
from Declare4Py.Encodings.IndexBased import IndexBased
from Declare4Py.Encodings.Static import Static
from Declare4Py.Encodings.PreviousState import PreviousState
from Declare4Py.Encodings.LastState import LastState
from Declare4Py.Encodings.Ngram import Ngram
from Declare4Py.Encodings.Declare import Declare

The input format for the Encodings classes are logs as Pandas dataframe. Therefore, we import the event log and convert it in a Pandas dataframe.

[17]:

from Declare4Py.D4PyEventLog import D4PyEventLog

log_path = os.path.join("../../../", "tests", "test_logs", "Sepsis Cases.xes.gz")
event_log = D4PyEventLog(case_name="case:concept:name")
event_log.parse_xes_log(log_path)
case_id_key = event_log.get_case_name()
event_log.to_dataframe()
df = event_log.log
df.head()

[17]:

	InfectionSuspected	org:group	DiagnosticBlood	DisfuncOrg	SIRSCritTachypnea	Hypotensie	SIRSCritHeartRate	Infusion	DiagnosticArtAstrup	concept:name	...	DiagnosticLacticAcid	lifecycle:transition	Diagnose	Hypoxie	DiagnosticUrinarySediment	DiagnosticECG	Leucocytes	CRP	LacticAcid	case:concept:name
0	True	A	True	True	True	True	True	True	True	ER Registration	...	True	complete	A	False	True	True	NaN	NaN	NaN	A
1	NaN	B	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Leucocytes	...	NaN	complete	NaN	NaN	NaN	NaN	9.6	NaN	NaN	A
2	NaN	B	NaN	NaN	NaN	NaN	NaN	NaN	NaN	CRP	...	NaN	complete	NaN	NaN	NaN	NaN	NaN	21.0	NaN	A
3	NaN	B	NaN	NaN	NaN	NaN	NaN	NaN	NaN	LacticAcid	...	NaN	complete	NaN	NaN	NaN	NaN	NaN	NaN	2.2	A
4	NaN	C	NaN	NaN	NaN	NaN	NaN	NaN	NaN	ER Triage	...	NaN	complete	NaN	NaN	NaN	NaN	NaN	NaN	NaN	A

5 rows × 32 columns

Encodings families¶

A Declare4Py encoding is implemented as a scikit-learn transformer class, you just need to instantiate the corresponding encoder object and call the function fit_transform(df) on the input dataframe. The name of the features can be retrieved with the get_feature_names() function.

The Boolean Encoding¶

In the boolean encoding sequences of events are represented as feature vectors, in such a way that each feature corresponds to an event class (an activity) from the log. This is achieved with the Declare4Py.Encodings.Aggregate.Aggregate class by setting the categorical attributes and the boolean parameter to True.

[18]:

encoder = Aggregate(case_id_col=case_id_key, cat_cols=['concept:name', 'org:group'], boolean=True)
enc_df = encoder.fit_transform(df)

print(f"Log features:\n {encoder.get_feature_names()}")
enc_df.head()

Log features:
 Index(['concept:name_Admission IC', 'concept:name_Admission NC',
       'concept:name_CRP', 'concept:name_ER Registration',
       'concept:name_ER Sepsis Triage', 'concept:name_ER Triage',
       'concept:name_IV Antibiotics', 'concept:name_IV Liquid',
       'concept:name_LacticAcid', 'concept:name_Leucocytes',
       'concept:name_Release A', 'concept:name_Release B',
       'concept:name_Release C', 'concept:name_Release D',
       'concept:name_Release E', 'concept:name_Return ER', 'org:group_?',
       'org:group_A', 'org:group_B', 'org:group_C', 'org:group_D',
       'org:group_E', 'org:group_F', 'org:group_G', 'org:group_H',
       'org:group_I', 'org:group_J', 'org:group_K', 'org:group_L',
       'org:group_M', 'org:group_N', 'org:group_O', 'org:group_P',
       'org:group_Q', 'org:group_R', 'org:group_S', 'org:group_T',
       'org:group_U', 'org:group_V', 'org:group_W', 'org:group_X',
       'org:group_Y'],
      dtype='object')

[18]:

	concept:name_Admission IC	concept:name_Admission NC	concept:name_CRP	concept:name_ER Registration	concept:name_ER Sepsis Triage	concept:name_ER Triage	concept:name_IV Antibiotics	concept:name_IV Liquid	concept:name_LacticAcid	concept:name_Leucocytes	...	org:group_P	org:group_Q	org:group_R	org:group_S	org:group_T	org:group_U	org:group_V	org:group_W	org:group_X	org:group_Y
case:concept:name
A	False	True	True	True	True	True	True	True	True	True	...	False	False	False	False	False	False	False	False	False	False
AA	False	False	True	True	True	True	True	True	True	True	...	False	False	False	False	False	False	False	False	False	False
AAA	False	True	True	True	True	True	True	True	True	True	...	False	False	False	False	False	False	False	False	False	False
AB	False	False	True	True	True	True	True	True	True	True	...	False	False	False	False	False	False	False	False	False	False
ABA	False	True	True	True	True	True	True	True	True	True	...	False	False	False	False	False	False	False	False	False	False

5 rows × 42 columns

The Frequency-Based Encoding¶

The frequency-based encoding, instead of boolean values, represents the control flow in a case with the frequency of each event class in the case. This is achieved with the Declare4Py.Encodings.Aggregate.Aggregate class by setting the categorical attributes and the boolean parameter to False.

[19]:

encoder = Aggregate(case_id_col=case_id_key, cat_cols=['concept:name', 'org:group'], boolean=False)
enc_df = encoder.fit_transform(df)
enc_df.head()

[19]:

	concept:name_Admission IC	concept:name_Admission NC	concept:name_CRP	concept:name_ER Registration	concept:name_ER Sepsis Triage	concept:name_ER Triage	concept:name_IV Antibiotics	concept:name_IV Liquid	concept:name_LacticAcid	concept:name_Leucocytes	...	org:group_P	org:group_Q	org:group_R	org:group_S	org:group_T	org:group_U	org:group_V	org:group_W	org:group_X	org:group_Y
case:concept:name
A	0	1	7	1	1	1	1	1	1	7	...	0	0	0	0	0	0	0	0	0	0
AA	0	0	1	1	1	1	1	1	1	1	...	0	0	0	0	0	0	0	0	0	0
AAA	0	1	1	1	1	1	1	1	1	1	...	0	0	0	0	0	0	0	0	0	0
AB	0	0	1	1	1	1	1	1	1	1	...	0	0	0	0	0	0	0	0	0	0
ABA	0	1	4	1	1	1	1	1	1	5	...	0	0	0	0	0	0	0	0	0	0

5 rows × 42 columns

The Aggregated Encoding¶

The aggregated encoding considers all events since the beginning of the case, but ignore the order of the events. In this case, several aggregation functions can be applied to the values that an event attribute has taken throughout the case. This is achieved with the Declare4Py.Encodings.Aggregate.Aggregate class by setting the categorical attributes, the numeric attributes, the boolean parameter to False and a list of functions to aggregate the numeric attributes, e.g., ‘mean’, ‘max’, ‘min’, ‘sum’, ‘std’.

[20]:

encoder = Aggregate(case_id_col=case_id_key, cat_cols=['concept:name', 'org:group'], num_cols=['CRP'], boolean=False, aggregation_functions=['min', 'mean', 'max'])
enc_df = encoder.fit_transform(df)
enc_df.head()

[20]:

	concept:name_Admission IC	concept:name_Admission NC	concept:name_CRP	concept:name_ER Registration	concept:name_ER Sepsis Triage	concept:name_ER Triage	concept:name_IV Antibiotics	concept:name_IV Liquid	concept:name_LacticAcid	concept:name_Leucocytes	...	org:group_S	org:group_T	org:group_U	org:group_V	org:group_W	org:group_X	org:group_Y	CRP_min	CRP_mean	CRP_max
case:concept:name
A	0	1	7	1	1	1	1	1	1	7	...	0	0	0	0	0	0	0	6.0	30.857143	109.0
AA	0	0	1	1	1	1	1	1	1	1	...	0	0	0	0	0	0	0	23.0	23.000000	23.0
AAA	0	1	1	1	1	1	1	1	1	1	...	0	0	0	0	0	0	0	68.0	68.000000	68.0
AB	0	0	1	1	1	1	1	1	1	1	...	0	0	0	0	0	0	0	48.0	48.000000	48.0
ABA	0	1	4	1	1	1	1	1	1	5	...	0	0	0	0	0	0	0	78.0	105.000000	140.0

5 rows × 45 columns

Indexed Encodings¶

The Simple-Index Encoding¶

Another way of encoding a sequence is by taking into account also information about the order in which events occur in the sequence, as in the simple-index encoding. Here, each feature corresponds to a position in the sequence and the possible values for each feature are the presence of that event classes. This is achieved with the Declare4Py.Encodings.IndexBased.IndexBased class by setting the categorical attributes, the create_dummies parameter to True and the max_events to an integer value lower or equal than the maximum number of events in a trace in the log. If None, the parameter will set to the maximum number of events in a trace in the log. Such parameter sets the first events in the log to be use for indexing.

[21]:

# with max_events the maximum number of events in a trace in the log.
encoder = IndexBased(case_id_col=case_id_key, cat_cols=['concept:name'], create_dummies=True)
enc_df = encoder.fit_transform(df)
enc_df.head()

[21]:

	concept:name_0_CRP	concept:name_0_ER Registration	concept:name_0_ER Sepsis Triage	concept:name_0_ER Triage	concept:name_0_IV Liquid	concept:name_0_Leucocytes	concept:name_1_CRP	concept:name_1_ER Registration	concept:name_1_ER Sepsis Triage	concept:name_1_ER Triage	...	concept:name_175_Leucocytes	concept:name_176_CRP	concept:name_177_CRP	concept:name_178_Leucocytes	concept:name_179_Leucocytes	concept:name_180_CRP	concept:name_181_Leucocytes	concept:name_182_CRP	concept:name_183_Leucocytes	concept:name_184_Release C
case:concept:name
A	False	True	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
AA	False	True	False	False	False	False	False	False	False	True	...	False	False	False	False	False	False	False	False	False	False
AAA	False	True	False	False	False	False	False	False	False	True	...	False	False	False	False	False	False	False	False	False	False
AB	False	True	False	False	False	False	False	False	False	True	...	False	False	False	False	False	False	False	False	False	False
ABA	False	True	False	False	False	False	False	False	False	True	...	False	False	False	False	False	False	False	False	False	False

5 rows × 656 columns

[22]:

# with max_events equal to 2.
encoder = IndexBased(case_id_col=case_id_key, cat_cols=['concept:name'], max_events=2, create_dummies=True)
enc_df = encoder.fit_transform(df)
enc_df.head()

[22]:

	concept:name_0_CRP	concept:name_0_ER Registration	concept:name_0_ER Sepsis Triage	concept:name_0_ER Triage	concept:name_0_IV Liquid	concept:name_0_Leucocytes	concept:name_1_CRP	concept:name_1_ER Registration	concept:name_1_ER Sepsis Triage	concept:name_1_ER Triage	concept:name_1_IV Antibiotics	concept:name_1_IV Liquid	concept:name_1_LacticAcid	concept:name_1_Leucocytes
case:concept:name
A	False	True	False	False	False	False	False	False	False	False	False	False	False	True
AA	False	True	False	False	False	False	False	False	False	True	False	False	False	False
AAA	False	True	False	False	False	False	False	False	False	True	False	False	False	False
AB	False	True	False	False	False	False	False	False	False	True	False	False	False	False
ABA	False	True	False	False	False	False	False	False	False	True	False	False	False	False

The Complex-Index Encoding¶

The complex-based encoding takes into account also payload columns in the cat_cols or num_cols parameters.

[23]:

encoder = IndexBased(case_id_col=case_id_key, cat_cols = ['concept:name', 'org:group'], num_cols=['CRP'], create_dummies=True)
enc_df = encoder.fit_transform(df)
enc_df.head()

[23]:

	CRP_0	CRP_1	CRP_2	CRP_3	CRP_4	CRP_5	CRP_6	CRP_7	CRP_8	CRP_9	...	org:group_175_B	org:group_176_B	org:group_177_B	org:group_178_B	org:group_179_B	org:group_180_B	org:group_181_B	org:group_182_B	org:group_183_B	org:group_184_E
case:concept:name
A	0.0	0.0	21.0	0.0	0.0	0.0	0.0	0.0	0.0	109.0	...	False	False	False	False	False	False	False	False	False	False
AA	0.0	0.0	0.0	0.0	0.0	23.0	0.0	0.0	0.0	0.0	...	False	False	False	False	False	False	False	False	False	False
AAA	0.0	0.0	0.0	0.0	0.0	68.0	0.0	0.0	0.0	0.0	...	False	False	False	False	False	False	False	False	False	False
AB	0.0	0.0	0.0	48.0	0.0	0.0	0.0	0.0	0.0	0.0	...	False	False	False	False	False	False	False	False	False	False
ABA	0.0	0.0	0.0	0.0	0.0	0.0	78.0	0.0	0.0	0.0	...	False	False	False	False	False	False	False	False	False	False

5 rows × 1400 columns

Static Encodings¶

In a static encoding, only an available snapshot of the data is used. Therefore, the size of the feature vector is proportional to the number of event attributes and is fixed throughout the execution of a case.

Using the last state abstraction, only one value (e.g., the last snapshot) of each data attribute is available. Here, the numeric attributes are added to the feature vector “as is” while one hot encoding is applied to each categorical attribute.

The First-State Encoding¶

In the first-state encoding only the information (control flow and payload) of the first event is retained. This is achieved with the Declare4Py.Encodings.Static.Static class by setting the categorical and numeric attributes.

[24]:

encoder = Static(case_id_col=case_id_key, cat_cols = ['concept:name', 'org:group'], num_cols=['CRP'])
enc_df = encoder.fit_transform(df)
enc_df.head()

[24]:

	CRP	concept:name_CRP	concept:name_ER Registration	concept:name_ER Sepsis Triage	concept:name_ER Triage	concept:name_IV Liquid	concept:name_Leucocytes	org:group_A	org:group_B	org:group_C	org:group_L
case:concept:name
A	21.0	False	True	False	False	False	False	True	False	False	False
AA	23.0	False	True	False	False	False	False	True	False	False	False
AAA	68.0	False	True	False	False	False	False	True	False	False	False
AB	48.0	False	True	False	False	False	False	True	False	False	False
ABA	78.0	False	True	False	False	False	False	True	False	False	False

The Second-to-Last-State Encoding¶

In the second-to-last-state encoding only the information (control flow and payload) of the second-to-last event is retained. This is achieved with the Declare4Py.Encodings.PreviousState.PreviousState class by setting the categorical and numeric attributes.

[25]:

encoder = PreviousState(case_id_col=case_id_key, cat_cols = ['concept:name', 'org:group'], num_cols=['CRP'])
enc_df = encoder.fit_transform(df)
enc_df.head()

[25]:

	CRP	concept:name_Admission NC	concept:name_CRP	concept:name_ER Sepsis Triage	concept:name_ER Triage	concept:name_IV Antibiotics	concept:name_IV Liquid	concept:name_LacticAcid	concept:name_Leucocytes	concept:name_Release A	...	org:group_M	org:group_N	org:group_O	org:group_P	org:group_Q	org:group_R	org:group_S	org:group_T	org:group_U	org:group_V
case:concept:name
A	0.0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
AA	0.0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
AAA	0.0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
AB	0.0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
ABA	0.0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

5 rows × 36 columns

The Last-State Encoding¶

In the last-state encoding only the information (control flow and payload) of the last event is retained. This is achieved with the Declare4Py.Encodings.LastState.LastState class by setting the categorical and numeric attributes.

[26]:

encoder = LastState(case_id_col=case_id_key, cat_cols = ['concept:name', 'org:group'], num_cols=['CRP'])
enc_df = encoder.fit_transform(df)
enc_df.head()

[26]:

	CRP	concept:name_Admission NC	concept:name_CRP	concept:name_ER Sepsis Triage	concept:name_ER Triage	concept:name_IV Antibiotics	concept:name_IV Liquid	concept:name_LacticAcid	concept:name_Leucocytes	concept:name_Release A	...	org:group_B	org:group_C	org:group_D	org:group_E	org:group_F	org:group_G	org:group_I	org:group_L	org:group_R	org:group_V
case:concept:name
A	6.0	False	False	False	False	False	False	False	False	True	...	False	False	False	True	False	False	False	False	False	False
AA	23.0	False	False	False	False	True	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
AAA	68.0	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
AB	48.0	False	False	False	False	True	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
ABA	140.0	False	False	False	False	False	False	False	False	True	...	False	False	False	True	False	False	False	False	False	False

5 rows × 27 columns

The Ngram encoding¶

[27]:

encoder = Ngram(case_id_col=case_id_key, n=2 , v=0.7, act_col='concept:name')
enc_df = encoder.fit_transform(df)
enc_df.head()

[27]:

	Admission NC\|CRP	ER Triage\|IV Liquid	ER Sepsis Triage\|CRP	CRP\|IV Liquid	IV Liquid\|LacticAcid	CRP\|Release B	Admission NC\|Release A	ER Triage\|ER Registration	LacticAcid\|CRP	Admission NC\|LacticAcid	...	Release D\|Return ER	IV Liquid\|Admission IC	Admission NC\|IV Antibiotics	CRP\|Release D	LacticAcid\|Release B	ER Triage\|LacticAcid	Admission IC\|LacticAcid	Leucocytes\|Release E	ER Sepsis Triage\|Leucocytes	CRP\|LacticAcid
case:concept:name
A	1.188124	0.49000	0.407527	0.2401	0.000	0.0	0.009689	0.0	0.199688	0.0	...	0.0	0.0	0.0	0.0	0.0	0.00000	0.0	0.0	0.381729	0.70
B	1.190000	0.16807	0.408170	0.2401	0.000	0.0	0.343000	0.0	0.200003	0.0	...	0.0	0.0	0.0	0.0	0.0	0.49000	0.0	0.0	0.000000	0.70
C	1.241170	0.24010	0.575896	0.7000	0.000	0.0	0.285719	0.0	0.000000	0.0	...	0.0	0.0	0.0	0.0	0.0	0.00000	0.0	0.0	0.822708	0.00
D	0.490000	0.16807	0.757648	0.3430	0.000	0.0	0.343000	0.0	0.117649	0.0	...	0.0	0.0	0.0	0.0	0.0	0.34300	0.0	0.0	0.425354	0.70
E	0.000000	0.49000	0.490000	0.0000	0.343	0.0	0.000000	0.0	0.000000	0.0	...	0.0	0.0	0.0	0.0	0.0	0.16807	0.0	0.0	0.343000	0.49

5 rows × 115 columns