[15]:
import warnings
warnings.filterwarnings("ignore", category=UserWarning)

Log Encodings

Declare4Py provides several among the main encoding techniques for vectorizing a traces log. These are useful for applying Machine Learning techniques. The encoding classes provided by Declare4Py (see the Declare4Py.Encodings package) take as an input a log in a Pandas dataframe format and return a Pandas dataframe whose rows represent a single trace and the columns the extracted features. The Declare4Py encodings are implemented as scikit-learn transformers so it is straightfoward to use them in a Machine Learning pipeline.

The tutorial will cover the following points:

  1. Encodings families:

    1. The boolean encoding;

    2. The frequency-based encoding;

    3. Aggregated encodings;

    4. Indexed encodings:

      1. The simple-index encoding;

      2. The complex-Index encoding;

    5. Static Encodings:

      1. The first-state encoding;

      2. The second-to-last-state encoding;

      3. The last-state encoding;

    6. The Ngram encoding;

    7. The Declare encoding;

  2. Encoding combinations:

    1. The index-latest-payload encoding;

  3. A Machine Learning pipeline.

Before starting how to use the encodings the necessary packages need to be imported.

[1] [2] [3] [4]

[16]:
import os
import pandas as pd


from Declare4Py.Encodings.Aggregate import Aggregate
from Declare4Py.Encodings.IndexBased import IndexBased
from Declare4Py.Encodings.Static import Static
from Declare4Py.Encodings.PreviousState import PreviousState
from Declare4Py.Encodings.LastState import LastState
from Declare4Py.Encodings.Ngram import Ngram
from Declare4Py.Encodings.Declare import Declare

The input format for the Encodings classes are logs as Pandas dataframe. Therefore, we import the event log and convert it in a Pandas dataframe.

[17]:
from Declare4Py.D4PyEventLog import D4PyEventLog

log_path = os.path.join("../../../", "tests", "test_logs", "Sepsis Cases.xes.gz")
event_log = D4PyEventLog(case_name="case:concept:name")
event_log.parse_xes_log(log_path)
case_id_key = event_log.get_case_name()
event_log.to_dataframe()
df = event_log.log
df.head()
[17]:
InfectionSuspected org:group DiagnosticBlood DisfuncOrg SIRSCritTachypnea Hypotensie SIRSCritHeartRate Infusion DiagnosticArtAstrup concept:name ... DiagnosticLacticAcid lifecycle:transition Diagnose Hypoxie DiagnosticUrinarySediment DiagnosticECG Leucocytes CRP LacticAcid case:concept:name
0 True A True True True True True True True ER Registration ... True complete A False True True NaN NaN NaN A
1 NaN B NaN NaN NaN NaN NaN NaN NaN Leucocytes ... NaN complete NaN NaN NaN NaN 9.6 NaN NaN A
2 NaN B NaN NaN NaN NaN NaN NaN NaN CRP ... NaN complete NaN NaN NaN NaN NaN 21.0 NaN A
3 NaN B NaN NaN NaN NaN NaN NaN NaN LacticAcid ... NaN complete NaN NaN NaN NaN NaN NaN 2.2 A
4 NaN C NaN NaN NaN NaN NaN NaN NaN ER Triage ... NaN complete NaN NaN NaN NaN NaN NaN NaN A

5 rows × 32 columns

Encodings families

A Declare4Py encoding is implemented as a scikit-learn transformer class, you just need to instantiate the corresponding encoder object and call the function fit_transform(df) on the input dataframe. The name of the features can be retrieved with the get_feature_names() function.

The Boolean Encoding

In the boolean encoding sequences of events are represented as feature vectors, in such a way that each feature corresponds to an event class (an activity) from the log. This is achieved with the Declare4Py.Encodings.Aggregate.Aggregate class by setting the categorical attributes and the boolean parameter to True.

[18]:
encoder = Aggregate(case_id_col=case_id_key, cat_cols=['concept:name', 'org:group'], boolean=True)
enc_df = encoder.fit_transform(df)

print(f"Log features:\n {encoder.get_feature_names()}")
enc_df.head()
Log features:
 Index(['concept:name_Admission IC', 'concept:name_Admission NC',
       'concept:name_CRP', 'concept:name_ER Registration',
       'concept:name_ER Sepsis Triage', 'concept:name_ER Triage',
       'concept:name_IV Antibiotics', 'concept:name_IV Liquid',
       'concept:name_LacticAcid', 'concept:name_Leucocytes',
       'concept:name_Release A', 'concept:name_Release B',
       'concept:name_Release C', 'concept:name_Release D',
       'concept:name_Release E', 'concept:name_Return ER', 'org:group_?',
       'org:group_A', 'org:group_B', 'org:group_C', 'org:group_D',
       'org:group_E', 'org:group_F', 'org:group_G', 'org:group_H',
       'org:group_I', 'org:group_J', 'org:group_K', 'org:group_L',
       'org:group_M', 'org:group_N', 'org:group_O', 'org:group_P',
       'org:group_Q', 'org:group_R', 'org:group_S', 'org:group_T',
       'org:group_U', 'org:group_V', 'org:group_W', 'org:group_X',
       'org:group_Y'],
      dtype='object')
[18]:
concept:name_Admission IC concept:name_Admission NC concept:name_CRP concept:name_ER Registration concept:name_ER Sepsis Triage concept:name_ER Triage concept:name_IV Antibiotics concept:name_IV Liquid concept:name_LacticAcid concept:name_Leucocytes ... org:group_P org:group_Q org:group_R org:group_S org:group_T org:group_U org:group_V org:group_W org:group_X org:group_Y
case:concept:name
A False True True True True True True True True True ... False False False False False False False False False False
AA False False True True True True True True True True ... False False False False False False False False False False
AAA False True True True True True True True True True ... False False False False False False False False False False
AB False False True True True True True True True True ... False False False False False False False False False False
ABA False True True True True True True True True True ... False False False False False False False False False False

5 rows × 42 columns

The Frequency-Based Encoding

The frequency-based encoding, instead of boolean values, represents the control flow in a case with the frequency of each event class in the case. This is achieved with the Declare4Py.Encodings.Aggregate.Aggregate class by setting the categorical attributes and the boolean parameter to False.

[19]:
encoder = Aggregate(case_id_col=case_id_key, cat_cols=['concept:name', 'org:group'], boolean=False)
enc_df = encoder.fit_transform(df)
enc_df.head()
[19]:
concept:name_Admission IC concept:name_Admission NC concept:name_CRP concept:name_ER Registration concept:name_ER Sepsis Triage concept:name_ER Triage concept:name_IV Antibiotics concept:name_IV Liquid concept:name_LacticAcid concept:name_Leucocytes ... org:group_P org:group_Q org:group_R org:group_S org:group_T org:group_U org:group_V org:group_W org:group_X org:group_Y
case:concept:name
A 0 1 7 1 1 1 1 1 1 7 ... 0 0 0 0 0 0 0 0 0 0
AA 0 0 1 1 1 1 1 1 1 1 ... 0 0 0 0 0 0 0 0 0 0
AAA 0 1 1 1 1 1 1 1 1 1 ... 0 0 0 0 0 0 0 0 0 0
AB 0 0 1 1 1 1 1 1 1 1 ... 0 0 0 0 0 0 0 0 0 0
ABA 0 1 4 1 1 1 1 1 1 5 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 42 columns

The Aggregated Encoding

The aggregated encoding considers all events since the beginning of the case, but ignore the order of the events. In this case, several aggregation functions can be applied to the values that an event attribute has taken throughout the case. This is achieved with the Declare4Py.Encodings.Aggregate.Aggregate class by setting the categorical attributes, the numeric attributes, the boolean parameter to False and a list of functions to aggregate the numeric attributes, e.g., ‘mean’, ‘max’, ‘min’, ‘sum’, ‘std’.

[20]:
encoder = Aggregate(case_id_col=case_id_key, cat_cols=['concept:name', 'org:group'], num_cols=['CRP'], boolean=False, aggregation_functions=['min', 'mean', 'max'])
enc_df = encoder.fit_transform(df)
enc_df.head()
[20]:
concept:name_Admission IC concept:name_Admission NC concept:name_CRP concept:name_ER Registration concept:name_ER Sepsis Triage concept:name_ER Triage concept:name_IV Antibiotics concept:name_IV Liquid concept:name_LacticAcid concept:name_Leucocytes ... org:group_S org:group_T org:group_U org:group_V org:group_W org:group_X org:group_Y CRP_min CRP_mean CRP_max
case:concept:name
A 0 1 7 1 1 1 1 1 1 7 ... 0 0 0 0 0 0 0 6.0 30.857143 109.0
AA 0 0 1 1 1 1 1 1 1 1 ... 0 0 0 0 0 0 0 23.0 23.000000 23.0
AAA 0 1 1 1 1 1 1 1 1 1 ... 0 0 0 0 0 0 0 68.0 68.000000 68.0
AB 0 0 1 1 1 1 1 1 1 1 ... 0 0 0 0 0 0 0 48.0 48.000000 48.0
ABA 0 1 4 1 1 1 1 1 1 5 ... 0 0 0 0 0 0 0 78.0 105.000000 140.0

5 rows × 45 columns

Indexed Encodings

The Simple-Index Encoding

Another way of encoding a sequence is by taking into account also information about the order in which events occur in the sequence, as in the simple-index encoding. Here, each feature corresponds to a position in the sequence and the possible values for each feature are the presence of that event classes. This is achieved with the Declare4Py.Encodings.IndexBased.IndexBased class by setting the categorical attributes, the create_dummies parameter to True and the max_events to an integer value lower or equal than the maximum number of events in a trace in the log. If None, the parameter will set to the maximum number of events in a trace in the log. Such parameter sets the first events in the log to be use for indexing.

[21]:
# with max_events the maximum number of events in a trace in the log.
encoder = IndexBased(case_id_col=case_id_key, cat_cols=['concept:name'], create_dummies=True)
enc_df = encoder.fit_transform(df)
enc_df.head()
[21]:
concept:name_0_CRP concept:name_0_ER Registration concept:name_0_ER Sepsis Triage concept:name_0_ER Triage concept:name_0_IV Liquid concept:name_0_Leucocytes concept:name_1_CRP concept:name_1_ER Registration concept:name_1_ER Sepsis Triage concept:name_1_ER Triage ... concept:name_175_Leucocytes concept:name_176_CRP concept:name_177_CRP concept:name_178_Leucocytes concept:name_179_Leucocytes concept:name_180_CRP concept:name_181_Leucocytes concept:name_182_CRP concept:name_183_Leucocytes concept:name_184_Release C
case:concept:name
A False True False False False False False False False False ... False False False False False False False False False False
AA False True False False False False False False False True ... False False False False False False False False False False
AAA False True False False False False False False False True ... False False False False False False False False False False
AB False True False False False False False False False True ... False False False False False False False False False False
ABA False True False False False False False False False True ... False False False False False False False False False False

5 rows × 656 columns

[22]:
# with max_events equal to 2.
encoder = IndexBased(case_id_col=case_id_key, cat_cols=['concept:name'], max_events=2, create_dummies=True)
enc_df = encoder.fit_transform(df)
enc_df.head()
[22]:
concept:name_0_CRP concept:name_0_ER Registration concept:name_0_ER Sepsis Triage concept:name_0_ER Triage concept:name_0_IV Liquid concept:name_0_Leucocytes concept:name_1_CRP concept:name_1_ER Registration concept:name_1_ER Sepsis Triage concept:name_1_ER Triage concept:name_1_IV Antibiotics concept:name_1_IV Liquid concept:name_1_LacticAcid concept:name_1_Leucocytes
case:concept:name
A False True False False False False False False False False False False False True
AA False True False False False False False False False True False False False False
AAA False True False False False False False False False True False False False False
AB False True False False False False False False False True False False False False
ABA False True False False False False False False False True False False False False

The Complex-Index Encoding

The complex-based encoding takes into account also payload columns in the cat_cols or num_cols parameters.

[23]:
encoder = IndexBased(case_id_col=case_id_key, cat_cols = ['concept:name', 'org:group'], num_cols=['CRP'], create_dummies=True)
enc_df = encoder.fit_transform(df)
enc_df.head()
[23]:
CRP_0 CRP_1 CRP_2 CRP_3 CRP_4 CRP_5 CRP_6 CRP_7 CRP_8 CRP_9 ... org:group_175_B org:group_176_B org:group_177_B org:group_178_B org:group_179_B org:group_180_B org:group_181_B org:group_182_B org:group_183_B org:group_184_E
case:concept:name
A 0.0 0.0 21.0 0.0 0.0 0.0 0.0 0.0 0.0 109.0 ... False False False False False False False False False False
AA 0.0 0.0 0.0 0.0 0.0 23.0 0.0 0.0 0.0 0.0 ... False False False False False False False False False False
AAA 0.0 0.0 0.0 0.0 0.0 68.0 0.0 0.0 0.0 0.0 ... False False False False False False False False False False
AB 0.0 0.0 0.0 48.0 0.0 0.0 0.0 0.0 0.0 0.0 ... False False False False False False False False False False
ABA 0.0 0.0 0.0 0.0 0.0 0.0 78.0 0.0 0.0 0.0 ... False False False False False False False False False False

5 rows × 1400 columns

Static Encodings

In a static encoding, only an available snapshot of the data is used. Therefore, the size of the feature vector is proportional to the number of event attributes and is fixed throughout the execution of a case.

Using the last state abstraction, only one value (e.g., the last snapshot) of each data attribute is available. Here, the numeric attributes are added to the feature vector “as is” while one hot encoding is applied to each categorical attribute.

The First-State Encoding

In the first-state encoding only the information (control flow and payload) of the first event is retained. This is achieved with the Declare4Py.Encodings.Static.Static class by setting the categorical and numeric attributes.

[24]:
encoder = Static(case_id_col=case_id_key, cat_cols = ['concept:name', 'org:group'], num_cols=['CRP'])
enc_df = encoder.fit_transform(df)
enc_df.head()
[24]:
CRP concept:name_CRP concept:name_ER Registration concept:name_ER Sepsis Triage concept:name_ER Triage concept:name_IV Liquid concept:name_Leucocytes org:group_A org:group_B org:group_C org:group_L
case:concept:name
A 21.0 False True False False False False True False False False
AA 23.0 False True False False False False True False False False
AAA 68.0 False True False False False False True False False False
AB 48.0 False True False False False False True False False False
ABA 78.0 False True False False False False True False False False

The Second-to-Last-State Encoding

In the second-to-last-state encoding only the information (control flow and payload) of the second-to-last event is retained. This is achieved with the Declare4Py.Encodings.PreviousState.PreviousState class by setting the categorical and numeric attributes.

[25]:
encoder = PreviousState(case_id_col=case_id_key, cat_cols = ['concept:name', 'org:group'], num_cols=['CRP'])
enc_df = encoder.fit_transform(df)
enc_df.head()
[25]:
CRP concept:name_Admission NC concept:name_CRP concept:name_ER Sepsis Triage concept:name_ER Triage concept:name_IV Antibiotics concept:name_IV Liquid concept:name_LacticAcid concept:name_Leucocytes concept:name_Release A ... org:group_M org:group_N org:group_O org:group_P org:group_Q org:group_R org:group_S org:group_T org:group_U org:group_V
case:concept:name
A 0.0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
AA 0.0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
AAA 0.0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
AB 0.0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
ABA 0.0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 36 columns

The Last-State Encoding

In the last-state encoding only the information (control flow and payload) of the last event is retained. This is achieved with the Declare4Py.Encodings.LastState.LastState class by setting the categorical and numeric attributes.

[26]:
encoder = LastState(case_id_col=case_id_key, cat_cols = ['concept:name', 'org:group'], num_cols=['CRP'])
enc_df = encoder.fit_transform(df)
enc_df.head()
[26]:
CRP concept:name_Admission NC concept:name_CRP concept:name_ER Sepsis Triage concept:name_ER Triage concept:name_IV Antibiotics concept:name_IV Liquid concept:name_LacticAcid concept:name_Leucocytes concept:name_Release A ... org:group_B org:group_C org:group_D org:group_E org:group_F org:group_G org:group_I org:group_L org:group_R org:group_V
case:concept:name
A 6.0 False False False False False False False False True ... False False False True False False False False False False
AA 23.0 False False False False True False False False False ... False False False False False False False False False False
AAA 68.0 False False False False False False False False False ... False False False False False False False False False False
AB 48.0 False False False False True False False False False ... False False False False False False False False False False
ABA 140.0 False False False False False False False False True ... False False False True False False False False False False

5 rows × 27 columns

The Ngram encoding

[27]:
encoder = Ngram(case_id_col=case_id_key, n=2 , v=0.7, act_col='concept:name')
enc_df = encoder.fit_transform(df)
enc_df.head()
[27]:
Admission NC|CRP ER Triage|IV Liquid ER Sepsis Triage|CRP CRP|IV Liquid IV Liquid|LacticAcid CRP|Release B Admission NC|Release A ER Triage|ER Registration LacticAcid|CRP Admission NC|LacticAcid ... Release D|Return ER IV Liquid|Admission IC Admission NC|IV Antibiotics CRP|Release D LacticAcid|Release B ER Triage|LacticAcid Admission IC|LacticAcid Leucocytes|Release E ER Sepsis Triage|Leucocytes CRP|LacticAcid
case:concept:name
A 1.188124 0.49000 0.407527 0.2401 0.000 0.0 0.009689 0.0 0.199688 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.00000 0.0 0.0 0.381729 0.70
B 1.190000 0.16807 0.408170 0.2401 0.000 0.0 0.343000 0.0 0.200003 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.49000 0.0 0.0 0.000000 0.70
C 1.241170 0.24010 0.575896 0.7000 0.000 0.0 0.285719 0.0 0.000000 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.00000 0.0 0.0 0.822708 0.00
D 0.490000 0.16807 0.757648 0.3430 0.000 0.0 0.343000 0.0 0.117649 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.34300 0.0 0.0 0.425354 0.70
E 0.000000 0.49000 0.490000 0.0000 0.343 0.0 0.000000 0.0 0.000000 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.16807 0.0 0.0 0.343000 0.49

5 rows × 115 columns