{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Log Encodings\n", "\n", "Declare4Py provides several among the main encoding techniques for vectorizing a traces log. These are useful for applying Machine Learning techniques. The encoding classes provided by Declare4Py (see the `Declare4Py.Encodings` package) take as an input a log in a Pandas dataframe format and return a Pandas dataframe whose rows represent a single trace and the columns the extracted features. The Declare4Py encodings are implemented as scikit-learn transformers so it is straightfoward to use them in a Machine Learning pipeline.\n", "\n", "The tutorial will cover the following points:\n", "\n", "1. Encodings families:\n", " 1. The boolean encoding;\n", " 2. The frequency-based encoding;\n", " 3. Aggregated encodings;\n", " 4. Indexed encodings:\n", " 1. The simple-index encoding;\n", " 2. The complex-Index encoding;\n", " 5. Static Encodings:\n", " 1. The first-state encoding;\n", " 2. The second-to-last-state encoding;\n", " 3. The last-state encoding;\n", " 6. The Ngram encoding;\n", " 7. The Declare encoding;\n", "2. Encoding combinations:\n", " 1. The index-latest-payload encoding;\n", "3. A Machine Learning pipeline.\n", "\n", "Before starting how to use the encodings the necessary packages need to be imported.\n", "\n", "[1]\n", "[2]\n", "[3]\n", "[4]" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.11/dist-packages/lark/utils.py:163: DeprecationWarning: module 'sre_parse' is deprecated\n", " import sre_parse\n", "/usr/local/lib/python3.11/dist-packages/lark/utils.py:164: DeprecationWarning: module 'sre_constants' is deprecated\n", " import sre_constants\n" ] } ], "source": [ "import os\n", "import pm4py\n", "import pandas as pd\n", "\n", "\n", "from Declare4Py.Encodings.Aggregate import Aggregate\n", "from Declare4Py.Encodings.IndexBased import IndexBased\n", "from Declare4Py.Encodings.Static import Static\n", "from Declare4Py.Encodings.PreviousState import PreviousState\n", "from Declare4Py.Encodings.LastState import LastState\n", "from Declare4Py.Encodings.Ngram import Ngram\n", "from Declare4Py.Encodings.Declare import Declare" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The input format for the `Encodings` classes are logs as Pandas dataframe. Therefore, we import the event log and convert it in a Pandas dataframe." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "045e2ff8f482454cbcca76e01f8f72f7", "version_major": 2, "version_minor": 0 }, "text/plain": [ "parsing log, completed traces :: 0%| | 0/1050 [00:00\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
InfectionSuspectedorg:groupDiagnosticBloodDisfuncOrgSIRSCritTachypneaHypotensieSIRSCritHeartRateInfusionDiagnosticArtAstrupconcept:name...DiagnosticLacticAcidlifecycle:transitionDiagnoseHypoxieDiagnosticUrinarySedimentDiagnosticECGLeucocytesCRPLacticAcidcase:concept:name
0TrueATrueTrueTrueTrueTrueTrueTrueER Registration...TruecompleteAFalseTrueTrueNaNNaNNaNA
1NaNBNaNNaNNaNNaNNaNNaNNaNLeucocytes...NaNcompleteNaNNaNNaNNaN9.6NaNNaNA
2NaNBNaNNaNNaNNaNNaNNaNNaNCRP...NaNcompleteNaNNaNNaNNaNNaN21.0NaNA
3NaNBNaNNaNNaNNaNNaNNaNNaNLacticAcid...NaNcompleteNaNNaNNaNNaNNaNNaN2.2A
4NaNCNaNNaNNaNNaNNaNNaNNaNER Triage...NaNcompleteNaNNaNNaNNaNNaNNaNNaNA
\n", "

5 rows × 32 columns

\n", "" ], "text/plain": [ " InfectionSuspected org:group DiagnosticBlood DisfuncOrg SIRSCritTachypnea \\\n", "0 True A True True True \n", "1 NaN B NaN NaN NaN \n", "2 NaN B NaN NaN NaN \n", "3 NaN B NaN NaN NaN \n", "4 NaN C NaN NaN NaN \n", "\n", " Hypotensie SIRSCritHeartRate Infusion DiagnosticArtAstrup concept:name \\\n", "0 True True True True ER Registration \n", "1 NaN NaN NaN NaN Leucocytes \n", "2 NaN NaN NaN NaN CRP \n", "3 NaN NaN NaN NaN LacticAcid \n", "4 NaN NaN NaN NaN ER Triage \n", "\n", " ... DiagnosticLacticAcid lifecycle:transition Diagnose Hypoxie \\\n", "0 ... True complete A False \n", "1 ... NaN complete NaN NaN \n", "2 ... NaN complete NaN NaN \n", "3 ... NaN complete NaN NaN \n", "4 ... NaN complete NaN NaN \n", "\n", " DiagnosticUrinarySediment DiagnosticECG Leucocytes CRP LacticAcid \\\n", "0 True True NaN NaN NaN \n", "1 NaN NaN 9.6 NaN NaN \n", "2 NaN NaN NaN 21.0 NaN \n", "3 NaN NaN NaN NaN 2.2 \n", "4 NaN NaN NaN NaN NaN \n", "\n", " case:concept:name \n", "0 A \n", "1 A \n", "2 A \n", "3 A \n", "4 A \n", "\n", "[5 rows x 32 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from Declare4Py.D4PyEventLog import D4PyEventLog\n", "\n", "log_path = os.path.join(\"../../../\", \"tests\", \"test_logs\", \"Sepsis Cases.xes.gz\")\n", "event_log = D4PyEventLog(case_name=\"case:concept:name\")\n", "event_log.parse_xes_log(log_path)\n", "case_id_key = event_log.get_case_name()\n", "event_log.to_dataframe()\n", "df = event_log.log\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Encodings families\n", "\n", "A Declare4Py encoding is implemented as a scikit-learn transformer class, you just need to instantiate the corresponding `encoder` object and call the function `fit_transform(df)` on the input dataframe. The name of the features can be retrieved with the `get_feature_names()` function.\n", "\n", "### The Boolean Encoding\n", "\n", "In the __boolean encoding__ sequences of events are represented as feature vectors, in such a way that each feature corresponds to an event class (an activity) from the log. This is achieved with the `Declare4Py.Encodings.Aggregate.Aggregate` class by setting the categorical attributes and the `boolean` parameter to `True`." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Log features:\n", " Index(['concept:name_Admission IC', 'concept:name_Admission NC',\n", " 'concept:name_CRP', 'concept:name_ER Registration',\n", " 'concept:name_ER Sepsis Triage', 'concept:name_ER Triage',\n", " 'concept:name_IV Antibiotics', 'concept:name_IV Liquid',\n", " 'concept:name_LacticAcid', 'concept:name_Leucocytes',\n", " 'concept:name_Release A', 'concept:name_Release B',\n", " 'concept:name_Release C', 'concept:name_Release D',\n", " 'concept:name_Release E', 'concept:name_Return ER', 'org:group_?',\n", " 'org:group_A', 'org:group_B', 'org:group_C', 'org:group_D',\n", " 'org:group_E', 'org:group_F', 'org:group_G', 'org:group_H',\n", " 'org:group_I', 'org:group_J', 'org:group_K', 'org:group_L',\n", " 'org:group_M', 'org:group_N', 'org:group_O', 'org:group_P',\n", " 'org:group_Q', 'org:group_R', 'org:group_S', 'org:group_T',\n", " 'org:group_U', 'org:group_V', 'org:group_W', 'org:group_X',\n", " 'org:group_Y'],\n", " dtype='object')\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
concept:name_Admission ICconcept:name_Admission NCconcept:name_CRPconcept:name_ER Registrationconcept:name_ER Sepsis Triageconcept:name_ER Triageconcept:name_IV Antibioticsconcept:name_IV Liquidconcept:name_LacticAcidconcept:name_Leucocytes...org:group_Porg:group_Qorg:group_Rorg:group_Sorg:group_Torg:group_Uorg:group_Vorg:group_Worg:group_Xorg:group_Y
case:concept:name
A0111111111...0000000000
AA0011111111...0000000000
AAA0111111111...0000000000
AB0011111111...0000000000
ABA0111111111...0000000000
\n", "

5 rows × 42 columns

\n", "
" ], "text/plain": [ " concept:name_Admission IC concept:name_Admission NC \\\n", "case:concept:name \n", "A 0 1 \n", "AA 0 0 \n", "AAA 0 1 \n", "AB 0 0 \n", "ABA 0 1 \n", "\n", " concept:name_CRP concept:name_ER Registration \\\n", "case:concept:name \n", "A 1 1 \n", "AA 1 1 \n", "AAA 1 1 \n", "AB 1 1 \n", "ABA 1 1 \n", "\n", " concept:name_ER Sepsis Triage concept:name_ER Triage \\\n", "case:concept:name \n", "A 1 1 \n", "AA 1 1 \n", "AAA 1 1 \n", "AB 1 1 \n", "ABA 1 1 \n", "\n", " concept:name_IV Antibiotics concept:name_IV Liquid \\\n", "case:concept:name \n", "A 1 1 \n", "AA 1 1 \n", "AAA 1 1 \n", "AB 1 1 \n", "ABA 1 1 \n", "\n", " concept:name_LacticAcid concept:name_Leucocytes ... \\\n", "case:concept:name ... \n", "A 1 1 ... \n", "AA 1 1 ... \n", "AAA 1 1 ... \n", "AB 1 1 ... \n", "ABA 1 1 ... \n", "\n", " org:group_P org:group_Q org:group_R org:group_S \\\n", "case:concept:name \n", "A 0 0 0 0 \n", "AA 0 0 0 0 \n", "AAA 0 0 0 0 \n", "AB 0 0 0 0 \n", "ABA 0 0 0 0 \n", "\n", " org:group_T org:group_U org:group_V org:group_W \\\n", "case:concept:name \n", "A 0 0 0 0 \n", "AA 0 0 0 0 \n", "AAA 0 0 0 0 \n", "AB 0 0 0 0 \n", "ABA 0 0 0 0 \n", "\n", " org:group_X org:group_Y \n", "case:concept:name \n", "A 0 0 \n", "AA 0 0 \n", "AAA 0 0 \n", "AB 0 0 \n", "ABA 0 0 \n", "\n", "[5 rows x 42 columns]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "encoder = Aggregate(case_id_col=case_id_key, cat_cols=['concept:name', 'org:group'], boolean=True)\n", "enc_df = encoder.fit_transform(df)\n", "\n", "print(f\"Log features:\\n {encoder.get_feature_names()}\")\n", "enc_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The Frequency-Based Encoding\n", "\n", "The __frequency-based encoding__, instead of boolean values, represents the control flow in a case with the frequency of each event class in the case. This is achieved with the `Declare4Py.Encodings.Aggregate.Aggregate` class by setting the categorical attributes and the `boolean` parameter to `False`." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
concept:name_Admission ICconcept:name_Admission NCconcept:name_CRPconcept:name_ER Registrationconcept:name_ER Sepsis Triageconcept:name_ER Triageconcept:name_IV Antibioticsconcept:name_IV Liquidconcept:name_LacticAcidconcept:name_Leucocytes...org:group_Porg:group_Qorg:group_Rorg:group_Sorg:group_Torg:group_Uorg:group_Vorg:group_Worg:group_Xorg:group_Y
case:concept:name
A0171111117...0000000000
AA0011111111...0000000000
AAA0111111111...0000000000
AB0011111111...0000000000
ABA0141111115...0000000000
\n", "

5 rows × 42 columns

\n", "
" ], "text/plain": [ " concept:name_Admission IC concept:name_Admission NC \\\n", "case:concept:name \n", "A 0 1 \n", "AA 0 0 \n", "AAA 0 1 \n", "AB 0 0 \n", "ABA 0 1 \n", "\n", " concept:name_CRP concept:name_ER Registration \\\n", "case:concept:name \n", "A 7 1 \n", "AA 1 1 \n", "AAA 1 1 \n", "AB 1 1 \n", "ABA 4 1 \n", "\n", " concept:name_ER Sepsis Triage concept:name_ER Triage \\\n", "case:concept:name \n", "A 1 1 \n", "AA 1 1 \n", "AAA 1 1 \n", "AB 1 1 \n", "ABA 1 1 \n", "\n", " concept:name_IV Antibiotics concept:name_IV Liquid \\\n", "case:concept:name \n", "A 1 1 \n", "AA 1 1 \n", "AAA 1 1 \n", "AB 1 1 \n", "ABA 1 1 \n", "\n", " concept:name_LacticAcid concept:name_Leucocytes ... \\\n", "case:concept:name ... \n", "A 1 7 ... \n", "AA 1 1 ... \n", "AAA 1 1 ... \n", "AB 1 1 ... \n", "ABA 1 5 ... \n", "\n", " org:group_P org:group_Q org:group_R org:group_S \\\n", "case:concept:name \n", "A 0 0 0 0 \n", "AA 0 0 0 0 \n", "AAA 0 0 0 0 \n", "AB 0 0 0 0 \n", "ABA 0 0 0 0 \n", "\n", " org:group_T org:group_U org:group_V org:group_W \\\n", "case:concept:name \n", "A 0 0 0 0 \n", "AA 0 0 0 0 \n", "AAA 0 0 0 0 \n", "AB 0 0 0 0 \n", "ABA 0 0 0 0 \n", "\n", " org:group_X org:group_Y \n", "case:concept:name \n", "A 0 0 \n", "AA 0 0 \n", "AAA 0 0 \n", "AB 0 0 \n", "ABA 0 0 \n", "\n", "[5 rows x 42 columns]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "encoder = Aggregate(case_id_col=case_id_key, cat_cols=['concept:name', 'org:group'], boolean=False)\n", "enc_df = encoder.fit_transform(df)\n", "enc_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The Aggregated Encoding\n", "\n", "The __aggregated encoding__ considers all events since the beginning of the case, but ignore the order of the events. In this case, several aggregation functions can be applied to the values that an event attribute has taken throughout the case. This is achieved with the `Declare4Py.Encodings.Aggregate.Aggregate` class by setting the categorical attributes, the numeric attributes, the `boolean` parameter to `False` and a list of functions to aggregate the numeric attributes, e.g., 'mean', 'max', 'min', 'sum', 'std'." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
concept:name_Admission ICconcept:name_Admission NCconcept:name_CRPconcept:name_ER Registrationconcept:name_ER Sepsis Triageconcept:name_ER Triageconcept:name_IV Antibioticsconcept:name_IV Liquidconcept:name_LacticAcidconcept:name_Leucocytes...org:group_Sorg:group_Torg:group_Uorg:group_Vorg:group_Worg:group_Xorg:group_YCRP_minCRP_meanCRP_max
case:concept:name
A0171111117...00000006.030.857143109.0
AA0011111111...000000023.023.00000023.0
AAA0111111111...000000068.068.00000068.0
AB0011111111...000000048.048.00000048.0
ABA0141111115...000000078.0105.000000140.0
\n", "

5 rows × 45 columns

\n", "
" ], "text/plain": [ " concept:name_Admission IC concept:name_Admission NC \\\n", "case:concept:name \n", "A 0 1 \n", "AA 0 0 \n", "AAA 0 1 \n", "AB 0 0 \n", "ABA 0 1 \n", "\n", " concept:name_CRP concept:name_ER Registration \\\n", "case:concept:name \n", "A 7 1 \n", "AA 1 1 \n", "AAA 1 1 \n", "AB 1 1 \n", "ABA 4 1 \n", "\n", " concept:name_ER Sepsis Triage concept:name_ER Triage \\\n", "case:concept:name \n", "A 1 1 \n", "AA 1 1 \n", "AAA 1 1 \n", "AB 1 1 \n", "ABA 1 1 \n", "\n", " concept:name_IV Antibiotics concept:name_IV Liquid \\\n", "case:concept:name \n", "A 1 1 \n", "AA 1 1 \n", "AAA 1 1 \n", "AB 1 1 \n", "ABA 1 1 \n", "\n", " concept:name_LacticAcid concept:name_Leucocytes ... \\\n", "case:concept:name ... \n", "A 1 7 ... \n", "AA 1 1 ... \n", "AAA 1 1 ... \n", "AB 1 1 ... \n", "ABA 1 5 ... \n", "\n", " org:group_S org:group_T org:group_U org:group_V \\\n", "case:concept:name \n", "A 0 0 0 0 \n", "AA 0 0 0 0 \n", "AAA 0 0 0 0 \n", "AB 0 0 0 0 \n", "ABA 0 0 0 0 \n", "\n", " org:group_W org:group_X org:group_Y CRP_min CRP_mean \\\n", "case:concept:name \n", "A 0 0 0 6.0 30.857143 \n", "AA 0 0 0 23.0 23.000000 \n", "AAA 0 0 0 68.0 68.000000 \n", "AB 0 0 0 48.0 48.000000 \n", "ABA 0 0 0 78.0 105.000000 \n", "\n", " CRP_max \n", "case:concept:name \n", "A 109.0 \n", "AA 23.0 \n", "AAA 68.0 \n", "AB 48.0 \n", "ABA 140.0 \n", "\n", "[5 rows x 45 columns]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "encoder = Aggregate(case_id_col=case_id_key, cat_cols=['concept:name', 'org:group'], num_cols=['CRP'], boolean=False, aggregation_functions=['min', 'mean', 'max'])\n", "enc_df = encoder.fit_transform(df)\n", "enc_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Indexed Encodings\n", "\n", "#### The Simple-Index Encoding\n", "\n", "Another way of encoding a sequence is by taking into account also information about the order in which events occur in the sequence, as in the __simple-index encoding__. Here, each feature corresponds to a position in the sequence and the possible values for each feature are the presence of that event classes. This is achieved with the `Declare4Py.Encodings.IndexBased.IndexBased` class by setting the categorical attributes, the `create_dummies` parameter to `True` and the `max_events` to an integer value lower or equal than the maximum number of events in a trace in the log. If None, the parameter will set to the maximum number of events in a trace in the log. Such parameter sets the first events in the log to be use for indexing." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
concept:name_0_CRPconcept:name_0_ER Registrationconcept:name_0_ER Sepsis Triageconcept:name_0_ER Triageconcept:name_0_IV Liquidconcept:name_0_Leucocytesconcept:name_1_CRPconcept:name_1_ER Registrationconcept:name_1_ER Sepsis Triageconcept:name_1_ER Triage...concept:name_175_Leucocytesconcept:name_176_CRPconcept:name_177_CRPconcept:name_178_Leucocytesconcept:name_179_Leucocytesconcept:name_180_CRPconcept:name_181_Leucocytesconcept:name_182_CRPconcept:name_183_Leucocytesconcept:name_184_Release C
case:concept:name
A0100000000...0000000000
AA0100000001...0000000000
AAA0100000001...0000000000
AB0100000001...0000000000
ABA0100000001...0000000000
\n", "

5 rows × 656 columns

\n", "
" ], "text/plain": [ " concept:name_0_CRP concept:name_0_ER Registration \\\n", "case:concept:name \n", "A 0 1 \n", "AA 0 1 \n", "AAA 0 1 \n", "AB 0 1 \n", "ABA 0 1 \n", "\n", " concept:name_0_ER Sepsis Triage concept:name_0_ER Triage \\\n", "case:concept:name \n", "A 0 0 \n", "AA 0 0 \n", "AAA 0 0 \n", "AB 0 0 \n", "ABA 0 0 \n", "\n", " concept:name_0_IV Liquid concept:name_0_Leucocytes \\\n", "case:concept:name \n", "A 0 0 \n", "AA 0 0 \n", "AAA 0 0 \n", "AB 0 0 \n", "ABA 0 0 \n", "\n", " concept:name_1_CRP concept:name_1_ER Registration \\\n", "case:concept:name \n", "A 0 0 \n", "AA 0 0 \n", "AAA 0 0 \n", "AB 0 0 \n", "ABA 0 0 \n", "\n", " concept:name_1_ER Sepsis Triage concept:name_1_ER Triage \\\n", "case:concept:name \n", "A 0 0 \n", "AA 0 1 \n", "AAA 0 1 \n", "AB 0 1 \n", "ABA 0 1 \n", "\n", " ... concept:name_175_Leucocytes concept:name_176_CRP \\\n", "case:concept:name ... \n", "A ... 0 0 \n", "AA ... 0 0 \n", "AAA ... 0 0 \n", "AB ... 0 0 \n", "ABA ... 0 0 \n", "\n", " concept:name_177_CRP concept:name_178_Leucocytes \\\n", "case:concept:name \n", "A 0 0 \n", "AA 0 0 \n", "AAA 0 0 \n", "AB 0 0 \n", "ABA 0 0 \n", "\n", " concept:name_179_Leucocytes concept:name_180_CRP \\\n", "case:concept:name \n", "A 0 0 \n", "AA 0 0 \n", "AAA 0 0 \n", "AB 0 0 \n", "ABA 0 0 \n", "\n", " concept:name_181_Leucocytes concept:name_182_CRP \\\n", "case:concept:name \n", "A 0 0 \n", "AA 0 0 \n", "AAA 0 0 \n", "AB 0 0 \n", "ABA 0 0 \n", "\n", " concept:name_183_Leucocytes concept:name_184_Release C \n", "case:concept:name \n", "A 0 0 \n", "AA 0 0 \n", "AAA 0 0 \n", "AB 0 0 \n", "ABA 0 0 \n", "\n", "[5 rows x 656 columns]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# with max_events the maximum number of events in a trace in the log.\n", "encoder = IndexBased(case_id_col=case_id_key, cat_cols=['concept:name'], create_dummies=True)\n", "enc_df = encoder.fit_transform(df)\n", "enc_df.head()" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
concept:name_0_CRPconcept:name_0_ER Registrationconcept:name_0_ER Sepsis Triageconcept:name_0_ER Triageconcept:name_0_IV Liquidconcept:name_0_Leucocytesconcept:name_1_CRPconcept:name_1_ER Registrationconcept:name_1_ER Sepsis Triageconcept:name_1_ER Triageconcept:name_1_IV Antibioticsconcept:name_1_IV Liquidconcept:name_1_LacticAcidconcept:name_1_Leucocytes
case:concept:name
A01000000000001
AA01000000010000
AAA01000000010000
AB01000000010000
ABA01000000010000
\n", "
" ], "text/plain": [ " concept:name_0_CRP concept:name_0_ER Registration \\\n", "case:concept:name \n", "A 0 1 \n", "AA 0 1 \n", "AAA 0 1 \n", "AB 0 1 \n", "ABA 0 1 \n", "\n", " concept:name_0_ER Sepsis Triage concept:name_0_ER Triage \\\n", "case:concept:name \n", "A 0 0 \n", "AA 0 0 \n", "AAA 0 0 \n", "AB 0 0 \n", "ABA 0 0 \n", "\n", " concept:name_0_IV Liquid concept:name_0_Leucocytes \\\n", "case:concept:name \n", "A 0 0 \n", "AA 0 0 \n", "AAA 0 0 \n", "AB 0 0 \n", "ABA 0 0 \n", "\n", " concept:name_1_CRP concept:name_1_ER Registration \\\n", "case:concept:name \n", "A 0 0 \n", "AA 0 0 \n", "AAA 0 0 \n", "AB 0 0 \n", "ABA 0 0 \n", "\n", " concept:name_1_ER Sepsis Triage concept:name_1_ER Triage \\\n", "case:concept:name \n", "A 0 0 \n", "AA 0 1 \n", "AAA 0 1 \n", "AB 0 1 \n", "ABA 0 1 \n", "\n", " concept:name_1_IV Antibiotics concept:name_1_IV Liquid \\\n", "case:concept:name \n", "A 0 0 \n", "AA 0 0 \n", "AAA 0 0 \n", "AB 0 0 \n", "ABA 0 0 \n", "\n", " concept:name_1_LacticAcid concept:name_1_Leucocytes \n", "case:concept:name \n", "A 0 1 \n", "AA 0 0 \n", "AAA 0 0 \n", "AB 0 0 \n", "ABA 0 0 " ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# with max_events equal to 2.\n", "encoder = IndexBased(case_id_col=case_id_key, cat_cols=['concept:name'], max_events=2, create_dummies=True)\n", "enc_df = encoder.fit_transform(df)\n", "enc_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The Complex-Index Encoding\n", "\n", "The __complex-based encoding__ takes into account also payload columns in the `cat_cols` or `num_cols` parameters." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CRP_0CRP_1CRP_2CRP_3CRP_4CRP_5CRP_6CRP_7CRP_8CRP_9...org:group_175_Borg:group_176_Borg:group_177_Borg:group_178_Borg:group_179_Borg:group_180_Borg:group_181_Borg:group_182_Borg:group_183_Borg:group_184_E
case:concept:name
A0.00.021.00.00.00.00.00.00.0109.0...0000000000
AA0.00.00.00.00.023.00.00.00.00.0...0000000000
AAA0.00.00.00.00.068.00.00.00.00.0...0000000000
AB0.00.00.048.00.00.00.00.00.00.0...0000000000
ABA0.00.00.00.00.00.078.00.00.00.0...0000000000
\n", "

5 rows × 1400 columns

\n", "
" ], "text/plain": [ " CRP_0 CRP_1 CRP_2 CRP_3 CRP_4 CRP_5 CRP_6 CRP_7 \\\n", "case:concept:name \n", "A 0.0 0.0 21.0 0.0 0.0 0.0 0.0 0.0 \n", "AA 0.0 0.0 0.0 0.0 0.0 23.0 0.0 0.0 \n", "AAA 0.0 0.0 0.0 0.0 0.0 68.0 0.0 0.0 \n", "AB 0.0 0.0 0.0 48.0 0.0 0.0 0.0 0.0 \n", "ABA 0.0 0.0 0.0 0.0 0.0 0.0 78.0 0.0 \n", "\n", " CRP_8 CRP_9 ... org:group_175_B org:group_176_B \\\n", "case:concept:name ... \n", "A 0.0 109.0 ... 0 0 \n", "AA 0.0 0.0 ... 0 0 \n", "AAA 0.0 0.0 ... 0 0 \n", "AB 0.0 0.0 ... 0 0 \n", "ABA 0.0 0.0 ... 0 0 \n", "\n", " org:group_177_B org:group_178_B org:group_179_B \\\n", "case:concept:name \n", "A 0 0 0 \n", "AA 0 0 0 \n", "AAA 0 0 0 \n", "AB 0 0 0 \n", "ABA 0 0 0 \n", "\n", " org:group_180_B org:group_181_B org:group_182_B \\\n", "case:concept:name \n", "A 0 0 0 \n", "AA 0 0 0 \n", "AAA 0 0 0 \n", "AB 0 0 0 \n", "ABA 0 0 0 \n", "\n", " org:group_183_B org:group_184_E \n", "case:concept:name \n", "A 0 0 \n", "AA 0 0 \n", "AAA 0 0 \n", "AB 0 0 \n", "ABA 0 0 \n", "\n", "[5 rows x 1400 columns]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "encoder = IndexBased(case_id_col=case_id_key, cat_cols = ['concept:name', 'org:group'], num_cols=['CRP'], create_dummies=True)\n", "enc_df = encoder.fit_transform(df)\n", "enc_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Static Encodings\n", "\n", "In a static encoding, only an available snapshot of the data is used. Therefore, the size of the feature vector is proportional to the number of event attributes and is fixed throughout the execution of a case.\n", "\n", "Using the last state abstraction, only one value (e.g., the last snapshot) of each data attribute is available. Here, the numeric attributes are added to the feature vector \"as is\" while one hot encoding is applied to each categorical attribute.\n", "\n", "#### The First-State Encoding\n", "In the __first-state encoding__ only the information (control flow and payload) of the first event is retained. This is achieved with the `Declare4Py.Encodings.Static.Static` class by setting the categorical and numeric attributes." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CRPconcept:name_CRPconcept:name_ER Registrationconcept:name_ER Sepsis Triageconcept:name_ER Triageconcept:name_IV Liquidconcept:name_Leucocytesorg:group_Aorg:group_Borg:group_Corg:group_L
case:concept:name
A21.00100001000
AA23.00100001000
AAA68.00100001000
AB48.00100001000
ABA78.00100001000
\n", "
" ], "text/plain": [ " CRP concept:name_CRP concept:name_ER Registration \\\n", "case:concept:name \n", "A 21.0 0 1 \n", "AA 23.0 0 1 \n", "AAA 68.0 0 1 \n", "AB 48.0 0 1 \n", "ABA 78.0 0 1 \n", "\n", " concept:name_ER Sepsis Triage concept:name_ER Triage \\\n", "case:concept:name \n", "A 0 0 \n", "AA 0 0 \n", "AAA 0 0 \n", "AB 0 0 \n", "ABA 0 0 \n", "\n", " concept:name_IV Liquid concept:name_Leucocytes \\\n", "case:concept:name \n", "A 0 0 \n", "AA 0 0 \n", "AAA 0 0 \n", "AB 0 0 \n", "ABA 0 0 \n", "\n", " org:group_A org:group_B org:group_C org:group_L \n", "case:concept:name \n", "A 1 0 0 0 \n", "AA 1 0 0 0 \n", "AAA 1 0 0 0 \n", "AB 1 0 0 0 \n", "ABA 1 0 0 0 " ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "encoder = Static(case_id_col=case_id_key, cat_cols = ['concept:name', 'org:group'], num_cols=['CRP'])\n", "enc_df = encoder.fit_transform(df)\n", "enc_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The Second-to-Last-State Encoding\n", "\n", "In the __second-to-last-state encoding__ only the information (control flow and payload) of the second-to-last event is retained. This is achieved with the `Declare4Py.Encodings.PreviousState.PreviousState` class by setting the categorical and numeric attributes." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CRPconcept:name_Admission NCconcept:name_CRPconcept:name_ER Sepsis Triageconcept:name_ER Triageconcept:name_IV Antibioticsconcept:name_IV Liquidconcept:name_LacticAcidconcept:name_Leucocytesconcept:name_Release A...org:group_Morg:group_Norg:group_Oorg:group_Porg:group_Qorg:group_Rorg:group_Sorg:group_Torg:group_Uorg:group_V
case:concept:name
A0.0000000010...0000000000
AA0.0000001000...0000000000
AAA0.0000000001...0000000000
AB0.0000001000...0000000000
ABA0.0000000010...0000000000
\n", "

5 rows × 36 columns

\n", "
" ], "text/plain": [ " CRP concept:name_Admission NC concept:name_CRP \\\n", "case:concept:name \n", "A 0.0 0 0 \n", "AA 0.0 0 0 \n", "AAA 0.0 0 0 \n", "AB 0.0 0 0 \n", "ABA 0.0 0 0 \n", "\n", " concept:name_ER Sepsis Triage concept:name_ER Triage \\\n", "case:concept:name \n", "A 0 0 \n", "AA 0 0 \n", "AAA 0 0 \n", "AB 0 0 \n", "ABA 0 0 \n", "\n", " concept:name_IV Antibiotics concept:name_IV Liquid \\\n", "case:concept:name \n", "A 0 0 \n", "AA 0 1 \n", "AAA 0 0 \n", "AB 0 1 \n", "ABA 0 0 \n", "\n", " concept:name_LacticAcid concept:name_Leucocytes \\\n", "case:concept:name \n", "A 0 1 \n", "AA 0 0 \n", "AAA 0 0 \n", "AB 0 0 \n", "ABA 0 1 \n", "\n", " concept:name_Release A ... org:group_M org:group_N \\\n", "case:concept:name ... \n", "A 0 ... 0 0 \n", "AA 0 ... 0 0 \n", "AAA 1 ... 0 0 \n", "AB 0 ... 0 0 \n", "ABA 0 ... 0 0 \n", "\n", " org:group_O org:group_P org:group_Q org:group_R \\\n", "case:concept:name \n", "A 0 0 0 0 \n", "AA 0 0 0 0 \n", "AAA 0 0 0 0 \n", "AB 0 0 0 0 \n", "ABA 0 0 0 0 \n", "\n", " org:group_S org:group_T org:group_U org:group_V \n", "case:concept:name \n", "A 0 0 0 0 \n", "AA 0 0 0 0 \n", "AAA 0 0 0 0 \n", "AB 0 0 0 0 \n", "ABA 0 0 0 0 \n", "\n", "[5 rows x 36 columns]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "encoder = PreviousState(case_id_col=case_id_key, cat_cols = ['concept:name', 'org:group'], num_cols=['CRP'])\n", "enc_df = encoder.fit_transform(df)\n", "enc_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The Last-State Encoding\n", "\n", "In the __last-state encoding__ only the information (control flow and payload) of the last event is retained. This is achieved with the `Declare4Py.Encodings.LastState.LastState` class by setting the categorical and numeric attributes." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CRPconcept:name_Admission NCconcept:name_CRPconcept:name_ER Sepsis Triageconcept:name_ER Triageconcept:name_IV Antibioticsconcept:name_IV Liquidconcept:name_LacticAcidconcept:name_Leucocytesconcept:name_Release A...org:group_Borg:group_Corg:group_Dorg:group_Eorg:group_Forg:group_Gorg:group_Iorg:group_Lorg:group_Rorg:group_V
case:concept:name
A6.0000000001...0001000000
AA23.0000010000...0000000000
AAA68.0000000000...0000000000
AB48.0000010000...0000000000
ABA140.0000000001...0001000000
\n", "

5 rows × 27 columns

\n", "
" ], "text/plain": [ " CRP concept:name_Admission NC concept:name_CRP \\\n", "case:concept:name \n", "A 6.0 0 0 \n", "AA 23.0 0 0 \n", "AAA 68.0 0 0 \n", "AB 48.0 0 0 \n", "ABA 140.0 0 0 \n", "\n", " concept:name_ER Sepsis Triage concept:name_ER Triage \\\n", "case:concept:name \n", "A 0 0 \n", "AA 0 0 \n", "AAA 0 0 \n", "AB 0 0 \n", "ABA 0 0 \n", "\n", " concept:name_IV Antibiotics concept:name_IV Liquid \\\n", "case:concept:name \n", "A 0 0 \n", "AA 1 0 \n", "AAA 0 0 \n", "AB 1 0 \n", "ABA 0 0 \n", "\n", " concept:name_LacticAcid concept:name_Leucocytes \\\n", "case:concept:name \n", "A 0 0 \n", "AA 0 0 \n", "AAA 0 0 \n", "AB 0 0 \n", "ABA 0 0 \n", "\n", " concept:name_Release A ... org:group_B org:group_C \\\n", "case:concept:name ... \n", "A 1 ... 0 0 \n", "AA 0 ... 0 0 \n", "AAA 0 ... 0 0 \n", "AB 0 ... 0 0 \n", "ABA 1 ... 0 0 \n", "\n", " org:group_D org:group_E org:group_F org:group_G \\\n", "case:concept:name \n", "A 0 1 0 0 \n", "AA 0 0 0 0 \n", "AAA 0 0 0 0 \n", "AB 0 0 0 0 \n", "ABA 0 1 0 0 \n", "\n", " org:group_I org:group_L org:group_R org:group_V \n", "case:concept:name \n", "A 0 0 0 0 \n", "AA 0 0 0 0 \n", "AAA 0 0 0 0 \n", "AB 0 0 0 0 \n", "ABA 0 0 0 0 \n", "\n", "[5 rows x 27 columns]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "encoder = LastState(case_id_col=case_id_key, cat_cols = ['concept:name', 'org:group'], num_cols=['CRP'])\n", "enc_df = encoder.fit_transform(df)\n", "enc_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The Ngram encoding" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Leucocytes|CRPIV Liquid|IV AntibioticsRelease A|LeucocytesER Sepsis Triage|IV LiquidAdmission NC|Release CLacticAcid|ER Sepsis TriageLacticAcid|IV AntibioticsCRP|Release BER Triage|IV LiquidAdmission IC|Admission IC...Release E|Return ERIV Antibiotics|LeucocytesER Sepsis Triage|LeucocytesER Sepsis Triage|IV AntibioticsAdmission NC|IV AntibioticsIV Liquid|LeucocytesER Triage|CRPIV Antibiotics|LacticAcidLacticAcid|Admission ICRelease C|Return ER
case:concept:name
A6.7158400.70000.00.70000.00.490.24010.00.490000.0...0.00.7790390.3817290.490000.00.5453270.2852690.00.00.0
B0.2857190.70000.00.70000.00.490.24010.00.168070.0...0.00.0000000.0000000.490000.00.0000000.7980020.00.00.0
C2.5657080.70000.00.34300.00.000.00000.00.240100.0...0.00.5110700.8227080.240100.00.3577490.4031270.00.00.0
D0.8680700.70000.00.24010.00.000.34300.00.168070.0...0.00.4900000.4253540.168070.00.3430000.5303540.00.00.0
E0.0000000.24010.00.70000.00.000.70000.00.490000.0...0.00.0000000.3430000.168070.00.4900000.3430000.00.00.0
\n", "

5 rows × 115 columns

\n", "
" ], "text/plain": [ " Leucocytes|CRP IV Liquid|IV Antibiotics \\\n", "case:concept:name \n", "A 6.715840 0.7000 \n", "B 0.285719 0.7000 \n", "C 2.565708 0.7000 \n", "D 0.868070 0.7000 \n", "E 0.000000 0.2401 \n", "\n", " Release A|Leucocytes ER Sepsis Triage|IV Liquid \\\n", "case:concept:name \n", "A 0.0 0.7000 \n", "B 0.0 0.7000 \n", "C 0.0 0.3430 \n", "D 0.0 0.2401 \n", "E 0.0 0.7000 \n", "\n", " Admission NC|Release C LacticAcid|ER Sepsis Triage \\\n", "case:concept:name \n", "A 0.0 0.49 \n", "B 0.0 0.49 \n", "C 0.0 0.00 \n", "D 0.0 0.00 \n", "E 0.0 0.00 \n", "\n", " LacticAcid|IV Antibiotics CRP|Release B \\\n", "case:concept:name \n", "A 0.2401 0.0 \n", "B 0.2401 0.0 \n", "C 0.0000 0.0 \n", "D 0.3430 0.0 \n", "E 0.7000 0.0 \n", "\n", " ER Triage|IV Liquid Admission IC|Admission IC ... \\\n", "case:concept:name ... \n", "A 0.49000 0.0 ... \n", "B 0.16807 0.0 ... \n", "C 0.24010 0.0 ... \n", "D 0.16807 0.0 ... \n", "E 0.49000 0.0 ... \n", "\n", " Release E|Return ER IV Antibiotics|Leucocytes \\\n", "case:concept:name \n", "A 0.0 0.779039 \n", "B 0.0 0.000000 \n", "C 0.0 0.511070 \n", "D 0.0 0.490000 \n", "E 0.0 0.000000 \n", "\n", " ER Sepsis Triage|Leucocytes \\\n", "case:concept:name \n", "A 0.381729 \n", "B 0.000000 \n", "C 0.822708 \n", "D 0.425354 \n", "E 0.343000 \n", "\n", " ER Sepsis Triage|IV Antibiotics \\\n", "case:concept:name \n", "A 0.49000 \n", "B 0.49000 \n", "C 0.24010 \n", "D 0.16807 \n", "E 0.16807 \n", "\n", " Admission NC|IV Antibiotics IV Liquid|Leucocytes \\\n", "case:concept:name \n", "A 0.0 0.545327 \n", "B 0.0 0.000000 \n", "C 0.0 0.357749 \n", "D 0.0 0.343000 \n", "E 0.0 0.490000 \n", "\n", " ER Triage|CRP IV Antibiotics|LacticAcid \\\n", "case:concept:name \n", "A 0.285269 0.0 \n", "B 0.798002 0.0 \n", "C 0.403127 0.0 \n", "D 0.530354 0.0 \n", "E 0.343000 0.0 \n", "\n", " LacticAcid|Admission IC Release C|Return ER \n", "case:concept:name \n", "A 0.0 0.0 \n", "B 0.0 0.0 \n", "C 0.0 0.0 \n", "D 0.0 0.0 \n", "E 0.0 0.0 \n", "\n", "[5 rows x 115 columns]" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "encoder = Ngram(case_id_col=case_id_key, n=2 , v=0.7, act_col='concept:name')\n", "enc_df = encoder.fit_transform(df)\n", "enc_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The Declare encoding" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "ename": "NameError", "evalue": "name 'DeclareTransformer' is not defined", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", "Cell \u001b[0;32mIn [23], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m encoder \u001b[38;5;241m=\u001b[39m \u001b[43mDeclareTransformer\u001b[49m(case_id_col\u001b[38;5;241m=\u001b[39mcase_id_key, n\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m3\u001b[39m , v\u001b[38;5;241m=\u001b[39m \u001b[38;5;241m0.7\u001b[39m, act_col\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mconcept:name\u001b[39m\u001b[38;5;124m'\u001b[39m)\n\u001b[1;32m 2\u001b[0m enc_df \u001b[38;5;241m=\u001b[39m encoder\u001b[38;5;241m.\u001b[39mfit_transform(df)\n\u001b[1;32m 3\u001b[0m enc_df\u001b[38;5;241m.\u001b[39mhead()\n", "\u001b[0;31mNameError\u001b[0m: name 'DeclareTransformer' is not defined" ] } ], "source": [ "encoder = DeclareTransformer(case_id_col=case_id_key, n=3 , v= 0.7, act_col='concept:name')\n", "enc_df = encoder.fit_transform(df)\n", "enc_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Encoding combinations\n", "\n", "### The Index-Latest-Payload Encoding\n", "\n", "The index latest-payload encoding adds the lat- est encoding to the simple-index encoding.\n", "\n", "combination of a index-based encoding with a static one (the last state)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "ename": "NameError", "evalue": "name 'LastStateTransformer' is not defined", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m/var/folders/5w/6k152p214xbc6ghcldxtvf2r0000gq/T/ipykernel_81008/619914427.py\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mlast_state_encoder\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mLastStateTransformer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcase_id_col\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mcase_id_key\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcat_cols\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'org:group'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnum_cols\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2\u001b[0m \u001b[0mlatest_df\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mlast_state_encoder\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfit_transform\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdf\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0msimple_index_encoder\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mIndexBasedTransformer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcase_id_col\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mcase_id_key\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcat_cols\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'concept:name'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnum_cols\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcreate_dummies\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0msimple_df\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0msimple_index_encoder\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfit_transform\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdf\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mNameError\u001b[0m: name 'LastStateTransformer' is not defined" ] } ], "source": [ "last_state_encoder = LastStateTransformer(case_id_col=case_id_key, cat_cols=['org:group'], num_cols=[])\n", "latest_df = last_state_encoder.fit_transform(df)\n", "\n", "simple_index_encoder = IndexBasedTransformer(case_id_col=case_id_key, cat_cols=['concept:name'], num_cols=[], create_dummies=True)\n", "simple_df = simple_index_encoder.fit_transform(df)\n", "\n", "index_latest_payload_df = pd.concat([latest_df, simple_df], axis=1)\n", "index_latest_payload_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A Machine Learning pipeline\n", "\n", "\n", "Esempio di pipeline per variant discovery basata su CF\n", "\n", "### TODO: mettere in un df trace id e label\n", "### TODO fare clustering su varianti\n", "### TODO mostra 2 tracce con stesse label hanno variante simile, e due classi con lbl diversa hanno diverse varianti" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0\n", "2\n", "0\n", "2\n", "0\n", "0\n", "2\n", "0\n", "0\n", "0\n", "2\n", "0\n", "0\n", "1\n", "0\n", "0\n", "0\n", "0\n", "0\n", "0\n", "1\n", "2\n", "1\n", "0\n", "0\n", "1\n", "1\n", "2\n", "1\n", "1\n", "2\n", "0\n", "0\n", "2\n", "2\n", "0\n", "0\n", "2\n", "1\n", "1\n", "0\n", "0\n", "0\n", "0\n", "1\n", "0\n", "1\n", "2\n", "0\n", "0\n", "0\n", "2\n", "0\n", "2\n", "1\n", "0\n", "0\n", "0\n", "2\n", "0\n", "0\n", "0\n", "0\n", "0\n", "0\n", "0\n", "2\n", "0\n", "0\n", "1\n", "1\n", "1\n", "0\n", "0\n", "2\n", "0\n", "0\n", "0\n", "0\n", "0\n", "1\n", "0\n", "0\n", "2\n", "2\n", "0\n", "0\n", "2\n", "2\n", "2\n", "1\n", "0\n", "0\n", "0\n", "0\n", "0\n", "2\n", "0\n", "0\n", "0\n", "1\n", "2\n", "1\n", "1\n", "0\n", "1\n", "2\n", "0\n", "2\n", "2\n", "1\n", "0\n", "2\n", "0\n", "0\n", "2\n", "0\n", "0\n", "0\n", "0\n", "0\n", "2\n", "2\n", "0\n", "2\n", "0\n", "0\n", "0\n", "0\n", "0\n", "1\n", "0\n", "1\n", "0\n", "1\n", "1\n", "0\n", "2\n", "0\n", "1\n", "0\n", "1\n", "0\n", "0\n", "0\n", "0\n", "0\n", "2\n", "2\n", "1\n", "1\n", "2\n", "0\n", "0\n", "2\n", "0\n", "0\n", "1\n", "2\n", "2\n", "0\n", "0\n", "0\n", "0\n", "2\n", "0\n", "1\n", "2\n", "0\n", "0\n", "0\n", "0\n", "0\n", "1\n", "0\n", "0\n", "1\n", "1\n", "0\n", "0\n", "0\n", "2\n", "0\n", "0\n", "0\n", "2\n", "0\n", "0\n", "0\n", "0\n", "0\n", "2\n", "1\n", "0\n", "2\n", "0\n", "2\n", "0\n", "0\n", "0\n", "2\n", "0\n", "0\n", "2\n", "1\n", "0\n", "2\n", "0\n", "2\n", "1\n", "1\n", "0\n", "2\n", "0\n", "0\n", "0\n", "2\n", "1\n", "0\n", "2\n", "0\n", "0\n", "2\n", "1\n", "0\n", "0\n", "0\n", "1\n", "2\n", "1\n", "0\n", "0\n", "2\n", "2\n", "0\n", "1\n", "1\n", "2\n", "1\n", "0\n", "1\n", "2\n", "0\n", "0\n", "0\n", "0\n", "0\n", "0\n", "0\n", "0\n", "0\n", "0\n", "0\n", "2\n", "1\n", "0\n", "0\n", "0\n", "0\n", "0\n", "2\n", "2\n", "0\n", "0\n", "2\n", "0\n", "0\n", "2\n", "1\n", "0\n", "0\n", "2\n", "1\n", "2\n", "0\n", "0\n", "0\n", "0\n", "1\n", "1\n", "0\n", "0\n", "2\n", "2\n", "2\n", "0\n", "1\n", "1\n", "1\n", "0\n", "0\n", "2\n", "1\n", "2\n", "0\n", "0\n", "0\n", "0\n", "0\n", "2\n", "0\n", "0\n", "0\n", "0\n", "0\n", "0\n", "0\n", "0\n", "2\n", "2\n", "0\n", "0\n", "1\n", "1\n", "0\n", "0\n", "0\n", "1\n", "0\n", "0\n", "2\n", "0\n", "1\n", "0\n", "1\n", "0\n", "0\n", "0\n", "0\n", "2\n", "0\n", "0\n", "0\n", "0\n", "1\n", "1\n", "0\n", "1\n", "0\n", "0\n", "2\n", "1\n", "0\n", "0\n", "0\n", "0\n", "0\n", "0\n", "2\n", "1\n", "2\n", "0\n", "0\n", "0\n", "1\n", "2\n", "1\n", "0\n", "0\n", "0\n", "2\n", "0\n", "0\n", "0\n", "2\n", "0\n", "2\n", "0\n", "1\n", "0\n", "0\n", "1\n", "2\n", "2\n", "0\n", "2\n", "0\n", "2\n", "0\n", "2\n", "2\n", "1\n", "0\n", "1\n", "0\n", "1\n", "2\n", "2\n", "0\n", "1\n", "0\n", "2\n", "0\n", "0\n", "1\n", "0\n", "0\n", "1\n", "0\n", "1\n", "2\n", "0\n", "0\n", "0\n", "0\n", "0\n", "0\n", "0\n", "0\n", "2\n", "0\n", "0\n", "1\n", "0\n", "0\n", "0\n", "0\n", "0\n", "1\n", "0\n", "2\n", "0\n", "0\n", "0\n", "1\n", "0\n", "0\n", "0\n", "0\n", "0\n", "1\n", "0\n", "0\n", "1\n", "0\n", "0\n", "2\n", "0\n", "0\n", "0\n", "0\n", "1\n", "2\n", "0\n", "0\n", "2\n", "1\n", "0\n", "0\n", "1\n", "0\n", "0\n", "1\n", "2\n", "2\n", "2\n", "0\n", "2\n", "0\n", "0\n", "2\n", "2\n", "1\n", "0\n", "2\n", "2\n", "1\n", "0\n", "0\n", "0\n", "2\n", "2\n", "0\n", "0\n", "0\n", "1\n", "0\n", "1\n", "0\n", "1\n", "2\n", "2\n", "0\n", "0\n", "0\n", "0\n", "0\n", "1\n", "0\n", "0\n", "1\n", "0\n", "1\n", "0\n", "1\n", "1\n", "0\n", "0\n", "1\n", "0\n", "2\n", "0\n", "0\n", "1\n", "0\n", "1\n", "2\n", "1\n", "0\n", "2\n", "0\n", "0\n", "1\n", "0\n", "0\n", "0\n", "1\n", "1\n", "2\n", "0\n", "0\n", "0\n", "0\n", "0\n", "0\n", "0\n", "0\n", "0\n", "2\n", "0\n", "1\n", "0\n", "0\n", "2\n", "1\n", "0\n", "1\n", "2\n", "0\n", "0\n", "0\n", "1\n", "0\n", "0\n", "2\n", "1\n", "2\n", "0\n", "0\n", "0\n", "0\n", "0\n", "0\n", "0\n", "0\n", "1\n", "1\n", "0\n", "0\n", "0\n", "0\n", "0\n", "1\n", "2\n", "1\n", "0\n", "0\n", "0\n", "0\n", "0\n", "2\n", "0\n", "0\n", "2\n", "0\n", "0\n", "1\n", "0\n", "1\n", "2\n", "2\n", "1\n", "2\n", "1\n", "2\n", "1\n", "0\n", "0\n", "0\n", "1\n", "0\n", "0\n", "1\n", "0\n", "0\n", "1\n", "0\n", "0\n", "0\n", "0\n", "0\n", "2\n", "0\n", "0\n", "0\n", "0\n", "0\n", "2\n", "0\n", "0\n", "1\n", "2\n", "1\n", "1\n", "2\n", "0\n", "2\n", "1\n", "1\n", "2\n", "0\n", "2\n", "0\n", "0\n", "0\n", "0\n", "0\n", "2\n", "2\n", "0\n", "0\n", "2\n", "2\n", "0\n", "0\n", "1\n", "0\n", "2\n", "0\n", "1\n", "0\n", "1\n", "1\n", "1\n", "0\n", "0\n", "2\n", "0\n", "0\n", "1\n", "2\n", "2\n", "1\n", "2\n", "2\n", "0\n", "0\n", "0\n", "0\n", "2\n", "1\n", "0\n", "2\n", "0\n", "1\n", "0\n", "2\n", "1\n", "0\n", "1\n", "0\n", "0\n", "2\n", "0\n", "0\n", "1\n", "2\n", "0\n", "0\n", "0\n", "0\n", "0\n", "0\n", "1\n", "0\n", "2\n", "0\n", "2\n", "0\n", "2\n", "1\n", "0\n", "1\n", "1\n", "2\n", "1\n", "0\n", "0\n", "0\n", "2\n", "2\n", "2\n", "1\n", "2\n", "2\n", "0\n", "0\n", "1\n", "1\n", "0\n", "0\n", "2\n", "2\n", "0\n", "1\n", "1\n", "2\n", "1\n", "2\n", "2\n", "1\n", "2\n", "0\n", "0\n", "0\n", "0\n", "1\n", "0\n", "1\n", "0\n", "0\n", "0\n", "0\n", "1\n", "0\n", "0\n", "2\n", "0\n", "0\n", "0\n", "0\n", "0\n", "0\n", "0\n", "1\n", "2\n", "0\n", "0\n", "0\n", "2\n", "0\n", "0\n", "2\n", "1\n", "0\n", "0\n", "0\n", "0\n", "2\n", "0\n", "1\n", "1\n", "0\n", "0\n", "0\n", "2\n", "2\n", "0\n", "0\n", "0\n", "2\n", "2\n", "1\n", "0\n", "0\n", "0\n", "0\n", "0\n", "2\n", "0\n", "1\n", "0\n", "1\n", "1\n", "0\n", "0\n", "0\n", "0\n", "1\n", "1\n", "0\n", "0\n", "1\n", "0\n", "0\n", "2\n", "2\n", "0\n", "0\n", "0\n", "0\n", "0\n", "1\n", "2\n", "1\n", "1\n", "0\n", "0\n", "0\n", "1\n", "2\n", "1\n", "0\n", "1\n", "2\n", "0\n", "0\n", "1\n", "2\n", "0\n", "0\n", "1\n", "0\n", "2\n", "2\n", "0\n", "1\n", "1\n", "0\n", "0\n", "0\n", "2\n", "0\n", "0\n", "2\n", "0\n", "2\n", "0\n", "1\n", "1\n", "0\n", "0\n", "2\n", "0\n", "0\n", "1\n", "0\n", "0\n", "2\n", "2\n", "2\n", "0\n", "0\n", "0\n", "0\n", "1\n", "0\n", "0\n", "0\n", "0\n", "0\n", "2\n", "0\n", "0\n", "0\n", "2\n", "1\n", "1\n", "0\n", "0\n", "0\n", "1\n", "0\n", "2\n", "0\n", "2\n", "0\n", "1\n", "0\n", "0\n", "2\n", "2\n", "2\n", "0\n", "0\n", "0\n", "1\n", "0\n", "1\n", "0\n", "1\n", "0\n", "1\n", "0\n", "1\n", "1\n", "0\n", "0\n", "1\n", "0\n", "0\n", "2\n", "0\n", "0\n", "0\n", "2\n", "1\n", "0\n", "1\n", "1\n", "2\n", "0\n", "2\n", "1\n", "1\n", "0\n", "1\n", "2\n", "0\n", "0\n", "0\n", "2\n", "1\n", "0\n", "0\n", "0\n", "0\n", "2\n", "1\n", "2\n", "0\n", "1\n", "0\n", "2\n", "0\n", "2\n", "0\n", "0\n", "1\n", "2\n", "0\n", "0\n", "0\n", "2\n", "0\n", "0\n", "2\n", "0\n", "1\n", "0\n", "0\n", "1\n", "0\n", "0\n", "2\n", "0\n", "1\n", "0\n", "2\n", "0\n", "0\n", "1\n", "2\n", "0\n", "2\n", "2\n", "0\n", "2\n", "0\n", "2\n", "2\n", "1\n", "0\n", "0\n", "0\n", "2\n", "2\n", "2\n", "0\n", "0\n", "0\n", "0\n", "0\n", "2\n", "0\n", "1\n", "0\n", "0\n", "0\n", "1\n", "0\n", "0\n", "0\n", "0\n", "0\n", "0\n", "1\n", "1\n", "0\n", "2\n", "0\n", "0\n", "0\n", "0\n", "1\n", "0\n", "1\n", "0\n", "0\n", "0\n", "1\n", "2\n", "0\n", "0\n", "0\n", "2\n", "2\n", "1\n", "0\n", "2\n", "0\n", "0\n", "0\n", "2\n", "0\n", "0\n", "0\n", "0\n", "2\n", "0\n", "0\n", "0\n", "0\n", "0\n", "0\n", "0\n", "2\n", "1\n", "0\n", "0\n", "0\n", "0\n", "0\n", "2\n", "0\n", "0\n", "0\n", "0\n", "2\n", "0\n" ] } ], "source": [ "from sklearn.pipeline import Pipeline\n", "from sklearn.cluster import KMeans\n", "\n", "variants_discovery = Pipeline([('vect', Aggregate(case_id_col=case_id_key, cat_cols=['concept:name'], num_cols=[], boolean=True)),\n", " ('kmeans', KMeans(n_clusters=3, random_state=0))])\n", "variants_discovery.fit_transform(df)\n", "\n", "for label in discover_variants['kmeans'].labels_:\n", " print(label)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" }, "vscode": { "interpreter": { "hash": "9b13726099ff4a9270d97cd5a303046c40236cea9d4b3d3acf7f22861afad882" } } }, "nbformat": 4, "nbformat_minor": 2 }