1 Introduction

The First Order Markov Model (FOMM) is the simplest Process Discovery algorithm implemented in pMineR. It produces a graph in which for each event, an edge is drawn to every event that is found directly subsequent in the event log. Let’s go more in deep with a practical example.

library(pMineR,quietly = TRUE)
library(DiagrammeR,quietly = TRUE)
library(kableExtra,quietly = TRUE)

2 Loading the data

In pMineR there is a class devoted to the creation of synthetic data. In this case we create a dummy dataset to simulate a set of possible diagnostic-treatment paths of 500 patients. Through the method syntheticDataCreator::cohort.RT() we can easily obtain a dataLoaded() object

objSDC <- syntheticDataCreator()
objDL <- objSDC$cohort.RT(numOfPat = 500,giveBack = "dataLoader")

3 Training a FOMM

Quite easy: we have to

FOMM <- firstOrderMarkovModel(verbose.mode = FALSE)
FOMM$loadDataset( dataList = objDL$getData() )
FOMM$trainModel()

we can easily plot a graph, passing thought the DiagrammeR libraries:

grViz(FOMM$plot(giveItBack = TRUE))

At a glance, the graph flows from the first events in the traces (the diagnostic ones such as imaging, medical visit, Biopsy) and then therapies and, finally, Death.

note to reduce the spaghetti effect, we can also specify a minimum threshold to avoid plotting unfrequent transitions. However, this must be used with caution as it can create disconnected nodes.

FOMM.v2 <- firstOrderMarkovModel(parameters.list = list("threshold"=0.1))
FOMM.v2$loadDataset( dataList = objDL$getData() )
FOMM.v2$trainModel()
grViz(FOMM.v2$plot(giveItBack = TRUE))

4 play

A trained FOMM model can be used to generate data by the firstOrderMarkovModel::play() method. As usual, we can generate a csv

dummyEL.csv <- FOMM$play(numberOfPlays = 50,toReturn = "csv")

or a probably more practical dataLoder object:

dummyEL.DL <- FOMM$play(numberOfPlays = 50,toReturn = "dataLoader")

5 replay

with the firstOrderMarkovModel::play() method we can let flow new traces though a pre-trained FOMM model. In our case, for example:

res <- FOMM$replay(dataList = dummyEL.DL$getData())

here

6 Comparing two FOMMs

FOMM also provide a method to compare two FOMM models. Let’s create, for example a second Event Log with the previously created syntheticDataCreator object.

We build the csv and then we load the csv into a dataLoader Obj

objDL.v2.csv <- objSDC$cohort.RT(numOfPat = 500,giveBack = "csv")

objDL.v2 <- dataLoader(verbose.mode = FALSE)
objDL.v2$load.data.frame(mydata = objDL.v2.csv,IDName = "ID",EVENTName = "Event",dateColumnName = "Date",format.column.date = "%Y-%m-%d"  )

now, queryng the new objDL.v2 we get the ID of the long survivor patients, where the life expectancy is higher than the 0.6 quantile

QODObj <- QOD()
QODObj$loadDataset(dataList = objDL.v2$getData())
mtr.t <- QODObj$query(from = "partial resection",to = "death",returnCompleteMatrix = TRUE)
quant.thres <- as.numeric(quantile(as.numeric(mtr.t[,"time"]),probs = c(0.60)))
IDs <- mtr.t[which(as.numeric(mtr.t[,"time"] ) > quant.thres),1]

now we change stats forcing partial resection to total resection in the long survivors

objDL.v2.csv[which(objDL.v2.csv[,"ID"] %in% IDs & objDL.v2.csv[,"Event"] == "partial resection"),"Event"] <- "total resection"

and we create a new dataLoader obj with the new dataset

newDLobj <- dataLoader(verbose.mode = FALSE)
newDLobj$load.data.frame(mydata = objDL.v2.csv,IDName = "ID",EVENTName = "Event",dateColumnName = "Date",format.column.date = "%Y-%m-%d"  )

now we train a new FOMM.v2 model

FOMM.v2 <- firstOrderMarkovModel()
FOMM.v2$loadDataset(dataList = newDLobj$getData() )
FOMM.v2$trainModel()

To compare the two FOMMs we can use the method firstOrderMarkovModel::plot.delta.graph() to show a graph overlapping the two FOMMs, keeping the edge values of the FOMM that is invoking the method but higlighting the edges where the difference, in terms of probability, is greater than 0.1 :

grViz(FOMM$plot.delta.graph(objToCheck = FOMM.v2,giveBackGrViz = TRUE,type.of.graph = "overlapped",threshold.4.overlapped = 0.1))

The dual is, obviously:

grViz(FOMM.v2$plot.delta.graph(objToCheck = FOMM,giveBackGrViz = TRUE,type.of.graph = "overlapped",threshold.4.overlapped = 0.1))

As expected, the events partial resection and total resection are involved.

Intuitively, by reducing the threshold.4.overlapped value, many other edges are going to be colored.

grViz(FOMM$plot.delta.graph(objToCheck = FOMM.v2,giveBackGrViz = TRUE,type.of.graph = "overlapped",threshold.4.overlapped = 0.05))

A similar comparison can be done also getting the MMatrix.perc in the dataLoader objs used to train the two FOMMs :

objDL$getData()$MMatrix.perc %>%
  kbl() %>%
  kable_minimal()
BEGIN END Imaging Medical Visit Biopsy partial resection radiotherapy death total resection chemotherapy
BEGIN 0 0.0000000 0.5040000 0.4960000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
END 0 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
Imaging 0 0.0000000 0.3415205 0.3847953 0.2350877 0.0000000 0.0000000 0.0385965 0.0000000 0.0000000
Medical Visit 0 0.0334545 0.2261818 0.4203636 0.1745455 0.0000000 0.0000000 0.1454545 0.0000000 0.0000000
Biopsy 0 0.0000000 0.0000000 0.0000000 0.0000000 0.4943311 0.0000000 0.1088435 0.3968254 0.0000000
partial resection 0 0.0000000 0.0000000 0.3027523 0.0000000 0.0000000 0.0917431 0.3119266 0.0000000 0.2935780
radiotherapy 0 0.0000000 0.0000000 0.0922587 0.0000000 0.0000000 0.8960764 0.0116649 0.0000000 0.0000000
death 0 1.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
total resection 0 0.0000000 0.0000000 0.2628571 0.0000000 0.0000000 0.0571429 0.4628571 0.0000000 0.2171429
chemotherapy 0 0.0000000 0.0000000 0.0502392 0.0000000 0.0000000 0.1626794 0.0311005 0.0000000 0.7559809

versus

newDLobj$getData()$MMatrix.perc %>%
  kbl() %>%
  kable_minimal()
BEGIN END Imaging Biopsy partial resection death Medical Visit total resection chemotherapy radiotherapy
BEGIN 0 0.0000000 0.4760000 0.0000000 0.0000000 0.0000000 0.5240000 0.0000000 0.0000000 0.0000000
END 0 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
Imaging 0 0.0000000 0.3572254 0.2670520 0.0000000 0.0335260 0.3421965 0.0000000 0.0000000 0.0000000
Biopsy 0 0.0000000 0.0000000 0.0000000 0.3407572 0.1314031 0.0000000 0.5278396 0.0000000 0.0000000
partial resection 0 0.0000000 0.0000000 0.0000000 0.0000000 0.4771242 0.1764706 0.0000000 0.2679739 0.0784314
death 0 1.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
Medical Visit 0 0.0425056 0.2371365 0.1625652 0.0000000 0.1469053 0.4108874 0.0000000 0.0000000 0.0000000
total resection 0 0.0000000 0.0000000 0.0000000 0.0000000 0.2531646 0.3459916 0.0000000 0.2953586 0.1054852
chemotherapy 0 0.0000000 0.0000000 0.0000000 0.0000000 0.0360169 0.0699153 0.0000000 0.7648305 0.1292373
radiotherapy 0 0.0000000 0.0000000 0.0000000 0.0000000 0.0092915 0.1045296 0.0000000 0.0000000 0.8861789

7 Survival Analysis

FOMM supports a survival analysis by the firstOrderMarkovModel::KaplanMeier() class. By this class we can query traces defining a starting point, an event (in the sense of Survival Analysis) and an array of censoring events (if any). In our case, we don’t have any censoring event (e.g.: “lost”)

KM.1 <- FOMM$KaplanMeier(fromState = "Medical Visit",toState = "death",passingThrough = "partial resection" ,UM = "months")
KM.2 <- FOMM$KaplanMeier(fromState = "Medical Visit",toState = "death",passingNotThrough = "total resection", UM = "months")

comparison <- FOMM$LogRankTest(KM1 = KM.1 , KM2 = KM.2)

A rude Kaplan-Meier curve can be seen with:

plot(comparison$survfit)

and the log-rank test :

comparison$survdiff
## Call:
## survdiff(formula = Surv(time, outcome) ~ KM, data = new.df)
## 
##          N Observed Expected (O-E)^2/E (O-E)^2/V
## KM=KM1 167      167      203      6.38      11.9
## KM=KM2 284      284      248      5.22      11.9
## 
##  Chisq= 11.9  on 1 degrees of freedom, p= 6e-04

8 PROS

Well understood and appreciated by clinicians, thanks to its simplicity.

9 CONS

It is not very useful when events do not tend to naturally unfold in succession over time. In this case the presence of loops can create chaotic graphs, Useless in most cases