The First Order Markov Model (FOMM) is the simplest Process Discovery algorithm implemented in pMineR. It produces a graph in which for each event, an edge is drawn to every event that is found directly subsequent in the event log. Let’s go more in deep with a practical example.
library(pMineR,quietly = TRUE)
library(DiagrammeR,quietly = TRUE)
library(kableExtra,quietly = TRUE)
In pMineR there is a class devoted to the creation of synthetic data. In this case we create a dummy dataset to simulate a set of possible diagnostic-treatment paths of 500 patients. Through the method syntheticDataCreator::cohort.RT() we can easily obtain a dataLoaded() object
objSDC <- syntheticDataCreator()
objDL <- objSDC$cohort.RT(numOfPat = 500,giveBack = "dataLoader")
Quite easy: we have to
FOMM <- firstOrderMarkovModel(verbose.mode = FALSE)
FOMM$loadDataset( dataList = objDL$getData() )
FOMM$trainModel()
we can easily plot a graph, passing thought the DiagrammeR libraries:
grViz(FOMM$plot(giveItBack = TRUE))
At a glance, the graph flows from the first events in the traces (the diagnostic ones such as imaging, medical visit, Biopsy) and then therapies and, finally, Death.
note to reduce the spaghetti effect, we can also specify a minimum threshold to avoid plotting unfrequent transitions. However, this must be used with caution as it can create disconnected nodes.
FOMM.v2 <- firstOrderMarkovModel(parameters.list = list("threshold"=0.1))
FOMM.v2$loadDataset( dataList = objDL$getData() )
FOMM.v2$trainModel()
grViz(FOMM.v2$plot(giveItBack = TRUE))
A trained FOMM model can be used to generate data by the firstOrderMarkovModel::play() method. As usual, we can generate a csv
dummyEL.csv <- FOMM$play(numberOfPlays = 50,toReturn = "csv")
or a probably more practical dataLoder object:
dummyEL.DL <- FOMM$play(numberOfPlays = 50,toReturn = "dataLoader")
with the firstOrderMarkovModel::play() method we can let flow new traces though a pre-trained FOMM model. In our case, for example:
res <- FOMM$replay(dataList = dummyEL.DL$getData())
here
FOMM also provide a method to compare two FOMM models. Let’s create, for example a second Event Log with the previously created syntheticDataCreator object.
We build the csv and then we load the csv into a dataLoader Obj
objDL.v2.csv <- objSDC$cohort.RT(numOfPat = 500,giveBack = "csv")
objDL.v2 <- dataLoader(verbose.mode = FALSE)
objDL.v2$load.data.frame(mydata = objDL.v2.csv,IDName = "ID",EVENTName = "Event",dateColumnName = "Date",format.column.date = "%Y-%m-%d" )
now, queryng the new objDL.v2 we get the ID of the long survivor patients, where the life expectancy is higher than the 0.6 quantile
QODObj <- QOD()
QODObj$loadDataset(dataList = objDL.v2$getData())
mtr.t <- QODObj$query(from = "partial resection",to = "death",returnCompleteMatrix = TRUE)
quant.thres <- as.numeric(quantile(as.numeric(mtr.t[,"time"]),probs = c(0.60)))
IDs <- mtr.t[which(as.numeric(mtr.t[,"time"] ) > quant.thres),1]
now we change stats forcing partial resection to total resection in the long survivors
objDL.v2.csv[which(objDL.v2.csv[,"ID"] %in% IDs & objDL.v2.csv[,"Event"] == "partial resection"),"Event"] <- "total resection"
and we create a new dataLoader obj with the new dataset
newDLobj <- dataLoader(verbose.mode = FALSE)
newDLobj$load.data.frame(mydata = objDL.v2.csv,IDName = "ID",EVENTName = "Event",dateColumnName = "Date",format.column.date = "%Y-%m-%d" )
now we train a new FOMM.v2 model
FOMM.v2 <- firstOrderMarkovModel()
FOMM.v2$loadDataset(dataList = newDLobj$getData() )
FOMM.v2$trainModel()
To compare the two FOMMs we can use the method firstOrderMarkovModel::plot.delta.graph() to show a graph overlapping the two FOMMs, keeping the edge values of the FOMM that is invoking the method but higlighting the edges where the difference, in terms of probability, is greater than 0.1 :
grViz(FOMM$plot.delta.graph(objToCheck = FOMM.v2,giveBackGrViz = TRUE,type.of.graph = "overlapped",threshold.4.overlapped = 0.1))
The dual is, obviously:
grViz(FOMM.v2$plot.delta.graph(objToCheck = FOMM,giveBackGrViz = TRUE,type.of.graph = "overlapped",threshold.4.overlapped = 0.1))
As expected, the events partial resection and total resection are involved.
Intuitively, by reducing the threshold.4.overlapped value, many other edges are going to be colored.
grViz(FOMM$plot.delta.graph(objToCheck = FOMM.v2,giveBackGrViz = TRUE,type.of.graph = "overlapped",threshold.4.overlapped = 0.05))
A similar comparison can be done also getting the MMatrix.perc in the dataLoader objs used to train the two FOMMs :
objDL$getData()$MMatrix.perc %>%
kbl() %>%
kable_minimal()
BEGIN | END | Imaging | Medical Visit | Biopsy | partial resection | radiotherapy | death | total resection | chemotherapy | |
---|---|---|---|---|---|---|---|---|---|---|
BEGIN | 0 | 0.0000000 | 0.5040000 | 0.4960000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 |
END | 0 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 |
Imaging | 0 | 0.0000000 | 0.3415205 | 0.3847953 | 0.2350877 | 0.0000000 | 0.0000000 | 0.0385965 | 0.0000000 | 0.0000000 |
Medical Visit | 0 | 0.0334545 | 0.2261818 | 0.4203636 | 0.1745455 | 0.0000000 | 0.0000000 | 0.1454545 | 0.0000000 | 0.0000000 |
Biopsy | 0 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.4943311 | 0.0000000 | 0.1088435 | 0.3968254 | 0.0000000 |
partial resection | 0 | 0.0000000 | 0.0000000 | 0.3027523 | 0.0000000 | 0.0000000 | 0.0917431 | 0.3119266 | 0.0000000 | 0.2935780 |
radiotherapy | 0 | 0.0000000 | 0.0000000 | 0.0922587 | 0.0000000 | 0.0000000 | 0.8960764 | 0.0116649 | 0.0000000 | 0.0000000 |
death | 0 | 1.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 |
total resection | 0 | 0.0000000 | 0.0000000 | 0.2628571 | 0.0000000 | 0.0000000 | 0.0571429 | 0.4628571 | 0.0000000 | 0.2171429 |
chemotherapy | 0 | 0.0000000 | 0.0000000 | 0.0502392 | 0.0000000 | 0.0000000 | 0.1626794 | 0.0311005 | 0.0000000 | 0.7559809 |
versus
newDLobj$getData()$MMatrix.perc %>%
kbl() %>%
kable_minimal()
BEGIN | END | Imaging | Biopsy | partial resection | death | Medical Visit | total resection | chemotherapy | radiotherapy | |
---|---|---|---|---|---|---|---|---|---|---|
BEGIN | 0 | 0.0000000 | 0.4760000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.5240000 | 0.0000000 | 0.0000000 | 0.0000000 |
END | 0 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 |
Imaging | 0 | 0.0000000 | 0.3572254 | 0.2670520 | 0.0000000 | 0.0335260 | 0.3421965 | 0.0000000 | 0.0000000 | 0.0000000 |
Biopsy | 0 | 0.0000000 | 0.0000000 | 0.0000000 | 0.3407572 | 0.1314031 | 0.0000000 | 0.5278396 | 0.0000000 | 0.0000000 |
partial resection | 0 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.4771242 | 0.1764706 | 0.0000000 | 0.2679739 | 0.0784314 |
death | 0 | 1.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 |
Medical Visit | 0 | 0.0425056 | 0.2371365 | 0.1625652 | 0.0000000 | 0.1469053 | 0.4108874 | 0.0000000 | 0.0000000 | 0.0000000 |
total resection | 0 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.2531646 | 0.3459916 | 0.0000000 | 0.2953586 | 0.1054852 |
chemotherapy | 0 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0360169 | 0.0699153 | 0.0000000 | 0.7648305 | 0.1292373 |
radiotherapy | 0 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0092915 | 0.1045296 | 0.0000000 | 0.0000000 | 0.8861789 |
FOMM supports a survival analysis by the firstOrderMarkovModel::KaplanMeier() class. By this class we can query traces defining a starting point, an event (in the sense of Survival Analysis) and an array of censoring events (if any). In our case, we don’t have any censoring event (e.g.: “lost”)
KM.1 <- FOMM$KaplanMeier(fromState = "Medical Visit",toState = "death",passingThrough = "partial resection" ,UM = "months")
KM.2 <- FOMM$KaplanMeier(fromState = "Medical Visit",toState = "death",passingNotThrough = "total resection", UM = "months")
comparison <- FOMM$LogRankTest(KM1 = KM.1 , KM2 = KM.2)
A rude Kaplan-Meier curve can be seen with:
plot(comparison$survfit)
and the log-rank test :
comparison$survdiff
## Call:
## survdiff(formula = Surv(time, outcome) ~ KM, data = new.df)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## KM=KM1 167 167 203 6.38 11.9
## KM=KM2 284 284 248 5.22 11.9
##
## Chisq= 11.9 on 1 degrees of freedom, p= 6e-04
Well understood and appreciated by clinicians, thanks to its simplicity.
It is not very useful when events do not tend to naturally unfold in succession over time. In this case the presence of loops can create chaotic graphs, Useless in most cases