--- title: "SEM Forests" author: "Andreas Brandmaier" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{SEM Forests} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(semtree) ``` This example demonstrates how SEM forests can be grown. SEM forests are ensembles of typically hundreds to thousands of SEM trees. Using permutation-based variable importance estimates, we can aggregate the importance of each predictor for improving model fit. Here, we use the `affect` dataset and a simple SEM with only a single observed variable and no latent variables. ## Load data Load affect dataset from the `psychTools` package. These are data from two studies conducted in the Personality, Motivation and Cognition Laboratory at Northwestern University to study affect dimensionality and the relationship to various personality dimensions. ```{r} library(psychTools) data(affect) knitr::kable(head(affect)) affect$Film <- as.factor(affect$Film) affect$lie <- as.ordered(affect$lie) affect$imp <- as.ordered(affect$imp) ``` ## Create simple model of state anxiety The following code implements a simple SEM with only a single manifest variables and two parameters, the mean of state anxiety after having watched a movie (`state2`), $\mu$, and the variance of state anxiety, $\sigma^2$. ```{r} library(OpenMx) manifests<-c("state2") latents<-c() model <- mxModel("Univariate Normal Model", type="RAM", manifestVars = manifests, latentVars = latents, mxPath(from="one",to=manifests, free=c(TRUE), value=c(50.0) , arrows=1, label=c("mu") ), mxPath(from=manifests,to=manifests, free=c(TRUE), value=c(100.0) , arrows=2, label=c("sigma2") ), mxData(affect, type = "raw") ); result <- mxRun(model) ``` These are the estimates of the model when run on the entire sample: ```{r} summary(result) ``` ## Forest Create a forest control object that stores all tuning parameters of the forest. Note that we use only 5 trees for illustration. Please increase the number in real applications to several hundreds. To speed up computation time, consider score-based test for variable selection in the trees. ```{r} control <- semforest_control(num.trees = 5) print(control) ``` Now, run the forest using the `control` object: ```{r message=FALSE, echo=TRUE, warning=FALSE, results="hide"} forest <- semforest( model=model, data = affect, control = control, covariates = c("Study","Film", "state1", "PA2","NA2","TA2")) ``` ## Variable importance Next, we compute permutation-based variable importance. This may take some time. ```{r} vim <- varimp(forest) print(vim, sort.values=TRUE) plot(vim) ``` From this, we can learn that variables such as `NA2` representing negative affect (after the movie), `TA2` representing tense arousal (after the movie), and `state1` representing the state anxiety before having watched the movie, are the best predictors of difference in the distribution of state anxiety (in either mean, variance or both) after having watched the movie.