Exploratory Factor Analysis (EFA) identifies a number of latent factors that explain correlations between observed variables. A key issue in the application of EFA is the selection of an adequate number of factors. This is a non-trivial problem because more factors always improve the fit of the model. Most methods for selecting the number of factors fall into two categories: either they analyze the patterns of eigenvalues of the correlation matrix, such as parallel analysis; or they frame the selection of the number of factors as a model selection problem and use approaches such as likelihood ratio tests or information criteria.

In a recent paper we proposed a new method based on model selection. We use the connection between model-implied correlation matrices and standardized regression coefficients to do model selection based on out-of-sample prediction errors, as is common in the field of machine learning. We show in a simulation study that our method slightly outperforms other standard methods on average and is relatively robust across specifications of the true model. An implementation is available in the R-package fspe, which I present here with a short code example.

We use a dataset with 24 measurements of cognitive tasks from 301 individuals from Holzinger and Swineford (1939). Harman (1967) presents both a four- and five-factor solution for this dataset. In the four-factor solution, the fifth factor corresponding to the variables 20–24 is eliminated. For this reason, we exclude variables 20–24, which gives us an example dataset in which we would theoretically expect four factors. This reduced dataset is is included in the fspe-package:

Next to providing the data to the fspe() function we specify that factor models with 1, 2, … ,10 factors should be considered (maxK = 10), that the cross-validation scheme should use with 10 folds (nfold = 10) and be repeated 10 times (rep = 10), and that prediction errors (method = "PE") should be used. An alternative method (method = "CovE") computes an out-of-sample estimation error on the covariance matrix instead of a prediction error on the raw data. This is a method that is similar to the one proposed by Browne & Cudeck (1989). Finally, we set a seed so that the analysis demonstrated here is fully reproducible.

We can inspect the out-of-sample prediction error averaged across variables, folds, and repetitions as a function of the number of factors:

We see that the out-of-sample prediction error is minimized by the factor model with four factors. The number of factors with lowest prediction error can also be directly obtained from the output object:

The un-aggregated of the 10 repetitions of the cross-validation scheme can be found in fspe_out\$PE_array.