Maximum penalised likelihood in Factor Analysis

Philipp Sterzinger

Department of Statistics, LSE

p.sterzinger@lse.ac.uk
psterzinger.at   psterzinger


Statistics Research Showcase
London School of Economics

May 28th

Joint with

Irini Moustaki
LSE

Ioannis Kosmidis
Warwick

This talk

Sterzinger P., Kosmidis I., and Moustaki I. (2026). Maximum Softly Penalized Likelihood in Factor Analysis, Psychometrika. DOI:10.1017/psy.2026.10092

Outline

Normal-linear factor model

Normal-linear factor model

Data

\(x_1,x_2,\ldots,x_n \in \Re^p\)

Model

\[X_i = \mu_0 + \Lambda_0 Y_i + E_i\]

Unobservables

Latent factors: \(Y_i \sim \mathrm{N}(0_q, I_q)\)

Unique errors: \(E_i \sim \mathrm{N}(0_p,\Psi_0)\)

Parameters

Conditional mean: \(\mu_0 \in \Re^p\)

Factor loadings: \(\Lambda_0 \in \Re^{p \times q}\)

Specific variances \(\Psi_0\): \(p \times p\) diagonal, positive definite matrix

Normal-linear factor model

Log-likelihood

\(X_i \sim \mathrm{N}(\mu_0, \Sigma_0)\), for \(\Sigma_0 = \Lambda_0 {\Lambda_0}^\top + \Psi_0\) so that

\[ \ell(\Sigma; S) = - \frac{n}{2} \left \{\log \mathrm{det}(\Sigma) + \mathrm{tr}\left(\Sigma^{-1}S \right) \right\} + C \,, \] where \(S = 1/n \sum_{i = 1}^n (x_i - \bar{x})(x_i - \bar{x})^\top\), \(\bar{x} = 1/n \sum_{i = 1}^n x_i\)

Maximum likelihood (ML) estimator

\(\hat{\Lambda}, \hat{\Psi} = \arg \max \,\ell(\Lambda \Lambda^\top + \Psi; S)\)

for \(\Lambda \in \Re^{p \times q}\), \(\Psi \in \Re^{p \times p}\) diagonal and positive definite

Motivating Example

Job application data

Data from Kendall (1980) which scores job applicants \((n = 48)\) according to a range of criteria \((p = 15)\)


Form Appearance Ability Likeability Selfconf. Lucidity Honesty Salesmanship Experience Drive Ambition Grasp Potential Keenness Suitability
6 7 2 5 8 7 8 8 3 8 9 7 5 7 10
9 10 5 8 10 9 9 10 5 9 9 8 8 8 10
7 8 3 6 9 8 9 7 4 9 9 8 6 8 10
5 6 8 5 6 5 9 2 8 4 5 8 7 6 5
6 8 8 8 4 5 9 2 8 5 5 8 8 7 7
7 7 7 6 8 7 10 5 9 6 5 8 6 6 6
9 9 8 8 8 8 8 8 10 8 10 8 9 8 10
9 9 9 8 9 9 8 8 10 9 10 9 9 9 10
9 9 7 8 8 8 8 5 9 8 9 8 8 8 10
4 7 10 2 10 10 7 10 3 10 10 10 9 3 10
4 7 10 0 10 8 3 9 5 9 10 8 10 2 5
4 7 10 4 10 10 7 8 2 8 8 10 10 3 7
6 9 8 10 5 4 9 4 4 4 5 4 7 6 8
8 9 8 9 6 3 8 2 5 2 6 6 7 5 6
4 8 8 7 5 4 10 2 7 5 3 6 6 4 6
6 9 3 7 8 9 8 9 8 8 7 6 8 6 10
8 7 7 7 9 5 8 6 6 7 8 6 6 7 8
6 8 8 4 8 8 6 4 3 3 6 7 2 6 4
6 7 8 4 7 8 5 4 4 2 6 8 3 5 4
4 8 7 8 8 9 10 5 2 6 7 9 8 8 9
3 8 6 8 8 8 10 5 3 6 7 8 8 5 8
9 8 7 8 9 10 10 10 3 10 8 10 8 10 8
7 10 7 9 9 9 10 10 3 9 9 10 9 10 8
9 8 7 10 8 10 10 10 2 9 7 9 9 10 8
6 9 7 7 4 5 9 3 2 4 4 4 4 5 4
7 8 7 8 5 4 8 2 3 4 5 6 5 5 6
2 10 7 9 8 9 10 5 3 5 6 7 6 4 5
6 3 5 3 5 3 5 0 0 3 3 0 0 5 0
4 3 4 3 3 0 0 0 0 4 4 0 0 5 0
4 6 5 6 9 4 10 3 1 3 3 2 2 7 3
5 5 4 7 8 4 10 3 2 5 5 3 4 8 3
3 3 5 7 7 9 10 3 2 5 3 7 5 5 2
2 3 5 7 7 9 10 3 2 2 3 6 4 5 2
3 4 6 4 3 3 8 1 1 3 3 3 2 5 2
6 7 4 3 3 0 9 0 1 0 2 3 1 5 3
9 8 5 5 6 6 8 2 2 2 4 5 6 6 3
4 9 6 4 10 8 8 9 1 3 9 7 5 3 2
4 9 6 6 9 9 7 9 1 2 10 8 5 5 2
10 6 9 10 9 10 10 10 10 10 8 10 10 10 10
10 6 9 10 9 10 10 10 10 10 10 10 10 10 10
10 7 8 0 2 1 2 0 10 2 0 3 0 0 10
10 3 8 0 1 1 0 0 10 0 0 0 0 0 10
3 4 9 8 2 4 5 3 6 2 1 3 3 3 8
7 7 7 6 9 8 8 6 8 8 10 8 8 6 5
9 6 10 9 7 7 10 2 1 5 5 7 8 4 5
9 8 10 10 7 9 110 3 1 5 7 9 9 4 4
0 7 10 3 5 0 10 0 0 2 2 0 0 0 0
0 6 10 1 5 0 10 0 0 2 2 0 0 0 0


Want to use exploratory factor analysis to find common factors underlying the data

Exploratory Factor Analysis with R

Most popular R packages for ML estimation of EFA on CRAN*:

* by monthly downloads; CRAN metadata search for EFA keywords & EFA fit function in documentation; stats::factanal excluded

Exploratory Factor Analysis with R

stats::factanal

factanal_fit <- stats::factanal(x = job_data, factors = 4)
 
factanal_uniquenesses
 
FormAppearanceAbilityLikeabilitySelfconf.
0.50334512 0.70264896 0.57633097 0.27770086 0.12490328
LucidityHonestySalesmanshipExperienceDrive
0.17331439 0.68782086 0.10727743 0.37295753 0.21792774
AmbitionGraspPotentialKeennessSuitability
0.15640507 0.13156330 0.09499911 0.00500000 0.14549804

Exploratory Factor Analysis with R

lavaan::efa

lavaan_fit <- efa(data = job_data, nfactors = 4)
 
lavaan_uniquenesses
 
FormAppearanceAbilityLikeabilitySelfconf.
3.5220432 2.6588417 2.4236585 2.1569829 0.7153340
LucidityHonestySalesmanshipExperienceDrive
1.6915113 150.1761574 1.2578323 4.0055043 1.8532189
AmbitionGraspPotentialKeennessSuitability
1.3201438 1.1856554 0.9438065 0.0000000 1.5426059

Exploratory Factor Analysis with R

psych::fa

fa_fit <- fa(job_data, nfactors = 4)
 
Warning: An ultra-Heywood case was detected. Examine the results carefully


fa_fit$uniquenesses
 
FormAppearanceAbilityLikeabilitySelfconf.
0.509321362 0.706583340 0.576785093 0.297299960 0.133265973
LucidityHonestySalesmanshipExperienceDrive
0.197023150 0.640299618 0.096908170 0.250289966 0.248683074
AmbitionGraspPotentialKeennessSuitability
0.157873233 0.137698609 0.095711717 -0.004485347 0.207264486

Exploratory Factor Analysis with R

psych::fa

fa_fit <- fa(job_data, nfactors = 4)
 
Warning: An ultra-Heywood case was detected. Examine the results carefully


fa_fit$uniquenesses
 
FormAppearanceAbilityLikeabilitySelfconf.
0.509321362 0.706583340 0.576785093 0.297299960 0.133265973
LucidityHonestySalesmanshipExperienceDrive
0.197023150 0.640299618 0.096908170 0.250289966 0.248683074
AmbitionGraspPotentialKeennessSuitability
0.157873233 0.137698609 0.095711717 -0.004485347 0.207264486


The ML estimate does not exist

This phenomenon is termed a Heywood (1931) case

Nonexistence of ML estimates

Dominating parameter sequence

\(\{ (\Lambda_n, \Psi_n) \}_{n \in \mathbb{N}}\):

  • for all \((\Lambda, \Psi)\), there is a \(N \in \mathbb{N}\) such that for all \(n > N\), \(\ell(\Lambda_n, \Psi_n) > \ell(\Lambda, \Psi)\)

Convergence to boundary

  • \(\lim_\limits{n \to \infty} \min_{i} [\Psi_n]_{ii} = 0\)

Nonexistence of ML estimates

Dominating parameter sequence

Profiled boundary path \((\Lambda_{\psi},\Psi_{\psi})\) from maximising \(\ell(\Lambda,\Psi)\) at fixed \(\psi_{\mathrm{Keenness}}=\psi\)

Maximum softly-penalized likelihood

Maximum penalized likelihood

Add penalty function \(P(\Lambda, \Psi)\) that diverges to \(-\infty\) whenever \(\Psi \to 0\), or \(\Psi,\Lambda \to \infty\)

Penalized log-likelihood

\[\ell^*(\Lambda, \Psi; S) = \ell(\Lambda \Lambda^\top + \Psi; S) + P(\Lambda, \Psi)\]

Maximum penalized likelihood (MPL) estimator

\[\bar{\Lambda}, \bar{\Psi} \in \arg \max \, \ell^*(\Lambda, \Psi; S) \tag{1}\]

Maximum penalized likelihood

Discouraging dominating parameter sequences

Existence

Assume:

  • \(S\) is full rank
  • \(P(\Lambda,\Psi)\) is continuous and bounded from above
  • \(P(\Lambda_n,\Psi_n)\to-\infty\) for any convergent sequence \((\Lambda_n,\Psi_n)\) with positive uniquenesses such that

\[ \lim_{n\to\infty}\min_i[\Psi_n]_{ii}=0, \qquad \lim_{n\to\infty} \lambda_{\min}(\Lambda_n\Lambda_n^\top+\Psi_n)>0 \]

Then the MPL estimates \(\bar{\Lambda}, \bar{\Psi}\) exist, i.e.  \[ \operatorname*{arg\,max}_{\Lambda,\Psi: \Psi_{ii} > 0} \, \ell^*(\Lambda,\Psi;S)\neq \emptyset \,. \]

Soft penalisation

Penalisation distorts MPL estimates away from ML estimates

Scale penalty by sequence \(c_n\) to make \(c_n P(\cdot, \cdot)\) asymptotically negligible

Maximum softly-penalized likelihood (MSPL) estimator

\[\bar{\Lambda}, \bar{\Psi} \in \arg \max \, \ell(\Lambda \Lambda^\top + \Psi; S) + c_n P(\Lambda, \Psi)\]

Consistency

Assume:

  • The set of MPL estimates is nonempty
  • \(P(\Lambda, \Psi)\) is non-positive
  • \(\Sigma_0 = \Lambda_0 \Lambda_0^\top + \Psi_0\) is strongly identifiable in that for any \(\epsilon>0\) there exists a \(\delta>0\) such that \[ ||| \Sigma_0 - \Sigma |||_{\max} < \delta \implies ||| \Lambda_0 - \Lambda Q|||_{\max} < \epsilon, \, ||| \Psi_0 - \Psi|||_{\max} < \epsilon \,, \] for some \(q \times q\) orthogonal matrix \(Q\).

For any \(\epsilon > 0\), there exists a \(\delta > 0\), such that \[ \begin{aligned} ||| \Sigma_0 - S |||_{\max} &< \delta \\ |n^{-1}c_n P(\Lambda, \Psi)| &< \delta \end{aligned} \quad \implies \quad \begin{aligned} ||| \Lambda_0 - \bar{\Lambda} Q|||_{\max} &< \epsilon \\ ||| \Psi_0 - \bar{\Psi}|||_{\max} &< \epsilon \end{aligned} \,, \] for some \(q \times q\) orthogonal matrix \(Q\).

Consistency

For any \(\epsilon > 0\), there exists a \(\delta > 0\), such that \[ \begin{aligned} ||| \Sigma_0 - S |||_{\max} &< \delta \\ \class{math-highlight}{|n^{-1}c_n P(\Lambda, \Psi)|} &< \delta \end{aligned} \quad \implies \quad \begin{aligned} ||| \Lambda_0 - \bar{\Lambda} Q|||_{\max} &< \epsilon \\ ||| \Psi_0 - \bar{\Psi}|||_{\max} &< \epsilon \end{aligned} \,, \] for some \(q \times q\) orthogonal matrix \(Q\).

If \(S \longrightarrow \Sigma_0\) and \(P(\Lambda,\Psi)\) is an \(O(1)\) function, then \(c_n=o(n)\) is sufficient for consistency

Consistency

For any \(\epsilon > 0\), there exists a \(\delta > 0\), such that \[ \begin{aligned} ||| \Sigma_0 - S |||_{\max} &< \delta \\ \class{math-highlight}{|n^{-1}c_n P(\Lambda, \Psi)|} &< \delta \end{aligned} \quad \implies \quad \begin{aligned} ||| \Lambda_0 - \bar{\Lambda} Q|||_{\max} &< \epsilon \\ ||| \Psi_0 - \bar{\Psi}|||_{\max} &< \epsilon \end{aligned} \,, \] for some \(q \times q\) orthogonal matrix \(Q\).

If \(S \longrightarrow \Sigma_0\) and \(P(\Lambda,\Psi)\) is an \(O(1)\) function, then \(c_n=o(n)\) is sufficient for consistency


Can be strengthened to \[ ||| \sqrt{n}(\hat{\Lambda}Q_n - \bar \Lambda O_n) |||_{\max} = o_p(1), \qquad |||\sqrt{n}(\hat \Psi - \bar \Psi) |||_{\max} = o_p(1), \] for rotation matrices \(O_n\) and \(Q_n\)

Choosing \(c_n\)

Independence model

\[ \mu_0 = 0_p, \qquad \Lambda_0 = 0_{p\times q}, \qquad \Psi_0 = \operatorname{diag}(\sigma_1^2,\ldots,\sigma_p^2). \]

Each coordinate reduces to a univariate normal model: \[ X_{ij} \sim N(0,\sigma_j^2) \,. \]

Rate of information accumulation

\[ I_j(\sigma_j^2) = -\mathbb E\left[ \frac{\partial^2 \ell_j}{\partial(\sigma_j^2)^2} \right] = \frac{n}{2\sigma_j^4}, \qquad \sqrt{\frac{n}{2}} \left( \frac{\hat\sigma_j^2}{\sigma_j^2}-1 \right) \longrightarrow N(0,1) \,. \]

Root-inverse-information scale is \(c_n = \sqrt{2/n}\)

Candidate penalties

Akaike (1987)

\[ P(\Lambda, \Psi) = -\frac{\rho n}{2} \mathrm{tr}\left(\Psi^{-1/2} \Lambda \Lambda^\top \Psi^{-1/2}\right) \]

Hirose et al. (2011)

\[ P(\Psi) = -\frac{\rho n}{2} \mathrm{tr}\left(\Psi^{-1/2} S \Psi^{-1/2}\right) \]

Both are scale-equivariant under response rescaling and rotation-invariant under \(\Lambda \mapsto \Lambda Q\).

Simulation

Setup

Based on Cooperman & Waller (2022)

  • \(q = 3\)

  • \(n \in \{50, 100, 400\}\)

  • item-to-factor ratio \(3:1, 5:1, 8:1\)

Methods

Method Figure label Scale
ML None
MPL Akaike[\(n\)], Hirose[\(n\)] \(\rho = 1\)
MSPL Akaike[\(n^{-1/2}\)], Hirose[\(n^{-1/2}\)] \(\rho = 2\sqrt{2/n^3}\)

Setup

Loading Matrix A

Items Loading 1 Loading 2 Loading 3
1 0.80 0 0
2 0.65 0 0
3 0.45 0 0
4 0 0.80 0
5 0 0.65 0
6 0 0.45 0
7 0 0 0.80
8 0 0 0.65
9 0 0 0.45

Loading Matrix B

Items Loading 1 Loading 2 Loading 3
1 0.80 0 0
2 0.80 0 0
3 0.80 0 0
4 0 0.80 0
5 0 0.80 0
6 0 0.80 0
7 0 0 0.30
8 0 0 0.30
9 0 0 0.30

% Heywood cases

Frequentist performance

Model selection

References

Akaike, H. (1987). Factor analysis and AIC. Psychometrika, 52, 317–332. https://doi.org/10.1007/BF02294359
Cooperman, A. W., & Waller, N. G. (2022). Heywood you go away! Examinig causes, effects, and treatments for Heywood cases in exploratory factor analysis. Psychological Methods, 27(2), 156–176. https://doi.org/10.1037/met0000384
Heywood, H. B. (1931). On finite sequences of real numbers. Proceedings of the Royal Society, A, 134, 486–510. https://doi.org/10.1098/rspa.1931.0209
Hirose, K., Kawano, S., Konishi, S., & Ichikawa, M. (2011). Bayesian information criterion and selection of the number of factors in factor analysis models. Journal of Data Science, 9(2), 243–259. https://doi.org/10.6339/JDS.201104_09(2).0007
Kendall, M. G. (1980). Multivariate analysis. Charles Griffin & Co. Ltd.

Maximum softly-penalised likelihood framework is a robust alternative to ML estimation in factor analysis

  • Penalisation guarantees existence of estimates

  • Soft scaling preserves ML asymptotics

Sterzinger P., Kosmidis I., and Moustaki I. (2026). Maximum Softly Penalized Likelihood in Factor Analysis, Psychometrika. DOI:10.1017/psy.2026.10092