Overcoming Heywood cases in factor analysis: a maximum penalized likelihood approach

Philipp Sterzinger

Department of Statistics, LSE

p.sterzinger@lse.ac.uk
lse.ac.uk/statistics/people/philipp-sterzinger   psterzinger


Social Statistics seminar
London

24 October 2024

Joint with




Irini Moustaki

Ioannis Kosmidis

Outline

Normal-linear factor model

Normal-linear factor model

Data

\(x_1,x_2,\ldots,x_n \in \Re^p\)

Model

\[X_i = \mu_0 + \Lambda_0 Y_i + E_i\]

Unobservables

Latent factors: \(Y_i \sim \mathrm{N}(0_q, I_q)\)

Unique errors: \(E_i \sim \mathrm{N}(0_p,\Psi_0)\)

Parameters

Conditional mean: \(\mu_0 \in \Re^p\)

Factor loadings: \(\Lambda_0 \in \Re^{p \times q}\)

Specific variances \(\Psi_0\): \(p \times p\) diagonal, postive definite matrix

Normal-linear factor model

Log-likelihood

\(X_i \sim \mathrm{N}(\mu_0, \Sigma_0)\), for \(\Sigma_0 = \Lambda_0 {\Lambda_0}^\top + \Psi_0\) so that

\[ \ell(\Sigma; S) = - \frac{n}{2} \left \{\log \mathrm{det}(\Sigma) + \mathrm{tr}\left(\Sigma^{-1}S \right) \right\} + C \,, \] where \(S = 1/n \sum_{i = 1}^n (x_i - \bar{x})(x_i - \bar{x})^\top\), \(\bar{x} = 1/n \sum_{i = 1}^n x_i\)

Maximum likelihood (ML) estimator

\(\hat{\Lambda}, \hat{\Psi} = \arg \max \,\ell(\Lambda \Lambda^\top + \Psi; S)\)

Motivating Example

Job application data

Data from Hirose et al. (2011) which scores job applicants (n = 48) according to a range of criteria (p = 15)


Form Appearance Ability Likeability Selfconf. Lucidity Honesty Salesmanship Experience Drive Ambition Grasp Potential Keenness Suitability
6 7 2 5 8 7 8 8 3 8 9 7 5 7 10
9 10 5 8 10 9 9 10 5 9 9 8 8 8 10
7 8 3 6 9 8 9 7 4 9 9 8 6 8 10
5 6 8 5 6 5 9 2 8 4 5 8 7 6 5
6 8 8 8 4 5 9 2 8 5 5 8 8 7 7
7 7 7 6 8 7 10 5 9 6 5 8 6 6 6
9 9 8 8 8 8 8 8 10 8 10 8 9 8 10
9 9 9 8 9 9 8 8 10 9 10 9 9 9 10
9 9 7 8 8 8 8 5 9 8 9 8 8 8 10
4 7 10 2 10 10 7 10 3 10 10 10 9 3 10
4 7 10 0 10 8 3 9 5 9 10 8 10 2 5
4 7 10 4 10 10 7 8 2 8 8 10 10 3 7
6 9 8 10 5 4 9 4 4 4 5 4 7 6 8
8 9 8 9 6 3 8 2 5 2 6 6 7 5 6
4 8 8 7 5 4 10 2 7 5 3 6 6 4 6
6 9 3 7 8 9 8 9 8 8 7 6 8 6 10
8 7 7 7 9 5 8 6 6 7 8 6 6 7 8
6 8 8 4 8 8 6 4 3 3 6 7 2 6 4
6 7 8 4 7 8 5 4 4 2 6 8 3 5 4
4 8 7 8 8 9 10 5 2 6 7 9 8 8 9
3 8 6 8 8 8 10 5 3 6 7 8 8 5 8
9 8 7 8 9 10 10 10 3 10 8 10 8 10 8
7 10 7 9 9 9 10 10 3 9 9 10 9 10 8
9 8 7 10 8 10 10 10 2 9 7 9 9 10 8
6 9 7 7 4 5 9 3 2 4 4 4 4 5 4
7 8 7 8 5 4 8 2 3 4 5 6 5 5 6
2 10 7 9 8 9 10 5 3 5 6 7 6 4 5
6 3 5 3 5 3 5 0 0 3 3 0 0 5 0
4 3 4 3 3 0 0 0 0 4 4 0 0 5 0
4 6 5 6 9 4 10 3 1 3 3 2 2 7 3
5 5 4 7 8 4 10 3 2 5 5 3 4 8 3
3 3 5 7 7 9 10 3 2 5 3 7 5 5 2
2 3 5 7 7 9 10 3 2 2 3 6 4 5 2
3 4 6 4 3 3 8 1 1 3 3 3 2 5 2
6 7 4 3 3 0 9 0 1 0 2 3 1 5 3
9 8 5 5 6 6 8 2 2 2 4 5 6 6 3
4 9 6 4 10 8 8 9 1 3 9 7 5 3 2
4 9 6 6 9 9 7 9 1 2 10 8 5 5 2
10 6 9 10 9 10 10 10 10 10 8 10 10 10 10
10 6 9 10 9 10 10 10 10 10 10 10 10 10 10
10 7 8 0 2 1 2 0 10 2 0 3 0 0 10
10 3 8 0 1 1 0 0 10 0 0 0 0 0 10
3 4 9 8 2 4 5 3 6 2 1 3 3 3 8
7 7 7 6 9 8 8 6 8 8 10 8 8 6 5
9 6 10 9 7 7 10 2 1 5 5 7 8 4 5
9 8 10 10 7 9 110 3 1 5 7 9 9 4 4
0 7 10 3 5 0 10 0 0 2 2 0 0 0 0
0 6 10 1 5 0 10 0 0 2 2 0 0 0 0


Want to use exploratory factor analysis to find common factors underlying the data

Job application data

Fitting a normal-linear factor model with 4 factors in R using the psych package yields

fa_fit <- fa(job_data, nfactors = 4)
 
Warning: simpleWarning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An ultra-Heywood case was detected. Examine the results carefully


Estimated specific variances

fa_fit$uniquenesses
 
FormAppearanceAbilityLikeabilitySelfconf.
0.509321362 0.706583340 0.576785093 0.297299960 0.133265973
LucidityHonestySalesmanshipExperienceDrive
0.197023150 0.640299618 0.096908170 0.250289966 0.248683074
AmbitionGraspPotentialKeennessSuitability
0.157873233 0.137698609 0.095711717 -0.004485347 0.207264486


The ML estimate does not exist

Heywood cases

Dominating parameter sequence

  • \(\{ (\Lambda_n, \Psi_n) \}_{n \in \mathbb{N}}\): for all \((\Lambda, \Psi)\), there is a \(n \in \mathbb{N}\) such that \(\ell(\Lambda_n, \Psi_n) > \ell(\Lambda, \Psi)\)

Divergence to boundary

  • \(\lim\limits_{n \to \infty} \min_{i} [\Psi_n]_{ii} = 0\)

Depending optimisation routine and constraints, observe small or negative values of \(\hat{\Psi}\)

This phenomenon is termed a Heywood (1931) case

Maximum softly-penalized likelihood

Maximum penalized likelihood

Add penalty function \(P(\Psi, \Lambda)\) that diverges to \(-\infty\) whenever \(\Psi \to 0\), or \(\Psi,\Lambda \to \infty\)

Penalized log-likelihood

\[\ell^*(\Lambda, \Psi; S) = \ell(\Lambda \Lambda^\top + \Psi; S) + P(\Lambda, \Psi)\]

Maximum penalized likelihood (MPL) estimator

\[\bar{\Lambda}, \bar{\Psi} = \arg \max \, \ell^*(\Lambda, \Psi; S) \tag{1}\]

Existence

  • \(S\) is full rank
  • \(P(\Lambda, \Psi)\) is continuous and bounded from above
  • \(\lim_\limits{n \to \infty} P(\Lambda_n,\Psi_n) = - \infty\) for any sequence \(\{\Lambda_n, \Psi_n\}_{n \in \mathbb{N}}\) such that \(\min_{i} [\Psi_n]_{ii} \to 0\) or either \(\max_{i} [\Psi_n]_{ii} \to \infty\), \(|||\Lambda_n|||_{\max} \to \infty\)

The MPL estimates \[\bar{\Lambda}, \bar{\Psi} = \arg \max \, \ell^*(\Lambda, \Psi; S)\] exist

Soft penalisation

Distort MPL estimates away from ML estimates through penalisation

Scale penalty by sequence \(c_n\) to make \(c_n P(\cdot, \cdot)\) asymtpotically negligible

Maximum softly-penalized likelihood (MSPL) estimator

\[\bar{\Lambda}, \bar{\Psi} = \arg \max \, \ell(\Lambda \Lambda^\top + \Psi; S) + c_n P(\Lambda, \Psi)\]

Consistency

  • \(\Sigma_0 = \Lambda_0 \Lambda_0^\top + \Psi_0\) is strongly identifiable in that for any \(\epsilon>0\) there exists a \(\delta>0\) such that \[ ||| \Sigma_0 - \Sigma |||_{\max} < \delta \implies ||| \Lambda_0 - \Lambda O|||_{\max} < \epsilon, \, ||| \Psi_0 - \Psi|||_{\max} < \epsilon \,, \] for some orthogonal matrix \(O\) of order \(q\)

  • \(P(\Lambda, \Psi)\) is non-positive

For any \(\epsilon > 0\), there exists a \(\delta > 0\), such that \[ \begin{aligned} ||| \Sigma_0 - S |||_{\max} &< \delta \\ |n^{-1}c_n P(\Lambda, \Psi)| &< \delta \end{aligned} \quad \implies \quad \begin{aligned} ||| \Lambda_0 - \bar{\Lambda} O|||_{\max} &< \epsilon \\ ||| \Psi_0 - \bar{\Psi}|||_{\max} &< \epsilon \end{aligned} \,, \] for some \(q \times q\) orthogonal matrix \(O\).

Asymptotic normality

  • \((\hat{\Lambda}, \hat{\Psi})\), \((\bar{\Lambda}, \bar{\Psi})\) are consistent for \((\Lambda_0, \Psi_0)\)

  • \({\sup}_{\theta \in \mathcal{N}_{0}} \, ||| n^{-1}\nabla \nabla^\top \ell(\theta) - J(\theta)|||_{\max} = o_{p}(1)\), where \(\theta = (\textrm{vec}(\Lambda)^\top, \textrm{diag}(\Psi)^\top)^\top\), \(J(\theta)\) is continuous and invertible in a neighbourhood \(\mathcal{N}_0\) around \(\theta_0\)

  • \(P(\theta)\) is differentiable in a neighbourhood \(\mathcal{N}_0\) around \(\theta_0\) a

\[ {\sup}_{\theta \in \mathcal{N}_0} \, || c_n \nabla P(\theta)||_{\max} = o_p(\sqrt{n}) \quad \implies \quad \begin{aligned} ||| \sqrt{n} (\hat{\Lambda}Q_n - \bar{\Lambda} O_n) |||_{\max} &= o_p(1) \\ \quad ||| \sqrt{n} (\hat{\Psi} - \bar{\Psi}) |||_{\max} &= o_p(1) \\ \end{aligned} \] for some sequence of orthogonal matrices \(\{O_n, Q_n\}_{n \in \mathbb{N}}\).

Candidate penalties

Huber loss

\[ P(\Psi) =- \sum_{j =1}^p g(\Psi_{ii}), \quad g(x) = \begin{cases} \frac{1}{2} \log(x)^2 & \textrm{if } |\log(x)| < 1, \\ |\log(x)| - \frac{1}{2} & \textrm{else} \end{cases} \]

Akaike (1987)

\[ P(\Lambda, \Psi) = -\frac{\rho n}{2} \mathrm{tr}\left(\Psi^{-1/2} \Lambda \Lambda^\top \Psi^{-1/2}\right) \]

Hirose et al. (2011)

\[ P(\Psi) = -\frac{\rho n}{2} \mathrm{tr}\left(\Psi^{-1/2} S \Psi^{-1/2}\right) \]

Simulation

Simulation setup

Based on Cooperman & Waller (2022)

  • \(q = 3\)
  • \(n \in \{50, 100, 400\}\)
  • item-to-factor ratio \(3:1, 5:1, 8:1\)
  • Two loading matrix designs

Loading Matrix A

Items Loading 1 Loading 2 Loading 3
1 0.80 0 0
2 0.65 0 0
3 0.45 0 0
4 0 0.80 0
5 0 0.65 0
6 0 0.45 0
7 0 0 0.80
8 0 0 0.65
9 0 0 0.45

Loading Matrix B

Items Loading 1 Loading 2 Loading 3
1 0.80 0 0
2 0.80 0 0
3 0.80 0 0
4 0 0.80 0
5 0 0.80 0
6 0 0.80 0
7 0 0 0.30
8 0 0 0.30
9 0 0 0.30

Huber loss penalty with \(c_n = n^{-1/2}\) on \(\Psi\) (Huber[\(\Psi\)]) and \(\Psi,\Lambda \Lambda^\top\) (Huber[\(\Psi, \Lambda \Lambda'\)])

Akaike (1987), Hirose et al. (2011) penalties with \(\rho \in \{1, n^{-3/2} \}\)

% Used

Bias

Probabiliy of underestimation

Root mean-squared error

Model selection

Take away

Maximum softly-penalized likelihood framework is a robust alternative to ML estimation in factor analysis with desirable theoretical properties and strong finite-sample performance

  • Penalisation guarantees existence of estimates

  • Soft scaling preserves ML asymptotics

  • Strong finite-sample & model selection performance

  • Simple Huber loss penalty favourable to penalties of Hirose et al. (2011) and Akaike (1987)

References

Akaike, H. (1987). Factor analysis and AIC. Psychometrika, 52, 317–332.
Cooperman, A. W., & Waller, N. G. (2022). Heywood you go away! Examinig causes, effects, and treatments for Heywood cases in exploratory factor analysis. Psychological Methods, 27, 156–176.
Heywood, H. (1931). On finite sequences of real numbers. Proceedings of the Royal Society of London. Series A, Containing Papers of a Mathematical and Physical Character, 134(824), 486–501.
Hirose, K., Kawano, S., Konishi, S., & Ichikawa, M. (2011). Bayesian information criterion and selection of the number of factors in factor analysis models. Journal of Data Science, 9(2), 243–259.