Improving estimation and prediction from statistical models

Philipp Sterzinger

Department of Statistics, London School of Economics

p.sterzinger@lse.ac.uk
https://www.lse.ac.uk/people/philipp-sterzinger psterzinger

Harrison Award Ceremony
University of Warwick

12 December 2025

Joint with

Ioannis Kosmidis

This talk

Improving estimation and prediction from statistical models

High-dimensional logistic regresison

Sterzinger P, Kosmidis I (2024). Diaconis-Ylvisaker prior penalized likelihood for \(p / n \to \kappa \in (0, 1)\) logistic regression. ArXiV: 2311.11290

Improving estimation and prediction from statistical models

General theme

Estimation problem

\[ \hat \theta \in \underset{\theta \in \Theta}{\arg \min} \, \ell(\theta; X, y) \]

Nonexistence

\(\{\theta_n\}_{n \in \mathbb{N}}\) such that either

\(\lim\limits_{n \to \infty} \| \theta_n \| = \infty\), or
\(\lim\limits_{n \to \infty} \theta_n \in \partial \Theta\), and

\[ \lim\limits_{n \to \infty} \ell(\theta_n; y, X) = \underset{\theta \in \Theta}{\inf} \, \ell(\theta; y, X) \]

Penalised estimation

\[ \tilde \theta \in \underset{\theta \in \Theta}{\arg \min} \, \ell(\theta; y, X) + P(\theta) \]

Penalty function

\[ \lim\limits_{n \to \infty} P(\theta_n) = \infty \]

for \(\{\theta_n\}_{n \in \mathbb{N}}\) such that either

\(\lim\limits_{n \to \infty} \| \theta_n \| = \infty\), or
\(\lim\limits_{n \to \infty} \theta_n \in \partial \Theta\), and

Improved penalised estimation

Maximum softly penalized likelihood

Soft scaling of penalty function to make distortions of penalty asymptotically negligible
Preserve potential optimal properties of original estimator

Sterzinger P, Kosmidis I (2023). Maximum softly-penalized likelihood for mixed effects logistic regression. Statistics and Computing, 33(2), 53.

Sterzinger, P., Kosmidis, I., and Moustaki, I. (2025). Maximum softly penalized likelihood for factor analysis. ArXiV: 2510.06465

High-dimensional corrections

Derive limiting distribution of penalised estimator in high-dimensional asymptotic regime
Devise corrections of estimator and test statistics to recover signal recovery and inference

Sterzinger P, Kosmidis I (2024). Diaconis-Ylvisaker prior penalized likelihood for \(p / n \to \kappa \in (0, 1)\) logistic regression. ArXiV: 2311.11290

Logistic regression

Logistic regression model

Data

\(y_1, \ldots, y_n \in \{0, 1\}\)

\(x_1, \ldots, x_n \in \Re^p\)

Model

\(Y_{1}, \ldots, Y_{n}\) conditionally independent with

\[ Y_i | x_i \sim \mathop{\mathrm{Bernoulli}}(\pi_i)\,, \quad \log \frac{\pi_i}{1 - \pi_i} = \eta_i = x_i^\top \beta \]

Maximum likelihood estimation

Log-likelihood

\(\displaystyle l(\beta; y, X) = \sum_{i = 1}^n \left\{y_i \eta_i - \log\left(1 + e^{\eta_i}\right) \right\}\)

Maximum likelihood (ML) estimator

\(\hat{\beta} = \arg \max l(\beta; y, X)\)

\(p / n \to \kappa \in (0, 1)\)

Artificial example

\[ \begin{array}{l} n = 2000 \\ p = 400 \\ x_{ij} \stackrel{\text{iid}}{\sim} \mathop{\mathrm{N}}(0, 1/n) \\ \\ \beta_0 = (b_1, \ldots, b_{200}, 0, \ldots, 0)^\top \\ b_j \stackrel{\text{iid}}{\sim} \mathop{\mathrm{N}}(6.78, 1) \\ \end{array} \implies \begin{array}{l} \gamma^2 = \mathop{\mathrm{var}}(x_i^\top \beta_0) \approx 4.7 \\ \kappa = 0.2 \\ \end{array} \]

\[ y_i \stackrel{\text{ind}}{\sim} \mathop{\mathrm{Bernoulli}}\left(\pi_i\right), \quad \log\frac{\pi_i}{1 - \pi_i} = x_i^\top \beta_0 \]

\(W = 2 ( \hat{l} - \hat{l}_{0} )\), \(\hat{l}_{0}\) is maximized log-likelihood under \(H_0: \beta_{201} = \ldots = \beta_{210} = 0\)

Classical theory predicts \(W \stackrel{d}{\rightarrow} \chi^2_{10}\) when \(\beta_{201} = \ldots = \beta_{210} = 0\)

\(Z = \hat\beta_{k} / [(X^\top \hat{W} X)^{-1}]_{kk}^{1/2}\)

Classical theory predicts \(Z \stackrel{d}{\rightarrow} \mathop{\mathrm{N}}(0, 1)\) when \(\beta_{k} = 0\)

Recent developments

Candès & Sur (2020)

sharp phase transition about when the ML estimate does not exist, when \(\eta_i = \delta + x_i^\top \beta\), \(x_i \sim \mathop{\mathrm{N}}(0, \Sigma)\), \(p/n \to \kappa \in (0, 1)\), \(\mathop{\mathrm{var}}(x_i^\top\beta_0) \to \gamma^2\)

Sur & Candès (2019), Zhao et al. (2022)

a method, based on approximate message passing (AMP), that recovers estimation and inferential performance by appropriately rescaling \(\hat\beta\), whenever that exists

Salehi et al. (2019)

aggregate asymptotics of the logistic ridge and LASSO based on convex gaussian min max theorem, but no inference

Rigon & Aliverti (2023)

Study maximum Diaconis-Ylvisaker prior penalised likelihood estimator for logistic regression and report solid empirical performance in high-dimensional setting

Phase transition, \(\hat\beta\) (Candès & Sur, 2020)

Phase transition, \(\hat\beta / {\mu}_{\star}\) (Sur & Candès, 2019)

Diaconis-Ylvisaker prior

Prior

\[ \log p(\beta; X) = \frac{1 - {\color[]{#E16A86}{\alpha}}}{{\color[]{#E16A86}{\alpha}}} \sum_{i=1}^n \left\{ \zeta'\left(x_i^\top {\color[]{#E16A86}{\beta}_P}\right) x_i^\top \beta - \zeta\left(x_i^\top \beta\right) \right\} + C \] with \(\zeta(\eta) = \log(1 + e^\eta)\)

Penalized log-likelihood

\[ l^*(\beta; y, X) = \frac{1}{\alpha} \sum_{i = 1}^n \left\{y_i^* x_i^\top\beta - \zeta\left(x_i^\top \beta\right) \right\} \] with \(y_i^* = \alpha y_i + (1 - \alpha) \zeta'\left(x_i^\top \beta_P\right)\)

Maximum DY prior penalized likelihood

Maximum DY prior penalized likelihood (MDYPL)

\(\hat{\beta}^{\textrm{\small DY}}= \arg \max l^*(\beta; y, X)\)

MDYPL implementation using ML procedures

MDYPL is ML with pseudo-responses: \(l(\beta; y^*, X) / \alpha = \ell^*(\beta; y, X)\)

Existence and uniqueness

Rigon & Aliverti (2023, Theorem 1): \(\hat{\beta}^{\textrm{\small DY}}\) is unique and exists for all \(\{y, X\}\) configurations

Shrinkage

\(\displaystyle \hat{\beta}^{\textrm{\small DY}}\longrightarrow \left\{ \begin{array}{ll} \hat\beta\,, & \alpha \to 1 \\ \beta_P\,, & \alpha \to 0 \end{array}\right.\)

Rigon & Aliverti (2023) suggest adaptive shrinkage with \(\alpha = 1 / (1 + \kappa)\)

Shrinkage direction

Fixed \(p\) (Cordeiro & McCullagh, 1991)

ML estimator of a logistic regression model is biased away from \(0\)

\(p / n \to \kappa \in (0, 1)\) (Sur & Candès, 2019; Zhao et al., 2022)

Empirical evidence that this continues to hold in high dimensions

\[\LARGE\Downarrow\]

\[\beta_{P} = 0\]

Estimation

Setting

Signal \(\beta_0\)

The empirical distribution of \(\beta_0\) elements converges to \(\pi_{\bar{\beta}}\) and \(\sum_{j = 1}^p \beta_{0,j}^2 / p \to \mathop{\mathrm{E}}(\bar{\beta}^2) < \infty\)

Covariate distribution, dimension, and signal strength

\(x_{ij} \stackrel{iid}{\sim} \mathop{\mathrm{N}}(0, 1 / n)\)

As \(p / n \to \kappa \in (0, 1)\), \(\mathop{\mathrm{var}}(x_i^\top \beta_0) \to \gamma^2 > 0\)

Idea

Develop a generalized AMP recursion (see Feng et al., 2022 for an overview of AMP), whose iterates have a known asymptotic distribution, with stationary point \(\hat{\beta}^{\textrm{\small DY}}\)

AMP recursion

\[ \begin{aligned} \hat{\beta}^t &= \hat \beta^{t - 1} + \kappa^{-1} \Psi^{t-1}(y^*, S^{t-1}) \\ S^t &= X \hat \beta^t - \Psi^{t-1}(y^*, S^{t-1}) ,, \end{aligned} \] for \[ \Psi^t(y_i^*, S^t_i) = b_t (y^*_i - \zeta'(\textrm{prox}_{b_t \zeta}\left(b_t y_i^* + S_i\right))) \] and where \(\textrm{prox}_{b\zeta}\left(x\right)=\arg\min_{u} \left\{ b\zeta(u) + \frac{1}{2} (x-u)^2 \right\}\) is the proximal operator

Fixed point satisfies \[ X^\top (y^* - X\beta^*) = 0 \]

State evolution of AMP recursion

Solution \(({\mu}_{\star}, {b}_{\star}, {\sigma}_{\star})\) to the system of nonlinear equations in \((\mu, b, \sigma)\)

\[ \begin{aligned} 0 & = \mathop{\mathrm{E}}\left[2 \zeta'(Z) Z \left\{\frac{1+\color[]{#E16A86}{\alpha}}{2} - \zeta' \left(\textrm{prox}_{b \zeta}\left(Z_{*} + \frac{1+\color[]{#E16A86}{\alpha}}{2}b\right) \right) \right\} \right] \\ 0 & = 1 - \kappa - \mathop{\mathrm{E}}\left[\frac{2 \zeta'(Z)}{1 + b \zeta''\left(\textrm{prox}_{b\zeta}\left(Z_{*} + \frac{1+\color[]{#E16A86}{\alpha}}{2}b\right)\right)} \right] \\ 0 & = \sigma^2 - \frac{b^2}{\kappa^2} \mathop{\mathrm{E}}\left[2 \zeta'(Z) \left\{\frac{1+\color[]{#E16A86}{\alpha}}{2} - \zeta'\left(\textrm{prox}_{b\zeta}\left( Z_{*} + \frac{1+\color[]{#E16A86}{\alpha}}{2}b\right)\right) \right\}^2 \right] \end{aligned} \tag{1}\]

\(Z \sim \mathop{\mathrm{N}}(0, \gamma^2)\)

\(Z_{*} = \mu Z + \kappa^{1/2} \sigma G\) with \(G \sim \mathop{\mathrm{N}}(0, 1)\) independent of \(Z\)

Solvers for nonlinear systems, after cubature approximation of expectations

MDYPL asymptotics

Theorem 1 (Aggregate asymptotic behaviour of mDYPL estimator)

Assume that \((\alpha,\kappa,\gamma)\) are such that (1) admits a solution \(({\mu}_{\star}, {b}_{\star}, {\sigma}_{\star})\) such that the Jacobian of the RHS of (1) is nonsingular.

Then, for any function \(\psi: \Re^2 \to \Re\) that is pseudo-Lipschitz of order 2¹ \[ \frac{1}{p} \sum_{j=1}^{n} \psi(\hat{\beta}^{\textrm{\small DY}}_j - {\mu}_{\star}\beta_{0,j}, \beta_{0,j}) \overset{\textrm{a.s.}}{\longrightarrow} \mathop{\mathrm{E}}\left[\psi({\sigma}_{\star}G, \bar{\beta})\right], \quad \textrm{as } n \to \infty \] where \(G \sim \mathop{\mathrm{N}}(0,1)\) is independent of \(\bar{\beta} \sim \pi_{\bar{\beta}}\)

Behaviour of \(\hat{\beta}^{\textrm{\small DY}}\)

\(\psi(t,u)\)	statistic		a.s. limit
\(t\)	\(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j - {\mu}_{\star}\beta_{0,j}\right)\)	Bias	\(0\)
\((t-(1-{\mu}_{\star})u)^2\)	\(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j - \beta_{0,j}\right)^2\)	MSE	\({\sigma}_{\star}^2 + (1-{\mu}_{\star})^2 \gamma^2 / \kappa\)
\(t^2\)	\(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j - {\mu}_{\star}\beta_{0,j}\right)^2\)	Variance	\({\sigma}_{\star}^2\)

Behaviour of \(\hat{\beta}^{\textrm{\small DY}}/ {\mu}_{\star}\)

\(\psi(t,u)\)	statistic		a.s. limit
\(t / {\mu}_{\star}\)	\(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j / {\mu}_{\star}- \beta_{0,j}\right)\)	Bias	\(0\)
\(t^2/{\mu}_{\star}^2\)	\(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j / {\mu}_{\star}- \beta_{0,j}\right)^2\)	MSE / Variance	\({\sigma}_{\star}^2 / {\mu}_{\star}^2\)

\(\hat{\beta}^{\textrm{\small DY}}/ {\mu}_{\star}\) for \(\alpha = 1 / (1 + \kappa)\)

Abritrary covariate covariance

\(\hat{\beta}^{\textrm{\small DY}}\) is the mDYPL estimator with covariate vectors \(x_{i} \stackrel{iid}{\sim} \mathop{\mathrm{N}}(0, \Sigma)\) and signal \(\beta_0\)

Dimension and signal strength

As \(p / n \to \kappa \in (0, 1)\), \(\mathop{\mathrm{var}}(x_i^\top \beta_0) = \beta_0^\top \Sigma \beta_0 \to \gamma^2 > 0\)

Aggregate asymptotics

Under the assumption \(\underset{n \to \infty}{\limsup } \, \lambda_{\max}(\Sigma) / \lambda_{\min}(\Sigma) < \infty\), Theorem 1 still holds with \[ \frac{1}{p} \sum_{j=1}^p \psi \left( {\color[]{#E16A86}{\sqrt{n} \tau_j}} \left( \hat{\beta}^{\textrm{\small DY}}_{j}-{\mu}_{\star}\beta_{0,j}\right), {\color[]{#E16A86}{\sqrt{n} \tau_j}} \beta_{0,j} \right) \overset{\textrm{a.s.}}{\longrightarrow} \mathop{\mathrm{E}}\left[\psi({\sigma}_{\star}G, \bar{\beta})\right] \] where \(\tau_j^2 = \mathop{\mathrm{var}}(x_{ij} \mid x_{i, -j})\)

Inference

Adjusted \(Z\)-statistics

Theorem 2 Assume that \((\alpha,\kappa,\gamma)\) are such that the conditions of Theorem 1 are met.

Then for any regression coefficient such that \(\sqrt{n} \tau_j \beta_{0,j} = \mathcal{O}(1)\), \[ Z^*_j = \sqrt{n} \tau_j \frac{\hat{\beta}^{\textrm{\small DY}}_{j} - {\mu}_{\star}\beta_{0,j}}{{\sigma}_{\star}} \overset{d}{\longrightarrow} \mathcal{N}(0,1) \]

DY prior penalized likelihood ratio test statistics

Theorem 3 Assume that \((\alpha,\kappa,\gamma)\) are such that the conditions of Theorem 1 are met.

Let \(I = \{i_1,\ldots,i_k\}\), and define the DY prior penalized likelihood ratio test statistic \[ \Lambda_I = \underset{\beta \in \Re^p}{\max} \, \ell(\beta; y^*, X) - \underset{\substack{\beta \in \Re^p: \\ \beta_j = 0, \, j \in I }}{\max} \, \ell(\beta; y^*, X) \]

Then, under \(H_0: \beta_{0,i_1} = \ldots = \beta_{0,i_k} = 0\), it holds that \[ 2 \Lambda_{I} \overset{d}{\longrightarrow} \frac{\kappa \sigma_{\star}^2}{b_{\star}} \chi^2_k \]

Frequentist performance

Penalized likelihood ratio tests
Adjusted Wald tests

Extensions

Current work

Inclusion of intercept

\[ Y_i | x_i \sim \mathop{\mathrm{Bernoulli}}(\pi_i)\,, \quad \log \frac{\pi_i}{1 - \pi_i} = \eta_i ={ \color[]{#E16A86}{\theta}_{0}} + x_i^\top \beta_0 \]

Nonzero prior signal

\[ y_i^* = \alpha y_i + (1-\alpha) \zeta'({\color[]{#E16A86}{\theta}_P} + x_i^\top {\color[]{#E16A86}{\beta_P}}) \]

with

\[ \operatorname{cov}\left(\theta_0 + x_i^\top \beta_0, \theta_P + x_i^\top \beta_P\right) \overset{\text{a.s.}}{\longrightarrow} \begin{bmatrix} \gamma^2 & \varphi \\ \varphi & \delta^2 \end{bmatrix} \]

CGMT (Thrampoulidis et al., 2018)

Let

\[ \begin{aligned} \Phi(G) &= \underset{w \in \mathcal{S}_w}{\min} \, \underset{u \in \mathcal S_u}{\max} \, w^\top G u + \psi(u, w) && (\text{PO})\\ \phi(g,h) &= \underset{w \in \mathcal{S}_w}{\min} \, \underset{u \in \mathcal S_u}{\max} \, \|u\|_2 g^\top w + \|w\|_2 h^\top u + \psi(u, w) && (\text{AO}) \end{aligned} \] where \(G \in \Re^{n \times m}\), \(g \in \Re^n\), \(h \in \Re^m\), \(S_w \subset \Re^n\), \(S_u \subset \Re^m\), and \(\psi : \Re^n \times \Re^m \rightarrow \Re\) is convex concave.

If \(\mathcal{S}_u, \mathcal{S}_w\) are convex, compact sets and \(G, g, h\) have i.i.d. \(\mathrm{N}(0,1)\) entries, then

Tail bounds of the PO in terms of the AO
Performance metrics of the PO optimiser converge to the AO-predicted limits

CGMT pipeline

Original problem

\[ \hat \theta,\, \hat \beta = \underset{\substack{\theta \in \Re \\ \beta \in \Re^p}}{\min} \, \frac{1}{n} \left\{1_n^\top \zeta(\theta 1_n + X\beta) - (y^*)^\top (\theta 1_n + X\beta) \right\} \]

Primary optimisation (PO)

\[ \underset{\substack{\theta \in \Re \\ \beta \in S_1}}{\min} \, \underset{\lambda \in S_2}{\max} \, \frac{1}{n} \left\{1_n^\top \zeta(\eta) - \eta^\top y^* + \lambda^\top \left(\eta - \theta 1_n - X\beta \right) \right\} = \underset{\beta \in S_1}{\min} \, \underset{\lambda \in S_2}{\max} \, \frac{1}{n} \lambda^\top X \beta + \psi(\lambda) \]

Auxiliary optimisation (AO)

\[ \underset{\beta \in S_1}{\min} \, \underset{\lambda \in S_2}{\max} \, \| \lambda \|_2 g^\top \beta + \|\beta\|_2 h^\top \lambda + \psi(\lambda) \]

CGMT pipeline

Limiting AO

\[ \underset{\substack{\lvert \theta \rvert \leq C \\ \sigma \in [0,C] \\ \mu_1, \mu_2 \in [-C, C]}}{\min} \, \underset{\lambda \in [0, C]}{\max} \, \mathop{\mathrm{E}}[M_{\zeta}(\theta + \mu_1 Q_1 + \mu_2 Q_2 + \sigma Z - \lambda y; \lambda)] + \mathop{\mathrm{E}}\left[y\left(\theta + \mu_1 Q_1 + \mu_2 Q_2 - \frac{\lambda y}{2}\right)\right] - \frac{\kappa \sigma^2}{2\lambda} \,, \tag{2}\]

\[ \begin{bmatrix} Q_1 \\ Q_2 \end{bmatrix} \sim \mathrm{N}\left(\begin{bmatrix} 0 \\ 0 \end{bmatrix}, \begin{bmatrix} \gamma^2 & \varphi \\ \varphi & \delta^2 \end{bmatrix} \right), \quad Z \sim \mathrm{N}(0,1) \, \perp Q_1, Q_2, \quad M_\zeta(x; \lambda) = \min_{z} \zeta(x) - \frac{1}{2 \lambda} \| z - x \|_2^2 \]

CGMT link

\[ \begin{aligned} \hat \theta &\overset{\text{a.s.}}{\longrightarrow} \theta_{\star}\\ \hat \beta &\approx \mu_{1,\star} \beta_0 + \mu_{2,\star} \beta_P + {\sigma}_{\star}\varepsilon \end{aligned} \] where \((\mu_{1,\star}, \mu_{2,\star}, {\sigma}_{\star}, \lambda_{\star}, \theta_{\star})\) are the solution to Equation 2, and \(\varepsilon\) is some mean zero noise

Opens door to inference, subgaussian isotropic covariates

References

Candès, E. J., & Sur, P. (2020). The phase transition for the existence of the maximum likelihood estimate in high-dimensional logistic regression. The Annals of Statistics, 48(1), 27–42. https://doi.org/10.1214/18-AOS1789

Cordeiro, G. M., & McCullagh, P. (1991). Bias correction in generalized linear models. Journal of the Royal Statistical Society Series B: Statistical Methodology, 53(3), 629–643. https://doi.org/10.1111/j.2517-6161.1991.tb01852.x

Feng, O. Y., Venkataramanan, R., Rush, C., & Samworth, R. J. (2022). A unifying tutorial on approximate message passing. Foundations and Trends in Machine Learning, 15(4), 335–536. https://doi.org/10.1561/2200000092

Rigon, T., & Aliverti, E. (2023). Conjugate priors and bias reduction for logistic regression models. Statistics & Probability Letters, 202, 109901. https://doi.org/10.1016/j.spl.2023.109901

Salehi, F., Abbasi, E., & Hassibi, B. (2019). The impact of regularization on high-dimensional logistic regression. Proceedings of the 33rd International Conference on Neural Information Processing Systems.

Sur, P., & Candès, E. J. (2019). A modern maximum-likelihood theory for high-dimensional logistic regression. Proceedings of the National Academy of Sciences, 116(29), 14516–14525. https://doi.org/10.1073/pnas.1810420116

Thrampoulidis, C., Abbasi, E., & Hassibi, B. (2018). Precise error analysis of regularized \(M\)-estimators in high dimensions. IEEE Transactions on Information Theory, 64(8), 5592–5628. https://doi.org/10.1109/TIT.2018.2840720

Zhao, Q., Sur, P., & Candes, E. J. (2022). The asymptotic distribution of the MLE in high-dimensional logistic models: Arbitrary covariance. Bernoulli, 28(3), 1835–1861. https://doi.org/10.3150/21-BEJ1401

Thank you!

Sterzinger P, Kosmidis I (2024). Diaconis-Ylvisaker prior penalized likelihood for \(p / n \to \kappa \in (0, 1)\) logistic regression. ArXiV: 2311.11290