Department of Statistics, London School of Economics
p.sterzinger@lse.ac.uk
https://www.lse.ac.uk/people/philipp-sterzinger psterzinger
Harrison Award Ceremony
University of Warwick
12 December 2025
Sterzinger P, Kosmidis I (2024). Diaconis-Ylvisaker prior penalized likelihood for \(p / n \to \kappa \in (0, 1)\) logistic regression. ArXiV: 2311.11290
\[ \hat \theta \in \underset{\theta \in \Theta}{\arg \min} \, \ell(\theta; X, y) \]
\(\{\theta_n\}_{n \in \mathbb{N}}\) such that either
\[ \lim\limits_{n \to \infty} \ell(\theta_n; y, X) = \underset{\theta \in \Theta}{\inf} \, \ell(\theta; y, X) \]
\[ \tilde \theta \in \underset{\theta \in \Theta}{\arg \min} \, \ell(\theta; y, X) + P(\theta) \]
\[ \lim\limits_{n \to \infty} P(\theta_n) = \infty \]
for \(\{\theta_n\}_{n \in \mathbb{N}}\) such that either
Soft scaling of penalty function to make distortions of penalty asymptotically negligible
Preserve potential optimal properties of original estimator
Sterzinger P, Kosmidis I (2023). Maximum softly-penalized likelihood for mixed effects logistic regression. Statistics and Computing, 33(2), 53.
Sterzinger, P., Kosmidis, I., and Moustaki, I. (2025). Maximum softly penalized likelihood for factor analysis. ArXiV: 2510.06465
Derive limiting distribution of penalised estimator in high-dimensional asymptotic regime
Devise corrections of estimator and test statistics to recover signal recovery and inference
Sterzinger P, Kosmidis I (2024). Diaconis-Ylvisaker prior penalized likelihood for \(p / n \to \kappa \in (0, 1)\) logistic regression. ArXiV: 2311.11290
\(y_1, \ldots, y_n \in \{0, 1\}\)
\(x_1, \ldots, x_n \in \Re^p\)
\(Y_{1}, \ldots, Y_{n}\) conditionally independent with
\[ Y_i | x_i \sim \mathop{\mathrm{Bernoulli}}(\pi_i)\,, \quad \log \frac{\pi_i}{1 - \pi_i} = \eta_i = x_i^\top \beta \]
\(\displaystyle l(\beta; y, X) = \sum_{i = 1}^n \left\{y_i \eta_i - \log\left(1 + e^{\eta_i}\right) \right\}\)
\(\hat{\beta} = \arg \max l(\beta; y, X)\)
\[ \begin{array}{l} n = 2000 \\ p = 400 \\ x_{ij} \stackrel{\text{iid}}{\sim} \mathop{\mathrm{N}}(0, 1/n) \\ \\ \beta_0 = (b_1, \ldots, b_{200}, 0, \ldots, 0)^\top \\ b_j \stackrel{\text{iid}}{\sim} \mathop{\mathrm{N}}(6.78, 1) \\ \end{array} \implies \begin{array}{l} \gamma^2 = \mathop{\mathrm{var}}(x_i^\top \beta_0) \approx 4.7 \\ \kappa = 0.2 \\ \end{array} \]
\[ y_i \stackrel{\text{ind}}{\sim} \mathop{\mathrm{Bernoulli}}\left(\pi_i\right), \quad \log\frac{\pi_i}{1 - \pi_i} = x_i^\top \beta_0 \]
\(W = 2 ( \hat{l} - \hat{l}_{0} )\), \(\hat{l}_{0}\) is maximized log-likelihood under \(H_0: \beta_{201} = \ldots = \beta_{210} = 0\)
Classical theory predicts \(W \stackrel{d}{\rightarrow} \chi^2_{10}\) when \(\beta_{201} = \ldots = \beta_{210} = 0\)
\(Z = \hat\beta_{k} / [(X^\top \hat{W} X)^{-1}]_{kk}^{1/2}\)
Classical theory predicts \(Z \stackrel{d}{\rightarrow} \mathop{\mathrm{N}}(0, 1)\) when \(\beta_{k} = 0\)
sharp phase transition about when the ML estimate does not exist, when \(\eta_i = \delta + x_i^\top \beta\), \(x_i \sim \mathop{\mathrm{N}}(0, \Sigma)\), \(p/n \to \kappa \in (0, 1)\), \(\mathop{\mathrm{var}}(x_i^\top\beta_0) \to \gamma^2\)
a method, based on approximate message passing (AMP), that recovers estimation and inferential performance by appropriately rescaling \(\hat\beta\), whenever that exists
aggregate asymptotics of the logistic ridge and LASSO based on convex gaussian min max theorem, but no inference
Study maximum Diaconis-Ylvisaker prior penalised likelihood estimator for logistic regression and report solid empirical performance in high-dimensional setting
?
\[ \log p(\beta; X) = \frac{1 - {\color[]{#E16A86}{\alpha}}}{{\color[]{#E16A86}{\alpha}}} \sum_{i=1}^n \left\{ \zeta'\left(x_i^\top {\color[]{#E16A86}{\beta}_P}\right) x_i^\top \beta - \zeta\left(x_i^\top \beta\right) \right\} + C \] with \(\zeta(\eta) = \log(1 + e^\eta)\)
\[ l^*(\beta; y, X) = \frac{1}{\alpha} \sum_{i = 1}^n \left\{y_i^* x_i^\top\beta - \zeta\left(x_i^\top \beta\right) \right\} \] with \(y_i^* = \alpha y_i + (1 - \alpha) \zeta'\left(x_i^\top \beta_P\right)\)
\(\hat{\beta}^{\textrm{\small DY}}= \arg \max l^*(\beta; y, X)\)
MDYPL is ML with pseudo-responses: \(l(\beta; y^*, X) / \alpha = \ell^*(\beta; y, X)\)
Rigon & Aliverti (2023, Theorem 1): \(\hat{\beta}^{\textrm{\small DY}}\) is unique and exists for all \(\{y, X\}\) configurations
\(\displaystyle \hat{\beta}^{\textrm{\small DY}}\longrightarrow \left\{ \begin{array}{ll} \hat\beta\,, & \alpha \to 1 \\ \beta_P\,, & \alpha \to 0 \end{array}\right.\)
Rigon & Aliverti (2023) suggest adaptive shrinkage with \(\alpha = 1 / (1 + \kappa)\)
ML estimator of a logistic regression model is biased away from \(0\)
Empirical evidence that this continues to hold in high dimensions
\[\LARGE\Downarrow\]
\[\beta_{P} = 0\]
The empirical distribution of \(\beta_0\) elements converges to \(\pi_{\bar{\beta}}\) and \(\sum_{j = 1}^p \beta_{0,j}^2 / p \to \mathop{\mathrm{E}}(\bar{\beta}^2) < \infty\)
\(x_{ij} \stackrel{iid}{\sim} \mathop{\mathrm{N}}(0, 1 / n)\)
As \(p / n \to \kappa \in (0, 1)\), \(\mathop{\mathrm{var}}(x_i^\top \beta_0) \to \gamma^2 > 0\)
Develop a generalized AMP recursion (see Feng et al., 2022 for an overview of AMP), whose iterates have a known asymptotic distribution, with stationary point \(\hat{\beta}^{\textrm{\small DY}}\)
\[ \begin{aligned} \hat{\beta}^t &= \hat \beta^{t - 1} + \kappa^{-1} \Psi^{t-1}(y^*, S^{t-1}) \\ S^t &= X \hat \beta^t - \Psi^{t-1}(y^*, S^{t-1}) ,, \end{aligned} \] for \[ \Psi^t(y_i^*, S^t_i) = b_t (y^*_i - \zeta'(\textrm{prox}_{b_t \zeta}\left(b_t y_i^* + S_i\right))) \] and where \(\textrm{prox}_{b\zeta}\left(x\right)=\arg\min_{u} \left\{ b\zeta(u) + \frac{1}{2} (x-u)^2 \right\}\) is the proximal operator
Fixed point satisfies \[ X^\top (y^* - X\beta^*) = 0 \]
Solution \(({\mu}_{\star}, {b}_{\star}, {\sigma}_{\star})\) to the system of nonlinear equations in \((\mu, b, \sigma)\)
\[ \begin{aligned} 0 & = \mathop{\mathrm{E}}\left[2 \zeta'(Z) Z \left\{\frac{1+\color[]{#E16A86}{\alpha}}{2} - \zeta' \left(\textrm{prox}_{b \zeta}\left(Z_{*} + \frac{1+\color[]{#E16A86}{\alpha}}{2}b\right) \right) \right\} \right] \\ 0 & = 1 - \kappa - \mathop{\mathrm{E}}\left[\frac{2 \zeta'(Z)}{1 + b \zeta''\left(\textrm{prox}_{b\zeta}\left(Z_{*} + \frac{1+\color[]{#E16A86}{\alpha}}{2}b\right)\right)} \right] \\ 0 & = \sigma^2 - \frac{b^2}{\kappa^2} \mathop{\mathrm{E}}\left[2 \zeta'(Z) \left\{\frac{1+\color[]{#E16A86}{\alpha}}{2} - \zeta'\left(\textrm{prox}_{b\zeta}\left( Z_{*} + \frac{1+\color[]{#E16A86}{\alpha}}{2}b\right)\right) \right\}^2 \right] \end{aligned} \tag{1}\]
\(Z \sim \mathop{\mathrm{N}}(0, \gamma^2)\)
\(Z_{*} = \mu Z + \kappa^{1/2} \sigma G\) with \(G \sim \mathop{\mathrm{N}}(0, 1)\) independent of \(Z\)
Solvers for nonlinear systems, after cubature approximation of expectations
Theorem 1 (Aggregate asymptotic behaviour of mDYPL estimator)
Assume that \((\alpha,\kappa,\gamma)\) are such that (1) admits a solution \(({\mu}_{\star}, {b}_{\star}, {\sigma}_{\star})\) such that the Jacobian of the RHS of (1) is nonsingular.
Then, for any function \(\psi: \Re^2 \to \Re\) that is pseudo-Lipschitz of order 21 \[ \frac{1}{p} \sum_{j=1}^{n} \psi(\hat{\beta}^{\textrm{\small DY}}_j - {\mu}_{\star}\beta_{0,j}, \beta_{0,j}) \overset{\textrm{a.s.}}{\longrightarrow} \mathop{\mathrm{E}}\left[\psi({\sigma}_{\star}G, \bar{\beta})\right], \quad \textrm{as } n \to \infty \] where \(G \sim \mathop{\mathrm{N}}(0,1)\) is independent of \(\bar{\beta} \sim \pi_{\bar{\beta}}\)
| \(\psi(t,u)\) | statistic | a.s. limit | |
|---|---|---|---|
| \(t\) | \(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j - {\mu}_{\star}\beta_{0,j}\right)\) | Bias | \(0\) |
| \((t-(1-{\mu}_{\star})u)^2\) | \(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j - \beta_{0,j}\right)^2\) | MSE | \({\sigma}_{\star}^2 + (1-{\mu}_{\star})^2 \gamma^2 / \kappa\) |
| \(t^2\) | \(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j - {\mu}_{\star}\beta_{0,j}\right)^2\) | Variance | \({\sigma}_{\star}^2\) |
| \(\psi(t,u)\) | statistic | a.s. limit | |
|---|---|---|---|
| \(t / {\mu}_{\star}\) | \(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j / {\mu}_{\star}- \beta_{0,j}\right)\) | Bias | \(0\) |
| \(t^2/{\mu}_{\star}^2\) | \(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j / {\mu}_{\star}- \beta_{0,j}\right)^2\) | MSE / Variance | \({\sigma}_{\star}^2 / {\mu}_{\star}^2\) |
\(\hat{\beta}^{\textrm{\small DY}}\) is the mDYPL estimator with covariate vectors \(x_{i} \stackrel{iid}{\sim} \mathop{\mathrm{N}}(0, \Sigma)\) and signal \(\beta_0\)
As \(p / n \to \kappa \in (0, 1)\), \(\mathop{\mathrm{var}}(x_i^\top \beta_0) = \beta_0^\top \Sigma \beta_0 \to \gamma^2 > 0\)
Under the assumption \(\underset{n \to \infty}{\limsup } \, \lambda_{\max}(\Sigma) / \lambda_{\min}(\Sigma) < \infty\), Theorem 1 still holds with \[ \frac{1}{p} \sum_{j=1}^p \psi \left( {\color[]{#E16A86}{\sqrt{n} \tau_j}} \left( \hat{\beta}^{\textrm{\small DY}}_{j}-{\mu}_{\star}\beta_{0,j}\right), {\color[]{#E16A86}{\sqrt{n} \tau_j}} \beta_{0,j} \right) \overset{\textrm{a.s.}}{\longrightarrow} \mathop{\mathrm{E}}\left[\psi({\sigma}_{\star}G, \bar{\beta})\right] \] where \(\tau_j^2 = \mathop{\mathrm{var}}(x_{ij} \mid x_{i, -j})\)
Theorem 2 Assume that \((\alpha,\kappa,\gamma)\) are such that the conditions of Theorem 1 are met.
Then for any regression coefficient such that \(\sqrt{n} \tau_j \beta_{0,j} = \mathcal{O}(1)\), \[ Z^*_j = \sqrt{n} \tau_j \frac{\hat{\beta}^{\textrm{\small DY}}_{j} - {\mu}_{\star}\beta_{0,j}}{{\sigma}_{\star}} \overset{d}{\longrightarrow} \mathcal{N}(0,1) \]
Theorem 3 Assume that \((\alpha,\kappa,\gamma)\) are such that the conditions of Theorem 1 are met.
Let \(I = \{i_1,\ldots,i_k\}\), and define the DY prior penalized likelihood ratio test statistic \[ \Lambda_I = \underset{\beta \in \Re^p}{\max} \, \ell(\beta; y^*, X) - \underset{\substack{\beta \in \Re^p: \\ \beta_j = 0, \, j \in I }}{\max} \, \ell(\beta; y^*, X) \]
Then, under \(H_0: \beta_{0,i_1} = \ldots = \beta_{0,i_k} = 0\), it holds that \[ 2 \Lambda_{I} \overset{d}{\longrightarrow} \frac{\kappa \sigma_{\star}^2}{b_{\star}} \chi^2_k \]
\[ Y_i | x_i \sim \mathop{\mathrm{Bernoulli}}(\pi_i)\,, \quad \log \frac{\pi_i}{1 - \pi_i} = \eta_i ={ \color[]{#E16A86}{\theta}_{0}} + x_i^\top \beta_0 \]
\[ y_i^* = \alpha y_i + (1-\alpha) \zeta'({\color[]{#E16A86}{\theta}_P} + x_i^\top {\color[]{#E16A86}{\beta_P}}) \]
with
\[ \operatorname{cov}\left(\theta_0 + x_i^\top \beta_0, \theta_P + x_i^\top \beta_P\right) \overset{\text{a.s.}}{\longrightarrow} \begin{bmatrix} \gamma^2 & \varphi \\ \varphi & \delta^2 \end{bmatrix} \]
Let
\[ \begin{aligned} \Phi(G) &= \underset{w \in \mathcal{S}_w}{\min} \, \underset{u \in \mathcal S_u}{\max} \, w^\top G u + \psi(u, w) && (\text{PO})\\ \phi(g,h) &= \underset{w \in \mathcal{S}_w}{\min} \, \underset{u \in \mathcal S_u}{\max} \, \|u\|_2 g^\top w + \|w\|_2 h^\top u + \psi(u, w) && (\text{AO}) \end{aligned} \] where \(G \in \Re^{n \times m}\), \(g \in \Re^n\), \(h \in \Re^m\), \(S_w \subset \Re^n\), \(S_u \subset \Re^m\), and \(\psi : \Re^n \times \Re^m \rightarrow \Re\) is convex concave.
If \(\mathcal{S}_u, \mathcal{S}_w\) are convex, compact sets and \(G, g, h\) have i.i.d. \(\mathrm{N}(0,1)\) entries, then
\[ \hat \theta,\, \hat \beta = \underset{\substack{\theta \in \Re \\ \beta \in \Re^p}}{\min} \, \frac{1}{n} \left\{1_n^\top \zeta(\theta 1_n + X\beta) - (y^*)^\top (\theta 1_n + X\beta) \right\} \]
\[ \underset{\substack{\theta \in \Re \\ \beta \in S_1}}{\min} \, \underset{\lambda \in S_2}{\max} \, \frac{1}{n} \left\{1_n^\top \zeta(\eta) - \eta^\top y^* + \lambda^\top \left(\eta - \theta 1_n - X\beta \right) \right\} = \underset{\beta \in S_1}{\min} \, \underset{\lambda \in S_2}{\max} \, \frac{1}{n} \lambda^\top X \beta + \psi(\lambda) \]
\[ \underset{\beta \in S_1}{\min} \, \underset{\lambda \in S_2}{\max} \, \| \lambda \|_2 g^\top \beta + \|\beta\|_2 h^\top \lambda + \psi(\lambda) \]
\[ \underset{\substack{\lvert \theta \rvert \leq C \\ \sigma \in [0,C] \\ \mu_1, \mu_2 \in [-C, C]}}{\min} \, \underset{\lambda \in [0, C]}{\max} \, \mathop{\mathrm{E}}[M_{\zeta}(\theta + \mu_1 Q_1 + \mu_2 Q_2 + \sigma Z - \lambda y; \lambda)] + \mathop{\mathrm{E}}\left[y\left(\theta + \mu_1 Q_1 + \mu_2 Q_2 - \frac{\lambda y}{2}\right)\right] - \frac{\kappa \sigma^2}{2\lambda} \,, \tag{2}\]
\[ \begin{bmatrix} Q_1 \\ Q_2 \end{bmatrix} \sim \mathrm{N}\left(\begin{bmatrix} 0 \\ 0 \end{bmatrix}, \begin{bmatrix} \gamma^2 & \varphi \\ \varphi & \delta^2 \end{bmatrix} \right), \quad Z \sim \mathrm{N}(0,1) \, \perp Q_1, Q_2, \quad M_\zeta(x; \lambda) = \min_{z} \zeta(x) - \frac{1}{2 \lambda} \| z - x \|_2^2 \]
\[ \begin{aligned} \hat \theta &\overset{\text{a.s.}}{\longrightarrow} \theta_{\star}\\ \hat \beta &\approx \mu_{1,\star} \beta_0 + \mu_{2,\star} \beta_P + {\sigma}_{\star}\varepsilon \end{aligned} \] where \((\mu_{1,\star}, \mu_{2,\star}, {\sigma}_{\star}, \lambda_{\star}, \theta_{\star})\) are the solution to Equation 2, and \(\varepsilon\) is some mean zero noise
Opens door to inference, subgaussian isotropic covariates
Sterzinger P, Kosmidis I (2024). Diaconis-Ylvisaker prior penalized likelihood for \(p / n \to \kappa \in (0, 1)\) logistic regression. ArXiV: 2311.11290
Philipp Sterzinger - Harrison Award Ceremony