Department of Statistics, London School of Economics
p.sterzinger@lse.ac.uk
psterzinger.at psterzinger
Data Lab Hell
Zirl, Austria
16 June 2026
High-dimensional statistics · proportional-limit asymptotics · penalized likelihood
LSE Fellow in Statistics · LSE
PhD Statistics · Warwick
MSc Applicable Mathematics · LSE
MPhil Economics · Cambridge
BA Economics · St. Gallen
200 handwritten numeral patterns per class, from Dutch utility maps, digitized as binary images
1000 digits for training + 1000 digits for testing
Explain the contribution of the feature sets in describing the digit “7”
Font, digitization
noise, downscaling,
complicate
discrimination
If only rotation
invariant features are
used, “7” and “2” can
look similar
Widely used for inference about covariate effects on binomial probabilities, and prediction
\(y_1, \ldots, y_n \in \{0, 1\}\)
\(x_1, \ldots, x_n \in \Re^p\)
\(Y_{1}, \ldots, Y_{n}\) conditionally independent with
\[ \Pr \left(Y_i = 1 \mid x_i \right) = \pi(x_i^\top\beta_0) , \quad \pi(\eta) = \frac{1}{1 + \exp\{-\eta \}} \]
\(L(\beta; y,X) = \Pr \left(Y = y \mid X, \beta \right) = \prod_{i = 1}^n \pi(x_i^\top \beta)^{y_i} (1 - \pi(x_i^\top \beta))^{1 - y_i}\)
\(\displaystyle \ell(\beta; y, X) = \log(L(\beta; y, X)) = \sum_{i = 1}^n \left\{y_i (x_i^\top \beta) - \log\left(1 + \exp\{x_i^\top \beta\}\right) \right\}\)
\(\hat{\beta} = \underset{\beta \in \Re^p}{\arg \max} \, \ell(\beta; y, X)\)
\(\phi(x) = \mathbb{1}\{ \pi(x^\top \hat \beta) > 1/2 \}\)
We want to make statements about the signal \(\beta_0\)
From the data, we only have access to \(\displaystyle \hat\beta = \hat\beta(Y,X) \approx \beta_0 + \textrm{noise}\)
To quantify uncertainty, we need the distribution of \(\displaystyle \hat\beta - \beta_0\)
\[ \hat \beta: X^\top(y - \pi(X \hat \beta)) = 0 \]
No closed form of \(\hat{\beta}\) or its distribution
\[ 0 = \nabla \ell(\hat \beta) = \nabla \ell(\beta_0) + [\nabla \nabla^\top \ell(\beta_0)] (\hat \beta - \beta_0 ) + \epsilon_n \]
\[ \sqrt{n}(\hat \beta - \beta_0 ) = {\color{black}{\left(\frac{1}{n} X^\top W(\beta_0) X\right)^{-1}}} {\color{black}{\left(\frac{1}{\sqrt{n}} X^\top \left[y - \pi(X\beta_0)\right]\right)}} {\color{black}{+ \epsilon_n}}\rlap{\phantom{\hspace{-1.1em}\raise{-0.20em}{\color{#E16A86}{\Large \times}}}} \,, \] with \(\operatorname{diag}(W(\beta)) = \pi(X\beta)(1 - \pi (X \beta))\)
\[ 0 = \nabla \ell(\hat \beta) = \nabla \ell(\beta_0) + [\nabla \nabla^\top \ell(\beta_0)] (\hat \beta - \beta_0 ) + \epsilon_n \]
\[ \sqrt{n}(\hat \beta - \beta_0 ) = {\color{black}{\left(\frac{1}{n} X^\top W(\beta_0) X\right)^{-1}}} {\color{black}{\left(\frac{1}{\sqrt{n}} X^\top \left[y - \pi(X\beta_0)\right]\right)}} {\color{#777777}{+ \epsilon_n}}\rlap{\hspace{-1.1em}\raise{-0.20em}{\color{#E16A86}{\Large \times}}} \,, \] with \(\operatorname{diag}(W(\beta)) = \pi(X\beta)(1 - \pi (X \beta))\)
\[ 0 = \nabla \ell(\hat \beta) = \nabla \ell(\beta_0) + [\nabla \nabla^\top \ell(\beta_0)] (\hat \beta - \beta_0 ) + \epsilon_n \]
\[ \sqrt{n}(\hat \beta - \beta_0 ) = {\color{black}{\left(\frac{1}{n} X^\top W(\beta_0) X\right)^{-1}}} {\color{black}{\left(\frac{1}{\sqrt{n}} X^\top \left[y - \pi(X\beta_0)\right]\right)}} {\color{#777777}{+ \epsilon_n}}\rlap{\hspace{-1.1em}\raise{-0.20em}{\color{#E16A86}{\Large \times}}} \,, \] with \(\operatorname{diag}(W(\beta)) = \pi(X\beta)(1 - \pi (X \beta))\)
\[ 0 = \nabla \ell(\hat \beta) = \nabla \ell(\beta_0) + [\nabla \nabla^\top \ell(\beta_0)] (\hat \beta - \beta_0 ) + \epsilon_n \]
\[ \sqrt{n}(\hat \beta - \beta_0 ) = {\color{black}{\left(\frac{1}{n} X^\top W(\beta_0) X\right)^{-1}}} {\color{#E16A86}{\left(\frac{1}{\sqrt{n}} X^\top \left[y - \pi(X\beta_0)\right]\right)}} {\color{#777777}{+ \epsilon_n}}\rlap{\hspace{-1.1em}\raise{-0.20em}{\color{#E16A86}{\Large \times}}} \,, \] with \(\operatorname{diag}(W(\beta)) = \pi(X\beta)(1 - \pi (X \beta))\)
Theorem 1 (Central limit theorem) If \(Z_1,\ldots,Z_n\) are \(n\) independent copies of a random variable with mean zero, and variance \(\sigma^2\), then
\[ \frac{1}{\sqrt n}\sum_{i=1}^n Z_i \overset{d}{\longrightarrow} \mathrm{N}(0,\sigma^2), \quad \text{as } n \to \infty \,. \]
Theorem 1 (Central limit theorem) If \(Z_1,\ldots,Z_n\) are \(n\) independent copies of a random variable with mean zero, and variance \(\sigma^2\), then
\[ {\color{#E16A86}{\frac{1}{\sqrt n}\sum_{i=1}^n Z_i}} \overset{d}{\longrightarrow} \mathrm{N}(0,\sigma^2), \quad \text{as } n \to \infty \,. \]
\[ \sqrt{n} (\hat \beta - \beta_0 ) \overset{d}{\longrightarrow} \mathrm{N}\left(0, \mathcal{I}(\beta_0)^{-1}\right), \quad \text{as } n \to \infty\] with \(\mathcal I(\beta) = \lim\limits_{n \to \infty} n^{-1}X^\top W(\beta) X\)
\[ \begin{array}{l} n \in \{100, 200, \ldots, 6400 \} \\ p = 5 \\ x_{ij} \stackrel{\text{iid}}{\sim} \mathop{\mathrm{N}}(0, 1) \\ \beta_0 = \sqrt{\frac{5}{3}}\,(1,1,1,0,0)^\top \end{array} \qquad \Longrightarrow \qquad \gamma^2 = \mathop{\mathrm{var}}(x_i^\top\beta_0) = 5 \]
\[ y_i \stackrel{\text{ind}}{\sim} \mathop{\mathrm{Bernoulli}}\left(\pi_i\right), \quad \log\frac{\pi_i}{1 - \pi_i} = x_i^\top \beta_0 \]
| Df | LR |
|---|---|
| 64 | 64.36 |
\[ \Longrightarrow \Pr\!\left(\chi^2_{64} > 64.36\right) \approx 0.464 \]
Fail to reject the null hypothesis \(\beta_{\texttt{kar}} = 0\)
Historically: number of parameters is small relative to sample size
Tall narrow dataframes
Approximate with the asymptotic regime: \(\displaystyle p/n \to 0 \)
Number of parameters is no longer negligible relative to the sample size
Wider, shallower dataframes
Approximate with the asymptotic regime: \(\displaystyle p/n \to \kappa \)
\[ \begin{array}{l} n \in \{100, 200, \ldots, 6400 \} \\ \kappa = p/n = 0.2 \\ x_{ij} \stackrel{\text{iid}}{\sim} \mathop{\mathrm{N}}(0, 1/n) \\ \beta_0 =\sqrt{\frac{10}{\kappa}}\, (b_1,\ldots,b_{p/2},0,\ldots,0)^\top, \quad b_j = 1 \end{array} \qquad \Longrightarrow \qquad \gamma^2 = \mathop{\mathrm{var}}(x_i^\top\beta_0) = 5 \]
\[ y_i \stackrel{\text{ind}}{\sim} \mathop{\mathrm{Bernoulli}}\left(\pi_i\right), \quad \log\frac{\pi_i}{1 - \pi_i} = x_i^\top \beta_0 \]
sharp phase transition about when the ML estimate does not exist, when \(\eta_i = \theta_0 + x_i^\top \beta_0\), \(x_i \sim \mathop{\mathrm{N}}(0, \Sigma)\), \(p/n \to \kappa \in (0, 1)\), \(\mathop{\mathrm{var}}(\eta_i) \to \gamma^2\)
a method, based on approximate message passing (AMP), that recovers estimation and inferential performance by appropriately rescaling \(\hat\beta\), whenever that exists
study maximum Diaconis-Ylvisaker prior penalized likelihood estimator, which always exists, and report solid empirical performance in high-dimensional setting
Phase transition, \(\hat\beta\) (Candès & Sur, 2020)
Phase transition, \(\hat\beta / \mu_{\star}\) (Sur & Candès, 2019)
Crossing the phase transition
?
\(\displaystyle \log p(\beta; X) = \frac{1 - {\color[]{#E16A86}{\alpha}}}{{\color[]{#E16A86}{\alpha}}} \sum_{i=1}^n \left\{ \pi\left(x_i^\top {\color[]{#E16A86}{\beta}_P}\right) x_i^\top \beta - \log(1 + \exp \{x_i^\top \beta \}) \right\} + C\)
\(\displaystyle \ell^*(\beta; y, X) = \ell(\beta; y, X) + \log p(\beta; X) = \frac{1}{\alpha} \sum_{i = 1}^n \left\{y_i^* x_i^\top\beta - \log\left(1 + \exp\{x_i^\top \beta\}\right) \right\} + C' \,,\)
\(y_i^* = \alpha y_i + (1 - \alpha) \pi\left(x_i^\top \beta_P\right)\)
\(\hat{\beta}^{\textrm{\small DY}}= \arg \max \ell^*(\beta; y, X)\)
MDYPL is ML with pseudo-responses: \(\ell(\beta; y^*, X) / \alpha = \ell^*(\beta; y, X)\)
Rigon & Aliverti (2023, Theorem 1): \(\hat{\beta}^{\textrm{\small DY}}\) is unique and exists for all \(\{y, X\}\) configurations
\(\displaystyle \begin{array}{@{}l@{}} x_{ij} \stackrel{\mathrm{iid}}{\sim} \mathop{\mathrm{N}}(0, 1/n) \\ \left. \begin{array}{@{}l@{}} p/n \to \kappa \in (0,1) \\ \operatorname{var}(x_i^\top \beta_0) \to \gamma^2 > 0 \end{array} \quad \right\} \quad \text{as } n \to \infty \end{array}\)
The empirical distribution of \(\beta_0\) elements converges to \(\pi_{\bar{\beta}}\) and \(\sum_{j = 1}^p \beta_{0,j}^2 / p \to \mathop{\mathrm{E}}(\bar{\beta}^2) < \infty\)
Develop a generalized AMP Feng et al. (2022), whose iterates have a known asymptotic distribution, with stationary point \(\hat{\beta}^{\textrm{\small DY}}\)
\[ \begin{aligned} \hat{\beta}^t &= \hat \beta^{t - 1} + \kappa^{-1} X^\top \Psi^{t-1}(y^*, S^{t-1}) \\ S^t &= X \hat \beta^t - \Psi^{t-1}(y^*, S^{t-1}) \,, \end{aligned} \] for \[ \Psi^t(y_i^*, S^t_i) = b_t (y^*_i - \zeta'(\textrm{prox}_{b_t \zeta}\left(b_t y_i^* + S_i^t\right))) \,, \] where \(\zeta(x) = \log(1 + \exp\{x\})\) and \(\textrm{prox}_{b\zeta}\left(x\right)=\arg\min_{u} \left\{ b\zeta(u) + \frac{1}{2} (x-u)^2 \right\}\) is the proximal operator
Fixed point \(\hat \beta^\infty\) satisfies \(\displaystyle X^\top (y^* - \pi(X\hat \beta^\infty)) = 0 \Rightarrow \hat \beta^{\infty} = \hat{\beta}^{\textrm{\small DY}}\)
Solution \(({\mu}_{\star}, {b}_{\star}, {\sigma}_{\star})\) to the system of nonlinear equations in \(({\color[]{#E16A86}{\mu}, b, \sigma})\)
\[ \begin{aligned} 0 & = \mathop{\mathrm{E}}\left[2 \zeta'(Z) Z \left\{\frac{1+ \alpha}{2} - \zeta' \left(\textrm{prox}_{{\color[]{#E16A86}{b}} \zeta}\left(Z_{*} + \frac{1+ \alpha}{2}{\color[]{#E16A86}{b}}\right) \right) \right\} \right] \\ 0 & = 1 - \kappa - \mathop{\mathrm{E}}\left[\frac{2 \zeta'(Z)}{1 + {\color[]{#E16A86}{b}} \zeta''\left(\textrm{prox}_{{\color[]{#E16A86}{b}} \zeta}\left(Z_{*} + \frac{1+ \alpha}{2} {\color[]{#E16A86}{b}}\right)\right)} \right] \\ 0 & = {\color[]{#E16A86}{\sigma}}^2 - \frac{{\color[]{#E16A86}{b}}^2}{\kappa^2} \mathop{\mathrm{E}}\left[2 \zeta'(Z) \left\{\frac{1+ \alpha}{2} - \zeta'\left(\textrm{prox}_{{\color[]{#E16A86}{b}} \zeta}\left( Z_{*} + \frac{1+ \alpha}{2}{\color[]{#E16A86}{b}}\right)\right) \right\}^2 \right] \end{aligned} \tag{1}\]
\(Z \sim \mathop{\mathrm{N}}(0, \gamma^2)\), \(Z_{*} = {\color[]{#E16A86}{\mu }}Z + \kappa^{1/2} {\color[]{#E16A86}{\sigma}} G\) with \(G \sim \mathop{\mathrm{N}}(0, 1)\) independent of \(Z\)
Informally: \(\displaystyle \hat{\beta}^{\textrm{\small DY}}\approx {\color[]{#E16A86}{{\mu}_{\star}}} \beta_0 + {\color[]{#E16A86}{{\sigma}_{\star}}} { \varepsilon}, \quad \text{as } n \to \infty\)
Theorem 2 (Aggregate asymptotic behaviour of MDYPL estimator)
Assume that \((\alpha,\kappa,\gamma)\) are such that (1) admits a solution \(({\mu}_{\star}, {b}_{\star}, {\sigma}_{\star})\) such that the Jacobian of the RHS of (1) is nonsingular.
Then, for any pseudo-Lipschitz function \(\psi: \Re^2 \to \Re\) of order 21, \[ \frac{1}{p} \sum_{j=1}^{p} \psi(\hat{\beta}^{\textrm{\small DY}}_j - {\mu}_{\star}\beta_{0,j}, \beta_{0,j}) \overset{\textrm{a.s.}}{\longrightarrow} \mathop{\mathrm{E}}\left[\psi({\sigma}_{\star}G, \bar{\beta})\right], \quad \textrm{as } n \to \infty \] where \(G \sim \mathop{\mathrm{N}}(0,1)\) is independent of \(\bar{\beta} \sim \pi_{\bar{\beta}}\)
1 For some \(L > 0\), \(\lvert \psi(u) - \psi(v) \rvert \leq L (1 + \|u\|_2 + \|v\|_2)\) \(\|u - v\|_2\).
| \(\psi(t,u)\) | statistic | a.s. limit | |
|---|---|---|---|
| \(t - (1 - \mu_\star) u\) | \(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j - \beta_{0,j}\right)\) | Bias | \(({\mu}_{\star}- 1) \mathop{\mathrm{E}}[\bar \beta]\) |
| \(t^2\) | \(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j - {\mu}_{\star}\beta_{0,j}\right)^2\) | Variance | \({\sigma}_{\star}^2\) |
| \((t-(1-{\mu}_{\star})u)^2\) | \(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j - \beta_{0,j}\right)^2\) | MSE | \({\sigma}_{\star}^2 + (1-{\mu}_{\star})^2 \gamma^2 / \kappa\) |
| \(\psi(t,u)\) | statistic | a.s. limit | |
|---|---|---|---|
| \(t / {\mu}_{\star}\) | \(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j / {\mu}_{\star}- \beta_{0,j}\right)\) | Bias | \(0\) |
| \(t^2/{\mu}_{\star}^2\) | \(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j / {\mu}_{\star}- \beta_{0,j}\right)^2\) | MSE / Var | \({\sigma}_{\star}^2 / {\mu}_{\star}^2\) |
\(\hat{\beta}^{\textrm{\small DY}}\) for \(\alpha = 1 / (1 + \kappa)\)
\(\hat{\beta}^{\textrm{\small DY}}/ {\mu}_{\star}\) for \(\alpha = 1 / (1 + \kappa)\)
Theorem 3 (Adjusted \(Z\)-statistics)
Assume that \((\alpha,\kappa,\gamma)\) are such that the conditions of Theorem 2 are met.
\[
Z^*_j = \frac{\hat{\beta}^{\textrm{\small DY}}_{j} - {\mu}_{\star}\beta_{0,j}}{{\sigma}_{\star}} \overset{d}{\longrightarrow} \mathrm{N}(0,1)
\]
Theorem 4 (DY prior penalized likelihood ratio test statistics)
Assume that \((\alpha,\kappa,\gamma)\) are such that the conditions of Theorem 2 are met.
Let \(I = \{i_1,\ldots,i_k\}\), and define the DY prior penalized likelihood ratio test statistic \[ \Lambda_I = \underset{\beta \in \Re^p}{\max} \, \ell(\beta; y^*, X) - \underset{\substack{\beta \in \Re^p: \\ \beta_j = 0, \, j \in I }}{\max} \, \ell(\beta; y^*, X) \]
Then, under \(H_0: \beta_{0,i_1} = \ldots = \beta_{0,i_k} = 0\), it holds that \[ 2 \Lambda_{I} \overset{d}{\longrightarrow} \frac{\kappa \sigma_{\star}^2}{b_{\star}} \chi^2_k \]
Arbitrary covariance: \(x_{i} \sim \mathrm{N}(0, \Sigma)\), subgaussian covariate distributions
\(\displaystyle Y_i | x_i \sim \mathop{\mathrm{Bernoulli}}(\pi_i), \quad \log \frac{\pi_i}{1 - \pi_i} = \eta_i ={ \color[]{#E16A86}{\theta}_{0}} + x_i^\top \beta_0\)
\(\displaystyle y_i^* = \alpha y_i + (1-\alpha) \zeta'({\color[]{#E16A86}{\theta}_P} + x_i^\top {\color[]{#E16A86}{\beta_P}})\) with \[ \begin{bmatrix} \theta_0 \\ \theta_P \end{bmatrix} \overset{\text{a.s.}}{\longrightarrow} \begin{bmatrix} \theta_0^* \\ \theta_P^* \end{bmatrix}, \qquad \begin{bmatrix} \frac{\| \beta_0 \|_2^2}{p} & \frac{\langle \beta_0, \beta_P \rangle}{p} \\ \frac{\langle \beta_0, \beta_P \rangle}{p} & \frac{\| \beta_P \|_2^2}{p} \end{bmatrix} \overset{\text{a.s.}}{\longrightarrow} \begin{bmatrix} \gamma^2 & \varphi \\ \varphi & \delta^2 \end{bmatrix} \]
\[ Y_i \mid x_i \sim \mathop{\mathrm{Bernoulli}}(\pi_i), \qquad \pi_i = \pi(\eta_0(x_i)), \qquad x_i \stackrel{\mathrm{iid}}{\sim} \mathop{\mathrm{N}}(0,I_p) \]
\[ \eta_0(x) = c_0 + \sum_{k=1}^K \gamma_{0,k} (u_{0,k}^\top x - t_{0,k})_+ \]
\[ p/n \to \kappa, \qquad K \text{ fixed} \]
ReLU networks have non-unique parameters
Transform to a common representation
\[ \eta_{\theta}(x) = c + \sum_{k=1}^K \gamma_k (u_k^\top x - t_k)_+, \qquad \|u_k\|_2 = 1 \]
Hidden units are matched before comparison
\[ \hat\theta \in \arg\min_{\theta} \frac{1}{n} \sum_{i=1}^n \left\{ \log(1 + \exp\{ \eta_\theta(x_i) \}) - y_i\eta_\theta(x_i) \right\} \]
\(\displaystyle \hat\beta \approx \mu_\star \beta_0+\sigma_\star \varepsilon, \quad \mu_\star = \operatorname{a.s.}\!\lim \frac{\langle \beta_0,\hat\beta\rangle}{\|\beta_0\|_2^2}, \quad \sigma_\star^2 = \operatorname{a.s.}\!\lim \frac{\|\hat\beta-\mu_\star\beta_0\|_2^2}{p}\)
Coefficients: \(\displaystyle V_0=[u_{0,1},\ldots,u_{0,K}], \, \hat V=[\hat u_1,\ldots,\hat u_K]\)
\(\displaystyle \hat V \approx V_0A_\star+\text{noise}, \, A_\star \approx (V_0^\top V_0)^{-1}V_0^\top\hat V, \, \Omega_\star \approx (\hat V-V_0A_\star)^\top(\hat V-V_0A_\star)\)
For a new \(x\),
\[ \hat q(x) \approx A_\star^\top q_0(x)+\Omega_\star^{1/2}G, \qquad q_0(x)=V_0^\top x, \quad \hat q(x)=\hat V^\top x \,. \]
The network uses ridge functions
\[ \gamma_k(q_k(x)-t_k)_+ \,. \]
Correct the fitted ridge coordinates:
\[ \tilde q(x)=A_\star^{-\top}\hat q(x), \quad \tilde\gamma_k=\hat\gamma_k/\rho_{\star,k}, \quad \tilde t_k=\hat t_k-d_{\star,k}, \quad \tilde c=\hat c-d_{\star,c} \,. \]
Thus
\[ \tilde\eta(x) = \tilde c+ \sum_{k=1}^K \tilde\gamma_k (\tilde q_k(x)-\tilde t_k)_+ \,. \]
\[ \begin{array}{l} p \in \{100,200,400\}, \quad \kappa = p/n = 0.2 \\ x_i \stackrel{\text{iid}}{\sim} \mathop{\mathrm{N}}(0,I_p) \\ K=3 \\ v_1,\ldots,v_3 \in \Re^p \text{ block-dense orthonormal} \\ u_{0,k}=v_k \\ \gamma_0\approx 0.96 (0.75,1,1.25)^\top, t_0=0 \quad c_0=-1.11 \end{array} \qquad \Longrightarrow \qquad \begin{array}{l} \mathop{\mathrm{var}}(\eta_0(x)) \approx 1 \\ \mathop{\mathrm{E}}\{\pi(\eta_0(x))\}\approx 0.5 \end{array} \]
\[ \eta_0(x) = c_0 + \sum_{k=1}^3 \gamma_{0,k}(u_{0,k}^\top x)_+ \]
\[ y_i \stackrel{\text{ind}}{\sim} \mathop{\mathrm{Bernoulli}}(\pi_i), \quad \log\frac{\pi_i}{1-\pi_i}=\eta_0(x_i) \]
Sterzinger P, Kosmidis I (2026). Diaconis-Ylvisaker prior penalized likelihood for \(p / n \to \kappa \in (0, 1)\) logistic regression. Biometrika. 113(2), asag014 DOI: 10.1093/biomet/asag014
Philipp Sterzinger – High-dimensional logistic regression