High-dimensional phenomena in statistics:
Logistic regression and beyond

Philipp Sterzinger

Department of Statistics, London School of Economics

p.sterzinger@lse.ac.uk
psterzinger.at psterzinger

Data Lab Hell
Zirl, Austria

16 June 2026

About me

High-dimensional statistics · proportional-limit asymptotics · penalized likelihood

Employment

2024–2026

LSE Fellow in Statistics · LSE

Education

2020–2024

PhD Statistics · Warwick

2018–2019

MSc Applicable Mathematics · LSE

2017–2018

MPhil Economics · Cambridge

2013–2017

BA Economics · St. Gallen

Outside work

mountains · competitive sports · brewing

Logistic regression

Predicting “7”

UCI’s Multiple Features

200 handwritten numeral patterns per class, from Dutch utility maps, digitized as binary images

76 Fourier coefficients of the character shapes, rotation invariant, per digit
64 Karhunen-Loève coefficients per digit

1000 digits for training + 1000 digits for testing

Aim

Explain the contribution of the feature sets in describing the digit “7”

Considerations

Font, digitization
noise, downscaling,
complicate
discrimination

If only rotation
invariant features are
used, “7” and “2” can
look similar

Logistic regression

Widely used for inference about covariate effects on binomial probabilities, and prediction

Data

\(y_1, \ldots, y_n \in \{0, 1\}\)

\(x_1, \ldots, x_n \in \Re^p\)

Model

\(Y_{1}, \ldots, Y_{n}\) conditionally independent with

\[ \Pr \left(Y_i = 1 \mid x_i \right) = \pi(x_i^\top\beta_0) , \quad \pi(\eta) = \frac{1}{1 + \exp\{-\eta \}} \]

Maximum likelihood estimation

Likelihood

\(L(\beta; y,X) = \Pr \left(Y = y \mid X, \beta \right) = \prod_{i = 1}^n \pi(x_i^\top \beta)^{y_i} (1 - \pi(x_i^\top \beta))^{1 - y_i}\)

Log-likelihood

\(\displaystyle \ell(\beta; y, X) = \log(L(\beta; y, X)) = \sum_{i = 1}^n \left\{y_i (x_i^\top \beta) - \log\left(1 + \exp\{x_i^\top \beta\}\right) \right\}\)

Maximum likelihood (ML) estimator

\(\hat{\beta} = \underset{\beta \in \Re^p}{\arg \max} \, \ell(\beta; y, X)\)

Classification

\(\phi(x) = \mathbb{1}\{ \pi(x^\top \hat \beta) > 1/2 \}\)

Predicting “7”

Uncertainty quantification

Target

We want to make statements about the signal \(\beta_0\)

E.g.: Should the Karhunen–Loève features be included? \(\displaystyle (\beta_{\texttt{kar}} = 0?)\)

Observable

From the data, we only have access to \(\displaystyle \hat\beta = \hat\beta(Y,X) \approx \beta_0 + \textrm{noise}\)

Problem

To quantify uncertainty, we need the distribution of \(\displaystyle \hat\beta - \beta_0\)

\[ \hat \beta: X^\top(y - \pi(X \hat \beta)) = 0 \]

No closed form of \(\hat{\beta}\) or its distribution

Fixed-\(p\) asymptotics

Limiting distribution

Score expansion

\[ 0 = \nabla \ell(\hat \beta) = \nabla \ell(\beta_0) + [\nabla \nabla^\top \ell(\beta_0)] (\hat \beta - \beta_0 ) + \epsilon_n \]

\[ \sqrt{n}(\hat \beta - \beta_0 ) = {\color{black}{\left(\frac{1}{n} X^\top W(\beta_0) X\right)^{-1}}} {\color{black}{\left(\frac{1}{\sqrt{n}} X^\top \left[y - \pi(X\beta_0)\right]\right)}} {\color{black}{+ \epsilon_n}}\rlap{\phantom{\hspace{-1.1em}\raise{-0.20em}{\color{#E16A86}{\Large \times}}}} \,, \] with \(\operatorname{diag}(W(\beta)) = \pi(X\beta)(1 - \pi (X \beta))\)

\[ 0 = \nabla \ell(\hat \beta) = \nabla \ell(\beta_0) + [\nabla \nabla^\top \ell(\beta_0)] (\hat \beta - \beta_0 ) + \epsilon_n \]

\[ \sqrt{n}(\hat \beta - \beta_0 ) = {\color{black}{\left(\frac{1}{n} X^\top W(\beta_0) X\right)^{-1}}} {\color{black}{\left(\frac{1}{\sqrt{n}} X^\top \left[y - \pi(X\beta_0)\right]\right)}} {\color{#777777}{+ \epsilon_n}}\rlap{\hspace{-1.1em}\raise{-0.20em}{\color{#E16A86}{\Large \times}}} \,, \] with \(\operatorname{diag}(W(\beta)) = \pi(X\beta)(1 - \pi (X \beta))\)

\[ 0 = \nabla \ell(\hat \beta) = \nabla \ell(\beta_0) + [\nabla \nabla^\top \ell(\beta_0)] (\hat \beta - \beta_0 ) + \epsilon_n \]

\[ \sqrt{n}(\hat \beta - \beta_0 ) = {\color{black}{\left(\frac{1}{n} X^\top W(\beta_0) X\right)^{-1}}} {\color{black}{\left(\frac{1}{\sqrt{n}} X^\top \left[y - \pi(X\beta_0)\right]\right)}} {\color{#777777}{+ \epsilon_n}}\rlap{\hspace{-1.1em}\raise{-0.20em}{\color{#E16A86}{\Large \times}}} \,, \] with \(\operatorname{diag}(W(\beta)) = \pi(X\beta)(1 - \pi (X \beta))\)

\[ 0 = \nabla \ell(\hat \beta) = \nabla \ell(\beta_0) + [\nabla \nabla^\top \ell(\beta_0)] (\hat \beta - \beta_0 ) + \epsilon_n \]

\[ \sqrt{n}(\hat \beta - \beta_0 ) = {\color{black}{\left(\frac{1}{n} X^\top W(\beta_0) X\right)^{-1}}} {\color{#E16A86}{\left(\frac{1}{\sqrt{n}} X^\top \left[y - \pi(X\beta_0)\right]\right)}} {\color{#777777}{+ \epsilon_n}}\rlap{\hspace{-1.1em}\raise{-0.20em}{\color{#E16A86}{\Large \times}}} \,, \] with \(\operatorname{diag}(W(\beta)) = \pi(X\beta)(1 - \pi (X \beta))\)

The CLT is your friend

Theorem 1 (Central limit theorem) If \(Z_1,\ldots,Z_n\) are \(n\) independent copies of a random variable with mean zero, and variance \(\sigma^2\), then

\[ \frac{1}{\sqrt n}\sum_{i=1}^n Z_i \overset{d}{\longrightarrow} \mathrm{N}(0,\sigma^2), \quad \text{as } n \to \infty \,. \]

The CLT is your friend

Theorem 1 (Central limit theorem) If \(Z_1,\ldots,Z_n\) are \(n\) independent copies of a random variable with mean zero, and variance \(\sigma^2\), then

\[ {\color{#E16A86}{\frac{1}{\sqrt n}\sum_{i=1}^n Z_i}} \overset{d}{\longrightarrow} \mathrm{N}(0,\sigma^2), \quad \text{as } n \to \infty \,. \]

Fixed-\(p\) asymptotics

\[ \sqrt{n} (\hat \beta - \beta_0 ) \overset{d}{\longrightarrow} \mathrm{N}\left(0, \mathcal{I}(\beta_0)^{-1}\right), \quad \text{as } n \to \infty\] with \(\mathcal I(\beta) = \lim\limits_{n \to \infty} n^{-1}X^\top W(\beta) X\)

Artificial example

\[ \begin{array}{l} n \in \{100, 200, \ldots, 6400 \} \\ p = 5 \\ x_{ij} \stackrel{\text{iid}}{\sim} \mathop{\mathrm{N}}(0, 1) \\ \beta_0 = \sqrt{\frac{5}{3}}\,(1,1,1,0,0)^\top \end{array} \qquad \Longrightarrow \qquad \gamma^2 = \mathop{\mathrm{var}}(x_i^\top\beta_0) = 5 \]

\[ y_i \stackrel{\text{ind}}{\sim} \mathop{\mathrm{Bernoulli}}\left(\pi_i\right), \quad \log\frac{\pi_i}{1 - \pi_i} = x_i^\top \beta_0 \]

Signal recovery

\(\hat\beta \overset{p}{\longrightarrow} \beta_0\)

Estimation error

\( \| \hat \beta - \beta_0 \|_2 / \sqrt{p} \overset{p}{\longrightarrow} 0\)

Predicted probabilities

\(\pi(x^\top \hat \beta) \overset{p}{\longrightarrow} \pi(x^\top\beta_0)\)

Classification risk

\( \Pr\{\hat y \neq y\} \to \text{oracle}\)

Likelihood ratio

\(W = 2(\hat\ell-\hat\ell_0) \overset{d}{\longrightarrow} \chi^2_2\)

\(H_0:\beta_4=\beta_5=0\)

Wald

\(Z_k=\hat\beta_k/[(X^\top \hat{W} X)^{-1}]_{kk}^{1/2} \overset{d}{\longrightarrow} \mathrm N(0,1)\)

\(H_0:\beta_k=0, k > p/2 \)

Predicting “7”

Likelihood ratio test statistic

Df	LR
64	64.36

\[ \Longrightarrow \Pr\!\left(\chi^2_{64} > 64.36\right) \approx 0.464 \]

Fail to reject the null hypothesis \(\beta_{\texttt{kar}} = 0\)

\(p/n \to \kappa \in (0,1)\)

Tall data

Historically: number of parameters is small relative to sample size

Tall narrow dataframes

\(y\)

\(X\)

\(y_1\)

\(y_2\)

⋮

\(y_n\)

\(x_{11}\) \(x_{12}\) ⋯ \(x_{1p}\)

\(x_{21}\) \(x_{22}\) ⋯ \(x_{2p}\)

⋮

\(x_{n1}\) \(x_{n2}\) ⋯ \(x_{np}\)

\(n\) observations, \(p\) covariates

Approximate with the asymptotic regime: \(\displaystyle p/n \to 0 \)

Modern data

Number of parameters is no longer negligible relative to the sample size

Wider, shallower dataframes

\(y\)

\(X\)

\(y_1\)

\(y_2\)

⋮

\(y_n\)

\(x_{11}\) \(x_{12}\) ⋯ \(x_{1p}\)

\(x_{21}\) \(x_{22}\) ⋯ \(x_{2p}\)

⋮

\(x_{n1}\) \(x_{n2}\) ⋯ \(x_{np}\)

\(n\) observations, \(p\) covariates

Approximate with the asymptotic regime: \(\displaystyle p/n \to \kappa \)

Artificial example

\[ \begin{array}{l} n \in \{100, 200, \ldots, 6400 \} \\ \kappa = p/n = 0.2 \\ x_{ij} \stackrel{\text{iid}}{\sim} \mathop{\mathrm{N}}(0, 1/n) \\ \beta_0 =\sqrt{\frac{10}{\kappa}}\, (b_1,\ldots,b_{p/2},0,\ldots,0)^\top, \quad b_j = 1 \end{array} \qquad \Longrightarrow \qquad \gamma^2 = \mathop{\mathrm{var}}(x_i^\top\beta_0) = 5 \]

\[ y_i \stackrel{\text{ind}}{\sim} \mathop{\mathrm{Bernoulli}}\left(\pi_i\right), \quad \log\frac{\pi_i}{1 - \pi_i} = x_i^\top \beta_0 \]

Performance

Estimation
Prediction
Testing

Signal recovery

Estimation error

Predicted probabilities

Classification error

Likelihood ratio

\(W\) vs \(\chi^2_{10}\)

\(H_0:\beta_{p/2+1}=\cdots=\beta_{p/2+10}=0\)

Wald

\(Z_k\) vs \(\mathrm N(0,1)\)

\(H_0: \beta_k = 0, k > p/2\)

Recent developments

Candès & Sur (2020)

sharp phase transition about when the ML estimate does not exist, when \(\eta_i = \theta_0 + x_i^\top \beta_0\), \(x_i \sim \mathop{\mathrm{N}}(0, \Sigma)\), \(p/n \to \kappa \in (0, 1)\), \(\mathop{\mathrm{var}}(\eta_i) \to \gamma^2\)

Sur & Candès (2019), Zhao et al. (2022)

a method, based on approximate message passing (AMP), that recovers estimation and inferential performance by appropriately rescaling \(\hat\beta\), whenever that exists

Rigon & Aliverti (2023)

study maximum Diaconis-Ylvisaker prior penalized likelihood estimator, which always exists, and report solid empirical performance in high-dimensional setting

Phase transition

Phase transition, \(\hat\beta\) (Candès & Sur, 2020)

Phase transition, \(\hat\beta / \mu_{\star}\) (Sur & Candès, 2019)

Crossing the phase transition

Diaconis-Ylvisaker prior

Prior

\(\displaystyle \log p(\beta; X) = \frac{1 - {\color[]{#E16A86}{\alpha}}}{{\color[]{#E16A86}{\alpha}}} \sum_{i=1}^n \left\{ \pi\left(x_i^\top {\color[]{#E16A86}{\beta}_P}\right) x_i^\top \beta - \log(1 + \exp \{x_i^\top \beta \}) \right\} + C\)

Penalized log-likelihood

\(\displaystyle \ell^*(\beta; y, X) = \ell(\beta; y, X) + \log p(\beta; X) = \frac{1}{\alpha} \sum_{i = 1}^n \left\{y_i^* x_i^\top\beta - \log\left(1 + \exp\{x_i^\top \beta\}\right) \right\} + C' \,,\)

\(y_i^* = \alpha y_i + (1 - \alpha) \pi\left(x_i^\top \beta_P\right)\)

Maximum DY prior penalized likelihood

Maximum DY prior penalized likelihood (MDYPL)

\(\hat{\beta}^{\textrm{\small DY}}= \arg \max \ell^*(\beta; y, X)\)

MDYPL implementation using ML procedures

MDYPL is ML with pseudo-responses: \(\ell(\beta; y^*, X) / \alpha = \ell^*(\beta; y, X)\)

Existence and uniqueness

Rigon & Aliverti (2023, Theorem 1): \(\hat{\beta}^{\textrm{\small DY}}\) is unique and exists for all \(\{y, X\}\) configurations

Shrinkage

\(\beta_P = 0\) (Cordeiro & McCullagh, 1991; Sur & Candès, 2019; Zhao et al., 2022)
\(\alpha = 1 / (1 + \kappa)\) (Rigon & Aliverti, 2023)

Shrinkage

α = 0.50

β_P = (0.00, 0.00, 0.00)

Asymptotic results

Setup

Covariate distribution, dimension, and signal strength

\(\displaystyle \begin{array}{@{}l@{}} x_{ij} \stackrel{\mathrm{iid}}{\sim} \mathop{\mathrm{N}}(0, 1/n) \\ \left. \begin{array}{@{}l@{}} p/n \to \kappa \in (0,1) \\ \operatorname{var}(x_i^\top \beta_0) \to \gamma^2 > 0 \end{array} \quad \right\} \quad \text{as } n \to \infty \end{array}\)

Signal \(\beta_0\)

The empirical distribution of \(\beta_0\) elements converges to \(\pi_{\bar{\beta}}\) and \(\sum_{j = 1}^p \beta_{0,j}^2 / p \to \mathop{\mathrm{E}}(\bar{\beta}^2) < \infty\)

Idea

Develop a generalized AMP Feng et al. (2022), whose iterates have a known asymptotic distribution, with stationary point \(\hat{\beta}^{\textrm{\small DY}}\)

AMP recursion

\[ \begin{aligned} \hat{\beta}^t &= \hat \beta^{t - 1} + \kappa^{-1} X^\top \Psi^{t-1}(y^*, S^{t-1}) \\ S^t &= X \hat \beta^t - \Psi^{t-1}(y^*, S^{t-1}) \,, \end{aligned} \] for \[ \Psi^t(y_i^*, S^t_i) = b_t (y^*_i - \zeta'(\textrm{prox}_{b_t \zeta}\left(b_t y_i^* + S_i^t\right))) \,, \] where \(\zeta(x) = \log(1 + \exp\{x\})\) and \(\textrm{prox}_{b\zeta}\left(x\right)=\arg\min_{u} \left\{ b\zeta(u) + \frac{1}{2} (x-u)^2 \right\}\) is the proximal operator

Fixed point \(\hat \beta^\infty\) satisfies \(\displaystyle X^\top (y^* - \pi(X\hat \beta^\infty)) = 0 \Rightarrow \hat \beta^{\infty} = \hat{\beta}^{\textrm{\small DY}}\)

State evolution of AMP recursion

Solution \(({\mu}_{\star}, {b}_{\star}, {\sigma}_{\star})\) to the system of nonlinear equations in \(({\color[]{#E16A86}{\mu}, b, \sigma})\)

\[ \begin{aligned} 0 & = \mathop{\mathrm{E}}\left[2 \zeta'(Z) Z \left\{\frac{1+ \alpha}{2} - \zeta' \left(\textrm{prox}_{{\color[]{#E16A86}{b}} \zeta}\left(Z_{*} + \frac{1+ \alpha}{2}{\color[]{#E16A86}{b}}\right) \right) \right\} \right] \\ 0 & = 1 - \kappa - \mathop{\mathrm{E}}\left[\frac{2 \zeta'(Z)}{1 + {\color[]{#E16A86}{b}} \zeta''\left(\textrm{prox}_{{\color[]{#E16A86}{b}} \zeta}\left(Z_{*} + \frac{1+ \alpha}{2} {\color[]{#E16A86}{b}}\right)\right)} \right] \\ 0 & = {\color[]{#E16A86}{\sigma}}^2 - \frac{{\color[]{#E16A86}{b}}^2}{\kappa^2} \mathop{\mathrm{E}}\left[2 \zeta'(Z) \left\{\frac{1+ \alpha}{2} - \zeta'\left(\textrm{prox}_{{\color[]{#E16A86}{b}} \zeta}\left( Z_{*} + \frac{1+ \alpha}{2}{\color[]{#E16A86}{b}}\right)\right) \right\}^2 \right] \end{aligned} \tag{1}\]

\(Z \sim \mathop{\mathrm{N}}(0, \gamma^2)\), \(Z_{*} = {\color[]{#E16A86}{\mu }}Z + \kappa^{1/2} {\color[]{#E16A86}{\sigma}} G\) with \(G \sim \mathop{\mathrm{N}}(0, 1)\) independent of \(Z\)

MDYPL asymptotics

Informally: \(\displaystyle \hat{\beta}^{\textrm{\small DY}}\approx {\color[]{#E16A86}{{\mu}_{\star}}} \beta_0 + {\color[]{#E16A86}{{\sigma}_{\star}}} { \varepsilon}, \quad \text{as } n \to \infty\)

Theorem 2 (Aggregate asymptotic behaviour of MDYPL estimator)

Assume that \((\alpha,\kappa,\gamma)\) are such that (1) admits a solution \(({\mu}_{\star}, {b}_{\star}, {\sigma}_{\star})\) such that the Jacobian of the RHS of (1) is nonsingular.

Then, for any pseudo-Lipschitz function \(\psi: \Re^2 \to \Re\) of order 21, \[ \frac{1}{p} \sum_{j=1}^{p} \psi(\hat{\beta}^{\textrm{\small DY}}_j - {\mu}_{\star}\beta_{0,j}, \beta_{0,j}) \overset{\textrm{a.s.}}{\longrightarrow} \mathop{\mathrm{E}}\left[\psi({\sigma}_{\star}G, \bar{\beta})\right], \quad \textrm{as } n \to \infty \] where \(G \sim \mathop{\mathrm{N}}(0,1)\) is independent of \(\bar{\beta} \sim \pi_{\bar{\beta}}\)

1 For some \(L > 0\), \(\lvert \psi(u) - \psi(v) \rvert \leq L (1 + \|u\|_2 + \|v\|_2)\) \(\|u - v\|_2\).

Aggregate asymptotics

Behaviour of \(\hat{\beta}^{\textrm{\small DY}}\)

\(\psi(t,u)\)	statistic		a.s. limit
\(t - (1 - \mu_\star) u\)	\(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j - \beta_{0,j}\right)\)	Bias	\(({\mu}_{\star}- 1) \mathop{\mathrm{E}}[\bar \beta]\)
\(t^2\)	\(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j - {\mu}_{\star}\beta_{0,j}\right)^2\)	Variance	\({\sigma}_{\star}^2\)
\((t-(1-{\mu}_{\star})u)^2\)	\(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j - \beta_{0,j}\right)^2\)	MSE	\({\sigma}_{\star}^2 + (1-{\mu}_{\star})^2 \gamma^2 / \kappa\)

Behaviour of \(\hat{\beta}^{\textrm{\small DY}}/ {\mu}_{\star}\)

\(\psi(t,u)\)	statistic		a.s. limit
\(t / {\mu}_{\star}\)	\(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j / {\mu}_{\star}- \beta_{0,j}\right)\)	Bias	\(0\)
\(t^2/{\mu}_{\star}^2\)	\(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j / {\mu}_{\star}- \beta_{0,j}\right)^2\)	MSE / Var	\({\sigma}_{\star}^2 / {\mu}_{\star}^2\)

\(\hat{\beta}^{\textrm{\small DY}}\) for \(\alpha = 1 / (1 + \kappa)\)

\(\hat{\beta}^{\textrm{\small DY}}/ {\mu}_{\star}\) for \(\alpha = 1 / (1 + \kappa)\)

Testing

Theorem 3 (Adjusted \(Z\)-statistics)

Assume that \((\alpha,\kappa,\gamma)\) are such that the conditions of Theorem 2 are met.
\[ Z^*_j = \frac{\hat{\beta}^{\textrm{\small DY}}_{j} - {\mu}_{\star}\beta_{0,j}}{{\sigma}_{\star}} \overset{d}{\longrightarrow} \mathrm{N}(0,1) \]

Theorem 4 (DY prior penalized likelihood ratio test statistics)

Assume that \((\alpha,\kappa,\gamma)\) are such that the conditions of Theorem 2 are met.

Let \(I = \{i_1,\ldots,i_k\}\), and define the DY prior penalized likelihood ratio test statistic \[ \Lambda_I = \underset{\beta \in \Re^p}{\max} \, \ell(\beta; y^*, X) - \underset{\substack{\beta \in \Re^p: \\ \beta_j = 0, \, j \in I }}{\max} \, \ell(\beta; y^*, X) \]

Then, under \(H_0: \beta_{0,i_1} = \ldots = \beta_{0,i_k} = 0\), it holds that \[ 2 \Lambda_{I} \overset{d}{\longrightarrow} \frac{\kappa \sigma_{\star}^2}{b_{\star}} \chi^2_k \]

MDYPL performance

Estimation
Prediction
Testing

Signal Recovery

\(\hat\beta^{*}=\hat\beta^{\mathrm{DY}}/\mu_\star\)

Estimation error

\( \| \hat \beta^* - \beta_0 \|_2 / \sqrt{p} \)

Predicted probabilities

\(\pi(x^\top \hat \beta^*)\)

Classification error

\(\Pr(\hat y \neq y) \longrightarrow \textrm{HD limit} \)

Adjusted PLR

\(W^* = 2b_\star(\hat \ell - \hat \ell_0) /(\kappa\sigma_\star^2) \overset{d}{\longrightarrow} \chi^2_{10}\)

\(H_0:\beta_{p/2 + 1}= \ldots = \beta_{p/2 + 10}=0\)

Adjusted Wald

\(Z^*_k = \hat\beta^{\mathrm{DY}}_k/\sigma_\star \overset{d}{\longrightarrow} \mathrm N(0,1)\)

\(H_0:\beta_{k}=0, k > p / 2 \)

Extensions

Covariate structure

Arbitrary covariance: \(x_{i} \sim \mathrm{N}(0, \Sigma)\), subgaussian covariate distributions

Inclusion of intercept

\(\displaystyle Y_i | x_i \sim \mathop{\mathrm{Bernoulli}}(\pi_i), \quad \log \frac{\pi_i}{1 - \pi_i} = \eta_i ={ \color[]{#E16A86}{\theta}_{0}} + x_i^\top \beta_0\)

Nonzero prior signal

\(\displaystyle y_i^* = \alpha y_i + (1-\alpha) \zeta'({\color[]{#E16A86}{\theta}_P} + x_i^\top {\color[]{#E16A86}{\beta_P}})\) with \[ \begin{bmatrix} \theta_0 \\ \theta_P \end{bmatrix} \overset{\text{a.s.}}{\longrightarrow} \begin{bmatrix} \theta_0^* \\ \theta_P^* \end{bmatrix}, \qquad \begin{bmatrix} \frac{\| \beta_0 \|_2^2}{p} & \frac{\langle \beta_0, \beta_P \rangle}{p} \\ \frac{\langle \beta_0, \beta_P \rangle}{p} & \frac{\| \beta_P \|_2^2}{p} \end{bmatrix} \overset{\text{a.s.}}{\longrightarrow} \begin{bmatrix} \gamma^2 & \varphi \\ \varphi & \delta^2 \end{bmatrix} \]

Outlook

Feedforward neural network

Setup

Data

\[ Y_i \mid x_i \sim \mathop{\mathrm{Bernoulli}}(\pi_i), \qquad \pi_i = \pi(\eta_0(x_i)), \qquad x_i \stackrel{\mathrm{iid}}{\sim} \mathop{\mathrm{N}}(0,I_p) \]

Signal network

\[ \eta_0(x) = c_0 + \sum_{k=1}^K \gamma_{0,k} (u_{0,k}^\top x - t_{0,k})_+ \]

\[ p/n \to \kappa, \qquad K \text{ fixed} \]

Setup

Identifiable parametrisation

ReLU networks have non-unique parameters

Transform to a common representation

\[ \eta_{\theta}(x) = c + \sum_{k=1}^K \gamma_k (u_k^\top x - t_k)_+, \qquad \|u_k\|_2 = 1 \]

Hidden units are matched before comparison

Estimator

\[ \hat\theta \in \arg\min_{\theta} \frac{1}{n} \sum_{i=1}^n \left\{ \log(1 + \exp\{ \eta_\theta(x_i) \}) - y_i\eta_\theta(x_i) \right\} \]

High-dimensional limit

Logistic regression

\(\displaystyle \hat\beta \approx \mu_\star \beta_0+\sigma_\star \varepsilon, \quad \mu_\star = \operatorname{a.s.}\!\lim \frac{\langle \beta_0,\hat\beta\rangle}{\|\beta_0\|_2^2}, \quad \sigma_\star^2 = \operatorname{a.s.}\!\lim \frac{\|\hat\beta-\mu_\star\beta_0\|_2^2}{p}\)

FNN analogue

Coefficients: \(\displaystyle V_0=[u_{0,1},\ldots,u_{0,K}], \, \hat V=[\hat u_1,\ldots,\hat u_K]\)

\(\displaystyle \hat V \approx V_0A_\star+\text{noise}, \, A_\star \approx (V_0^\top V_0)^{-1}V_0^\top\hat V, \, \Omega_\star \approx (\hat V-V_0A_\star)^\top(\hat V-V_0A_\star)\)

Function-level correction

For a new \(x\),

\[ \hat q(x) \approx A_\star^\top q_0(x)+\Omega_\star^{1/2}G, \qquad q_0(x)=V_0^\top x, \quad \hat q(x)=\hat V^\top x \,. \]

The network uses ridge functions

\[ \gamma_k(q_k(x)-t_k)_+ \,. \]

Correct the fitted ridge coordinates:

\[ \tilde q(x)=A_\star^{-\top}\hat q(x), \quad \tilde\gamma_k=\hat\gamma_k/\rho_{\star,k}, \quad \tilde t_k=\hat t_k-d_{\star,k}, \quad \tilde c=\hat c-d_{\star,c} \,. \]

Thus

\[ \tilde\eta(x) = \tilde c+ \sum_{k=1}^K \tilde\gamma_k (\tilde q_k(x)-\tilde t_k)_+ \,. \]

Artificial example

\[ \begin{array}{l} p \in \{100,200,400\}, \quad \kappa = p/n = 0.2 \\ x_i \stackrel{\text{iid}}{\sim} \mathop{\mathrm{N}}(0,I_p) \\ K=3 \\ v_1,\ldots,v_3 \in \Re^p \text{ block-dense orthonormal} \\ u_{0,k}=v_k \\ \gamma_0\approx 0.96 (0.75,1,1.25)^\top, t_0=0 \quad c_0=-1.11 \end{array} \qquad \Longrightarrow \qquad \begin{array}{l} \mathop{\mathrm{var}}(\eta_0(x)) \approx 1 \\ \mathop{\mathrm{E}}\{\pi(\eta_0(x))\}\approx 0.5 \end{array} \]

\[ \eta_0(x) = c_0 + \sum_{k=1}^3 \gamma_{0,k}(u_{0,k}^\top x)_+ \]

\[ y_i \stackrel{\text{ind}}{\sim} \mathop{\mathrm{Bernoulli}}(\pi_i), \quad \log\frac{\pi_i}{1-\pi_i}=\eta_0(x_i) \]

Performance

Function-level estimation
Prediction

Function-level coefficients

Hidden layer

\(\gamma_1 u_1\)

Hidden layer

\(\gamma_2 u_2\)

Hidden layer

\(\gamma_3 u_3\)

Estimation error

Predicted probabilities

\(\hat{\pi}\)

Classification error

\(\Pr(\hat{y} \neq y) \)

Unadjusted

Function plug-in

Channel posterior

References

Candès, E. J., & Sur, P. (2020). The phase transition for the existence of the maximum likelihood estimate in high-dimensional logistic regression. The Annals of Statistics, 48(1), 27–42. https://doi.org/10.1214/18-AOS1789

Cordeiro, G. M., & McCullagh, P. (1991). Bias correction in generalized linear models. Journal of the Royal Statistical Society Series B: Statistical Methodology, 53(3), 629–643. https://doi.org/10.1111/j.2517-6161.1991.tb01852.x

Feng, O. Y., Venkataramanan, R., Rush, C., & Samworth, R. J. (2022). A unifying tutorial on approximate message passing. Foundations and Trends in Machine Learning, 15(4), 335–536. https://doi.org/10.1561/2200000092

Rigon, T., & Aliverti, E. (2023). Conjugate priors and bias reduction for logistic regression models. Statistics & Probability Letters, 202, 109901. https://doi.org/10.1016/j.spl.2023.109901

Sur, P., & Candès, E. J. (2019). A modern maximum-likelihood theory for high-dimensional logistic regression. Proceedings of the National Academy of Sciences, 116(29), 14516–14525. https://doi.org/10.1073/pnas.1810420116

Zhao, Q., Sur, P., & Candes, E. J. (2022). The asymptotic distribution of the MLE in high-dimensional logistic models: Arbitrary covariance. Bernoulli, 28(3), 1835–1861. https://doi.org/10.3150/21-BEJ1401

Thank you!

Sterzinger P, Kosmidis I (2026). Diaconis-Ylvisaker prior penalized likelihood for \(p / n \to \kappa \in (0, 1)\) logistic regression. Biometrika. 113(2), asag014 DOI: 10.1093/biomet/asag014

High-dimensional phenomena in statistics: Logistic regression and beyond

About me

Logistic regression

Predicting “7”

UCI’s Multiple Features

Aim

Considerations

Logistic regression

Data

Model

Maximum likelihood estimation

Likelihood

Log-likelihood

Maximum likelihood (ML) estimator

Classification

Predicting “7”

Uncertainty quantification

Target

Observable

Problem

Fixed-\(p\) asymptotics

Limiting distribution

Score expansion

The CLT is your friend

The CLT is your friend

Fixed-\(p\) asymptotics

Artificial example

Performance

Predicting “7”

Likelihood ratio test statistic

\(p/n \to \kappa \in (0,1)\)

Tall data

Modern data

Artificial example

Performance

Recent developments

Candès & Sur (2020)

Sur & Candès (2019), Zhao et al. (2022)

Rigon & Aliverti (2023)

Phase transition

Diaconis-Ylvisaker prior

Diaconis-Ylvisaker prior

Prior

Penalized log-likelihood

Maximum DY prior penalized likelihood

Maximum DY prior penalized likelihood (MDYPL)

MDYPL implementation using ML procedures

Existence and uniqueness

Shrinkage

Shrinkage

Asymptotic results

Setup

Covariate distribution, dimension, and signal strength

Signal \(\beta_0\)

Idea

AMP recursion

State evolution of AMP recursion

MDYPL asymptotics

Aggregate asymptotics

Behaviour of \(\hat{\beta}^{\textrm{\small DY}}\)

Behaviour of \(\hat{\beta}^{\textrm{\small DY}}/ {\mu}_{\star}\)

\(\hat{\beta}^{\textrm{\small DY}}\) for \(\alpha = 1 / (1 + \kappa)\)

Testing

MDYPL performance

Extensions

Covariate structure

Inclusion of intercept

Nonzero prior signal

Outlook

Feedforward neural network

Setup

Data

Signal network

Setup

Identifiable parametrisation

Estimator

High-dimensional limit

Logistic regression

FNN analogue

Function-level correction

High-dimensional phenomena in statistics:
Logistic regression and beyond