High-dimensional phenomena in statistics:
Logistic regression and beyond

Philipp Sterzinger

Department of Statistics, London School of Economics

p.sterzinger@lse.ac.uk
psterzinger.at   psterzinger


Data Lab Hell
Zirl, Austria

16 June 2026

About me

High-dimensional statistics · proportional-limit asymptotics · penalized likelihood

2024–2026

LSE Fellow in Statistics · LSE

2020–2024

PhD Statistics · Warwick

2018–2019

MSc Applicable Mathematics · LSE

2017–2018

MPhil Economics · Cambridge

2013–2017

BA Economics · St. Gallen

mountains · competitive sports · brewing

Logistic regression

Predicting “7”

UCI’s Multiple Features

200 handwritten numeral patterns per class, from Dutch utility maps, digitized as binary images

  • 76 Fourier coefficients of the character shapes, rotation invariant, per digit
  • 64 Karhunen-Loève coefficients per digit

1000 digits for training + 1000 digits for testing

Aim

Explain the contribution of the feature sets in describing the digit “7”

Considerations

Font, digitization
noise, downscaling,
complicate
discrimination

If only rotation
invariant features are
used, “7” and “2” can
look similar

Logistic regression

Widely used for inference about covariate effects on binomial probabilities, and prediction

Data

\(y_1, \ldots, y_n \in \{0, 1\}\)

\(x_1, \ldots, x_n \in \Re^p\)

Model

\(Y_{1}, \ldots, Y_{n}\) conditionally independent with

\[ \Pr \left(Y_i = 1 \mid x_i \right) = \pi(x_i^\top\beta_0) , \quad \pi(\eta) = \frac{1}{1 + \exp\{-\eta \}} \]

Maximum likelihood estimation

Likelihood

\(L(\beta; y,X) = \Pr \left(Y = y \mid X, \beta \right) = \prod_{i = 1}^n \pi(x_i^\top \beta)^{y_i} (1 - \pi(x_i^\top \beta))^{1 - y_i}\)

Log-likelihood

\(\displaystyle \ell(\beta; y, X) = \log(L(\beta; y, X)) = \sum_{i = 1}^n \left\{y_i (x_i^\top \beta) - \log\left(1 + \exp\{x_i^\top \beta\}\right) \right\}\)

Maximum likelihood (ML) estimator

\(\hat{\beta} = \underset{\beta \in \Re^p}{\arg \max} \, \ell(\beta; y, X)\)

Classification

\(\phi(x) = \mathbb{1}\{ \pi(x^\top \hat \beta) > 1/2 \}\)

Predicting “7”

Uncertainty quantification

Target

We want to make statements about the signal \(\beta_0\)

  • E.g.: Should the Karhunen–Loève features be included? \(\displaystyle (\beta_{\texttt{kar}} = 0?)\)

Observable

From the data, we only have access to \(\displaystyle \hat\beta = \hat\beta(Y,X) \approx \beta_0 + \textrm{noise}\)

Problem

To quantify uncertainty, we need the distribution of \(\displaystyle \hat\beta - \beta_0\)

\[ \hat \beta: X^\top(y - \pi(X \hat \beta)) = 0 \]

No closed form of \(\hat{\beta}\) or its distribution

Fixed-\(p\) asymptotics

Limiting distribution

Score expansion

\[ 0 = \nabla \ell(\hat \beta) = \nabla \ell(\beta_0) + [\nabla \nabla^\top \ell(\beta_0)] (\hat \beta - \beta_0 ) + \epsilon_n \]

\[ \sqrt{n}(\hat \beta - \beta_0 ) = {\color{black}{\left(\frac{1}{n} X^\top W(\beta_0) X\right)^{-1}}} {\color{black}{\left(\frac{1}{\sqrt{n}} X^\top \left[y - \pi(X\beta_0)\right]\right)}} {\color{black}{+ \epsilon_n}}\rlap{\phantom{\hspace{-1.1em}\raise{-0.20em}{\color{#E16A86}{\Large \times}}}} \,, \] with \(\operatorname{diag}(W(\beta)) = \pi(X\beta)(1 - \pi (X \beta))\)

\[ 0 = \nabla \ell(\hat \beta) = \nabla \ell(\beta_0) + [\nabla \nabla^\top \ell(\beta_0)] (\hat \beta - \beta_0 ) + \epsilon_n \]

\[ \sqrt{n}(\hat \beta - \beta_0 ) = {\color{black}{\left(\frac{1}{n} X^\top W(\beta_0) X\right)^{-1}}} {\color{black}{\left(\frac{1}{\sqrt{n}} X^\top \left[y - \pi(X\beta_0)\right]\right)}} {\color{#777777}{+ \epsilon_n}}\rlap{\hspace{-1.1em}\raise{-0.20em}{\color{#E16A86}{\Large \times}}} \,, \] with \(\operatorname{diag}(W(\beta)) = \pi(X\beta)(1 - \pi (X \beta))\)

\[ 0 = \nabla \ell(\hat \beta) = \nabla \ell(\beta_0) + [\nabla \nabla^\top \ell(\beta_0)] (\hat \beta - \beta_0 ) + \epsilon_n \]

\[ \sqrt{n}(\hat \beta - \beta_0 ) = {\color{black}{\left(\frac{1}{n} X^\top W(\beta_0) X\right)^{-1}}} {\color{black}{\left(\frac{1}{\sqrt{n}} X^\top \left[y - \pi(X\beta_0)\right]\right)}} {\color{#777777}{+ \epsilon_n}}\rlap{\hspace{-1.1em}\raise{-0.20em}{\color{#E16A86}{\Large \times}}} \,, \] with \(\operatorname{diag}(W(\beta)) = \pi(X\beta)(1 - \pi (X \beta))\)

\[ 0 = \nabla \ell(\hat \beta) = \nabla \ell(\beta_0) + [\nabla \nabla^\top \ell(\beta_0)] (\hat \beta - \beta_0 ) + \epsilon_n \]

\[ \sqrt{n}(\hat \beta - \beta_0 ) = {\color{black}{\left(\frac{1}{n} X^\top W(\beta_0) X\right)^{-1}}} {\color{#E16A86}{\left(\frac{1}{\sqrt{n}} X^\top \left[y - \pi(X\beta_0)\right]\right)}} {\color{#777777}{+ \epsilon_n}}\rlap{\hspace{-1.1em}\raise{-0.20em}{\color{#E16A86}{\Large \times}}} \,, \] with \(\operatorname{diag}(W(\beta)) = \pi(X\beta)(1 - \pi (X \beta))\)

The CLT is your friend

Theorem 1 (Central limit theorem) If \(Z_1,\ldots,Z_n\) are \(n\) independent copies of a random variable with mean zero, and variance \(\sigma^2\), then

\[ \frac{1}{\sqrt n}\sum_{i=1}^n Z_i \overset{d}{\longrightarrow} \mathrm{N}(0,\sigma^2), \quad \text{as } n \to \infty \,. \]

The CLT is your friend

Theorem 1 (Central limit theorem) If \(Z_1,\ldots,Z_n\) are \(n\) independent copies of a random variable with mean zero, and variance \(\sigma^2\), then

\[ {\color{#E16A86}{\frac{1}{\sqrt n}\sum_{i=1}^n Z_i}} \overset{d}{\longrightarrow} \mathrm{N}(0,\sigma^2), \quad \text{as } n \to \infty \,. \]

Fixed-\(p\) asymptotics

\[ \sqrt{n} (\hat \beta - \beta_0 ) \overset{d}{\longrightarrow} \mathrm{N}\left(0, \mathcal{I}(\beta_0)^{-1}\right), \quad \text{as } n \to \infty\] with \(\mathcal I(\beta) = \lim\limits_{n \to \infty} n^{-1}X^\top W(\beta) X\)

Artificial example

\[ \begin{array}{l} n \in \{100, 200, \ldots, 6400 \} \\ p = 5 \\ x_{ij} \stackrel{\text{iid}}{\sim} \mathop{\mathrm{N}}(0, 1) \\ \beta_0 = \sqrt{\frac{5}{3}}\,(1,1,1,0,0)^\top \end{array} \qquad \Longrightarrow \qquad \gamma^2 = \mathop{\mathrm{var}}(x_i^\top\beta_0) = 5 \]



\[ y_i \stackrel{\text{ind}}{\sim} \mathop{\mathrm{Bernoulli}}\left(\pi_i\right), \quad \log\frac{\pi_i}{1 - \pi_i} = x_i^\top \beta_0 \]

Performance

Signal recovery
\(\hat\beta \overset{p}{\longrightarrow} \beta_0\)
Estimation error
\( \| \hat \beta - \beta_0 \|_2 / \sqrt{p} \overset{p}{\longrightarrow} 0\)
Predicted probabilities
\(\pi(x^\top \hat \beta) \overset{p}{\longrightarrow} \pi(x^\top\beta_0)\)
Classification risk
\( \Pr\{\hat y \neq y\} \to \text{oracle}\)
Likelihood ratio
\(W = 2(\hat\ell-\hat\ell_0) \overset{d}{\longrightarrow} \chi^2_2\)
\(H_0:\beta_4=\beta_5=0\)
Wald
\(Z_k=\hat\beta_k/[(X^\top \hat{W} X)^{-1}]_{kk}^{1/2} \overset{d}{\longrightarrow} \mathrm N(0,1)\)
\(H_0:\beta_k=0, k > p/2 \)

Predicting “7”

Likelihood ratio test statistic

Df LR
64 64.36

\[ \Longrightarrow \Pr\!\left(\chi^2_{64} > 64.36\right) \approx 0.464 \]

Fail to reject the null hypothesis \(\beta_{\texttt{kar}} = 0\)

\(p/n \to \kappa \in (0,1)\)

Tall data

Historically: number of parameters is small relative to sample size

Tall narrow dataframes

\(y\)
\(X\)
\(y_1\)
\(y_2\)
\(y_n\)
\(x_{11}\) \(x_{12}\) \(x_{1p}\)
\(x_{21}\) \(x_{22}\) \(x_{2p}\)
\(x_{n1}\) \(x_{n2}\) \(x_{np}\)
\(n\) observations, \(p\) covariates

Approximate with the asymptotic regime: \(\displaystyle p/n \to 0 \)

Modern data

Number of parameters is no longer negligible relative to the sample size

Wider, shallower dataframes

\(y\)
\(X\)
\(y_1\)
\(y_2\)
\(y_n\)
\(x_{11}\) \(x_{12}\) \(x_{1p}\)
\(x_{21}\) \(x_{22}\) \(x_{2p}\)
\(x_{n1}\) \(x_{n2}\) \(x_{np}\)
\(n\) observations, \(p\) covariates

Approximate with the asymptotic regime: \(\displaystyle p/n \to \kappa \)

Artificial example

\[ \begin{array}{l} n \in \{100, 200, \ldots, 6400 \} \\ \kappa = p/n = 0.2 \\ x_{ij} \stackrel{\text{iid}}{\sim} \mathop{\mathrm{N}}(0, 1/n) \\ \beta_0 =\sqrt{\frac{10}{\kappa}}\, (b_1,\ldots,b_{p/2},0,\ldots,0)^\top, \quad b_j = 1 \end{array} \qquad \Longrightarrow \qquad \gamma^2 = \mathop{\mathrm{var}}(x_i^\top\beta_0) = 5 \]



\[ y_i \stackrel{\text{ind}}{\sim} \mathop{\mathrm{Bernoulli}}\left(\pi_i\right), \quad \log\frac{\pi_i}{1 - \pi_i} = x_i^\top \beta_0 \]

Performance

Signal recovery
Estimation error
Predicted probabilities
Classification error
Likelihood ratio
\(W\) vs \(\chi^2_{10}\)
\(H_0:\beta_{p/2+1}=\cdots=\beta_{p/2+10}=0\)
Wald
\(Z_k\) vs \(\mathrm N(0,1)\)
\(H_0: \beta_k = 0, k > p/2\)

Recent developments

Candès & Sur (2020)

sharp phase transition about when the ML estimate does not exist, when \(\eta_i = \theta_0 + x_i^\top \beta_0\), \(x_i \sim \mathop{\mathrm{N}}(0, \Sigma)\), \(p/n \to \kappa \in (0, 1)\), \(\mathop{\mathrm{var}}(\eta_i) \to \gamma^2\)

Sur & Candès (2019), Zhao et al. (2022)

a method, based on approximate message passing (AMP), that recovers estimation and inferential performance by appropriately rescaling \(\hat\beta\), whenever that exists

Rigon & Aliverti (2023)

study maximum Diaconis-Ylvisaker prior penalized likelihood estimator, which always exists, and report solid empirical performance in high-dimensional setting

Phase transition

Phase transition, \(\hat\beta\) (Candès & Sur, 2020)

Phase transition, \(\hat\beta / \mu_{\star}\) (Sur & Candès, 2019)

Crossing the phase transition

?

Diaconis-Ylvisaker prior

Diaconis-Ylvisaker prior

Prior

\(\displaystyle \log p(\beta; X) = \frac{1 - {\color[]{#E16A86}{\alpha}}}{{\color[]{#E16A86}{\alpha}}} \sum_{i=1}^n \left\{ \pi\left(x_i^\top {\color[]{#E16A86}{\beta}_P}\right) x_i^\top \beta - \log(1 + \exp \{x_i^\top \beta \}) \right\} + C\)

Penalized log-likelihood

\(\displaystyle \ell^*(\beta; y, X) = \ell(\beta; y, X) + \log p(\beta; X) = \frac{1}{\alpha} \sum_{i = 1}^n \left\{y_i^* x_i^\top\beta - \log\left(1 + \exp\{x_i^\top \beta\}\right) \right\} + C' \,,\)

\(y_i^* = \alpha y_i + (1 - \alpha) \pi\left(x_i^\top \beta_P\right)\)

Maximum DY prior penalized likelihood

Maximum DY prior penalized likelihood (MDYPL)

\(\hat{\beta}^{\textrm{\small DY}}= \arg \max \ell^*(\beta; y, X)\)

MDYPL implementation using ML procedures

MDYPL is ML with pseudo-responses: \(\ell(\beta; y^*, X) / \alpha = \ell^*(\beta; y, X)\)

Existence and uniqueness

Rigon & Aliverti (2023, Theorem 1): \(\hat{\beta}^{\textrm{\small DY}}\) is unique and exists for all \(\{y, X\}\) configurations

Shrinkage

Shrinkage

α = 0.50
βP = (0.00, 0.00, 0.00)

Asymptotic results

Setup

Covariate distribution, dimension, and signal strength

\(\displaystyle \begin{array}{@{}l@{}} x_{ij} \stackrel{\mathrm{iid}}{\sim} \mathop{\mathrm{N}}(0, 1/n) \\ \left. \begin{array}{@{}l@{}} p/n \to \kappa \in (0,1) \\ \operatorname{var}(x_i^\top \beta_0) \to \gamma^2 > 0 \end{array} \quad \right\} \quad \text{as } n \to \infty \end{array}\)

Signal \(\beta_0\)

The empirical distribution of \(\beta_0\) elements converges to \(\pi_{\bar{\beta}}\) and \(\sum_{j = 1}^p \beta_{0,j}^2 / p \to \mathop{\mathrm{E}}(\bar{\beta}^2) < \infty\)

Idea

Develop a generalized AMP Feng et al. (2022), whose iterates have a known asymptotic distribution, with stationary point \(\hat{\beta}^{\textrm{\small DY}}\)

AMP recursion

\[ \begin{aligned} \hat{\beta}^t &= \hat \beta^{t - 1} + \kappa^{-1} X^\top \Psi^{t-1}(y^*, S^{t-1}) \\ S^t &= X \hat \beta^t - \Psi^{t-1}(y^*, S^{t-1}) \,, \end{aligned} \] for \[ \Psi^t(y_i^*, S^t_i) = b_t (y^*_i - \zeta'(\textrm{prox}_{b_t \zeta}\left(b_t y_i^* + S_i^t\right))) \,, \] where \(\zeta(x) = \log(1 + \exp\{x\})\) and \(\textrm{prox}_{b\zeta}\left(x\right)=\arg\min_{u} \left\{ b\zeta(u) + \frac{1}{2} (x-u)^2 \right\}\) is the proximal operator


Fixed point \(\hat \beta^\infty\) satisfies \(\displaystyle X^\top (y^* - \pi(X\hat \beta^\infty)) = 0 \Rightarrow \hat \beta^{\infty} = \hat{\beta}^{\textrm{\small DY}}\)

State evolution of AMP recursion

Solution \(({\mu}_{\star}, {b}_{\star}, {\sigma}_{\star})\) to the system of nonlinear equations in \(({\color[]{#E16A86}{\mu}, b, \sigma})\)


\[ \begin{aligned} 0 & = \mathop{\mathrm{E}}\left[2 \zeta'(Z) Z \left\{\frac{1+ \alpha}{2} - \zeta' \left(\textrm{prox}_{{\color[]{#E16A86}{b}} \zeta}\left(Z_{*} + \frac{1+ \alpha}{2}{\color[]{#E16A86}{b}}\right) \right) \right\} \right] \\ 0 & = 1 - \kappa - \mathop{\mathrm{E}}\left[\frac{2 \zeta'(Z)}{1 + {\color[]{#E16A86}{b}} \zeta''\left(\textrm{prox}_{{\color[]{#E16A86}{b}} \zeta}\left(Z_{*} + \frac{1+ \alpha}{2} {\color[]{#E16A86}{b}}\right)\right)} \right] \\ 0 & = {\color[]{#E16A86}{\sigma}}^2 - \frac{{\color[]{#E16A86}{b}}^2}{\kappa^2} \mathop{\mathrm{E}}\left[2 \zeta'(Z) \left\{\frac{1+ \alpha}{2} - \zeta'\left(\textrm{prox}_{{\color[]{#E16A86}{b}} \zeta}\left( Z_{*} + \frac{1+ \alpha}{2}{\color[]{#E16A86}{b}}\right)\right) \right\}^2 \right] \end{aligned} \tag{1}\]


\(Z \sim \mathop{\mathrm{N}}(0, \gamma^2)\), \(Z_{*} = {\color[]{#E16A86}{\mu }}Z + \kappa^{1/2} {\color[]{#E16A86}{\sigma}} G\) with \(G \sim \mathop{\mathrm{N}}(0, 1)\) independent of \(Z\)

MDYPL asymptotics

Informally: \(\displaystyle \hat{\beta}^{\textrm{\small DY}}\approx {\color[]{#E16A86}{{\mu}_{\star}}} \beta_0 + {\color[]{#E16A86}{{\sigma}_{\star}}} { \varepsilon}, \quad \text{as } n \to \infty\)

Theorem 2 (Aggregate asymptotic behaviour of MDYPL estimator)

Assume that \((\alpha,\kappa,\gamma)\) are such that (1) admits a solution \(({\mu}_{\star}, {b}_{\star}, {\sigma}_{\star})\) such that the Jacobian of the RHS of (1) is nonsingular.


Then, for any pseudo-Lipschitz function \(\psi: \Re^2 \to \Re\) of order 21, \[ \frac{1}{p} \sum_{j=1}^{p} \psi(\hat{\beta}^{\textrm{\small DY}}_j - {\mu}_{\star}\beta_{0,j}, \beta_{0,j}) \overset{\textrm{a.s.}}{\longrightarrow} \mathop{\mathrm{E}}\left[\psi({\sigma}_{\star}G, \bar{\beta})\right], \quad \textrm{as } n \to \infty \] where \(G \sim \mathop{\mathrm{N}}(0,1)\) is independent of \(\bar{\beta} \sim \pi_{\bar{\beta}}\)

1 For some \(L > 0\), \(\lvert \psi(u) - \psi(v) \rvert \leq L (1 + \|u\|_2 + \|v\|_2)\) \(\|u - v\|_2\).

Aggregate asymptotics

Behaviour of \(\hat{\beta}^{\textrm{\small DY}}\)

\(\psi(t,u)\) statistic a.s. limit
\(t - (1 - \mu_\star) u\) \(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j - \beta_{0,j}\right)\) Bias \(({\mu}_{\star}- 1) \mathop{\mathrm{E}}[\bar \beta]\)
\(t^2\) \(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j - {\mu}_{\star}\beta_{0,j}\right)^2\) Variance \({\sigma}_{\star}^2\)
\((t-(1-{\mu}_{\star})u)^2\) \(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j - \beta_{0,j}\right)^2\) MSE \({\sigma}_{\star}^2 + (1-{\mu}_{\star})^2 \gamma^2 / \kappa\)

Behaviour of \(\hat{\beta}^{\textrm{\small DY}}/ {\mu}_{\star}\)

\(\psi(t,u)\) statistic a.s. limit
\(t / {\mu}_{\star}\) \(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j / {\mu}_{\star}- \beta_{0,j}\right)\) Bias \(0\)
\(t^2/{\mu}_{\star}^2\) \(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j / {\mu}_{\star}- \beta_{0,j}\right)^2\) MSE / Var \({\sigma}_{\star}^2 / {\mu}_{\star}^2\)

\(\hat{\beta}^{\textrm{\small DY}}\) for \(\alpha = 1 / (1 + \kappa)\)

\(\hat{\beta}^{\textrm{\small DY}}\) for \(\alpha = 1 / (1 + \kappa)\)

\(\hat{\beta}^{\textrm{\small DY}}/ {\mu}_{\star}\) for \(\alpha = 1 / (1 + \kappa)\)

Testing

Theorem 3 (Adjusted \(Z\)-statistics)

Assume that \((\alpha,\kappa,\gamma)\) are such that the conditions of Theorem 2 are met.
\[ Z^*_j = \frac{\hat{\beta}^{\textrm{\small DY}}_{j} - {\mu}_{\star}\beta_{0,j}}{{\sigma}_{\star}} \overset{d}{\longrightarrow} \mathrm{N}(0,1) \]

Theorem 4 (DY prior penalized likelihood ratio test statistics)

Assume that \((\alpha,\kappa,\gamma)\) are such that the conditions of Theorem 2 are met.

Let \(I = \{i_1,\ldots,i_k\}\), and define the DY prior penalized likelihood ratio test statistic \[ \Lambda_I = \underset{\beta \in \Re^p}{\max} \, \ell(\beta; y^*, X) - \underset{\substack{\beta \in \Re^p: \\ \beta_j = 0, \, j \in I }}{\max} \, \ell(\beta; y^*, X) \]

Then, under \(H_0: \beta_{0,i_1} = \ldots = \beta_{0,i_k} = 0\), it holds that \[ 2 \Lambda_{I} \overset{d}{\longrightarrow} \frac{\kappa \sigma_{\star}^2}{b_{\star}} \chi^2_k \]

MDYPL performance

Signal Recovery
\(\hat\beta^{*}=\hat\beta^{\mathrm{DY}}/\mu_\star\)
Estimation error
\( \| \hat \beta^* - \beta_0 \|_2 / \sqrt{p} \)
Predicted probabilities
\(\pi(x^\top \hat \beta^*)\)
Classification error
\(\Pr(\hat y \neq y) \longrightarrow \textrm{HD limit} \)
Adjusted PLR
\(W^* = 2b_\star(\hat \ell - \hat \ell_0) /(\kappa\sigma_\star^2) \overset{d}{\longrightarrow} \chi^2_{10}\)
\(H_0:\beta_{p/2 + 1}= \ldots = \beta_{p/2 + 10}=0\)
Adjusted Wald
\(Z^*_k = \hat\beta^{\mathrm{DY}}_k/\sigma_\star \overset{d}{\longrightarrow} \mathrm N(0,1)\)
\(H_0:\beta_{k}=0, k > p / 2 \)

Extensions

Covariate structure

Arbitrary covariance: \(x_{i} \sim \mathrm{N}(0, \Sigma)\), subgaussian covariate distributions

Inclusion of intercept

\(\displaystyle Y_i | x_i \sim \mathop{\mathrm{Bernoulli}}(\pi_i), \quad \log \frac{\pi_i}{1 - \pi_i} = \eta_i ={ \color[]{#E16A86}{\theta}_{0}} + x_i^\top \beta_0\)

Nonzero prior signal

\(\displaystyle y_i^* = \alpha y_i + (1-\alpha) \zeta'({\color[]{#E16A86}{\theta}_P} + x_i^\top {\color[]{#E16A86}{\beta_P}})\) with \[ \begin{bmatrix} \theta_0 \\ \theta_P \end{bmatrix} \overset{\text{a.s.}}{\longrightarrow} \begin{bmatrix} \theta_0^* \\ \theta_P^* \end{bmatrix}, \qquad \begin{bmatrix} \frac{\| \beta_0 \|_2^2}{p} & \frac{\langle \beta_0, \beta_P \rangle}{p} \\ \frac{\langle \beta_0, \beta_P \rangle}{p} & \frac{\| \beta_P \|_2^2}{p} \end{bmatrix} \overset{\text{a.s.}}{\longrightarrow} \begin{bmatrix} \gamma^2 & \varphi \\ \varphi & \delta^2 \end{bmatrix} \]

Outlook

Feedforward neural network

Setup

Data

\[ Y_i \mid x_i \sim \mathop{\mathrm{Bernoulli}}(\pi_i), \qquad \pi_i = \pi(\eta_0(x_i)), \qquad x_i \stackrel{\mathrm{iid}}{\sim} \mathop{\mathrm{N}}(0,I_p) \]

Signal network

\[ \eta_0(x) = c_0 + \sum_{k=1}^K \gamma_{0,k} (u_{0,k}^\top x - t_{0,k})_+ \]

\[ p/n \to \kappa, \qquad K \text{ fixed} \]

Setup

Identifiable parametrisation

ReLU networks have non-unique parameters

Transform to a common representation

\[ \eta_{\theta}(x) = c + \sum_{k=1}^K \gamma_k (u_k^\top x - t_k)_+, \qquad \|u_k\|_2 = 1 \]

Hidden units are matched before comparison

Estimator

\[ \hat\theta \in \arg\min_{\theta} \frac{1}{n} \sum_{i=1}^n \left\{ \log(1 + \exp\{ \eta_\theta(x_i) \}) - y_i\eta_\theta(x_i) \right\} \]

High-dimensional limit

Logistic regression

\(\displaystyle \hat\beta \approx \mu_\star \beta_0+\sigma_\star \varepsilon, \quad \mu_\star = \operatorname{a.s.}\!\lim \frac{\langle \beta_0,\hat\beta\rangle}{\|\beta_0\|_2^2}, \quad \sigma_\star^2 = \operatorname{a.s.}\!\lim \frac{\|\hat\beta-\mu_\star\beta_0\|_2^2}{p}\)

FNN analogue

Coefficients: \(\displaystyle V_0=[u_{0,1},\ldots,u_{0,K}], \, \hat V=[\hat u_1,\ldots,\hat u_K]\)

\(\displaystyle \hat V \approx V_0A_\star+\text{noise}, \, A_\star \approx (V_0^\top V_0)^{-1}V_0^\top\hat V, \, \Omega_\star \approx (\hat V-V_0A_\star)^\top(\hat V-V_0A_\star)\)

Function-level correction

For a new \(x\),

\[ \hat q(x) \approx A_\star^\top q_0(x)+\Omega_\star^{1/2}G, \qquad q_0(x)=V_0^\top x, \quad \hat q(x)=\hat V^\top x \,. \]

The network uses ridge functions

\[ \gamma_k(q_k(x)-t_k)_+ \,. \]

Correct the fitted ridge coordinates:

\[ \tilde q(x)=A_\star^{-\top}\hat q(x), \quad \tilde\gamma_k=\hat\gamma_k/\rho_{\star,k}, \quad \tilde t_k=\hat t_k-d_{\star,k}, \quad \tilde c=\hat c-d_{\star,c} \,. \]

Thus

\[ \tilde\eta(x) = \tilde c+ \sum_{k=1}^K \tilde\gamma_k (\tilde q_k(x)-\tilde t_k)_+ \,. \]

Artificial example

\[ \begin{array}{l} p \in \{100,200,400\}, \quad \kappa = p/n = 0.2 \\ x_i \stackrel{\text{iid}}{\sim} \mathop{\mathrm{N}}(0,I_p) \\ K=3 \\ v_1,\ldots,v_3 \in \Re^p \text{ block-dense orthonormal} \\ u_{0,k}=v_k \\ \gamma_0\approx 0.96 (0.75,1,1.25)^\top, t_0=0 \quad c_0=-1.11 \end{array} \qquad \Longrightarrow \qquad \begin{array}{l} \mathop{\mathrm{var}}(\eta_0(x)) \approx 1 \\ \mathop{\mathrm{E}}\{\pi(\eta_0(x))\}\approx 0.5 \end{array} \]


\[ \eta_0(x) = c_0 + \sum_{k=1}^3 \gamma_{0,k}(u_{0,k}^\top x)_+ \]


\[ y_i \stackrel{\text{ind}}{\sim} \mathop{\mathrm{Bernoulli}}(\pi_i), \quad \log\frac{\pi_i}{1-\pi_i}=\eta_0(x_i) \]

Performance

Function-level coefficients
Hidden layer
\(\gamma_1 u_1\)
Hidden layer
\(\gamma_2 u_2\)
Hidden layer
\(\gamma_3 u_3\)
Estimation error
Predicted probabilities
\(\hat{\pi}\)
Classification error
\(\Pr(\hat{y} \neq y) \)
Unadjusted
Function plug-in
Channel posterior

References

Candès, E. J., & Sur, P. (2020). The phase transition for the existence of the maximum likelihood estimate in high-dimensional logistic regression. The Annals of Statistics, 48(1), 27–42. https://doi.org/10.1214/18-AOS1789
Cordeiro, G. M., & McCullagh, P. (1991). Bias correction in generalized linear models. Journal of the Royal Statistical Society Series B: Statistical Methodology, 53(3), 629–643. https://doi.org/10.1111/j.2517-6161.1991.tb01852.x
Feng, O. Y., Venkataramanan, R., Rush, C., & Samworth, R. J. (2022). A unifying tutorial on approximate message passing. Foundations and Trends in Machine Learning, 15(4), 335–536. https://doi.org/10.1561/2200000092
Rigon, T., & Aliverti, E. (2023). Conjugate priors and bias reduction for logistic regression models. Statistics & Probability Letters, 202, 109901. https://doi.org/10.1016/j.spl.2023.109901
Sur, P., & Candès, E. J. (2019). A modern maximum-likelihood theory for high-dimensional logistic regression. Proceedings of the National Academy of Sciences, 116(29), 14516–14525. https://doi.org/10.1073/pnas.1810420116
Zhao, Q., Sur, P., & Candes, E. J. (2022). The asymptotic distribution of the MLE in high-dimensional logistic models: Arbitrary covariance. Bernoulli, 28(3), 1835–1861. https://doi.org/10.3150/21-BEJ1401

Thank you!

Sterzinger P, Kosmidis I (2026). Diaconis-Ylvisaker prior penalized likelihood for \(p / n \to \kappa \in (0, 1)\) logistic regression. Biometrika. 113(2), asag014 DOI: 10.1093/biomet/asag014