Jahr 2024 Jahr 2025 Jahr 2026
Modify filter

Filter is active: Show only talks in the following category : Oberseminar Statistics and Data Science.


08.01.2025 12:15 Hannah Laus (TU München) :
Non-Asymptotic Uncertainty Quantification in High-Dimensional Learning8101.02.110 / BC1 2.01.10 (Parkring 11, 85748 Garching)

Uncertainty quantification (UQ) is a crucial but challenging task in many high-dimensional regression or learning problems to increase the confidence of a given predictor. In this talk we discuss a new data-driven approach for UQ in regression that applies both to classical regression approaches such as the LASSO as well as to neural networks. One of the most notable UQ techniques is the debiased LASSO, which modifies the LASSO to allow for the construction of asymptotic confidence intervals by decomposing the estimation error into a Gaussian and an asymptotically vanishing bias component. However, in real-world problems with finite-dimensional data, the bias term is often too significant to be neglected, resulting in overly narrow confidence intervals. In this talk we will address this issue and derive a data-driven adjustment that corrects the confidence intervals for a large class of predictors by estimating the means and variances of the bias terms from training data, exploiting high-dimensional concentration phenomena. This gives rise to non-asymptotic confidence intervals, which can help avoid overestimating uncertainty in critical applications such as MRI diagnosis. Importantly, this analysis extends beyond sparse regression to data-driven predictors like neural networks, enhancing the reliability of model-based deep learning. Our findings, discussed in this talk, bridge the gap between established theory and the practical applicability of such debiased methods. This talk is based on joint work with Frederik Hoppe, Claudio Mayrink Verdun, Felix Krahmer and Holger Rauhut.

15.01.2025 11:15 Elisabeth Maria Griesbauer (Institute of Basic Medical Sciences, Oslo, NO):
Synthetic data generation balancing privacy and utility, using vine copulasOnline: attend8101.02.110 / BC1 2.01.10 (Parkring 11, 85748 Garching)

The availability of diverse, high-quality data has led to tremendous advances in science, technology and society at large, when analysed by means of statistical and machine learning (ML) methods. However, real-world data, in many cases, cannot be made public to the research community due to privacy restrictions, obstructing progress, especially in bio-medical research. Synthetic data can substitute the sensitive real data, and as long as they do not disclose private aspects. This has proven to be successful in training downstream ML applications. We propose TVineSynth, a vine copula based synthetic tabular data generator, which is designed to balance privacy and utility, using the vine tree structure and its truncation to do the trade-off. Contrary to synthetic data generators that achieve differential privacy (DP) by globally adding noise, TVineSynth performs a controlled approximation of the estimated data generating distribution, so that it does not suffer from poor utility of the resulting synthetic data for downstream prediction tasks. TVineSynth introduces a targeted bias into the vine copula model that, combined with the specific tree structure of the vine, causes the model to zero out privacy-leaking dependencies while relying on those that are beneficial for utility. Privacy is here measured with membership (MIA) and attribute inference attacks (AIA). Further, we theoretically justify how the construction of TVineSynth ensures AIA privacy under a natural privacy measure for continuous sensitive attributes. When compared to competitor models, with and without DP, on simulated and on real-world data, TVineSynth achieves a superior privacy-utility balance.

15.01.2025 12:15 David Huk (University of Warwick, Coventry, UK):
Quasi-Bayes meets VinesOnline: attend8101.02.110 / BC1 2.01.10 (Parkring 11, 85748 Garching)

Recently developed quasi-Bayesian (QB) methods proposed a stimulating change of paradigm in Bayesian computation by directly constructing the Bayesian predictive distribution through recursion, removing the need for expensive computations involved in sampling the Bayesian posterior distribution. This has proved to be data-efficient for univariate predictions, however, existing constructions for higher dimensional densities are only possible by relying on restrictive assumptions on the model's multivariate structure. In this talk, we discuss a wholly different approach to extend Quasi-Bayesian prediction to high dimensions through the use of Sklar's theorem, by decomposing the predictive distribution into one-dimensional predictive marginals and a high-dimensional copula. We use the efficient recursive QB construction for the one-dimensional marginals and model the dependence using highly expressive vine copulas. Further, we tune hyperparameters using robust divergences (eg. energy score) and show that our proposed Quasi-Bayesian Vine (QB-Vine) is a fully non-parametric density estimator with an analytical form and convergence rate independent of the dimension of the data in some situations. Our experiments illustrate that the QB-Vine is appropriate for high dimensional distributions (64), needs very few samples to train (200), and outperforms state-of-the-art methods with analytical forms for density estimation and supervised tasks by a considerable margin.

20.01.2025 13:45 Yuhao Wang (Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, CN; Shanghai Qi Zhi Institute, CN):
Residual permutation test for regression coefficient testing8101.02.110 / BC1 2.01.10 (Parkring 11, 85748 Garching)

We consider the problem of testing whether a single coefficient is equal to zero in linear models when the dimension of covariates p can be up to a constant fraction of sample size n. In this regime, an important topic is to propose tests with finite-population valid size control without requiring the noise to follow strong distributional assumptions. In this paper, we propose a new method, called residual permutation test (RPT), which is constructed by projecting the regression residuals onto the space orthogonal to the union of the column spaces of the original and permuted design matrices. RPT can be proved to achieve finite-population size validity under fixed design with just exchangeable noises, whenever p < n/2. Moreover, RPT is shown to be asymptotically powerful for heavy-tailed noises with bounded (1+t)-th order moment when the true coefficient is at least of order n^{-t/(1+t)} for t \in [0, 1]. We further proved that this signal size requirement is essentially rate-optimal in the minimax sense. Numerical studies confirm that RPT performs well in a wide range of simulation settings with normal and heavy-tailed noise distributions.

22.01.2025 12:15 Ingrid van Keilegom (KU Leuven, BE):
Semiparametric estimation of the survival function under dependent censoring8101.02.110 / BC1 2.01.10 (Parkring 11, 85748 Garching)

This paper proposes a novel estimator of the survival function under dependent random right censoring, a situation frequently encountered in survival analysis. We model the relation between the survival time T and the censoring C by using a parametric copula, whose association parameter is not supposed to be known. Moreover, the survival time distribution is left unspecified, while the censoring time distribution is modeled parametrically. We develop sufficient conditions under which our model for (T,C) is identifiable, and propose an estimation procedure for the distribution of the survival time T of interest. Our model and estimation procedure build further on the work on the copula-graphic estimator proposed by Zheng and Klein (1995) and Rivest and Wells (2001), which has the drawback of requiring the association parameter of the copula to be known, and on the recent work by Czado and Van Keilegom (2023), who suppose that both marginal distributions are parametric whereas we allow one margin to be unspecified. Our estimator is based on a pseudo-likelihood approach and maintains low computational complexity. The asymptotic normality of the proposed estimator is shown. Additionally, we discuss an extension to include a cure fraction, addressing both identifiability and estimation issues. The practical performance of our method is validated through extensive simulation studies and an application to a breast cancer data set.

29.01.2025 12:15 Siegfried Hörmann (Graz University of Technology, AT):
Measuring dependence between a scalar response and a functional covariate8101.02.110 / BC1 2.01.10 (Parkring 11, 85748 Garching)

We extend the scope of a recently introduced dependence coefficient between a scalar response Y and a multivariate covariate X to the case where X takes values in a general metric space. Particular attention is paid to the case where X is a curve. While on the population level, this extension is straight forward, the asymptotic behavior of the estimator we consider is delicate. It crucially depends on the nearest neighbor structure of the infinite-dimensional covariate sample, where deterministic bounds on the degrees of the nearest neighbor graphs available in multivariate settings do no longer exist. The main contribution of this paper is to give some insight into this matter and to advise a way how to overcome the problem for our purposes. As an important application of our results, we consider an independence test.

05.02.2025 12:15 Cecilie Recke (University of Copenhagen, DK):
Identifiability and Estimation in Continuous Lyapunov Models8101.02.110 / BC1 2.01.10 (Parkring 11, 85748 Garching)

We study causality in systems that allow for feedback loops among the variables via models of cross-sectional data from a dynamical system. Specifically, we consider the set of distributions which appears as the steady-state distributions of a stochastic differential equation (SDE) where the drift matrix is parametrized by a directed graph. The nth-order cumulant of the steady state distribution satisfies the corresponding nth-order continuous Lyapunov equation. Under the assumption that the driving Lévy process of the SDE is not a Brownian motion (so the steady state distribution is non-Gaussian) and the coordinates are independent, we are able to prove generic identifiability for any connected graph from the second and third-order Lyapunov equations while allowing the cumulants of the driving process to be unknown diagonal. We propose a minimum distance estimator of the drift matrix, which we are able to prove is consistent and asymptotically normal by utilizing the identifiability result.

19.02.2025 12:15 Jane Coons (Max Planck Institute of Molecular Cell Biology and Genetics, Dresden):
Iterative Proportional Scaling and Log-Linear Models with Rational Maximum Likelihood Estimator8101.02.110 / BC1 2.01.10 (Parkring 11, 85748 Garching)

In the field of algebraic statistics, we view statistical models as part of an algebraic variety and use tools from algebra, geometry, and combinatorics to learn statistically relevant information about these models. In this talk, we discuss the algebraic interpretation of likelihood inference for discrete statistical models. We present recent work on the iterative proportional scaling (IPS) algorithm, which is used to compute the maximum likelihood estimate (MLE), and give algebraic conditions under which this algorithm outputs the exact MLE in one cycle. Next, we introduce quasi-independence models, which describe the joint distribution of two random variables where some combinations of their states cannot co-occur, but they are otherwise independent. We combinatorially classify the quasi-independence models whose MLEs are rational functions of the data. We show that each of these has a parametrization which satisfies the conditions that guarantee one-cycle convergence of the IPS algorithm.

19.03.2025 12:15 Vincent Fortuin (Helmholtz/TU München):
Recent Advances in Bayesian Deep Learning8101.02.110 / BC1 2.01.10 (Parkring 11, 85748 Garching)

Combining Bayesian principles with the power of deep learning has long been an attractive direction of research, but its real-world impact has fallen short of the promises. Especially in the context of uncertainty estimation, there seem to be simpler methods that perform at least as well. In this talk, I want to argue that uncertainties are not the only reason to use Bayesian deep learning models, but that they also offer improved model selection and incorporation of prior knowledge. I will showcase these benefits supported by the results of two recent papers and situate them in the context of current research trends in Bayesian deep learning. \[ \] Bio: Vincent Fortuin is a tenure-track research group leader at Helmholtz AI in Munich, leading the group for Efficient Learning and Probabilistic Inference for Science (ELPIS), and a faculty member at the Technical University of Munich. He is also a Branco Weiss Fellow, an ELLIS Scholar, a Fellow of the Konrad Zuse School of Excellence in Reliable AI, and a Senior Researcher at the Munich Center for Machine Learning. His research focuses on reliable and data-efficient AI approaches, leveraging Bayesian deep learning, deep generative modeling, meta-learning, and PAC-Bayesian theory. Before that, he did his PhD in Machine Learning at ETH Zürich and was a Research Fellow at the University of Cambridge. He is a regular reviewer and area chair for all major machine learning conferences, an action editor for TMLR, and a co-organizer of the Symposium on Advances in Approximate Bayesian Inference (AABI) and the ICBINB initiative.

14.05.2025 12:15 Luciana Dalla Valle (University of Torino, IT):
Approximate Bayesian conditional copulasOnline: attend8101.02.110 / BC1 2.01.10 (Parkring 11, 85748 Garching)

According to Sklar’s theorem, any multidimensional absolutely continuous distribution function can be uniquely represented as a copula, which captures the dependence structure among the vector components. In real data applications, the interest of the analyses often lies on specific functionals of the dependence, which quantify aspects of it in a few numerical values. A broad literature exists on such functionals, however extensions to include covariates are still limited. This is mainly due to the lack of unbiased estimators of the conditional copula, especially when one does not have enough information to select the copula model. Several Bayesian methods to approximate the posterior distribution of functionals of the dependence varying according covariates are presented and compared; the main advantage of the investigated methods is that they use nonparametric models, avoiding the selection of the copula, which is usually a delicate aspect of copula modelling. These methods are compared in simulation studies and in two realistic applications, from civil engineering and astrophysics.

14.05.2025 16:15 Rajen Shah (University of Cambridge, UK):
Robustness in Semiparametric StatisticsOnline: attend (Meeting-ID: 631 1190 7291; Password: StatsCol)Raum 144 (Ludwigstrasse 33, 80333 München)

Given that all models are wrong, it is important to understand the performance of methods when the settings for which they have been designed are not met, and to modify them where possible so they are robust to these sorts of departures from the ideal. We present two examples with this broad goal in mind. \[ \] We first look at a classical case of model misspecification in (linear) mixed-effect models for grouped data. Existing approaches estimate linear model parameters through weighted least squares, with optimal weights (given by the inverse covariance of the response, conditional on the covariates) typically estimated by maximizing a (restricted) likelihood from random effects modelling or by using generalized estimating equations. We introduce a new ‘sandwich loss’ whose population minimizer coincides with the weights of these approaches when the parametric forms for the conditional covariance are well-specified, but can yield arbitrarily large improvements when they are not. \[ \] The starting point of our second vignette is the recognition that semiparametric efficient estimation can be hard to achieve in practice: estimators that are in theory efficient may require unattainable levels of accuracy for the estimation of complex nuisance functions. As a consequence, estimators deployed on real datasets are often chosen in a somewhat ad hoc fashion and may suffer high variance. We study this gap between theory and practice in the context of a broad collection of semiparametric regression models that includes the generalized partially linear model. We advocate using estimators that are robust in the sense that they enjoy root n consistent uniformly over a sufficiently rich class of distributions characterized by certain conditional expectations being estimable by user-chosen machine learning methods. We show that even asking for locally uniform estimation within such a class narrows down possible estimators to those parametrized by certain weight functions and develop a new random forest-based estimation scheme to estimate the optimal weights. We demonstrate the effectiveness of the resulting estimator in a variety of semiparametric settings on simulated and real-world data.

21.05.2025 12:15 Michael Muma (TU Darmstadt):
The T-Rex Selector: Fast High-Dimensional Variable Selection with False Discovery Rate Control8101.02.110 / BC1 2.01.10 (Parkring 11, 85748 Garching)

Providing guarantees on the reproducibility of discoveries is essential when drawing inferences from high-dimensional data. Such data is common in numerous scientific domains, for example, in biomedicine, it is imperative to reliably detect the genes that are truly associated with the survival time of patients diagnosed with a certain type of cancer, or in finance, one aims at determining a sparse portfolio to reliably perform index tracking. This talk introduces the Terminating-Random Experiments (T-Rex) selector, a fast multivariate variable selection framework for high-dimensional data. The T-Rex selector provably controls a user-defined target false discovery rate (FDR) while maximizing the number of selected variables. It scales to settings with millions of variables. Its computational complexity is linear in the number of variables, making it more than two orders of magnitude faster than, e.g., the existing model-X knockoff methods. An easy-to-use open-source R package that implements the TRexSelector is available on CRAN. The focus of this talk lies on high-dimensional linear regression models, but we also describe extensions to principal component analysis (PCA) and Gaussian graphical models (GGMs).

04.06.2025 12:15 Gilles Blanchard (Université Paris-Saclay, FR):
Estimating a large number of high-dimensional vector means 8101.02.110 / BC1 2.01.10 (Parkring 11, 85748 Garching)

The problem of simultaneously estimating multiple means from independent samples has a long history in statistics, from the seminal works of Stein, Robbins in the 50s, Efron and Morris in the 70s and up to the present day. This setting can be also seen as an (extremely stylized) instance of "personalized federated learning" problem, where each user has their own data and target (the mean of their personal distribution), but potentially want to share some relevant information with "similar" users (though there is no information available a priori about which users are "similar"). In this talk I will concentrate on contributions to the high-dimensional case, where the samples and their means belong to R^d with "large" d. \[ \] We consider a weighted aggregation scheme of empirical means of each sample, and study the possible improvement in quadratic risk over the simple empirical means. To make the stylized problem closer to challenges encountered in practice, we allow (a) full heterogeneity of sample sizes (b) zero a priori knowledge of the structure of the mean vectors (c) unknown and possibly heterogeneous sample covariances. \[ \] We focus on the role of the effective dimension of the data in a "dimensional asymptotics'' point of view, highlighting that the risk improvement of the proposed method satisfies an oracle inequality approaching an adaptive (minimax in a suitable sense) improvement as the effective dimension grows large. \[ \] (This is joint work with Jean-Baptiste Fermanian and Hannah Marienwald)

23.06.2025 12:15 Speaker has cancelled: Qingqing Zhai (Shanghai University, CN):
Modeling Complex System Deterioration: From Unit Degradation to Networked Recurrent FailuresMI 02.08.011 (Boltzmannstr. 3, 85748 Garching)

This presentation addresses statistical challenges in modeling the deterioration of complex systems, spanning from individual unit degradation to interdependent network failures. First, we introduce statistical degradation data modeling using stochastic processes. Then, we shift to modeling recurrent failures in large-scale infrastructure networks (e.g., water distribution systems). Motivated by 16 years of Scottish Water pipe failure data, we propose the novel Network Gamma-Poisson Autoregressive NHPP (GPAN) model. This two-layer framework captures temporal dynamics via Non-Homogeneous Poisson Processes (NHPPs) with node-specific frailties and spatial dependencies through a gamma-Poisson autoregressive scheme structured by the network's Directed Acyclic Graph (DAG). To overcome computational intractability, a scalable sum-product algorithm based on factor graphs and message passing is developed for efficient inference, enabling application to networks with tens of thousands of nodes. We demonstrate how this approach provides accurate failure predictions, identifies high-risk clusters, and supports operational management and risk assessment. The methodologies presented offer powerful tools for reliability analysis across diverse engineering contexts, from product lifespan prediction to critical infrastructure resilience.

09.07.2025 12:15 Nils Sturma (TU München):
Identifiability in Sparse Factor Analysis8101.02.110 / BC1 2.01.10 (Parkring 11, 85748 Garching)

Factor analysis is a statistical technique that explains correlations among observed random variables with the help of a smaller number of unobserved factors. In traditional full-factor analysis, each observed variable is influenced by every factor. However, many applications exhibit interesting sparsity patterns, that is, each observed variable only depends on a subset of the factors. In this talk, we will discuss parameter identifiability of sparse factor analysis models. In particular, we present a sufficient condition for parameter identifiability that generalizes the well-known Anderson-Rubin condition and is tailored to the sparse setup. This is joint work with Mathias Drton, Miriam Kranzlmüller, and Irem Portakal.

09.07.2025 13:15 Pratik Misra (TU München):
Structural identifiability in graphical continuous Lyapunov models8101.02.110 / BC1 2.01.10 (Parkring 11, 85748 Garching)

Graphical continuous Lyapunov models offer a novel framework for the statistical modeling of correlated multivariate data. These models define the covariance matrix through a continuous Lyapunov equation, parameterized by the drift matrix of the underlying dynamic process. In this talk, I will discuss key results on the defining equations of these models and explore the challenge of structural identifiability. Specifically, I will present conditions under which models derived from different directed acyclic graphs (DAGs) are equivalent and provide a transformational characterization of such equivalences. This is based on ongoing work with Carlos Amendola, Tobias Boege, and Ben Hollering.

22.07.2025 10:00 Junhyung Park (ETH Zürich, CH):
Causal Spaces: A Measure-Theoretic Axiomatisation of Causality8101.02.110 / BC1 2.01.10 (Parkring 11, 85748 Garching)

While the theory of causality is widely viewed as an extension of probability theory, a view which we share, there was no universally accepted, axiomatic framework for causality analogous to Kolmogorov's measure-theoretic axiomatization for the theory of probabilities. Instead, many competing frameworks exist, such as the structural causal models or the potential outcomes framework, that mostly have the flavor of statistical models. To fill this gap, we propose the notion of causal spaces, consisting of a probability space along with a collection of transition probability kernels, called causal kernels, which satisfy two simple axioms and which encode causal information that probability spaces cannot encode. The proposed framework is not only rigorously grounded in measure theory, but it also sheds light on long-standing limitations of existing frameworks, including, for example, cycles, latent variables, and stochastic processes. Our hope is that causal spaces will play the same role for the theory of causality that probability spaces play for the theory of probabilities.

23.07.2025 12:15 Oezge Sahin (TU Delft, NL):
Effects of covariate discretization on conditional quantiles in bivariate copulas8101.02.110 / BC1 2.01.10 (Parkring 11, 85748 Garching)

Clinical data often include a mix of continuous measurements and covariates that have been discretized, typically to protect privacy, meet reporting obligations, or simplify clinical interpretation. This combination, along with the nonlinear and tail-asymmetric dependence frequently observed in clinical data, affects the behavior of regression and variable-selection methods. Copula models, which separate marginal behavior from the dependence structure, provide a principled approach to studying these effects. In this talk, we analyze how discretizing a continuous covariate into equiprobable categories impacts conditional quantiles and likelihoods in bivariate copula models. For the Clayton and Frank families, we derive closed-form anchor points: for a given category, we identify the continuous covariate value at which the conditional quantile under the continuous model matches that of the discretized one. These anchors provide an exact measure of discretization bias, which is small near the center but can be substantial in the tails. Simulations across five copula families show that likelihood-based variable selection may over- or under-weight discretized covariates, depending on the dependence structure. Through simulations, we conclude by comparing polyserial and Pearson correlations, as well as Kendall’s tau (-b), in the same settings. Our results have practical implications for copula-based modeling of mixed-type data.

23.07.2025 16:00 Thomas Nagler (LMU Munich):
On dimension reduction in conditional dependence modelsOnline: attend (Meeting-ID: 670 7627 9498; Passcode: 335101)8101.02.110 / BC1 2.01.10 (Parkring 11, 85748 Garching)

Inference of the conditional dependence structure is challenging when many covariates are present. In numerous applications, only a low-dimensional projection of the covariates influences the conditional distribution. The smallest subspace that captures this effect is called the central subspace in the literature. We show that inference of the central subspace of a vector random variable Y conditioned on a vector of covariates can be separated into inference of the marginal central subspaces of the components of Y conditioned on X and on the copula central subspace, which we define in this paper. Further discussion addresses sufficient dimension reduction subspaces for conditional association measures. An adaptive nonparametric method is introduced for estimating the central dependence subspaces, achieving parametric convergence rates under mild conditions. Simulation studies illustrate the practical performance of the proposed approach.