Filter is active: Show only talks in the following category : Oberseminar Statistics and Data Science.
Causal discovery procedures are popular methods for discovering causal structure across the physical, biological, and social sciences. However, most procedures for causal discovery only output a single estimated causal model or single equivalence class of models. In this work, we propose a procedure for quantifying uncertainty in causal discovery. Specifically, we consider structural equation models where a unique graph can be identified and propose a procedure which returns a confidence sets of causal orderings which are not ruled out by the data. We show that asymptotically, a true causal ordering will be contained in the returned set with some user specified probability. In addition, the confidence set can be used to form conservative sets of ancestral relationships.
Random Forests (RFs) are at the cutting edge of supervised machine learning in terms of prediction performance, especially in genomics. Iterative RFs (iRFs) use a tree ensemble from iteratively modified RFs to obtain predictive and stable nonlinear or Boolean interactions of features. They have shown great promise for Boolean biological interaction discovery that is central to advancing functional genomics and precision medicine. However, theoretical studies into how tree-based methods discover Boolean feature interactions are missing. Inspired by the thresholding behavior in many biological processes, we first introduce a discontinuous nonlinear regression model, called the “Locally Spiky Sparse” (LSS) model. Specifically, the LSS model assumes that the regression function is a linear combination of piecewise constant Boolean interaction terms. Given an RF tree ensemble, we define a quantity called “Depth-Weighted Prevalence” (DWP) for a set of signed features S. Intuitively speaking, DWP(S) measures how frequently features in S appear together in an RF tree ensemble. We prove that, with high probability, DWP(S) attains a universal upper bound that does not involve any model coefficients, if and only if S corresponds to a union of Boolean interactions under the LSS model. Consequentially, we show that a theoretically tractable version of the iRF procedure, called LSSFind, yields consistent interaction discovery under the LSS model as the sample size goes to infinity. Finally, simulation results show that LSSFind recovers the interactions under the LSS model, even when some assumptions are violated.
Reference: https://www.pnas.org/doi/10.1073/pnas.2118636119
Co-authors: Yu Wang, Xiao Li, and Bin Yu (UC Berkeley)
Causal inference is a branch of machine learning and statistics that aims to develop theoretical models and practical algorithms to infer the statistical causal dynamics in complex systems. The incorporation of causality in learning is what predominantly sets human judgment apart from machines. In this talk, I will briefly explain new developments in causal discovery, the problem of identifiability and inverse aka experimental design, causal transfer learning, and Granger’s notion of causality in time series and discuss how his formulation can go beyond linear dynamics. As an application, I will present an application of causality in imitation learning and developing self-driving vehicles.
Networks are often used to represent complex dependencies in data, and network models can aid the understanding of such dependencies. These models can be parametric, but they could also be implicit, such as the output of an automated synthetic data generator. For assessing the goodness of fit of a model, often independent replicas are assumed. However, when the data are given in the form of a network, usually there is only one network available. Classical likelihood ratio methods may fail even in parametric models such as exponential random graph models, as due to an intractable normalising constant, the likelihood cannot be calculated explicitly. This talk will present some network models. We shall introduce a kernelized goodness of fit test (which is based on Stein’s method), give performance guarantees, and illustrate its use. This talk is based on joint works with Nathan Ross and with Wenkai Xu.
Biography: Gesine Reinert is Research Professor in the Department of Statistics at the University of Oxford, and a Fellow of the Alan Turing Institute. Her research includes probabilistic and statistical methods for network analysis, as well as applied probability – in particular Stein’s method – and connections with machine learning methods. Gesine is a Fellow of the Institute of Mathematical Statistics.
Adaptive numerical quadrature is used to normalize posterior distributions in many Bayesian models. We provide the first stochastic convergence rate for the error incurred when normalizing a posterior distribution under typical regularity conditions. We give approximations to moments, marginal densities, and quantiles, and provide convergence rates for several of these summaries. Low- and high-dimensional applications are presented, the latter using adaptive quadrature as one component of a more sophisticated approximation framework, for which limited theory is given. Extension of the theory to the high-dimensional framework for the Laplace approximation (a specific instance of an adaptive quadrature method) is considered and guarantees are provided under additional regularity assumptions
Renewable energies, in particular wind and solar power, have become responsible for a large part of the variation in electricity prices in the past years. Moreover, traders are more and more interested in models which can be used to forecast the day-ahead-/intraday-price-spread. In this talk we shed some light on new models for electricity prices using continuous autoregressive processes. In addition, we discuss intraday price modeling and spread forecasting based on Bayesian statistics and artificial intelligence.
Machine learning excels in learning associations and patterns from data and is increasingly adopted in natural-, life- and social sciences, as well as engineering. However, many relevant research questions about such complex systems are inherently causal and machine learning alone is not designed to answer them. At the same time there often exists ample theoretical and empirical knowledge in the application domains. In this talk, I will briefly outline causal inference as a powerful framework providing the theoretical foundations to combine data and machine learning models with qualitative domain assumptions to quantitatively answer causal questions. I will discuss challenges ahead and selected application scenarios to spark interest for integrating causal thinking into data-driven science.
Short bio: Jakob Runge heads the Causal Inference group at the German Aerospace Center’s Institute of Data Science in Jena since 2017 and is guest professor of computer science at TU Berlin since 2021. His group develops theory, methods, and accessible software for causal inference on time series data inspired by challenges in various application domains. Jakob studied physics at Humboldt University Berlin and finished his PhD project at the Potsdam Institute for Climate Impact Research in 2014. For his studies he was funded by the German National Foundation (Studienstiftung) and his thesis was awarded the Carl-Ramsauer prize by the Berlin Physical Society. In 2014 he won a $200.000 Fellowship Award in Studying Complex Systems by the James S. McDonnell Foundation and joined the Grantham Institute, Imperial College London, from 2016 to 2017. In 2020 he won an ERC Starting Grant with his interdisciplinary project CausalEarth. On https://github.com/jakobrunge/tigramite.git he provides Tigramite, a time series analysis python module for causal inference. For more details, see: www.climateinformaticslab.com.
The Kronecker covariance structure for array data posits that the covariances along comparable modes, such as rows and columns, of an array are similar. For example, when modelling a multivariate time series, it might be assumed that each individual series follows the same AR process, up to changes in scale, while at each particular timepoint the observations across series have the same correlation structure. Over and above being a plausible model for many types of data, the Kronecker covariance assumption is especially useful in high-dimensional settings, where unconstrained covariance matrix estimates are typically unstable. In this talk we explore the information geometric aspects of the estimation of Kronecker covariance matrices. The asymptotic properties of two estimators, the maximum likelihood estimator and an estimator based on partial traces, are contrasted. It is shown that the partial trace estimator is inefficient, where the relative performance of this estimator can be quantified in terms of a principle angle between tangent spaces. This principle angle can be related to the eigenvalues of the underlying Kronecker covariance matrix. By defining a rescaled version of the partial trace operator, an asymptotically efficient correction to the partial trace estimator is proposed. This estimator has a closed-form expression and also has a useful equivariance property. An orthogonal parameterization of the collection of Kronecker covariances is subsequently motivated by the rescaled partial trace estimator. Orthogonal parameterizations imply that the components of the parameterization are asymptotically independent, which in the Kronecker case has implications for tests concerning row and column covariances.
This talk will provide an overview of the recent progress made in exploring Sourav Chatterjee's newly introduced rank correlation. The objective is to elaborate on its practical utility and present several new findings pertaining to (a) the asymptotic normality and limiting variance of Chatterjee's rank correlation, (b) its statistical efficiency for testing independence, and (c) the issue of its bootstrap inconsistency. Notably, the presentation will reveal that Chatterjee's rank correlation is root-n consistent, asymptotically normal, but bootstrap inconsistent - an unusual phenomenon in the literature.
Multiple stochastic integrals with respect to Brownian motion is a classical topic while its version with respect to stable processes has created minor interest. Their distributions can be simulated using U-statistics. This will be discussed in the first part of the talk. On the other hand this representation allows for statistical applications for observations with slowly decaying tail distributions. I shall present some simulations and give an application from neuroscience.
We study the weak convergence of conditional empirical copula processes indexed by general families of conditioning events that have non zero probabilities. Moreover, we also study the case where the conditioning events are chosen in a data-driven way. The validity of several bootstrap schemes is stated, including the exchangeable bootstrap. We define general multivariate measures of association, possibly given some fixed or random conditioning events. By applying our theoretical results, we prove the asymptotic normality of the estimators of such measures. We illustrate our results with financial data.
Sample splitting is one of the most tried-and-true tools in the data scientist toolbox. It breaks a data set into two independent parts, allowing one to perform valid inference after an exploratory analysis or after training a model. A recent paper (Neufeld, et al. 2023) provided a remarkable alternative to sample splitting, which the authors showed to be attractive in situations where sample splitting is not possible. Their method, called convolution-closed data thinning, proceeds very differently from sample splitting, and yet it also produces two statistically independent data sets from the original. In this talk, we will show that sufficiency is the key underlying principle that makes their approach possible. This insight leads naturally to a new framework, which we call generalized data thinning. This generalization unifies both sample splitting and convolution-closed data thinning as different applications of the same procedure. Furthermore, we show that this generalization greatly widens the scope of distributions where thinning is possible. This work is a collaboration with Ameer Dharamshi, Anna Neufeld, Keshav Motwani, Lucy Gao, and Daniela Witten.
Causal effect estimation from observational data is a fundamental task in empirical sciences. It becomes particularly challenging when unobserved confounders are involved in a system. This presentation provides an introduction to front-door adjustment – a classic technique which, using observed mediators, allows to identify causal effects even in the presence of unobserved confounding. Focusing on the algorithmic aspects, this talk presents recent results for finding front-door adjustment sets in linear-time in the size of the causal graph.
Link to technical report: https://arxiv.org/abs/2211.16468
If explanatory variables and a response variable of interest are simultaneously observed, then multivariate models based on vine pair-copula constructions can be fit, from which inferences are based on the conditional distribution of the response variable given the explanatory variables.
For applications, there are things to consider when implementing this idea. Topics include: (a) inclusion of categorical predictors; (b) right-censored response variable; (c) for a pair with one ordinal and one continuous variable, diagnostics for copula choice and assessing fit of copula; (d) use of empirical beta copula; (e) performance metrics for prediction/classification and sensitivity to choice of vine structure and pair-copulas on edges of vine; (f) weighted log-likelihood for ordinal response variable; (g) comparisons with linear regression methods.
This paper extends the technique of gradient boosting in mortality forecasting. The two novel contributions are to use stochastic mortality models as weak learners in gradient boosting rather than trees, and to include a penalty that shrinks the forecasts of mortality in adjacent age groups and nearby geographical regions closer together. The proposed method demonstrates superior forecasting performance based on US male mortality data from 1969 to 2019. The boosted model with age-based shrinkage yields the most accurate national-level mortality forecast. For state-level forecasts, spatial shrinkage provides further improvement in accuracy in addition to the benefits achieved by age-based shrinkage. This additional improvement can be attributed to data sharing across states with both large and small populations in adjacent regions, as well as states which share common risk factors.
**September 27, 2023**
09:00-09:45 Stijn Vansteelandt (Ghent University)
09:45-10:30 Vanessa Didelez (Leibniz Institute for Prevention Research and Epidemiology - BIPS)
break
11:00-11:45 Peter Bühlmann (ETH Zürich)
11:45-12:30 Dominik Janzing (Amazon Research)
lunch
14:00-14:45 Giusi Moffa (University of Basel)
14:45-15:30 Ricardo Silva (University College London)
**September 28, 2023**
10:00-10:45 Kun Zhang (Carnegie Mellon University)
10:45-11:30 Robin Evans (University of Oxford)
break
11:45-12:30 Niels Richard Hansen (University of Copenhagen)
lunch
14:00-14:45 Niki Kilbertus (Helmholtz / TUM)
14:45-15:30 Mathias Drton (TUM)
See https://collab.dvb.bayern/display/TUMmathstat/Miniworkshop+on+Graphical+Models+and+Causality for more details.
Motivated by applications to water quality monitoring using fluorescence spectroscopy, we develop the source apportionment model for high dimensional profiles of dissolved organic matter (DOM). We describe simple methods to estimate the parameters of a linear source apportionment model, and we show how the estimates are related to those of ordinary and generalized least squares. Using this least squares framework, we analyze the variability of the estimates, and we propose predictors for missing elements of a DOM profile. We demonstrate the practical utility of our results on fluorescence spectroscopy data collected from the Neuse River in North Carolina.
In this talk, we begin with a motivation for and brief introduction to causal graphical modeling of time series. We then discuss two recent works in this area. First, a complete characterization of a class of graphical models for describing lag-resolved causal relationships in the presence of latent confounders. This characterization sheds new light on existing time series causal discovery algorithms and shows that there is room for stronger identifiability results than previously thought. Second, a method for projecting infinite time series graphs with time-invariant edges to finite marginals graphs. We argue that the construction of these marginal graphs is a big step towards a method-agnostic generalization of causal effect identifiability results to time series.
Many datasets for modern machine learning consist of high dimensional observations that are generated from some low dimensional latent variables. While recent advances in deep learning allow us to sample from distributions of almost arbitrary complexity, the recovery of the ground truth latent variable is still challenging even in simple settings. We study this problem through the lens of identifiability, i.e., when can we, at least theoretically, hope to recover the latent structure up to certain symmetries? We will present a general identifiability result for interventional data and a contrastive algorithm to find the latent variables. In the second part, we study the robustness of identifiability results to misspecification as one challenge for practical applications of representation learning. This talk is based on joined work with Goutham Rajendran, Elan Rosenfeld, Bryon Aragam, Bernhard Schölkopf, and Pradeep Ravikumar.
Bio: Simon Buchholz received his PhD in mathematics from the University of Bonn where he was advised by Stefan Mueller. Currently he is a Postdoctoral Researcher with Bernhard Schölkopf in the department for Empirical Inference at the Max Planck Institute for Intelligent Systems in Tübingen where he works on problems in causal representation learning.
Many modern statistical procedures are randomized in the sense that the output is a random function of data. For example, many procedures employ data splitting, which randomly divides the dataset into disjoint parts for separate purposes. Despite their flexibility and popularity, data splitting and other constructions of randomized procedures have obvious drawbacks. First, two analyses of the same dataset may lead to different results due to the extra randomness introduced. Second, randomized procedures typically lose statistical power because the entire sample is not fully utilized.
\[ \]
To address these drawbacks, in this talk, I will study how to properly combine the results from multiple realizations (such as through multiple data splits) of a randomized procedure. I will introduce rank-transformed subsampling as a general method for delivering large sample inference of the combined result under minimal assumptions. I will illustrate the method with three applications: (1) a “hunt-and-test” procedure for detecting cancer subtypes using high-dimensional gene expression data, (2) testing the hypothesis of no direct effect in a sequentially randomized trial and (3) calibrating cross-fit “double machine learning” confidence intervals. For these problems, our method is able to derandomize and improve power. Moreover, in contrast to existing approaches for combining p-values, our method enjoys type-I error control that asymptotically approaches the nominal level. This new development opens up the possibility of designing procedures that explicitly randomize and derandomize: extra randomness is introduced to make the problem easier before being marginalized out.
\[ \]
This talk is based on joint work with Prof. Rajen Shah.
\[ \]
Bio: Richard Guo is a research associate in the Statistical Laboratory at the University of Cambridge, mentored by Prof. Rajen Shah. Previously, he was the Richard M. Karp Research Fellow in the 2022 causality program at the Simons Institute for the Theory of Computing. He received his PhD in Statistics from University of Washington in 2021, advised by Thomas Richardson. His research interests include graphical models, causal inference, semiparametric methods and replicability of data analysis. Dr. Guo will start as a tenure-track assistant professor in Biostatistics at the University of Washington in 2024.
...