Filter is active: Show only talks in the following category : Oberseminar Statistics and Data Science.
In our manufacturing plants, many tens of thousands of components for the automotive industry, like cameras or brake boosters, are produced each day. For many of our products, thousands of quality measurements are collected and checked during their assembly process individually. Understanding the relations and interconnections between those measurements is key to obtain a high production uptime and keep scrap at a minimum. Graphical models, like Bayesian networks, provide a rich statistical framework to investigate these relationships, not alone because they represent them as a graph. However, learning their graph structure is an NP-hard problem and most existing algorithms designed to either deal with a small number of variables or a small number of observations. On our datasets, with many thousands of variables and many hundreds of thousands of observations, classic learning algorithms don’t converge. In this talk, we show how we use an adapted version of the NOTEARs algorithm that uses mixture density neural networks to learn the structure of Bayesian networks even for very high-dimensional manufacturing data.
I this talk I will introduce a class of discrete statistical models to represent context-specific conditional independence relations for discrete data. These models can also be represented by sequences of context-DAGs (directed acyclic graphs). We prove that two of these models are statistically equivalent if and only their contexts are equal and the context DAGs have the same skeleton and v-structures. This is a generalization of the Verma and Pearl criterion for equivalence for DAGs. This is joint work with Liam Solus. A 3 minute video abstract for this talk is available at https://youtu.be/CccVNRFmR1I .
In this talk the relationship between strongly chordal graphs and m-saturated vines (regular vines with certain nodes removed or assigned with independence copula) is proved. Moreover, an algorithm to construct an m-saturated vine structure corresponding to a strongly chordal graph is provided. When the underlying data is sparse this approach leads to improvements in an estimation process as compared to current heuristic methods. Furthermore due to reduction of model complexity it is possible to evaluate all vine structures as well as to fit non-simplified vines.
Toric varieties have a strong combinatorial flavor: those algebraic varieties are described in terms of a fan. Based on joint work with M. Borinsky, B. Sturmfels, and S. Telen (https://arxiv.org/abs/2204.06414), I explain how to understand toric varieties as probability spaces. Bayesian integrals for discrete statistical models that are parameterized by a toric variety can be computed by a tropical sampling method. Our methods are motivated by the study of Feynman integrals and positive geometries in particle physics.
In high-dimensional classification problems, a commonly used approach is to first project the high-dimensional features into a lower dimensional space, and base the classification on the resulting lower dimensional projections. In this talk, we formulate a latent-variable model with a hidden low-dimensional structure to justify this two-step procedure and to guide which projection to choose. We propose a computationally efficient classifier that takes certain principal components (PCs) of the observed features as projections, with the number of retained PCs selected in a data-driven way. A general theory is established for analyzing such two-step classifiers based on any low-dimensional projections. We derive explicit rates of convergence of the excess risk of the proposed PC-based classifier. The obtained rates are further shown to be optimal up to logarithmic factors in the minimax sense. Our theory allows, but does not require, the lower-dimension to grow with the sample size and the feature dimension exceeds the sample size. Simulations support our theoretical findings. This is joint work with Xin Bing (Department of Statistical Sciences, University of Toronto).
Topic models have been and continue to be an important modeling tool for an ensemble of independent multinomial samples with shared commonality. Although applications of topic models span many disciplines, the jargon used to define them stems from text analysis. In keeping with the standard terminology, one has access to a corpus of n independent documents, each utilizing words from a given dictionary of size p. One draws N words from each document and records their respective count, thereby representing the corpus as a collection of n samples from independent, p-dimensional, multinomial distributions, each having a different, document specific, true word probability vector Π. The topic model assumption is that each Π is a mixture of K discrete distributions, that are common to the corpus, with document specific mixture weights. The corpus is assumed to cover K topics, that are not directly observable, and each of the K mixture components correspond to conditional probabilities of words, given a topic. The vector of the K mixture weights, per document, is viewed as a document specific topic distribution T, and is thus expected to be sparse, as most documents will only cover a few of the K topics of the corpus.
Despite the large body of work on learning topic models, the estimation of sparse topic distributions, of unknown sparsity, especially when the mixture components are not known, and are estimated from the same corpus, is not well understood and will be the focus of this talk. We provide estimators of T, with sharp theoretical guarantees, valid in many practically relevant situations, including the scenario p >> N (short documents, sparse data) and unknown K. Moreover, the results are valid when dimensions p and K are allowed to grow with the sample sizes N and n.
When the mixture components are known, we propose MLE estimation of the sparse vector T, the analysis of which has been open until now. The surprising result, and a remarkable property of the MLE in these models, is that, under appropriate conditions, and without further regularization, it can be exactly sparse, and contain the true zero pattern of the target. When the mixture components are not known, we exhibit computationally fast and rate optimal estimators for them, and propose a quasi-MLE estimator of T, shown to retain the properties of the MLE. The practical implication of our sharp, finite-sample, rate analyses of the MLE and quasi-MLE reveal that having short documents can be compensated for, in terms of estimation precision, by having a large corpus.
Our main application is to the estimation of Wasserstein distances between document generating distributions. We propose, estimate and analyze Wasserstein distances between alternative probabilistic document representations, at the word and topic level, respectively. The effectiveness of the proposed Wasserstein distance estimates, and contrast with the more commonly used Word Mover Distance between empirical frequency estimates, is illustrated by an analysis of an IMDb movie reviews data set.
Brief Bio: Florentina Bunea obtained her Ph.D. in Statistics at the University of Washington, Seattle. She is now a Professor of Statistics in the Department of Statistics and Data Science, and she is affiliated with the Center for Applied Mathematics and the Department of Computer Science, at Cornell University. She is a fellow of the Institute of Mathematical Statistics, and she is or has been part of numerous editorial boards such as JRRS-B, JASA, Bernoulli, the Annals of Statistics. Her work has been continuously funded by the US National Science Foundation. Her most recent research interests include latent space models, topic models, and optimal transport in high dimensions.
The Cox proportional hazards model is a semiparametric regression model that can be used in medical research, engineering or insurance for investigating the association between the survival time (the so-called lifetime) of an object and predictor variables. We investigate the Cox proportional hazards model for right-censored data, where the baseline hazard rate belongs to an unbounded set of nonnegative Lipschitz functions, with fixed constant, and the vector of regression parameters belongs to a compact parameter set, and in addition, the time-independent covariates are subject to measurement errors. We construct a simultaneous estimator of the baseline hazard rate and regression parameter, present asymptotic results and discuss goodness-of-fit tests.
Bayesian networks are probabilistic graphical models widely employed to understand dependencies in high-dimensional data, and even to facilitate causal discovery. Learning the underlying network structure, which is encoded as a directed acyclic graph (DAG) is highly challenging mainly due to the vast number of possible networks in combination with the acyclicity constraint, and a wide plethora of algorithms have been developed for this task. Efforts have focused on two fronts: constraint-based methods that perform conditional independence tests to exclude edges and score and search approaches which explore the DAG space with greedy or MCMC schemes. We synthesize these two fields in a novel hybrid method which reduces the complexity of Bayesian MCMC approaches to that of a constraint-based method. This enables full Bayesian model averaging for much larger Bayesian networks, and offers significant improvements in structure learning. To facilitate the benchmarking of different methods, we further present a novel automated workflow for producing scalable, reproducible, and platform-independent benchmarks of structure learning algorithms. It is interfaced via a simple config file, which makes it accessible for all users, while the code is designed in a fully modular fashion to enable researchers to contribute additional methodologies. We demonstrate the applicability of this workflow for learning Bayesian networks in typical data scenarios.
References: doi:10.1080/10618600.2021.2020127 and arXiv:2107.03863
A dependence graph for a set of variables has rules for which pairs of variables are connected. In the literature on dependence graphs for gene expression measurements, there have been several rules for connecting pairs of variables based on a correlation matrix: (a) absolute correlation of the pair exceed a threshold; (b) absolute partial correlation of the pair given the rest exceed a threshold; (c) first-order conditional independence rule of Magwene and Kim (2004).
These three methods will be compared with the dependence graph from a truncated partial correlation vine with thresholding. The comparisons are made for correlation matrices that are derived from (a) factor dependence structures, (b) Markov tree structure, and (c) variables that form groups with strong within group dependence and weaker between group dependence. If there are latent variables, the graphs are compared with and without them. The goal is to show that more parsimonious and interpretable graphs can be obtained with inclusion of latent variables.
This research project contributes to insurance risk management by modeling extreme climate risk and extreme mortality risk in an integrated manner via extreme value theory (EVT). We conduct an empirical study using monthly temperature and death data and find that the joint extremes in cold weather and old-age death counts exhibit the strongest level of dependence. Based on the estimated bivariate generalized Pareto distribution, we quantify the extremal dependence between death counts and temperature indexes. Methodologically, we employ the bivariate peaks over threshold (POT) approach, which is readily applicable to a wide range of topics in extreme risk management.
Testing for the presence of changepoints and determining their location is a common problem in time series analysis. Applying changepoint procedures to multivariate data results in higher power and more precise location estimates, both in online and offline detection. However, this requires that all changepoints occur at the same time. We study the problem of testing the equality of changepoint locations. One approach is to treat common breaks as a common feature and test, whether an appropriate linear combination of the data can cancel the breaks. We propose how to determine such a linear combination and derive the asymptotic distribution resulting CUSUM and MOSUM statistics. We also study the power of the test under local alternatives and provide simulation results of its nite sample performance. Finally, we suggest a clustering algorithm to group variables into clusters that are co-breaking.
Manifold learning can be used to obtain a low-dimensional representation of the underlying manifold given the high-dimensional data. However, kernel density estimates of the low-dimensional embedding with a fixed bandwidth fail to account for the way manifold learning algorithms distort the geometry of the underlying Riemannian manifold. We propose a novel kernel density estimator for any manifold learning embedding by introducing the estimated Riemannian metric of the manifold as the variable bandwidth matrix for each point. The geometric information of the manifold guarantees a more accurate density estimation of the true manifold, which subsequently could be used for anomaly detection. To compare our proposed estimator with a fixed-bandwidth kernel density estimator, we run two simulations with 2-D metadata mapped into a 3-D swiss roll or twin peaks shape and a 5-D semi-hypersphere mapped in a 100-D space, and demonstrate that the proposed estimator could improve the density estimates given a good manifold learning embedding and has higher rank correlations between the true and estimated manifold density. A shiny app in R is also developed for various simulation scenarios. The proposed method is applied to density estimation in statistical manifolds of electricity usage with the Irish smart meter data. This demonstrates our estimator's capability to fix the distortion of the manifold geometry and to be further used for anomaly detection in high-dimensional data.
We study statistical models of regular Gaussian distributions given by assumptions about the signs of partial correlations. This includes conditional independence models and graphical modeling devices such as Markov and Bayes networks. For these models, we consider the following basic questions: (1) How hard is it (complexity-theoretically) to check if the model specification is inconsistent? (2) If it is consistent, how hard is it (algebraically) to write down a covariance matrix from the model? (3) How badly shaped (homotopy-theoretically) can these models be? For all of these questions the answer is "it is as bad as it could possibly be".
A statistical model is identifiable if the map parameterizing the model is injective. This means that the parameters producing a probability distribution in the model can be uniquely determined from the distribution itself which is a critical property for meaningful data analysis. In this talk I'll discuss a new strategy for proving that discrete parameters are identifiable that uses algebraic matroids associated to statistical models. This technique allows us to avoid elimination and is also parallelizable. I'll then discuss a new extension of this technique which utilizes oriented matroids to prove identifiability results that the original matroid technique is unable to obtain.
Bayesian networks (BNs) are a versatile and powerful tool to model complex phenomena and the interplay of their components in a probabilistically principled way. Moving beyond the comparatively simple case of completely observed, static data, which has received the most attention in the literature, I will discuss how BNs can be extended to model continuous data and data in which observations are not independent and identically distributed.
For the former, I will discuss continuous-time BNs. For the latter, I will show how mixed effects models can be integrated with BNs to get the best of both worlds.
Deep learning models exhibit a rather curious phenomenon. They optimize over hugely complex model classes and are often trained to memorize the training data. This is seemingly contradictory to classical statistical wisdom, which suggests avoiding interpolation in favor of reducing the complexity of the prediction rules. A large body of recent work partially resolves this contradiction. It suggests that interpolation does not necessarily harm statistical generalization and may even be necessary for optimal statistical generalization in some settings. This is, however, an incomplete picture. In modern ML, we care about more than building good statistical models. We want to learn models which are reliable and have good causal implications. Under a simple linear model in high dimensions, we will discuss the role of interpolation and its counterpart --- regularization --- in learning better causal models.
Sparsity is popular in statistics and machine learning, because it can avoid overfitting, speed up computations, and facilitate interpretations. In deep learning, however, the full potential of sparsity still needs to be explored. This presentation first recaps sparsity in the framework of high-dimensional statistics and then introduces sparsity-inducing methods and corresponding theory for modern deep-learning pipelines.
Multivariate extreme value theory mostly focuses on asymptotic dependence, where the probability of observing a large value in one of the variables is of the same order as that of observing a large value in all variables simultaneously. There is growing evidence, however, that asymptotic independence prevails in many data sets. Available statistical methodology in this setting is scarce and not well understood theoretically. We revisit non-parametric estimation of bivariate tail dependence and introduce rank-based M-estimators for parametric models that may include both asymptotic dependence and asymptotic independence, without requiring prior knowledge on which of the two regimes applies. We further show how the method can be leveraged to obtain parametric estimators in spatial tail models. All the estimators are proved to be asymptotically normal under minimal regularity conditions. The methodology is illustrated through an application to extreme rainfall data.
The Laplacian matrix of an undirected graph with positive edge weights encodes graph properties in matrix form. In this talk, we will discuss how Laplacian matrices appear prominently in multiple applications in statistics and machine learning. Our interest in Laplacian matrices originates in graphical models for extremes. For Hüsler--Reiss distributions, which are considered as an analogue of Gaussians in extreme value theory, they characterize an extremal notion of multivariate total positivity of order 2 (MTP2). This leads to a consistent estimation procedure with a typically sparse graphical structure. Furthermore, the underlying convex optimization problem under Laplacian constraints allows for a simple block descent algorithm that we implemented in R. An active area of research in machine learning are Laplacian-constrained Gaussian graphical models. These models admit structure learning under various connectivity constraints. Multiple algorithms for these problems with different lasso-type penalties are available in the literature. A surprising appearance of Laplacian matrices is in the design of discrete choice experiments. Here, the Fisher information of a discrete choice design is a Laplacian matrix, which gives rise to a new approach for learning locally D-optimal designs.
Under an endogenous binary treatment with heterogeneous effects and multiple instruments, we propose a two-step procedure for identifying complier groups with identical local average treatment effects (LATE) despite relying on distinct instruments, even if several instruments violate the identifying assumptions. Our procedure is based on the fact that the LATE is homogeneous for instruments which (i) satisfy the LATE assumptions (instrument validity and treatment monotonicity in the instrument) and (ii) generate identical complier groups in terms of treatment propensities given the respective instruments. Under the plurality assumption that within each set of instruments with identical treatment propensities, instruments truly satisfying the LATE assumptions are the largest group, our procedure permits identifying these true instruments in a data driven way. We also provide a simulation study investigating the finite sample properties of our approach and an empirical application investigating the effect of incarceration on recidivism in the US with judge assignments serving as instruments.
Phylogenetic networks provide a means of describing the evolutionary history of sets of species believed to have undergone hybridization or horizontal gene flow during the course of their evolution. The mutation process for a set of such species can be modeled as a Markov process on a phylogenetic network. Previous work has shown that a site-pattern probability distributions from a Jukes-Cantor phylogenetic network model must satisfy certain algebraic invariants, i.e. polynomial relationships. As a corollary, aspects of the phylogenetic network are theoretically identifiable from site-pattern frequencies. In practice, because of the probabilistic nature of sequence evolution, the phylogenetic network invariants will rarely be satisfied, even for data generated under the model. Thus, using network invariants for inferring phylogenetic networks requires some means of interpreting the residuals when observed site-pattern frequencies are substituted into the invariants. In this work, we propose an approach that combines statistical learning and phylogenetic invariants to infer small, level-one phylogenetic networks, and we discuss how the approach can be extended to infer larger networks. This is joint work with Travis Barton, Colby Long, and Joseph Rusinko.