Filter is active: Show only talks in the following category : Oberseminar Statistics and Data Science.
The Minkowski functional is a series of geometric quantities including the volume, the surface area, and the Euler characteristic. In this talk, we consider the Minkowski functional of the excursion set (sup-level set) of an isotropic smooth random field on arbitrary dimensional Euclidean space. Under the setting that the random field has weak non-Gaussianity, we provide the perturbation formula of the expected Minkowski functional. This result is a generalization of Matsubara (2003) who treated the 2- and 3-dimensional cases under weak skewness. The Minkowski functional is used in astronomy and cosmology as a test statistic for testing Gaussianity of the cosmic microwave background (CMB), and to characterize the large-scale structures of the universe. Besides, the expected Minkowski functional of the highest degree is the expected Euler-characteristic of the excursion set, which approximates the upper tail probability of the maximum of the random field. This methodology is used in multiple testing problems. We explain some applications of the perturbation formulas in these contexts. This talk is based on joint work with Takahiko Matsubara.
When studying survival data in the presence of right censoring, it often happens that a certain proportion of the individuals under study do not experience the event of interest and are considered as cured. The mixture cure model is one of the common models that take this feature into account. It depends on a model for the conditional probability of being cured (called the incidence) and a model for the conditional survival function of the uncured individuals (called the latency). This work considers a logistic model for the incidence and a semiparametric accelerated failure time model for the latency part. The estimation of this model is obtained via the maximization of the semiparametric likelihood, in which the unknown error density is replaced by a kernel estimator based on the Kaplan-Meier estimator of the error distribution. Asymptotic theory for consistency and asymptotic normality of the parameter estimators is provided. Moreover, the proposed estimation method is compared with a method proposed by Lu (2010), which uses a kernel approach based on the EM algorithm to estimate the model parameters. Finally, the new method is applied to data coming from a cancer clinical trial.
In many real-life applications, it is of interest to study how the distribution of a (continuous) response variable changes with covariates. Dependent Dirichlet process (DDP) mixture of normal models, a Bayesian nonparametric method, successfully addresses such goal. The approach of considering covariate independent mixture weights, also known as the single weights dependent Dirichlet process mixture model, is very popular due to its computational convenience but can have limited flexibility in practice. To overcome the lack of flexibility, but retaining the computational tractability, this work develops a single weights DDP mixture of normal model, where the components’ means are modelled using Bayesian penalised splines (P-splines). We coin our approach as psDDP. A practically important feature of psDDP models is that all parameters have conjugate full conditional distributions thus leading to straightforward Gibbs sampling. In addition, they allow the effect associated with each covariate to be learned automatically from the data. The validity of our approach is supported by simulations and applied to a study concerning the association of a toxic metabolite on preterm birth.
Over-parameterized models, in particular deep networks, often exhibit a ``double-descent'' phenomenon, where as a function of model size, error first decreases, increases, and decreases at last. This intriguing double-descent behavior also occurs as a function of training time, and it has been conjectured that such ``epoch-wise double descent'' arises because training time controls the model complexity. In this paper, we show that double descent arises for a different reason: It is caused by two overlapping bias-variance tradeoffs that arise because different parts of the network are learned at different speeds.
Thousands of researchers use social media data to analyze human behavior at scale. The underlying assumption is that millions of people leave digital traces and by collecting these traces we can re-construct activities, topics, and opinions of groups or societies. Some data biases are obvious. For instance, most social media platforms do not represent the socio-demographic setup of society. Social bots can also obscure actual human activity on these platforms. Consequently, it is not trivial to use social media analyses and draw conclusions to societal questions. In this presentation, I will focus on a more specific question: do we even get good social media samples? In other words, do social media data that are available for researchers represent the overall platform activity? I will show how nontransparent sampling algorithms create non-representative data samples and how technical artifacts of hidden algorithms can create surprising side effects with potentially devastating implications for data sample quality.
I will start with a general motivation for cause-effect estimation and describe common challenges such as identifiability. We will then take a closer look at the instrumental variable setting and how an instrument can help for identification. Most approaches to achieve identifiability require one-size-fits-all assumptions such as an additive error model for the outcome. Instead, I will present a framework for partial identification, which provides lower and upper bounds on the causal treatment effect. Our approach leverages advances in gradient-based optimization for the non-convex objective and works in the most general case, where instrument, treatment and outcome are continuous. Finally, we demonstrate on a set of synthetic and real-world data that our bounds capture the causal effect when additive methods fail, providing a useful range of answers compatible with observation as opposed to relying on unwarranted structural assumptions.
We analyse the temporal and regional structure in COVID-19 infections, making use of the openly available data on registered cases in Germany published by the Robert Koch Institute (RKI) on a daily basis. We demonstrate the necessity to apply nowcasting to cope with delayed reporting. Delayed reporting occurs because local health authorities report infections with delay due to delayed test results, delayed reporting chains or other issues not controllable by the RKI. A reporting delay also occurs for fatal cases, where the decease occurs after the infection (unless post-mortem tests are applied). The talk gives a general discussion on nowcasting and applies this in two settings. First, we derive an estimate for the number of present-day infections that will, at a later date, prove to be fatal. Our district-level modelling approach allows to disentangle spatial variation into a global pattern for Germany, district-specific long-term effects and short-term dynamics, taking the demographic composition of the local population into account. Joint work with Marc Schneble, Giacomo De Nicola & Ursula Berger The second applications combines nowcasting with forecasting of infection numbers. This leads to a fore-nowcast, which is motivated methodologically. The method is suitable for all data which are reported with delay and we demonstrate the usability on COVID-19 infections.
Multivariate time series exhibit two types of dependence: across variables and across time points. Vine copulas are graphical models for the dependence and can conveniently capture both types of dependence in the same model. We derive the maximal class of graph structures that guarantees stationarity under a condition called translation invariance. Translation invariance is not only a necessary condition for stationarity, but also the only condition we can reasonably check in practice. In this sense, the new model class characterizes all practically relevant vine structures for modeling stationary time series. We propose computationally efficient methods for estimation, simulation, prediction, and uncertainty quantification and show their validity by asymptotic results and simulations. The theoretical results allow for misspecified models and, even when specialized to the \emph{iid} case, go beyond what is available in the literature. The new model class is illustrated by an application to forecasting returns of a portolio of 20 stocks, where they show excellent forecast performance. The paper is accompanied by an open source software implementation.