Research Showcase Symposium
April 1, 2022
1680 & 1690 SPHI
The 2022 Senior PhD student Research Showcase Symposium will showcase current research of our PhD students who are graduating this year.
Additional details will be shared as they become available.
|Ph.D. Candidate Speaker||Postdoc Facilitator|
|Jiaqiang Zhu||Fred Boehm|
|Emily Roberts||Xianshi Yu|
|Jung Yeon Won||Yan Li|
|Jingyi Zhai||Fred Boehm|
|Andrew Whiteman||Jiyeon Song|
|Jonathan Boss||Jade Xiaoqing Wang|
|Guangyu Yang||Kendrick Li|
|Yi Zhao||Jiyeon Song|
|Tianwen Ma||Satwik Acharyya|
|Hengshi Yu||Kalins Banerjee|
|Wenbo Wu||Chen Shen|
|Pedro Orozco del Pino||Satwik Acharyya|
Featuring presentations from:
- Jonathan Boss
- Tianwen Ma
- Pedro Orozco del Pino
- Emily Roberts
- Andrew Whiteman
- Jung Yeon Won
- Wenbo Wu
- Guangyu Yang
- Hengshi Yu
- Yi Zhao
- Jingyi Zhai
- Jiaqiang Zhu
Mediation Analysis with External Summary-Level Information on the Total Effect
As modern assaying technologies continue to improve, environmental health studies are increasingly measuring endogenous omics data to study intermediary biological pathways of outcome-exposure associations. Mediation analysis is often carried out when there is a well-established literature showing statistical and practical significance of the association between an exogenous exposure and a health outcome of interest or the total effect. For example, there are a plethora of studies associating maternal phthalate exposure with preterm delivery, and researchers are now trying to characterize the mechanisms by which phthalate exposure impacts gestational age at the time of delivery. Existing methodology for performing mediation analyses does not leverage the rich external information available on the total effect. The first goal of this presentation is to show that incorporating external summary-level information on the total effect, improves estimation efficiency of the direct and indirect effects provided that the outcome-mediator association conditional on exposure is non-zero. The second goal is to discuss how to make the method more robust when we have incongruous external information on the total effect relative to the internal information on the total effect. The proposed framework blends mediation analysis with data integration techniques.
Bayesian Inferences on Neural Activity in EEG-Based Brain-Computer Interface
A brain-computer interface (BCI) is a system that translates brain activity into commands to operate technology. A common design for an electroencephalogram (EEG) BCI relies on the classification of the P300 event-related potential (ERP), which is a response elicited by the rare occurrence of target stimuli among common non-target stimuli. Few existing ERP classifiers directly explore the underlying mechanism of the neural activity. To this end, we perform a novel Bayesian analysis of the probability distribution of multi-channel real EEG signals under the P300 ERP-BCI design. We aim to identify relevant spatial temporal differences of the neural activity, which provides statistical evidence of P300 ERP responses and helps design individually efficient and accurate BCIs. As one key finding of our single participant analysis, there is a 90% posterior probability that the target ERPs of the channels around visual cortex reach their negative peaks around 200 milliseconds post-stimulus. Our analysis identifies five important channels (PO7, PO8, Oz, P4, Cz) for the BCI speller leading to a 100% prediction accuracy. From the analyses of nine other participants, we consistently select the identified five channels, and the selection frequencies are robust to small variations of bandpass filters and kernel hyper-parameters.
Transferability of Polygenic Risk Scores Across Populations of Different Ancestry
Research has found evidence that polygenic risk scores (PRS) can identify people at risk that non-genetic tools can not identify. Hence, clinical interest in PRS is increasing rapidly. PRS estimates an individual's risk of developing a disease based on a weighted sum of the number of risk variants found on their genome using summary statistics obtained from Genome-Wide Association Studies (GWAS). Since most existing GWAS samples are of European ancestry, current PRS has the lower predictive ability in populations more genetically distant from the European. Thus, health disparities may increase diagnosing and treating based on current PRS.
We want to understand the factors that contribute to the loss of the predictive ability of PRS across populations. Assuming the underlying disease architecture is the same across populations, we show through systematic and extensive simulations that increasing the sample size of European GWAS is suboptimal to overcome PRS transferability. Thus, we highlight the importance of fine mapping, functional data, diverse cohorts in GWAS, and methods that integrate GWAS from different populations. For the latter case, this implies using a large sample GWAS to increase the PRS's predictive ability of a much lower sample size GWAS. We present a Bayesian model that dynamically borrows information from a large sample external GWAS to increase the predictive ability of the PRS of a small sample GWAS target population. The model explicitly protects against the risk of the large sample size of the external population overpowering the information from the target population. We will also show that the model's flexibility allows it to adjust for different biological assumptions.
Causal inference methods to validate surrogate endpoints with time-to-event data
A common practice in clinical trials is to evaluate a treatment effect on an intermediate endpoint when the true outcome of interest would be difficult or costly to measure. We consider how to incorporate intermediate endpoints in a causally-valid way. Using counterfactual outcomes (those that would be observed if the counterfactual treatment had been given), the causal association paradigm assesses the relationship of the treatment effect on the surrogate S with the treatment effect on the true endpoint T. In particular, we propose illness death models to accommodate the censored and semi-competing risk structure of survival data. We assess the estimation properties of a Bayesian method using Markov Chain Monte Carlo.
Bayesian Inference for Brain Activity from Functional Magnetic Resonance Imaging Collected at Two Spatial Resolutions
Neuroradiologists and neurosurgeons increasingly opt to use func-tional magnetic resonance imaging (fMRI) to map functionally relevant brain regions for noninvasive presurgical planning and intraoperative neuronavigation. This application requires a high degree of spatial ac-curacy, but the fMRI signal-to-noise ratio (SNR) decreases as spatial resolution increases. In practice, fMRI scans can be collected at mul-tiple spatial resolutions, and it is of interest to make more accurate inference on brain activity by combining data with diﬀerent resolu-tions. To this end, we develop a new Bayesian model to leverage both better anatomical precision in high resolution fMRI and higher SNR in standard resolution fMRI. We assign a Gaussian process prior to the mean intensity function and develop an eﬃcient, scalable posterior computation algorithm to integrate both sources of data. We draw posterior samples using an algorithm analogous to Riemann manifold Hamiltonian Monte Carlo in an expanded parameter space. We illus-trate our method in analysis of presurgical fMRI data, and show in simulation that it infers the mean intensity more accurately than al-ternatives that use either the high or standard resolution fMRI data alone.
Bias reduction method for estimating the effect of latent time-varying count exposures using multiple lists
A major challenge in the longitudinal built-environment health studies is the accuracy of commercial business databases used to characterize dynamic food environments. These databases may miss existing businesses or include businesses that either no longer exist or never existed in a given area. Moreover, different databases often provide conflicting exposure measures in a given area due to different source credibilities. As on-site verification is not feasible for historical data, we suggest combining multiple databases to supplement incompleteness of listings. Given that these databases often do have external references for quality – specifically, their sensitivity and positive predictive value -- we propose a joint model for the time-varying health outcomes, observed count exposures, and the latent true count exposures. Our model estimates the time-specific quality of sources and incorporates time dependence of true count exposure by Poisson integer-valued first-order autoregressive process. We take a Bayesian nonparametric approach to flexibly account for location-specific exposures. By resolving the discordance between different databases, our method reduces the bias in the longitudinal health effect of the true exposures. Our method is demonstrated with childhood obesity data in California public schools with respect to convenience store exposures in school neighborhoods from 2001 to 2008.
Analysis of Hospital Readmissions with Competing Risks
The 30-day hospital readmission rate has been used in provider profiling for evaluating inter-provider care coordination, medical cost effectiveness, and patient quality of life. Current profiling analyses use logistic regression to model readmission as a binary outcome without explicitly considering competing risks (e.g., death) and event times. Overlooking competing risks and event times leads to less comprehensive modeling and distorted provider evaluation. To address these drawbacks, we propose a discrete time competing risk model wherein the cause-specific readmission hazard is used to assess provider-level effects. This readmission-focused assessment utilizes as the associated quality measure the standardized readmission ratio, which is not systematically affected by the rate of competing risks. Most existing methods do not explicitly account for competing risks. An unintended consequence is that a given provider may appear to have a lower readmission rate due to having a higher rate of competing risks. To facilitate the estimation and inference of a large number of provider effects, we develop an efficient Blockwise Inversion Newton algorithm, and a stabilized robust score test that overcomes the conservative nature of the classical robust score test. An application to dialysis patients demonstrates improved profiling, model fitting, and outlier detection over existing methods.
Estimation of Knots in Linear Spline Models
The linear spline model is able to accommodate nonlinear effects while allowing for an easy interpretation. It has significant applications in studying threshold effects and change-points. However, its application in practice has been limited by the lack of both rigorously studied and computationally convenient method for estimating knots. A key difficulty in estimating knots lies in the nondifferentiability. In this article, we study influence functions of regular and asymptotically linear estimators for linear spline models using the semiparametric theory. Based on the theoretical development, we propose a simple semismooth estimating equation approach to circumvent the nondifferentiability issue using modified derivatives, in contrast to the previous smoothing-based methods. Without relying on any smoothing parameters, the proposed method is computationally convenient. To further improve numerical stability, a two-step algorithm taking advantage of the analytic solution available when knots are known is developed to solve the proposed estimating equation. Consistency and asymptotic normality are rigorously derived using the empirical process theory. Simulation studies have shown that the two-step algorithm performs well in terms of both statistical and computational properties and improves over existing methods.
Predicting unobserved cell states from disentangled representations of single-cell data using generative adversarial networks
Deep generative models such as variational autoencoders (VAEs) and generative adversarial networks (GANs) generate and manipulate high-dimensional images. VAEs excel at learning disentangled image representa-tions, while GANs excel at generating realistic images. We systematically assess disentanglement and gen-eration performance of these models on single-cell gene expression data and find that these strengths and weaknesses of VAEs and GANs apply to single-cell gene expression data in a similar way. We also develop MichiGAN, a novel neural network that combines the strengths of VAEs and GANs to sample from disen-tangled representations without sacrificing data generation quality. We learn disentangled representations of three large single-cell RNA-seq datasets and use MichiGAN to sample from these representations. Michi-GAN allows us to manipulate semantically distinct aspects of cellular identity and predict single-cell gene expression response to drug treatment.
Statistical Methods for Replicability Assessment
The replicability of scientific discoveries is a hallmark of scientific research. Although its criticality is widely appreciated in the scientific community, a precise definition of replicability and methods for quantitative assessment are still lacking. In our work, we re-examine the characterization of replicable signals in a setting where each experimental unit is assessed with a signed effect size estimate. The proposed approaches are built on a novel definition of replicability, which emphasizes directional consistency. Based on the definition, we discuss some inference principles for replicability assessment and develop statistical tools to apply to various important statistical and scientific problems.
Statistical methods for gene differential expression detection and cell trajectory reconstruction from single-cell RNA sequencing data
The single-cell RNA-sequencing (scRNA-seq) technology is a recent advancement that enables the measurement of gene expression at single cell level. Gene differential expression detection (DE) and cell trajectory reconstruction (TR) are two common tasks in the analysis of scRNA-seq data. However, the data generated from this new technology creates many challenges for the statistical analysis due to both technological and biological variations. For instance, there are inflated numbers of zero read counts and multimodality in gene expression distribution. In this dissertation, we propose novel statistical methods for DE and TR analyses for scRNA-seq data. In Chapter II, we introduce a semi-parametric statistical model to detect differentially expressed genes between two biological conditions from scRNA-seq data. We model the read counts as zero-inflated Poisson variables with means following flexible distributions in the exponential family using g-modeling. Two-sample Kolmogorov-Smirnov test statistic is used to measure the discrepancy between two biological conditions and the bootstrap method is used to estimate the statistical significance of the test statistics. Simulated and real data analyses are performed to assess the performance of our proposed method. In Chapter III, we develop a new TR method to estimate a tree-structured cell trajectory from scRNA-seq data. We derive a penalized likelihood framework and a stochastic optimization algorithm to search through the non-convex tree space to obtain the global solution. We compare our proposed approach with other existing methods using simulated and real scRNA-seq data sets. In Chapter IV, we propose to extend the algorithm developed in Chapter III to include tree-structured cell trajectories with non-linear edges. We model each edge with a principal curve bounded by the two vertices. An L1-penalty on the parameters for the principal curve is added to constrain the total degree of freedom of the curved embedding tree, and the stochastic search algorithm in Chapter III will be modified accordingly for optimizing the parameters. We will apply the proposed method to simulated and real scRNA-seq data sets to evaluate its performance.
Statistical Analysis of Spatial Expression Pattern in Spatial Transcriptomics
Spatial transcriptomics is a collection of groundbreaking new genomics technologies that enable the measurements of gene expression with spatial localization information on tissues or cell cultures. Identifying genes that display spatial expression patterns in spatially resolved transcriptomic studies is an important first step towards characterizing the spatial transcriptomic landscape of complex tissues. Here, I will describe two statistical methods, SPARK and SPARK-X, for identifying such spatially expressed genes in data generated from various spatially resolved transcriptomic techniques. SPARK directly models spatial count data through the generalized linear spatial models while SPARK-X models the data in a non-parametric fashion to ensure scalable computation. Both SPARK and SPARK-X provide effective type I error control and yield high statistical power. I will illustrate the benefits of SPARK and SPARK-X in published spatially resolved transcriptomic data sets and show that these methods can lead to new biological findings that cannot be revealed by existing approaches.