Senior PhD Student Research Showcase
April 3, 2023
9:00am - 4:00pm
SPH I, Room 1680
The 2023 Senior PhD student Research Showcase will showcase current research of our PhD students who are graduating this year.
Each of the four sessions scheduled for the Senior PhD Student Research Showcase will feature four PhD candidates delivering 10-minute presentations about their research. The remaining time in each session (approximately 15 minutes) will be an opportunity for the chair of the session -- each session will be chaired by a departmental postdoctoral researcher -- to facilitate discussions among the presenters and audience members.
EVENT SCHEDULE |
|
---|---|
9:00am - 9:30am |
Breakfast Social |
SESSION ONE 9:30am - 10:30am |
Irena Chen Individual Variances as a Predictor of Health Outcomes: A Hierarchical Bayesian Approach |
Tsung-Hung Yao Bayesian Learning of Structured Covariances with Applications to Cancer Data |
|
Rupam Bhattacharyya Functional Integrative Bayesian Analysis of High-dimensional Multiplatform Genomic Data |
|
Yibo Wang A Latent Variable Model for Individual Degree Estimation in Respondent-Driven Sampling |
|
10:30am - 10:45am | Coffee/Refreshment Break |
SESSION TWO 10:45am - 11:45am |
Margaret Banker Regularized Simultaneous Estimation of Changepoint and Functional Parameter in Functional Accelerometer Data Analysis |
Yuhua Zhang Statistical Modeling of Large-Scale Network Data |
|
Stephen Salerno Novel Deep Learning Approaches for Semi-Competing Risk Prediction |
|
Jieru Shi A Meta-Learning Method for Estimation of Causal Excursion Effects to Assess Time-Varying Moderation |
|
12:00 pm - 1:30pm |
Lunch |
SESSION THREE 1:30pm-2:30pm |
Ying Ma Statistical and Computational Methods for High-Dimensional Genomics Data |
Lulu Shang Statistical Methods for Genetic and Genomic Studies |
|
Lam Tran Approaches for Constrained Variable Selection in Large Datasets |
|
Lap Sum Chan Censoring-based Differential Abundance Analysis for Microbiome Data |
|
2:30pm - 2:45pm | Coffee/Refreshment Break |
SESSION FOUR2:45pm - 3:45pm |
Yuqi Zhai Improving Estimation Efficiency by Integrating External Summary Information from Heterogeneous Populations |
Elizabeth Chase Modeling Basal Body Temperature Data Using Horseshoe Process Regression |
|
Xuemei Ding Models and Methods for Analyzing Clustered Recurrent Hospitalizations in the Presence of COVID-19 Effects |
|
Fatema Shafie Khorassani Data Fusion for Time-to-Event Outcomes |
|
3:45pm - 4:00pm | Closing Remarks |
MARGARET BANKER
Regularized Simultaneous Estimation of Changepoint and Functional Parameter in Functional
Accelerometer Data Analysis
Abstract coming soon.
Rupam Bhattacharyya
Functional Integrative Bayesian Analysis of High-dimensional Multiplatform Genomic
Data
Large-scale multi-omics datasets offer complementary, partly independent, high-resolution views of the human genome. Modeling and inference using such data poses challenges like high-dimensionality and structured dependencies but offers potential for understanding the complex biological processes characterizing a disease. We propose fiBAG, an integrative hierarchical Bayesian framework for modeling the fundamental biological relationships underlying such cross-platform molecular features. Using Gaussian processes, fiBAG identifies mechanistic evidence for covariates from corresponding upstream information. Such evidence, mapped to prior inclusion probabilities, informs a calibrated Bayesian variable selection (cBVS) model identifying genes/proteins associated with the outcome. Simulation studies illustrate that cBVS has higher power to detect disease-related markers than non-integrative approaches. A pan-cancer analysis of 14 TCGA cancer datasets is performed to identify markers associated with cancer stemness and patient survival. Our findings include both known associations like the role of RPS6KA1/p90RSK in gynecological cancers and interesting novelties like EGFR in gastrointestinal cancers.
Lap Sum Chan
Censoring-based Differential Abundance Analysis for Microbiome Data
Abstract coming soon.
Elizabeth Chase
Modeling Basal Body Temperature Data Using Horseshoe Process Regression
Biomedical data often exhibit jumps or abrupt changes. For example, women’s basal body temperature may jump at time of ovulation and menstruation. These sudden changes make these data challenging to model: many methods will oversmooth the sharp changes or overfit in response to measurement error. We develop horseshoe process regression (HPR) to address this problem. We define a horseshoe process as a stochastic process in which each increment is horseshoe-distributed. We use the horseshoe process as a nonparametric Bayesian prior for modeling a potentially nonlinear association between an outcome and its continuous predictor. We find that HPR performs well when fitting functions that have sharp changes, such as women’s basal body temperature trajectory. We apply HPR to model women’s basal body temperatures over the course of the menstrual cycle and propose modifications to more fully incorporate prior information about basal body temperature patterns.
Irena Chen
Individual Variances as a Predictor of Health Outcomes: A Hierarchical Bayesian Approach
Modeling variability as a predictor of health outcomes may provide critical information about disease risk and health outcomes. Existing methods for longitudinal data limit scientists’ ability to leverage subject-level biomarker variability for predicting health outcomes. In this talk, I will describe a joint modeling framework that estimates subject-level means and variances of multiple longitudinal predictors in order to predict an outcome of interest. This framework enables systematic investigation of the role of multi-marker variability in health outcomes. I will also present a simulation study in which the model demonstrates excellent recovery of true parameters. Finally, I will present a concrete application of this model to women's health, where we investigate the effects of individual estradiol and follicle-stimulating hormone variabilities and co-variability on women’s fat distribution over the course of menopause. In addition, I will also outline ongoing and future research directions for modeling subject-level variances.
Xuemei Ding
Models and Methods for Analyzing Clustered Recurrent Hospitalizations in the Presence
of COVID-19 Effects
Current methods are inadequate to analyze data from many dialysis facilities with multiple hospitalizations, especially when adjustments are needed for multiple time scales. We propose a method that has a flexible baseline rate function and is computationally efficient. The proposed method demonstrates substantially improved computational efficiency over the existing R package survival in simulations. Finally, we illustrate the method with an important application to monitoring dialysis facilities in the U.S., while making time-dependent adjustments for COVID-19’s effects.
Ying Ma
Statistical and Computational Methods for High-Dimensional Genomics Data
Recent explosion of various transcriptomic technologies such as single-cell RNA sequencing (scRNA-seq) and spatially resolved transcriptomic (SRT) datasets has provided comprehensive cell atlas and enabled the thorough characterization of transcriptomic landscapes on tissues for mechanistic understanding of many biological processes. In the meantime, improvements in transcriptomic technologies have raised both the volume and complexity of data, introducing new computational and statistical challenges for data analysis. In this talk, I will present several methods to address these challenges for capturing and dissecting the heterogeneity within cells and tissues with high statistical power and accuracy while providing new insight into the biological systems. Specifically, we develop effective and efficient statistical methods for integrative differential expression and gene set enrichment analysis in scRNA-seq studies, for spatially informed cell type deconvolution, and for integrative reference-informed tissue segmentation analysis in SRT studies. I will illustrate our methods by showing results from applications to human embryonic stem cell data, human ductal adenocarcinoma (PDAC) data, and human dorsolateral prefrontal cortex (DLPFC) data.
Stephen Salerno
Novel Deep Learning Approaches for Semi-Competing Risk Prediction
In the era of precision medicine, time-to-event outcomes such as time to death or disease progression are routinely collected, along with risk factors that often have complex relationships. Recent emphasis has been placed on developing novel machine learning approaches for survival estimation and prognostication in settings with one outcome of interest, however, many survival processes in real applications involve multiple competing events. Semi-competing risk problems, a variant of competing risk problems, have commonly been encountered in clinical studies. By semi-competing, we mean that the occurrence of one event, i.e., a non-terminal event, is subject to the occurrence of another, terminal event, but not vice versa. In this dissertation, we propose a series of deep learning approaches for survival prediction and causal inference in this setting of semi-competing risks. Our motivation comes from the Boston Lung Cancer Survival Cohort study, one of the largest cancer epidemiology cohorts investigating the complex mechanisms of lung cancer.
Fatema Shafie Khorassani
Data Fusion for Time-to-Event Outcomes
Despite significant reductions in cancer mortality over the past three decades, racial disparities in cancer-specific mortality persist. Studying factors associated with these observed disparities requires data on many variables, including demographics, healthcare access, socioeconomic status, and comorbidities. There are existing national cancer surveillance databases that each collect parts of the information needed for studying racial disparities in cancer. Integrating data from multiple sources allows us to study associations between race and cancer-specific mortality over time adjusted for important confounders. We propose a method for data fusion of time-to-event outcomes motivated by confounder adjustment when studying racial disparities in cancer-specific mortality. Data fusion is a particularly challenging problem in data integration, in which no subject has complete data on all the covariates and outcome. Some existing missing data methods have been extended to the setting of data fusion; however, they do not account for time-to-event outcomes. We present a method for regressing a time-to-event outcome on a set of covariates from two integrated datasets that include some overlapping variables. We will present a class of doubly robust estimators which are unbiased if either the data source model or the model of the unobserved covariates is specified correctly. Through simulation studies we will present the bias and coverage of our estimators under correctly specified and misspecified models and will apply the method to fuse cancer-specific mortality information from the Surveillance, Epidemiology, and End Results (SEER) Program with confounders collected in the National Cancer Database (NCDB) that are not available in SEER.
Lulu Shang
Statistical Methods for Genetic and Genomic Studies
Recent advances in array-based and sequencing-based technologies have enabled genome-wide profiling of gene expression and various epigenetic markers. Extracting valuable biological information from these various omics data types requires the development of new computational and statistical methods. My dissertation centers around developing statistical methods and analyzing various omics data. In this dissertation, we propose several effective and efficient statistical and computational methods to address critical biological problems encountered in various genomics fields including spatial transcriptomics, single cell, and bulk RNA-seq studies. In addition, we have conducted two large-scale comprehensive quantitative trait loci (QTL) mapping studies in underrepresented African Americans in the GENOA cohort, to carefully examine how inherited genetic variation affects local gene expression and DNA methylation in the under-represented populations.
In Chapter II, I focus on data collected from various spatial transcriptomic technologies and developed a method called SpatialPCA for spatially aware dimension reduction in spatial transcriptomics. We demonstrate the advantages of SpatialPCA through spatial transcriptomics visualization, spatial domain detection, spatial trajectory inference on the tissue, and high-resolution spatial map reconstruction. In Chapter III, I continue to focus on spatial transcriptomics data and develop a method, Stella (SpaTially variable cELL type specific gene identificAtion), that enables spatially variable cell type specific gene identification for spatial transcriptomics studies. We demonstrate ability of Stella in detecting genes that display spatial expression patterns in a cell type specific fashion, providing calibrated type I error control with enhanced detection power across a variety of technical platforms. In Chapter IV, I connect genome-wide association studies (GWAS) with single cell and bulk RNA-seq data and develop a method, CoCoNet (COmposite likelihood-based COvariance regression NETwork model). CoCoNet utilizes tissue-specific gene co-expression networks to infer trait-relevant tissues by integrating GWAS and gene expression studies. We demonstrate how CoCoNet can be used to identify specific glial cell types associated with neurological disorders and disease-targeted colon tissues associated with autoimmune disorders. In Chapter V, I conducted two large-scale cis-QTL mapping studies to link genetic variants with gene expression and various epigenetic markers. We performed expression and methylation cis-QTL mapping studies on African Americans in the GENOA cohort to identify genetic variants that influence either gene expression or DNA methylation. Our results promote diversity, equity, and inclusion in genetic research and enhance the current understanding of the genetic architecture underlying gene expression and DNA methylation in the underrepresented African American population.
Jieru Shi
A Meta-Learning Method for Estimation of Causal Excursion Effects to Assess Time-Varying
Moderation
Twin revolutions in wearable technologies and smartphone-delivered digital health interventions have significantly expanded the accessibility and uptake of mobile health (mHealth) interventions in multiple health science domains. In this talk, the estimation of causal excursion effects is revisited from a meta-learner perspective, where the analyst is agnostic to the choices of supervised learning algorithms used to estimate nuisance parameters.
Lam Tran
Approaches for Constrained Variable Selection in Large Datasets
Abstract coming soon.
Yibo Wang
A Latent Variable Model for Individual Degree Estimation in Respondent-Driven Sampling
Individual network size (degree) is a crucial factor in respondent-driven sampling analysis, as it is often used as a proxy for sampling probability. However, self-reported data from the interview, which is a commonly used estimation, typically suffers from substantial measurement error. To address this issue, we propose a latent variable model that blends the analysis of reporting behaviors and responses to questions about the number of acquaintances in a particular subpopulation. We demonstrate via simulation studies that our approach provides accurate degree estimation and improves statistical inferences when using it as the sampling probability.
Tsung-Hung Yao
Bayesian Learning of Structured Covariances with Applications to Cancer Data
The identification of scientifically-driven dependence structures is of interest across many biomedical domains. Examples include tree- and graph-based structures that manifest themselves in precision medicine and genomic contexts. Such dependence structures can be compactly represented as covariance or precision matrices, which is useful for both characterization and interpretation of the complex dependencies. This presentation focuses on the tree structure of dependency with the application of cancer treatments. Specifically, we propose a novel Bayesian probabilistic tree-based framework for patient-derived xenografts data to investigate the hierarchical relationships between treatments by inferring treatment cluster trees, referred to as treatment trees (Rx-tree). The framework motivates a new metric of mechanistic similarity between two or more treatments accounting for inherent uncertainty in tree estimation; treatments with a high estimated similarity have potentially high mechanistic synergy. Building upon Dirichlet Diffusion Trees, we derive a closed-form marginal likelihood encoding the tree structure, which facilitates computationally efficient posterior inference via a new two-stage algorithm. Simulation studies demonstrate superior performance of the proposed method in recovering the tree structure and treatment similarities. The analyses of a recently collated PDX dataset produce treatment similarity estimates that show a high degree of concordance with known biological mechanisms across treatments in five different cancers. More importantly, our analysis uncovers new and potentially effective combination therapies that confer synergistic regulation of specific downstream biological pathways for future clinical investigations.
Yuqi Zhai
Improving Estimation Efficiency by Integrating External Summary Information from Heterogeneous
Populations
Abstract coming soon.
Yuhua Zhang
Statistical Modeling of Large-Scale Network Data
Scientists are increasingly interested in discovering community structure from modern relational data arising on large-scale social networks. While many methods have been proposed, few account for the fact that modern networks arise from processes of interactions in the population and that interactions may exhibit different categories. In this presentation, we first introduce a novel statistical model for the study of interaction networks with latent node-level community structure. In particular, this model allows network properties such as sparsity and power-law degree distributions. These properties are frequently observed in real-world networks. We then discuss a joint model that allows integration of interaction-wise prior knowledge into node-level community detection. We demonstrate the proposed models using post-comment interaction data from Talklife, a large-scale online peer-to-peer support network, through identifying its underlying online user groups.