Skip to main content
CSBBD Research

CSBBD Research

The Center for Statistics in Biomedical Big Data (CSBBD) conducts both methodological and collaborative research, focusing on advanced statistical and computational approaches to tackle big data challenges in health sciences. Current research areas include integrative analysis of genomic data, methods for studying the microbiome and metagenomics, wearable biomedical device data analysis, and high-dimensional causal inference in genomics. These efforts aim to leverage big data for insights into disease mechanisms and improve patient care.

Focus Areas

Integrative Analysis of Genomic Data

This research area focuses on methods to quantify the genetic control over disease phenotypes and identify specific genetic factors and disrupted pathways that influence disease risk. By integrating diverse genomic, epigenomic, and metagenomic data, our researchers aim to uncover underlying mechanisms. Key applications include integrative cancer genomics, as well as multi-omics studies of cardiovascular disease, diabetes, and renal disease. The initial focus is on leveraging eQTL data to address complex issues in genetic analysis within a causal mediation framework, helping to clarify genetic contributions to complex diseases.

Visualization of DNA sequence code.

Methods for Microbiome & Metagenomics

This research area develops methods to measure, annotate, and prioritize microbial taxa and genes, examining their associations with diverse phenotypes. It also advances techniques to study microbial community dynamics and how they are affected by environmental factors and treatments. Additionally, researchers focus on uncovering the role of gut microbial metabolism in influencing susceptibility to, and treatment outcomes for, heart disease, cancer, and autoimmune diseases by integrating metagenomic and metabolomics data.

Visualization of microbes and DNA strand.

Methods for Analysis of Wearable Biomedical Devices Data

This research area focuses on methods for analyzing data from wearable biomedical devices, which capture detailed information on physical activity, sleep patterns, environmental factors, physiological signals, and overall health status. These wearable systems enable clinicians to monitor patients continuously over extended periods, providing data on an unprecedented scale. A central aim is to integrate this high-density data with electronic health records (EHRs) to predict patients’ current health states, future health trajectories, and optimal interventions. Researchers are developing methods to create richer, data-driven disease profiles and generate digital phenotypes. Innovative study designs will be essential to address pressing clinical questions. Functional data analysis and deep learning approaches are anticipated to play a major role in processing this data, while cloud computing and advanced data storage solutions are necessary to support real-time health state predictions.

Visualization of health data projected above a close-up of a person's hands typing on a keyboard.

High Dimensional Causal Inference in Genomics

This research area focuses on developing machine learning and high-dimensional data methods to estimate heterogeneous causal effects in genomics and disease research. It aims to create novel approaches for assessing the impact of multiple simultaneous interventions—such as gene knockouts or CRISPR-Cas9 gene editing—based on observational data modeled by unknown linear structural equations with independent errors. Initial efforts will emphasize high-dimensional graphical models and formal statistical theory for causal inference through data fusion. These methods will provide critical insights into complex genetic interactions and their causal roles in disease.

Nodes and DNA helix

Publications

Selected Publications from CSBBD Researchers

Bittinger K, Zhao C, Li Y, Ford E, Friedman ES, Ni J, Kulkarni CV, Cai J, Tian Y, Liu Q, Patterson AD, Sarkar D, Chan SHJ, Maranas C, Saha-Shah A, Lund P, Garcia BA, Mattei LM, Gerber JS, Elovitz MA, Kelly A, DeRusso P, Kim D, Hofstaedter CE, Goulian M, Li H, Bushman FD, Zemel BS, Wu GD. Bacterial colonization reprograms the neonatal gut metabolome. Nature Microbiology, 2020 Jun;5(6):838-847. doi: 10.1038/s41564-020-0694-0.

Bushman FD, Conrad M, Ren Y, Zhao C, Gu C, Petucci C, Kim MS, Abbas A, Downes KJ, Devas N, Mattei LM, Breton J, Kelsen J, Marakos S, Galgano A, Kachelries K, Erlichman J, Hart JL, Moraskie M, Kim D, Zhang H, Hofstaedter CE, Wu GD, Lewis JD, Zackular JP, Li H, Bittinger K, Baldassano R. Multi-omic Analysis of the Interaction Between Clostridioides difficile Infection and Pediatric Inflammatory Bowel Disease. Cell Host Microbe. 2020 Sep 9;28(3):422-433.e7. doi: 10.1016/j.chom.2020.07.020. 

Cao Y, Zhang A, Li H. Multisample Estimation of Bacterial Composition Matrices in Metagenomics Data. Biometrika. Oxford University Press (OUP); 2020 Mar 1;107(1):75–92.

Cullen CM, Aneja KK, Beyhan S, Cho CE, Woloszynek S, Convertino M, McCoy SJ, Zhang Y, Anderson MZ, Alvarez-Ponce D, Smirnova E, Karstens L, Dorrestein PC, Li H, Sen Gupta A, Cheung K, Powers JG, Zhao Z, Rosen GL. Emerging Priorities for Microbiome Research. Frontiers in Microbiology. 2020 Feb 19;11:136. doi: 10.3389/fmicb.2020.00136.

Ma R, Cai TT, Li H. Global and Simultaneous Hypothesis Testing for High-Dimensional Logistic Regression Models. J Am Stat Assoc. 2021;116(534):984-998. doi: 10.1080/01621459.2019.1699421.

Ma R, Cai TT, Li H (2020): Optimal permutation recovery in permuted monotone matrix model. Journal of the American Statistical Association, accepted.

Ma R, Cai TT, Li H. Optimal Permutation Recovery in Permuted Monotone Matrix Model. J Am Stat Assoc. 2021;116(535):1358-1372. doi: 10.1080/01621459.2020.1713794.

Ma R, Cai TT, Li H (2020): Optimal estimation of bacterial growth rates based on permuted monotone matrix. Biometrika, accepted.

Sheng Z, Qiu C, Liu H, Gluck C, Hsu J, He J, Hsu CY, Sha D, Weir MR, Isakova T, Raj DS, Ricon-Choles H, Feldman HI, Townsend R, Li H, Susztak K (2020): Integrated genetic-epigenetic analysis supports a causal role for inflammation in diabetic kidney disease pathogenesis. Proceedings of the National Academy of Sciences, accepted.

Wang S, Cai TT, Li H (2020): Optimal estimation of Wasserstein distance on a tree with an application to microbiome studies. Journal of the American Statistical Association, accepted.

Wang S, Cai TT, Li H (2020): Hypothesis testing for phylogenetic composition: A minimum-cost flow perspective. Biometrika, accepted.

Yarmarkovich M, Farrell A, Sison A III, Di Marco M, Raman P, Parris J, Monos DS, Lee H, Stevanovic S, Maris JM (2020): Immunogenicity and immune silence in human cancer. Frontiers in Immunology, 11:69.