Hyun Min Kang is an Associate Professor in the Department of Biostatistics. He received his PhD in Computer Science from University of California, San Diego in 2009 and joined the University of Michigan faculty in the same year. Prior to his doctoral studies, he worked as a research fellow at the Genome Research Center for Diabetes and Endocrine Disease in the Seoul National University Hospital for a year and a half, after completing his Bachelors and Masters degree in Electrical Engineering at Seoul National University. His research interest lies in big data genome science. Methodologically, his primary focus is on developing statistical methods and computational tools for large-scale genetic studies. Scientifically, his research aims to understand the etiology of complex disease traits, including type 2 diabetes, bipolar disorder, cardiovascular diseases, and glomerular diseases.
- PhD, Computer Science, University of California, San Diego, 2009
- M.S., Electrical Engineering, Seoul National University, 2000
- B.S., Electrical Engineering, Seoul National University, 1998
Research Interests & Projects
- Summary - Big Data Genome Science: My research focuses on practical, accurate, and efficient methods for big data genome science. Methodologically, my primary focus is on developing statistical methods and computational tools for large-scale genetic studies. Scientifically, my research aims to understand the molecular mechanism of gene regulation, subsequently their impact on the etiology of complex disease traits. I have a comprehensive set of expertise in the analysis of ultra-high-throughput sequence data for studying genetics of complex traits. I have developed widely used statistical and computational methods for association analysis under hidden sample structure (EMMA, EMMAX), for analysis of high throughput DNA sequence data (GotCloud, EPACTS, verifyBamID, cleanCall), for haplotyping and genotype imputation (EMINIM, thunderVCF), and for analysis of bulk and single cell expression data (demuxlet, SCRUB, ICE, MMC). My integrative expertise in statistical and computational methods for sequence-based genetic studies, in conjunction with analytic experiences from large-scale sequencing projects, empowers me to develop practical and scalable methods for addressing analytic challenges for genetic studies with ultra-high-throughput sequence reads that will be produced in an unprecedented scale in the next several years.
- Statistical methods for genome-wide association studies: My research has focused on developing statistical methods for genome-wide association studies (GWAS), especially with sequence data. I have developed efficient algorithms, EMMA, EMMAX, and ICE for performing GWAS and eQTL analyses under linear mixed model. These papers are together cited over 1,000 times to date and motivated the development of many other association analysis tools under linear mixed models. With the advent of sequencing technologies, I am implementing these algorithms and other widely available association analysis methods into software tools.
- Single cell transcriptomics and epigenomics: Over the past few years, my research has been shifted on understanding the molecular mechanism of gene regulation in the resolution of single cells. transcriptomics and epigenomics. ). Recently, I developed a method (demuxlet) to enable multiplexed single-cell library preparation in population scale to enable experimental design of single cell RNA-sequencing experiment that substantially reduces batch effects, doublet rates, and experimental cost. With Dr. Hojoong Kwak at Cornell University, I am looking into the genetic regulation of enhancer transcription activities, which is fascinating science itself. I am very much looking forward to use my computational expertise to enable cost-effective experiments using (yet to merge) single cell epigenomic technologies such as ATAC-seq, PRO-seq, and ChIP-seq
- Rapid and accurate methods and tools for big data genomics: It is very important to efficiently handle high-throughput sequence data for successful genetic analysis in population scale. I developed GotCloud (Genomes On The Cloud) sequence processing and variant calling pipeline, which produces high quality variant calls from high-throughput DNA sequence reads. The pipeline has been applied to sequence data across hundreds of thousands of human genomes on the cloud computing environment. My particular focus is on methods for tandem repeat variants enriched in repeat-rich regions of genomes. I also developed novel mixture model methods, implemented in the verifyBamID and cleanCall software package, to detect and correct for DNA contamination from sequence reads. Our method enabled us to maintain the quality of sequencing data by enabling early-on detection of DNA contamination in a number of large-scale sequencing studies.
- Large-scale sequence analysis in population scale: In parallel to the method development efforts described above, I have also led scientific analysis of many large-scale whole genome sequencing and genome-wide association studies. The studies in which I have played a key role includes the Trans-Omics Precision Medicine (TOPMed) study, 1000 Genomes project with, the Genetics of Type 2 Diabetes (GoT2D) study, the Bipolar Research in Deep Genome and Epigenome Sequencing (BRIDGES) study, the NHLBI Exome Sequencing Project study, the Nephrotic Syndrome Study Network (NEPTUNE) study, the Genetics and Epidemiology of Colorectal Cancer Consorium (GECCO) study, the HUNT sequencing study focusing on cardiovascular traits, and the systems genetic renal studies of Pima Indians. In my earlier career, I was also heavily involved in mouse genetics, building Hybrid Mouse Diversity Panel (HMDP), mouse HapMap resources, and perlegen high-density mouse panel that enabled high-resolution genetic studies using inbred mouse strains.
- Zhang F, Flickinger M, InPSYght Psychiatric Genetics Consortium, Abecasis GR, Boehnke M, Kang HM, (2019) Ancestry-agnostic estimation of DNA sample contamination from sequence reads. Genome Research (in press)
- Quick C, Fuchsberger C, Taliun D, Abecasis G, Boehnke M, Kang HM. (2019) emeraLD: rapid linkage disequilibrium estimation with massive datasets. Bioinformatics. 35(1):164-166. PMID: 30204848; PMCID: PMC6298049.
- Kang HM, Subramaniam M, Targ S, Nguyen M, Maliskova L, Wan E, Wong S, Byrnes L, Lanata C, Gate R, Mostafavi S, Marson A, Zaitlen NA, Criswell LA, Ye CJ (2018) Multiplexing droplet-based single cell RNA-sequencing using natural genetic barcodes, Nat Biotechnol, 36(1):89. PMID: 29227470; PMCID: PMC5784859.
- Zhou W, Fritsche LG, Das S, Zhang H, Nielsen JB, Holmen OL, Chen J, Lin M, Elvestad MB, Hveem K, Abecasis GR, Kang HM*, Willer CJ*, (2017) Improving power of association tests using multiple sets of imputed genotypes from distributed reference panels, Genet Epidemiol, 41(8):744-755. PMID: 28861891; PMCID: PMC6324190
- Gillies CE, Robertson CC, Sampson MG, Kang HM. (2015) GeneVetter: a web tool for quantitative monogenic assessment of rare diseases, Bioinformatics, 31(22) 3682-4. PMID: 26209433; PMCID: PMC4643620
- 1000 Genomes Project Consortium, Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, Korbel JO, Marchini JL, McCarthy S, McVean GA, Abecasis GR, The 1000 Genomes Project Consortium (2015) A global reference for human genetic variation, Nature. 526(7571):68-74. PMID: 26432245; PMCID: PMC4750478.
- Tan A, Abecasis GR, Kang HM. (2015) Unified representation of genetic variants. Bioinformatics. 31(13):2202-4. PMID: 25701572; PMCID: PMC4481842.
- Jun G, Wing MK, Abecasis GR, Kang HM. (2015) “An efficient and scalable analysis framework for variant extraction and refinement from population scale DNA sequence data.”, Genome Res. 25(6):918-25. PMID: 25883319; PMCID: PMC4448687
- Sampson MG, Gillies CE, Ju W, Kretzler M, Kang HM, (2013) Gene-level Integrated Metric of negative Selection (GIMS) Prioritizes Candidate Genes for Nephrotic Syndrome, PLoS One. 8(11):e81062. PMID: 24260533; PMCID: PMC3832435
- Jun G, Flickinger M, Hetrick KN, Romm JM, Doheny KF, Abecasis GR, Boehnke M, Kang HM. (2012) Detecting and estimating contamination of human DNA samples in sequencing and array-based genotype data. Am J Hum Genet. 91(5):839-48. PMID: 23103226; PMCID: PMC3487130.