Research Activities
2018 Big Data Summer Institute Projects
Each afternoon from 2-5 p.m., you will work on big data projects. For project work, students are divided into small research teams and assigned to one faculty member leading a particular project area of their interest. A graduate student instructor will help with each project and will be present to help the students each afternoon.
There are four project areas this year:
Project One: Genomics
Mentor: Matt Zawistowski (BIOS) and TBD
Co-Mentor: TBD
For this project, students will be given access to large-scale genetic datasets containing
millions of genetic markers and extensive phenotypes for thousands of samples. Students
will have the opportunity to explore various issues in large genetic studies, including
visualizing patterns of variation to estimate the ancestry of study participants and
identify artifacts due to technological protocols, searching for associations between
individual genetic variants and disease risk, and creation of polygenic risk score
to prediction disease onset.
Project Two: Imaging
Mentor: Tim Johnson (BIOS), Jian Kang (BIOS), and Eunjee Lee (BIOS)
Co-Mentor: TBD
Students will be given a large biomedical imaging data set to analyze. The project
will consist of using the imaging data to predict the disease status or the cognitive
state of subjects (e.g. what are they thinking about). A training set will be used
to build a classifier and a testing set of data will be used to test the final predictive
performance of the model. The students, with the help of the instructors and graduate
student assistant, will decide how they wish to model the data. Either statistical
models and/or machine learning algorithms may be used.
Project Three: Machine Learning
Mentor: Jenna Wiens (CS) and Danai Koutra (CS)
Co-Mentor: TBD
Today, hospitals collect an immense amount of data pertaining to their patients. Put to good use, these data could help improve healthcare. In this project group, students will learn to apply machine learning approaches to real (i.e., messy) health data for patient risk stratification for adverse health outcomes (e.g., in-hospital mortality). Students will explore a variety of approaches ranging from supervised learning (e.g., deep learning) to unsupervised learning (e.g., graph mining). These techniques will be explored in a range of settings across multiple modalities (e.g., graphs images, waveforms, and text). Implementation will be largely conducted in Python, but will rely on external packages/libraries. Students will be guided through the full "data intensive science" pipeline from data extraction to preprocessing, model selection, evaluation, interpretation and visualization of results. Students will see firsthand the data science opportunities that exist in healthcare.
Project Four: Data Mining on Large Complex Datasets
Mentor: Johann Gagnon-Bartsch (STAT)
Co-Mentor: TBD
Students will be given a large, complex dataset from a cancer drug screening experiment. The dataset will include information on the effectiveness of hundreds of drugs on hundreds of different cell lines, in addition to genomic information on the cell lines. The goal of the project will be to build prediction algorithms that are able to determine which drugs are most effective against which types of cancers, with an ultimate goal of customizing drug treatments to individual patients. Students will learn to work with several kinds of data (e.g. gene expression), methods to integrate different types of complex data, and various machine learning algorithms.