2018 Big Data Summer Institute Projects
Each afternoon from 2-5 p.m., you will work on big data projects. For project work, students are divided into small research teams and assigned to one faculty member leading a particular project area of their interest. A graduate student instructor will help with each project and will be present to help the students each afternoon.
There are four project areas this year:
Project One: Genomics
Mentor: Matt Zawistowski (BIOS) and TBD
For this project, students will be given access to large-scale genetic datasets containing millions of genetic markers and extensive phenotypes for thousands of samples. Students will have the opportunity to explore various issues in large genetic studies, including visualizing patterns of variation to estimate the ancestry of study participants and identify artifacts due to technological protocols, searching for associations between individual genetic variants and disease risk, and creation of polygenic risk score to prediction disease onset.
Project Two: Imaging
Students will be given a large biomedical imaging data set to analyze. The project will consist of using the imaging data to predict the disease status or the cognitive state of subjects (e.g. what are they thinking about). A training set will be used to build a classifier and a testing set of data will be used to test the final predictive performance of the model. The students, with the help of the instructors and graduate student assistant, will decide how they wish to model the data. Either statistical models and/or machine learning algorithms may be used.
Project Three: Machine Learning
Today, hospitals collect an immense amount of data pertaining to their patients. Put to good use, these data could help improve healthcare. In this project group, students will learn to apply machine learning approaches to real (i.e., messy) health data for patient risk stratification for adverse health outcomes (e.g., in-hospital mortality). Students will explore a variety of approaches ranging from supervised learning (e.g., deep learning) to unsupervised learning (e.g., graph mining). These techniques will be explored in a range of settings across multiple modalities (e.g., graphs images, waveforms, and text). Implementation will be largely conducted in Python, but will rely on external packages/libraries. Students will be guided through the full "data intensive science" pipeline from data extraction to preprocessing, model selection, evaluation, interpretation and visualization of results. Students will see firsthand the data science opportunities that exist in healthcare.
Project Four: Data Mining on Large Complex Datasets
Mentor: Johann Gagnon-Bartsch (STAT)
Students will be given a large, complex dataset from a cancer drug screening experiment. The dataset will include information on the effectiveness of hundreds of drugs on hundreds of different cell lines, in addition to genomic information on the cell lines. The goal of the project will be to build prediction algorithms that are able to determine which drugs are most effective against which types of cancers, with an ultimate goal of customizing drug treatments to individual patients. Students will learn to work with several kinds of data (e.g. gene expression), methods to integrate different types of complex data, and various machine learning algorithms.