Statistics Online Computational Resource (SOCR) DataSifter: a statistical technique to protect research participant privacy while enabling data sharing
University of Michigan School of Public Health
3755 SPH I, 1415 Washington Heights Ann Arbor, MI 48109-2029

ABSTRACT: Effective and pragmatic sharing of data that includes sensitive information is difficult. The validation and reproducibility of findings in many health, financial, intelligence, socioeconomic, and other high-dimensional case-studies is inhibited when the data can’t be shared and the results independently confirmed. Either the utility of the data may be compromised by significant masking of the data or alternatively there may be a high risk of exposing private personal or secure organizational information. Excessive scrambling or encoding of the information makes the information less useful for modeling, or analytical processing. Insufficient preprocessing may compromise sensitive information and introduce a substantial risk for re-identification of individuals by various stratification techniques. To address this problem, the SOCR lab developed a novel statistical method (DataSifter) that provides on-the-fly obfuscation of high-dimensional structured and unstructured sensitive data, e.g., clinical data from electronic health records (EHR). This technique provides complete administrative control over the balance between risk of data re-identification and preservation of the data information. Under careful set up of user-defined privacy levels, our simulation experiments suggest that the DataSifter protects privacy while maintaining data utility for different types of outcomes of interest. The application of DataSifter on ABIDE data provides a realistic demonstration of how to employ the proposed algorithm on EHR with more than 500 features. We are extending the DataSifter to desensitize longitudinal data and free-text. Time-permitting, some additional SOCR tools and resources may be demonstrated (http://www.socr.umich.edu).

Integrated Health Sciences Core of M-LEEaD (Michigan Center on Lifestage Environmental Exposures and Disease)

Statistics Online Computational Resource (SOCR) DataSifter: a statistical technique to protect research participant privacy while enabling data sharing

Environmental Research Seminar presented by Ivo D. Dinov, PhD, MS (Director, SOCR; Professor, Health Behavior & Biological Sciences; Computational Medicine & Bioinformatics)

icon to add this event to your google calendarDecember 3, 2019
12:00 pm - 12:50 pm
3755 SPH I
1415 Washington Heights
Ann Arbor, MI 48109-2029
Sponsored by: Integrated Health Sciences Core of M-LEEaD (Michigan Center on Lifestage Environmental Exposures and Disease)
Contact Information: Meredith McGehee ([email protected] | 647-0819)

More Information

ABSTRACT: Effective and pragmatic sharing of data that includes sensitive information is difficult. The validation and reproducibility of findings in many health, financial, intelligence, socioeconomic, and other high-dimensional case-studies is inhibited when the data can’t be shared and the results independently confirmed. Either the utility of the data may be compromised by significant masking of the data or alternatively there may be a high risk of exposing private personal or secure organizational information. Excessive scrambling or encoding of the information makes the information less useful for modeling, or analytical processing. Insufficient preprocessing may compromise sensitive information and introduce a substantial risk for re-identification of individuals by various stratification techniques. To address this problem, the SOCR lab developed a novel statistical method (DataSifter) that provides on-the-fly obfuscation of high-dimensional structured and unstructured sensitive data, e.g., clinical data from electronic health records (EHR). This technique provides complete administrative control over the balance between risk of data re-identification and preservation of the data information. Under careful set up of user-defined privacy levels, our simulation experiments suggest that the DataSifter protects privacy while maintaining data utility for different types of outcomes of interest. The application of DataSifter on ABIDE data provides a realistic demonstration of how to employ the proposed algorithm on EHR with more than 500 features. We are extending the DataSifter to desensitize longitudinal data and free-text. Time-permitting, some additional SOCR tools and resources may be demonstrated (http://www.socr.umich.edu).

Event Flyer for Statistics Online Computational Resource (SOCR) DataSifter: a statistical technique to protect research participant privacy while enabling data sharing