Point/Counterpoint: Big Numbers vs. Little Numbers

Point/Counterpoint: Big Numbers vs. Little Numbers

Why I Love Big Numbers

by Sharon Kardia

I love big numbers in so many ways. In public health genetics we deeply need to accumulate information on the lived experience of millions of people to best predict what can happen to a single individual. There are 25,000 genes and millions of mutations. To sort out what is causing what, we need a really large databases of people's health experiences. For example, babies with rare genetic mutations or abnormalities occur in approximately one in 10,000 births. Every baby in the country is screened so those with the rare diseases can be identified and treated. Newborn screening programs worldwide then pool together information on the new cases to help companies develop and test new drugs. Without really large efforts like millions of babies being tested every year, we couldn't save the lives of those one in 10,000 infants.

Here is another way I love big numbers: the genome is big, really big. It is approximately three billion DNA base pairs large, and I share 99.9 percent of those base pairs with every other human on the planet. So, in one way, I have seven billion brothers and sisters—that's a big family. However, the 0.1 percent difference in our genomes means that we are each different by approximately three to six million mutations. It is the perfect paradox: we are simultaneously all the same while being completely unique.

Very little of the genome can be used to predict without knowing a person's context. Large population studies are the key to understanding which things in the genome have predictive power and which do not. There are great studies of bacteria and mice where scientists have systematically deleted an entire gene to see what happens. Essentially, those studies show that some 50 percent of our genes are "not essential." Our bodies are set up to have a lot of redundancy and back-up systems so that most mutations don't have catastrophic effects.

That's why we need large numbers, so that we can understand how a given individual's genome is likely to play out in different environments or over a lifetime. As we move closer to the time when major medical systems (including U-M) start to incorporate genetic sequencing and other genomic technologies into clinical practice, we face a huge paradox: we'll be able to sequence people's big genomes cheaply, and we'll find thousands of rare or "private" mutations, but we won't know how to interpret those mutations unless we have huge data sets to find other people with those same rare mutations. The more ways we can learn from the health experience of others (entire populations) the better we will be able to serve each individual in our society. For example, I have loved ones who are allergic to sulfa drugs, but they only learned this after being given sulfa drugs. It might have been nice to have had that information beforehand so that they didn't have to learn through trial and error. The type of genetic epidemiology we do here at U-M SPH uses large populations to discover the genes and the mutations that make people sick.

There is some really great work being done by our health informatics program to develop a prototype for a regional or national Learning Health System that can integrate every conceivable type of health information—across counties and states, including data from primary and emergency care, Medicare, and Medicaid—so that we can learn how to better care for people. This system would also revolutionize public health, since it would give us real-time health data about the health of our communities and help us improve health outcomes in those communities. It is a great way to merge individual benefit with public health values.

The most personal way I love big numbers is the mind-boggling mega-relationship between me and my genome. My body (and everyone else's body) is an entire world made up of the grass-roots, autonomous action of bazillions of molecules. For example, I am about ten to 20 trillion cells large (not counting my 100-trillion microbial inhabitants), and each cell has two copies of an entire genome (three billion base pairs from my mom and three billion base pairs from my dad). Through the miracle of meiosis, which happens each generation, the DNA in my cells is the DNA of thousands of my ancestors. For example, I can trace my family line back 250 years—and at 25 years per generation that means about ten generations—so through that lens, my genome can be traced back to 210 genomes (which is 1,024 people). The whole mega-collection of genomic worlds within my cells works together, moment by moment, to process information about where I am, what I am eating and doing to adapt and optimize my health. I couldn't be here without large numbers.

Sharon Kardia is the U-M SPH senior associate dean for administration and a professor of epidemiology, whose research focuses on the genetic epidemiology of common chronic diseases. She directs the U-M Life Sciences and Society Program.

Why I Love Small Numbers

By Bhramar Mukherjee

In the public health and medical research studies I'm involved with as a biostatistician, my collaborators are happy if their findings reach statistical significance, and that comes with what we call "small P-values." P-values are a measure of evidence— a way of determining whether your data is compatible with the baseline, or "null," hypothesis that's under consideration.

The null hypothesis usually indicates that not much is happening with an intervention, or there is no evidence of association. As a researcher, you often want to refute the null and establish an "alternative hypothesis," which means that something is actually going on in your study—for example, your intervention has an effect, or a new drug is better than an existing therapy, or a set of genetic markers is indeed associated with a particular disease. To quantify the strength of that finding you need a measure of evidence, and P-values are the most commonly used measures for this purpose, with P<0.05 considered to be a statistically significant finding. Researchers want tiny P-values—in fact whenever my analysis leads to large P-values, my collaborators become sad because it means their study doesn't show much. This obsession with tiny P-values has generated considerable publication bias. As a statistician, I am bothered by this obsession, but I can't avoid it.

I'm currently working with a large consortium, funded by the National Cancer Institute, that's exploring gene-environment interactions related to colorectal cancer. We've put together various study cohorts from all over the world and are trying to understand the extent to which genes modify risk due to environmental factors such as high red-meat intake or lack of physical exercise. We're studying millions of genes and about a dozen established environmental factors for colorectal cancer, and our work is giving rise to many gene-by-environment interaction tests. As the number of potential hypotheses that we are testing increases, we need even tinier P-values—say, of the order of 10-8. On the other hand, larger sample sizes lead to a greater probability of statistically significant results with small P-values, and thus large and small numbers play a very important dual role in all this.

When it comes to public health, particularly genomics, little numbers can add up to big differences. In risk prediction models, one individual gene may not confer much risk, but if you have an ensemble of genes, each with modest relative risk, they can contribute collectively to disease risk in a significant way.

As we move closer to the era of personalized medicine, we're paying more attention to the potential collective significance of individual genes with tiny effects. Scientists are trying to create composite cancer risk scores, for example, for people who have a number of cancer genes, each with a tiny elevated risk. Such scores can be used for risk prediction as well as for defining subgroups of patients who might benefit from a particular treatment regimen.

On the other hand, just as little effects can add up to big differences, little errors can add up to big problems. When you are conducting studies with a large number of tests, small violations from classical assumptions can lead to trouble. In the modern era, scientists are confronted with zillions of tests, and even minute departures in the distribution of a test statistic—say, a violation of normality—can substantially inflate your false discovery rate. We need to pay more attention to the behavior of the extremes—or "tails"—of the distribution of these test statistics than we did in classical statistical inference.

I've always been fascinated by the duality between small and large numbers. In our theoretical statistics courses, we teach about limiting behaviors, and we speak of things "tending to infinity" or "becoming infinitesimally close to zero" in very similar ways.

Ultimately it's all relative. Take a small number, and its reciprocal is large. Take a big number, put a minus sign in front of it, and it becomes small. As a culture, we want some numbers to be small (the unemployment rate, price inflation) and some numbers to be large (the gross domestic product of a country, per capita income). One thing I can tell you for sure is that I do not like small numbers for my teaching evaluations or my salary!

Bhramar Mukherjee is an associate professor of biostatics at U-M SPH, whose principal research interests are Bayesian methods in epidemiology and studies of gene-environment interaction. She is a co-investigator in several studies led by faculty in the U-M Departments of Epidemiology, Environment Health Sciences, and Internal Medicine.

Overlapping Passions

Student: Amanda Eccleston

Amanda Eccleston
First-year MPH student in epidemiology (global health certificate); member, U-M women's varsity cross-country and track teams

"Running and public health draw similar types of people. If you want to get involved in running, become passionate about it. People who go into public health are also very passionate. It's not a field you go into just to make money or have a job—it's a field you go into because you really care about it. Success in both is long-term. There's no instant gratification. With running there can be years and years of training. Public health can also take years of work, research, negotiation, and program implementation to achieve results.

"Running and public health both require teamwork. Last fall, I got about halfway through the Big 10 championship cross-country race and then fell off the lead pack. One of my teammates caught up to me, and for the second half of the race we ran every step together. Together we were able to get through the race—and Michigan won the Big 10 championship! In public health also, it's about collaborating with people and organizations.

"I've been reading lately about the United Nations Millennium Development goals related to immunization, and about how much international cooperation this takes. These are things you can't do alone. You can't win a national championship as a team alone, and you can't cure diseases and solve public health problems by yourself."

Faculty: Mousumi Banerjee

Mousumi Banerjee
U-M research professor, biostatistics; director of biostatistics, Center for Healthcare Outcomes & Policy; member, U-M Comprehensive Cancer Center

"I have a very deep love and affection for the arts, particularly music and poetry. They counterbalance my work as a scientist. As it happens, I did my training as a statistician at the same time I was studying Indian music—specifically Rabindrasangeet, or songs written by the Bengali poet Rabindranath Tagore. What I love most about this genre is that I can find a song for any mood that I'm experiencing.

"Music helps me feel the human link for my work as a scientist doing statistical modeling for cancer. It's important to have that connection—to realize that there is a human face on the other side of the data. To me, data and music both hold mysteries—whether it's about an underlying biological mechanism, or patterns of disease in a population, or connections to one's deep inner self. You just have to know how to unravel these mysteries.

"I enjoy making sense of data, using statistics to learn important stories and discover interesting phenomena. Music gives me similar joy, because through music I discover myself. There is a level of abstraction in both statistics and music, which I love. Both fields are all about patterns and rhythms."

AudioHear Mousumi Banerjee sing Sraboner Dharar Moto

Staff: Vlad Wielbut

Vlad Wielbut
Director, U-M SPH Informatics and Computing Services; co-organizer and host, Ann Arbor Polish Film Festival; member, U-M Collaborative Domain Group

"I have always been fascinated by how people communicate. When I was a kid growing up in Poland, my grandmother would take me to the village where she was born. I would listen to the peculiar dialect these people spoke, and after a couple of weeks I would talk like them.

"In school, we all had to learn Russian. It was a language of the Soviet oppressor, but I enjoyed it so much that I wrote poetry in it. After high school I wanted to learn English, so I bought a textbook and went to see American movies. That helped a lot when my wife and I left Poland in 1987 and spent two years in Germany waiting for our visas to the U.S. I of course wanted to learn the language, so I studied German literature.

"When we came to America, I found a wholly different set of languages, which allowed me to communicate with a machine, so I studied computer science. Once I wrote my first program, I was hooked. I "spoke" with a machine and we "understood" each other.

"Underlying all this is my love of learning. I think I inherited it from my grandmother, who had only a fifth-grade education but was always surrounded by books. When I lived in Warsaw, I always had a book with me. Now I carry a small library on my Kindle."