Finding the needle: How big data can diagnose disease and improve patients’ lives
By Dr. Raymond Ng, Director UBC Data Science Institute
Modern medicine produces a massive amount of information about patients, but what good is the data without a way to understand it? A UBC computer scientist is finding ways to uncover secrets in the haystack of data.
If I told you to find a needle in a haystack, how would you do it? You might start by diving in and throwing hay around the room in hopes of getting lucky and finding the needle quickly. But you’d probably realize this approach was messy and start neatly sorting one piece of hay at a time. You’d develop some sort of system and, eventually, find the needle.
My job is to find needles in haystacks. Actually, looking for a needle in a haystack is a piece of cake compared to what I’m trying to find. Your average bale of hay may have a few thousand pieces of hay in it, but the datasets I work with can contain millions of records with tens of thousands of attributes. It’s a lot of hay.
The needle-in-the-haystack problem is really a data science problem. You have a lot of pieces of data (your hay) and you need a way of organizing, processing and analyzing it so you can find what’s hidden.
While you could get a computer to sort one piece of data at time, there are more effective strategies for finding hidden patterns and useful lessons in the sea of information. We call that data mining.
Data mining is the discovery of patterns in large (or sometimes very large) sets of data. Data mining is interdisciplinary — combining computer science, statistics, data management and machine learning.
I helped develop and regularly use two main techniques for analyzing data: outlier detection and data clustering.
An outlier is a piece of data that doesn’t fit expected patterns — the odd-duck. Being able to identify which pieces of data are outliers, and which aren’t, is very valuable. It can help catch errors in the way data is collected and ensure that data is of a high quality. Before we spend a lot of time and energy analyzing data, we need to know it’s of high quality. Otherwise, whatever conclusions we draw from the data won’t be useful. Garbage in, garbage out.
But identifying outliers isn’t always simple. By considering outliers not as binary — either an outlier or not — but instead treating outliers on a scale of more or less outlier-y, you can draw more meaningful conclusions.
Identifying outliers can help find fraud in e-commerce and banking, or identify who will get a particular disease based on their DNA.
Another technique I regularly use, and helped develop, is called data clustering, the process of sorting data into groups so objects or individuals in a group are more similar than those not in the group. In other words, we try to identify groups which have different characteristics and may require different treatments or support.
Previously, data clustering only worked on smaller data sets, but by basing clustering on a randomized search we can now apply it to much larger data sets. Methods we developed almost 30 years ago have paved the way for today’s state-of-the-art solutions that can cluster datasets with millions of records and tens of thousands of attributes in a matter of minutes, and sometimes only seconds.
Great, so now we can train a computer to find a needle in a haystack, but why do we want the needle?
Modern medical research can create massive data sets. Today, we can test patients’ genomes (the collection of all of their genes), proteomes (the collection of all of their proteins), transcriptomes (the collection of all of their genes being transcribed into RNA), metabolisms (the collection of all of the chemical reactions happening in their body), and microbiomes (the collection of microorganisms living on and in them). For a single patient, we can collect thousands and thousands of data points. But without a way of sorting through all of that data, it isn’t useful for learning about, diagnosing or treating disease.
At the Centre of Excellence for the Prevention of Organ Failures, we’re working to develop two blood tests to diagnose organ transplantation rejection. Using data mining, we were able to whittle down tens of thousands of genes and over a million human proteins into a small number of genes and proteins that can reliably diagnose transplantation rejection. Creating a blood test of these key proteins and genes allows physicians to quickly and cheaply diagnose organ rejection, helping patients stay healthier.
This sort of big data approach could not only be key to learning more about how our bodies work, but could be instrumental in developing less invasive and expensive blood tests for better patient care.
We have applied the same general approach to many other conditions, including chronic obstructive pulmonary diseases, which affect millions of Canadians. Other groups have used similar approaches to rank treatment options for cancer patients. There are also promising attempts to create blood tests for diagnosing early stages of various types of cancer. In the next decade, we will see many of these tests being used clinically.
Beyond genes and proteins, we have also performed data mining on text messages provided by patients and their families, with their consent of course. Very promising tools are being developed to mine these collected conversations to identify the needs of patients and their families, which include their emotional needs. These tools are particularly valuable for monitoring mental wellbeing and helping patients with chronic conditions, such as cancers, HIV, and diabetes. In the foreseeable future, we may see how such tools can also be applied to help even healthy people, like seniors, receive better support.
But these big data tools have applications outside of medical fields. There are just as many compelling examples in other domains, such as finance, manufacturing, and education. Only the future will show us how these tools can help us live happier and healthier lives.