It’s one of medicine’s greatest challenges for researchers who are using Machine Learning (ML) to inform and advance health: maintaining the privacy of patient data.
So when two CIFAR catalyst grants were recently awarded to UBC researchers for synthetic health data projects, they were very happy to receive the support that will enable research to progress. These projects explore different AI methods to generate synthetic imaging data and clinical data while ensuring fidelity and patient privacy, and will be conducted over the next year.
These projects arose from the Synthetic Data for Health Symposium, an event that drew more than 250 attendees, including Canadian and international experts, to review case studies on synthetic health data for uses in medical imaging, clinical data and genomics. Its goal of facilitating discussions and collaborations between academic researchers and potential private sector or hospital partners was indeed a success, with the emergence of these two new projects.
ML: an invaluable tool in healthcare
Through the input and processing of large quantities of data, machine learning technology can help healthcare professionals diagnose and cure diseases, help with drug development and discovery, assist with organizing and classifying medical information, and much more.
But preserving privacy of medical data in the process is extremely challenging. That’s because medical researchers require large amounts of data to develop conclusive results, but the labels need to remain ‘anonymous’ and not expose patients’ private information.
Seeing clearly without seeing the details
The first UBC grant will fund a research project entitled “Privacy-preserving generative models for retina image synthesis used for diagnosis purposes” co-led by Dr. Xiaoxiao Li (UBC Electrical & Computer Engineering) and Dr. Mi Jung Park (UBC Computer Science), in partnership with Roche, one of the world’s largest biotech companies.
They propose generating privacy-preserving synthetic ophthalmological medical image data from available public data sources for the purpose of diagnosing glaucoma, with the hope of leading to an eventual cure for the eye disease.
Dr. Li explained, “One major obstacle for rare diseases is that we don’t have sufficient data. The data is in different locations (multiple hospitals), and when you try to share such data, you come up against legal barriers. If you want to train a large-scale synthesis model, you still need a large quantity of data.”
In attaining a large quantity of health data, a model must not compromise the original data, she explains. A privacy-preserving generative model can be the answer.
Dr. Park said, “We proposed novel metrics to compare the synthetic and privacy-sensitive data distributions which use high-dimensional features learned from public data, so there is no compromise with the privacy-sensitive data.” She continued, “Let’s consider the analogy of a cake. By slicing your cake in many ways or at different angles, you can gain a lot of knowledge about the cake. The ways to slice the cake can be considered the high-dimensional features. We are only measuring the similarity of the synthetic and privacy-sensitive data distributions from a multitude of those slices.”
In terms of applications, the researchers say their method is versatile enough to generate privacy-preserving methods across a multitude of fields, from legal to finance, or wherever people want to use private data.
Xiaoxiao added, “Privacy has become a very hot topic – people in our field are paying more attention to using data more carefully. You can save money on not having to purchase super expensive machines, you can solve disease-related problems more readily, and avoid legal fees and administrative hassles related to compromised privacy.”
Learning every step of the way
The second UBC proposal to receive a CIFAR grant is for a research project that involves a generator capable of creating images and associated labels for different types of images such as retina images, skin lesions and histopathology. The co-leads are both UBC Computer Science researchers, and they are both members of the UBC Data Science Institute: Dr. Raymond Ng & Dr. Mathias Lécuyer. Their research project is in partnership with Microsoft Research.
The project is entitled “Synthetic data generation through recycled gradients: reducing the privacy footprint of ML for health”. Their focus is on training Differential Privacy (DP) ML models and improving the process of data sharing and learning in health applications, while ensuring the data remains strictly private.
Dr. Lécuyer explained their research, “Typically, most approaches to training DP ML models iteratively improve a model’s parameters by computing incremental updates. These updates contain large amounts of information about the data they were computed on, but they are only applied once for a small improvement, and never re-used.”
He says the project attempts to find a way to leverage these updates more extensively to improve future computations. “The generator inherits the privacy guarantees of the model, but can also be used for additional things such as debugging or improving the original model.”
The grant will allow the researchers to continue a deeper exploration of this topic, as they have already conducted some exploratory work on the subject and found promising results.
Together, Raymond and Mathias bring a wealth of knowledge to this project, with a good combination of expertise in ML for health, ML with privacy, and data generation.
Many applications and expert collaborations in ML at UBC
Beyond these two synthetic data projects, UBC CS is making great strides in Machine Learning overall. Through collaborative efforts with other faculties, lab groups and associations like LEAP and the Data Science Institute, in addition to the AI-focused CAIDA group, UBC researchers are finding new avenues to expand machine learning research and its applications.
With the efforts of researchers like Dr. Park, Dr. Li, Dr. Ng and Dr. Lécuyer, the hope is that significant advancement can be made to speed up and improve the path in health data as well as other fields. The positive implications are well within view.