NeurIPS’22: Improving the quality of synthetic data

Part 6 in a series about some of the department’s accepted papers at NeurIPS 2022 (conference being held Nov. 28 - Dec. 9)

Raymond Ng, a professor with UBC Computer Science and his co-authors, grad student Ali Seyfi and senior data scientist Jean Francois Rajotte, have a paper accepted at the upcoming NeurIPS conference.

The paper:
Generating multivariate time series with COmmon Source CoordInated GAN (COSCI-GAN), Ali Seyfi, Jean-Francois Rajotte, Raymond T. Ng

What is COSCI-GAN?

COSCI-GAN stands for "Common Source Coordinated GAN."

Rajotte explains, “In our collaboration at the Data Science Institute, we were exploring improved ways of generating matched synthetic data features.”

He said they realized there is a lack of methodology for ensuring matching data features. In the health domain for example, if one has multivariate data from a heartbeat feature and a respiratory feature, it’s not easy to tell if they come from the same source; that is, the same person.

When researchers are creating synthetic data (in order to protect private data sources), they wish to ensure similar characteristics between multiple features, to ensure the data from these two connected sources are correlated. In this health case, that would mean they would be able to create realistic heartrate and respiration data from the same person.

“We decided to generate signals individually, using a Generative Adversarial Network, or GAN for short,” said Rajotte.

What is a GAN? GAN (Generative Adversarial Network) is a class of machine learning frameworks in which two neural networks contest with each other to become more accurate in their respective tasks, where one agent's gain is another agent's loss.

Then the researchers added another step whereby the data was analyzed together after being generated individually. If the data points looked realistic together (when grouped), they were determined to be reasonably accurate. “It’s a two-stage process of individual generation and combined evaluation,” he said.

With this extra step of evaluation and iteration, it’s easy to determine if the synthetic data is a match or a mismatch.

Rajotte said, “COSCI-GAN could help people share data. For example, let’s say a team is doing health research but they cannot share the details of their results because of privacy regulations. They could create a synthetic data set that represents these multiple related data points, which are called a multivariate time series so tests can be run using COSCI-GAN on that set of synthetic data.”

They have shown in their paper that this framework is relevant for generating multivariate time series from a common source and is particularly suited for human-based biometric measurements.

As for future work, they say their method could be extended to more practical use cases where various channels correspond to different types of time series, e.g. heartbeats, temperature, respiration, wearable measurements.

On the technical side, their framework can be implemented with a wide variety of GANs chosen based on the data type, including modern architectures like transformers.

“COSCI-GAN can help pivot synthetic data to a place of significant increased reliability, while maintaining strict privacy protocols. We’re encouraged by the outcomes to date,” said Dr. Ng.

In total, the department has 13 accepted papers by 9 professors at the NeurIPS conference. Read more about the accepted papers and their authors.

More about Dr. Raymond Ng and his research

More about Jean Francois Rajotte

More about Ali Seyfi