Data Augmentation: How do GANs synthesize tabular data in recent studies?

“Data is the new oil” by Clive Humby highlights the importance of data in the 21st century. In this context, data augmentation is one of the AI trends which solves the lack of data for machine learning tasks. It is even more critical when collecting large amounts of data is difficult, especially for tabular data. Moreover, synthetic data help to protect private information contained in the original data (e.g., medical records) while sharing it with other parties and the research community. Existing applications of synthetic data include deepfake[1], oversampling for imbalanced data, and public healthcare records.
More recently, generative models have emerged as the state-of-the-art technique in generating synthetic data, by discovering the pattern of the original data and generating new samples similar to the original data. In this article, we explore several recent studies on one of the most popular generative architectures, i.e., Generative Adversarial Network (GAN), for generating synthetic tabular data. We also discuss the challenges of the tabular data generation task and how GANs overcome these obstacles.
I. Overview of GANs
GANs are based on a game-theoretic scenario between two neural networks: a generator and a discriminator (Goodfellow, et al., 2016). The generator is a directed latent variable model deterministically generating synthetic samples from noise and tries to fool the discriminator. On the other hand, the discriminator distinguishes between real and synthetic samples of the generator (see Figure 1).
We show a notation of GANs:
- Discriminator
: a neural network with parameters
- Generator
: a neural network with parameters
The objective of a vanilla GAN is where:
is sampled from the real distribution
.
- z is noise, and the generator learns to generate samples from this noise.
returns the probability of being a real sample.
and
are respectively the expectation with sampled from data distribution and the expectation with z sampled from its noise distribution
For this GAN, the generator tries to increase the probability of a synthetic sample , which means
On the other hand, the discriminator tries to increase the probability of a real sample and decrease the probability of the synthetic sample,
Note that there are several variants of GAN, e.g., Wasserstein GAN (Arjovsky, et al., 2017), PacGAN (Lin, et al., 2017), PATE-GAN (Jordon, et al., 2019), etc., which help to improve the performance of the vanilla GAN.
Figure 1: GAN structure.
II. Synthetic table generation task
Tabular data are data structured into a table form. It consists of columns and rows standing for attributes and data samples, respectively. These attributes can depend on one another and have different data types (i.e., categorical and numerical data).
The task of synthesizing tabular data is to generate an artificial table (or multiple relational tables) based on an original tabular dataset. The synthetical table must share similar properties to the original dataset. In practice, it must satisfy the two following properties:
- A machine learning task on real and synthetical data must share similar performance.
- Mutual dependency between any pair of attributes must be preserved. Note that every sample of synthetic data must differ from the original data.
Generation task notation
We present the notation of the generation task used in (Xu, et al., 2019). In this work, we apply GANs to learn from a single table and generate a single synthetic table.
- Input: an original table consists of
variables (i.e., columns):
continuous random variables (i.e., numerical data) and
discrete random variables (i.e., categorical data, ordinal data). Note that these variables follow an unknown joint distribution.
independent samples (i.e., rows) come from this joint distribution.
- Columns:
continuous random variables:
- Columns: