IA de confiance, Machine Learning, Recherche et développement

Data Augmentation: How do GANs synthesize tabular data in recent studies?

Auteur : Anh Khoa NGO HO
Temps de lecture : 15 minutes
Quantmetry.com : Data Augmentation: How do GANs synthesize tabular data in recent studies?

“Data is the new oil” by Clive Humby highlights the importance of data in the 21st century. In this context, data augmentation is one of the AI trends which solves the lack of data for machine learning tasks. It is even more critical when collecting large amounts of data is difficult, especially for tabular data. Moreover, synthetic data help to protect private information contained in the original data (e.g., medical records) while sharing it with other parties and the research community. Existing applications of synthetic data include deepfake[1], oversampling for imbalanced data, and public healthcare records.

More recently, generative models have emerged as the state-of-the-art technique in generating synthetic data, by discovering the pattern of the original data and generating new samples similar to the original data. In this article, we explore several recent studies on one of the most popular generative architectures, i.e., Generative Adversarial Network (GAN), for generating synthetic tabular data. We also discuss the challenges of the tabular data generation task and how GANs overcome these obstacles.

I. Overview of GANs

GANs are based on a game-theoretic scenario between two neural networks: a generator and a discriminator (Goodfellow, et al., 2016). The generator is a directed latent variable model deterministically generating synthetic samples from noise and tries to fool the discriminator. On the other hand, the discriminator distinguishes between real and synthetic samples of the generator (see Figure 1).

We show a notation of GANs:

  • Discriminator D_\phi: a neural network with parameters \phi
  • Generator G_\theta: a neural network with parameters \theta

The objective of a vanilla GAN is \min_\theta max_\phi v(G_\theta, D_\phi) where:

    \[v(G_\theta, D_\phi) = E_{x \sim p_{data}} \log D_\phi (x) + E_{z \sim p(z)} \log (1 - D_\phi (G_\theta (z)))\]

  • x is sampled from the real distribution p_{data}.
  • z is noise, and the generator learns to generate samples from this noise.
  • D_\phi (x) returns the probability of being a real sample.
  • E_{x \sim p_{data}} and E_{z \sim p(z)} are respectively the expectation with  sampled from data distribution and the expectation with z sampled from its noise distribution

For this GAN, the generator tries to increase the probability of a synthetic sample D_\phi (G_\theta (z)), which means

    \[\min_\theta v(G_\theta, D_\phi) = E_{z \sim p(z)} \log (1 - D_\phi (G_\theta (z))).\]

On the other hand, the discriminator tries to increase the probability of a real sample and decrease the probability of the synthetic sample,

    \[\max_\phi v(G_\theta, D_\phi) = E_{x \sim p_{data}} \log D_\phi (x) + E_{z \sim p(z)} \log (1 - D_\phi (G_\theta (z))).\]

Note that there are several variants of GAN, e.g., Wasserstein GAN (Arjovsky, et al., 2017), PacGAN (Lin, et al., 2017), PATE-GAN (Jordon, et al., 2019), etc., which help to improve the performance of the vanilla GAN.

Figure 1: GAN structure.

II. Synthetic table generation task

Tabular data are data structured into a table form. It consists of columns and rows standing for attributes and data samples, respectively. These attributes can depend on one another and have different data types (i.e., categorical and numerical data).

The task of synthesizing tabular data is to generate an artificial table (or multiple relational tables) based on an original tabular dataset. The synthetical table must share similar properties to the original dataset. In practice, it must satisfy the two following properties:

  • A machine learning task on real and synthetical data must share similar performance.
  • Mutual dependency between any pair of attributes must be preserved. Note that every sample of synthetic data must differ from the original data.
Generation task notation

We present the notation of the generation task used in (Xu, et al., 2019). In this work, we apply GANs to learn from a single table and generate a single synthetic table.

  • Input: an original table consists of n variables (i.e., columns): n_c continuous random variables (i.e., numerical data) and n_d discrete random variables (i.e., categorical data, ordinal data). Note that these variables follow an unknown joint distribution. J independent samples (i.e., rows) come from this joint distribution.
    • Columns:
      • n_c continuous random variables: