Recherche et développement
24/01/2020

Tabular data generation using Generative Adversarial Networks 


Temps de lecture : 12 minutes
Quantmetry.com : Tabular data generation using Generative Adversarial Networks 

Aurélia Nègre et Michaël Sok  / Temps de lecture : 17 minutes

Generative models are an exciting area of research and have multiple applications. Indeed synthetic data could be used in replacement or complement of real data for simulation purpose, robustness improvement, or to handle privacy issues. That is why Generative Adversarial Networks (GANs), have had a huge success since their introduction in 2014 by Ian Goodfellow [1], their image generation being outstandingly realist. You can see more in our previous blog article here.

Image from the website: https://thispersondoesnotexist.com 

A vast majority of the GANs literature focuses on images and every Data Scientist knows they are not the most frequent data source he might work on. Tabular data are indeed the most common data source. Structured data is the data that conforms to a data model, has a well-defined structure, follows a consistent order and can be easily accessed and used by a person or a computer program. Tabular data are structured data with clearly defined columns and rows (e.g Excel format).

Fortunately, there are some scientific papers that address the topic of tabular data generation. This literature was the starting point of our own implementation of tabular data generation.

Context

As consultants, we were working on a project where a classifier was going into production, and we considered using synthetic data -that would be generated by a GAN- to improve its robustness. 

NB: the customer and project context are confidential, therefore no information regarding the use case can be provided. Simply imagine a standard churn or fraud classifier, using Machine Learning or Deep Learning models.

We had a two-step approach in mind:

  1. can we design a GAN that can generate realistic synthetic data in our specific context? If we can, the synthetic data could be used in replacement of real data to train the model, in addition to real data to improve the model robustness, or could be shared with other group entities that can not share real data
  2. can we generate adversarial examples that would purely improve the robustness?

Interested by very recent papers on GANs on tabular data [2, 6, 7, 8, 9], a Proof Of Concept was launched for testing GANs in our context. The objective was to verify if data under high constraints and specificities could be generated while keeping the statistical properties needed for learning the original task of classification.

Thus we focused on Mottini et al. (2018) [2], whose problem was similar to ours, to try to answer our problem in around 60 person-days.

Our starting point – an academic paper

As mentioned in their article [2], Mottini et al. use Passenger Name Records (PNRs) data that contain travel and passenger information. Indeed, the authors are both academics (Inria) and corporates (from Amadeus, a major Spanish IT provider for the global travel and tourism industry). Amadeus’ issue with PNR data was the following: the data could not be kept for more than 3 months after the trip due to data ownership laws (GDPR). However, they wanted to train Machine Learning models to improve their company’s efficiency (e.g predict the next trip).

Therefore, they considered building a historic of synthetic data to train their models on. For such use case, GANs were the reference methodology. We note that in PNR, both categorical and numerical features with missing/NaN values are present, which makes the use of GANs quite challenging. 

Thus, they proposed a solution based on Cramér GANs [3], categorical feature embedding (also known as entity embedding in the literature) and a Cross-Net layer backbone [10]. The dataset used for training was a real PNR dataset, while their evaluation focused on distribution matching, memorization and performance of inference models for two real business use cases: customer segmentation and passenger nationality prediction.

Though the real PNRs were not memorized, the synthetic matched them quite well and could be used for their use cases models training.

In conclusion, their use case was close to our own use case (tabular data, mixed types, sometimes missing values) and the paper being written by corporates, we considered their approach might be sufficiently pragmatic and transposable for us.

Reminders about GANs: Vanilla, WGAN, WGAN-GP

Before presenting our specific approach, let’s remind how the original GAN was defined (also known as Vanilla GAN):

Figure 1: Vanilla GAN schema

 

In the above picture, two “agents” are defined: the generator G and the discriminator D. Those two names will be kept throughout this article, thus remind them well! The objective for the generator G is to deceive the discriminator into thinking its outputs are valid inputs. Symmetrically, the discriminator D should classify features into generated or real. Those opposed objectives explain where the “adversarial” part from GAN comes from.

This GAN, defined in 2014 by Ian Goodfellow et al. [1], has many extensions whether on its loss, on its network backbone or on the discriminator output. For information, the above problem from Vanilla GAN could be reformulated as a minimization problem of the Jensen-Shannon divergence. The use of Jensen-Shannon divergence posed some problems in the practical use of data generation such as:

  • Vanishing gradients (the gradients were way too small to learn something)
  • Mode collapse (a single kind of instances were generated on which the discriminator was bad at discerning if the data was fake or not)

Those problems were what motivated the use of a GAN minimizing another divergence or distance. That is how the Wasserstein GAN (or WGAN for short) was conceived, in a publication by Arjovsky et al. [4]. Many things changed between the original GAN and WGAN such as:

  1. a new loss function was used, based on Wasserstein-1 distance
  2. the output of D is no longer a probability of being real or not, but rather a score in the \mathbb{R} domain, which is why the discriminator is now called a critic
  3. the optimization problem constrains the discriminator to be a k-lipschitz function, which was made possible by clipping the weights of the discriminator, or by adding a gradient penalty in its variant (known as WGAN-GP)
  4. using an alternate optimizer than Adam (RMSProp) since the momentum in Adam posed convergence problems

The resulting GAN could be summed up in the following schema:

Figure 2: WGAN schema

 

Though we’ll not dive into the meanders of the implementation, one has to know that this implementation is optimizing an equivalent problem to a minimization of the Wasserstein-1 distance using the Kantorovich-Rubinstein duality [5] explaining the k-lipschitz constraint and the output domain.

Application to tabular data

As explained in the “Context” section, we had a classifier on which we wanted to know if it could learn on synthetic data as a first step and if generated adversarial attacks could be used for robustifying the model.

Before diving into the implementation method, let’s talk about the data specificities. Since we said that Amadeus paper restrictions were close to ours, we indeed had categorical features, as well as numerical discrete and continuous ones. We also had the obligation of not replicating the observed data points (obviously, since otherwise we would have a kind of random oversampling method). Some multimodal distributions were also mixed in. Finally, we had around 100 features with significant dependencies (such as the sum of some features is lesser than another).

The classifier was binary, thus we knew that for testing a trained model on synthetic data, we could use the label knowledge. Regarding the label, three methods could be used:

  1. ignoring the label knowledge and generate the label as any other feature
  2. creating two generation models, one per label
  3. using the label as a generation input, similar to Conditional GAN (cGAN).

We chose the second method because of time constraints, though the third option is tempting because of its smallest restrictions when dealing with a multi-classification problem.

For this generation problem we tested multiple GANs implementation:

  • Vanilla GAN [1]
  • Wasserstein-GAN [4]
  • Wasserstein-GAN Gradient Penalty [4]
  • Cramér GAN [3], with and without crossnet layers
  • Tabular GAN [7] (on-shelves implementation, see https://github.com/DAI-Lab/TGAN)

The POC objective was twofold: testing GANs in our specific problem and implementing a generation pipeline and performance criteria. We’ll introduce how we choose to implement our generation pipeline:

Figure 3: Generation and performances pipeline

As shown in the above figure, raw data (in XML format) is transformed into classifier features (in pickle format) which we’ll preprocess into a GAN input. After training, the generator creates new observations already preprocessed which we’ll pass into an inverse transformer. This will transform the new observations into the original features domain, and those will be saved into a pickle format. And those generated features will be subject to quality analysis tests as well as “model-compatibility” tests, which we’ll define later on.

As seen above, the generation is model-specific and the features created may have sense only for our specific use-case, which was acceptable in our case. However, if you can, try to generate data as raw as possible for increasing model-compatibility tests.

The preprocessing part defined is a transformer (that can inverse-transform) of the data from an original feature domain into a neural network acceptable domain. Mainly scaling is done, but for very skewed distributions, you