MinDiff Data Preparation

View on TensorFlow.org

Run in Google Colab

View source on GitHub

Download notebook

Introduction

When implementing MinDiff, you will need to make complex decisions as you choose and shape your input before passing it on to the model. These decisions will largely determine the behavior of MinDiff within your model.

This guide will cover the technical aspects of this process, but will not discuss how to evaluate a model for fairness, or how to identify particular slices and metrics for evaluation. Please see the Fairness Indicators guidance for details on this.

To demonstrate MinDiff, this guide uses the UCI income dataset. The model task is to predict whether an individual has an income exceeding $50k, based on various personal attributes. This guide assumes there is a problematic gap in the FNR (false negative rate) between "Male" and "Female" slices and the model owner (you) has decided to apply MinDiff to address the issue. For more information on the scenarios in which one might choose to apply MinDiff, see the requirements page.

MinDiff works by penalizing the difference in distribution scores between examples in two sets of data. This guide will demonstrate how to choose and construct these additional MinDiff sets as well as how to package everything together so that it can be passed to a model for training.

Setup

pip install --upgrade tensorflow-model-remediation

import tensorflow as tf
from tensorflow_model_remediation import min_diff
from tensorflow_model_remediation.tools.tutorials_utils import uci as tutorials_utils

2024-07-19 09:12:10.760771: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-19 09:12:10.781683: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-19 09:12:10.787865: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered

Original Data

For demonstration purposes and to reduce runtimes, this guide uses only a sample fraction of the UCI Income dataset. In a real production setting, the full dataset would be utilized.

# Sampled at 0.3 for reduced runtimes.
train = tutorials_utils.get_uci_data(split='train', sample=0.3)

print(len(train), 'train examples')

9768 train examples

Converting to `tf.data.Dataset`

MinDiffModel requires that the input be a tf.data.Dataset. If you were using a different format of input prior to integrating MinDiff, you will have to convert your input data.

Use tf.data.Dataset.from_tensor_slices to convert to tf.data.Dataset.

dataset = tf.data.Dataset.from_tensor_slices((x, y, weights))
dataset.shuffle(...)  # Optional.
dataset.batch(batch_size)

See Model.fit documentation for details on equivalences between the two methods of input.

In this guide, the input is downloaded as a Pandas DataFrame and therefore, needs this conversion.

# Function to convert a DataFrame into a tf.data.Dataset.
def df_to_dataset(dataframe, shuffle=True):
  dataframe = dataframe.copy()
  labels = dataframe.pop('target')
  ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
  if shuffle:
    ds = ds.shuffle(buffer_size=5000)  # Reasonable but arbitrary buffer_size.
  return ds

# Convert the train DataFrame into a Dataset.
original_train_ds = df_to_dataset(train)

WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1721380333.740706   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1721380333.745983   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1721380333.749582   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1721380333.753131   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1721380333.763393   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1721380333.766897   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1721380333.770266   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1721380333.773538   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1721380333.776460   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1721380333.779899   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1721380333.783299   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1721380333.786581   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1721380335.025282   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1721380335.027381   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1721380335.029430   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1721380335.032077   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1721380335.034163   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1721380335.036087   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1721380335.038088   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1721380335.040061   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1721380335.042022   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1721380335.043955   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1721380335.045985   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1721380335.047967   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1721380335.086347   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1721380335.088345   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1721380335.090375   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1721380335.092391   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1721380335.094523   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1721380335.096455   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1721380335.098382   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1721380335.100366   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1721380335.102381   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1721380335.104813   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1721380335.107188   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1721380335.109569   11983 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355

Creating MinDiff data

During training, MinDiff will encourage the model to reduce differences in predictions between two additional datasets (which may include examples from the original dataset). The selection of these two datasets is the key decision which will determine the effect MinDiff has on the model.

The two datasets should be picked such that the disparity in performance that you are trying to remediate is evident and well-represented. Since the goal is to reduce a gap in FNR between "Male" and "Female" slices, this means creating one dataset with only positively labeled "Male" examples and another with only positively labeled "Female" examples; these will be the MinDiff datasets.

First, examine the data present.

female_pos = train[(train['sex'] == ' Female') & (train['target'] == 1)]
male_pos = train[(train['sex'] == ' Male') & (train['target'] == 1)]
print(len(female_pos), 'positively labeled female examples')
print(len(male_pos), 'positively labeled male examples')

385 positively labeled female examples
2063 positively labeled male examples

It is perfectly acceptable to create MinDiff datasets from subsets of the original dataset.

While there aren't 5,000 or more positive "Male" examples as recommended in the requirements guidance, there are over 2,000 and it is reasonable to try with that many before collecting more data.

min_diff_male_ds = df_to_dataset(male_pos)

Positive "Female" examples, however, are much scarcer at 385. This is probably too small for good performance and so will require pulling in additional examples.

full_uci_train = tutorials_utils.get_uci_data(split='train')
augmented_female_pos = full_uci_train[((full_uci_train['sex'] == ' Female') &
                                       (full_uci_train['target'] == 1))]
print(len(augmented_female_pos), 'positively labeled female examples')

1179 positively labeled female examples

Using the full dataset has more than tripled the number of examples that can be used for MinDiff. It’s still low but it is enough to try as a first pass.

min_diff_female_ds = df_to_dataset(augmented_female_pos)

Both the MinDiff datasets are significantly smaller than the recommended 5,000 or more examples. While it is reasonable to attempt to apply MinDiff with the current data, you may need to consider collecting additional data if you observe poor performance or overfitting during training.

Using `tf.data.Dataset.filter`

Alternatively, you can create the two MinDiff datasets directly from the converted original Dataset.

# Male
def male_predicate(x, y):
  return tf.equal(x['sex'], b' Male') and tf.equal(y, 0)

alternate_min_diff_male_ds = original_train_ds.filter(male_predicate).cache()

# Female
def female_predicate(x, y):
  return tf.equal(x['sex'], b' Female') and tf.equal(y, 0)

full_uci_train_ds = df_to_dataset(full_uci_train)
alternate_min_diff_female_ds = full_uci_train_ds.filter(female_predicate).cache()

The resulting alternate_min_diff_male_ds and alternate_min_diff_female_ds will be equivalent in output to min_diff_male_ds and min_diff_female_ds respectively.

Constructing your Training Dataset

As a final step, the three datasets (the two newly created ones and the original) need to be merged into a single dataset that can be passed to the model.

Batching the datasets

Before merging, the datasets need to batched.

The original dataset can use the same batching that was used before integrating MinDiff.
The MinDiff datasets do not need to have the same batch size as the original dataset. In all likelihood, a smaller one will perform just as well. While they don't even need to have the same batch size as each other, it is recommended to do so for best performance.

While not strictly necessary, it is recommended to use drop_remainder=True for the two MinDiff datasets as this will ensure that they have consistent batch sizes.

original_train_ds = original_train_ds.batch(128)  # Same as before MinDiff.

# The MinDiff datasets can have a different batch_size from original_train_ds
min_diff_female_ds = min_diff_female_ds.batch(32, drop_remainder=True)
# Ideally we use the same batch size for both MinDiff datasets.
min_diff_male_ds = min_diff_male_ds.batch(32, drop_remainder=True)

Packing the Datasets with `pack_min_diff_data`

Once the datasets are prepared, pack them into a single dataset which will then be passed along to the model. A single batch from the resulting dataset will contain one batch from each of the three datasets you prepared previously.

You can do this by using the provided utils function in the tensorflow_model_remediation package:

train_with_min_diff_ds = min_diff.keras.utils.pack_min_diff_data(
    original_dataset=original_train_ds,
    sensitive_group_dataset=min_diff_female_ds,
    nonsensitive_group_dataset=min_diff_male_ds)

And that's it! You will be able to use other util functions in the package to unpack individual batches if needed.

for inputs, original_labels in train_with_min_diff_ds.take(1):
  # Unpacking min_diff_data
  min_diff_data = min_diff.keras.utils.unpack_min_diff_data(inputs)
  min_diff_examples, min_diff_membership = min_diff_data
  # Unpacking original data
  original_inputs = min_diff.keras.utils.unpack_original_inputs(inputs)

With your newly formed data, you are now ready to apply MinDiff in your model! To learn how this is done, please take a look at the other guides starting with Integrating MinDiff with MinDiffModel.

Using a Custom Packing Format (optional)

You may decide to pack the three datasets together in whatever way you choose. The only requirement is that you will need to ensure the model knows how to interpret the data. The default implementation of MinDiffModel assumes that the data was packed using min_diff.keras.utils.pack_min_diff_data.

One easy way to format your input as you want is to transform the data as a final step after you have used min_diff.keras.utils.pack_min_diff_data.

# Reformat input to be a dict.
def _reformat_input(inputs, original_labels):
  unpacked_min_diff_data = min_diff.keras.utils.unpack_min_diff_data(inputs)
  unpacked_original_inputs = min_diff.keras.utils.unpack_original_inputs(inputs)

  return {
      'min_diff_data': unpacked_min_diff_data,
      'original_data': (unpacked_original_inputs, original_labels)}

customized_train_with_min_diff_ds = train_with_min_diff_ds.map(_reformat_input)

Your model will need to know how to read this customized input as detailed in the Customizing MinDiffModel guide.

for batch in customized_train_with_min_diff_ds.take(1):
  # Customized unpacking of min_diff_data
  min_diff_data = batch['min_diff_data']
  # Customized unpacking of original_data
  original_data = batch['original_data']

Additional Resources

For an in depth discussion on fairness evaluation see the Fairness Indicators guidance
For general information on Remediation and MinDiff, see the remediation overview.
For details on requirements surrounding MinDiff see this guide.
To see an end-to-end tutorial on using MinDiff in Keras, see this tutorial.

Utility Functions for other Guides

This guide outlines the process and decision making that you can follow whenever applying MinDiff. The rest of the guides build off this framework. To make this easier, logic found in this guide has been factored out into helper functions:

get_uci_data: This function is already used in this guide. It returns a DataFrame containing the UCI income data from the indicated split sampled at whatever rate is indicated (100% if unspecified).
df_to_dataset: This function converts a DataFrame into a tf.data.Dataset as detailed in this guide with the added functionality of being able to pass the batch_size as a parameter.
get_uci_with_min_diff_dataset: This function returns a tf.data.Dataset containing both the original data and the MinDiff data packed together using the Model Remediation Library util functions as described in this guide.

The rest of the guides will build off of these to show how to use other parts of the library.