Private Heavy Hitters

View on TensorFlow.org Run in Google Colab View source on GitHub Download notebook

This tutorial shows how to use the tff.analytics.heavy_hitters.iblt.build_iblt_computation API to build a federated analytics computation to discover the most frequent strings (private heavy hitters) in the population.

Environment Setup

Please run the following to make sure that your environment is correctly setup. If you don't see a greeting, please refer to the Installation guide for instructions.

# tensorflow_federated_nightly also bring in tf_nightly, which
# can causes a duplicate tensorboard install, leading to errors.
pip install --quiet tensorflow-text-nightly
pip install --quiet --upgrade tensorflow-federated
import collections

import numpy as np
import tensorflow as tf
import tensorflow_federated as tff
import tensorflow_text as tf_text

np.random.seed(0)
tff.backends.test.set_sync_test_cpp_execution_context()

tff.federated_computation(lambda: 'Hello, World!')()
b'Hello, World!'

Background: Private Heavy Hitters in Federated Analytics

Consider the following setting: Each client has a list of strings, and each string is from an open set, which means it could be arbitrary. The goal is to discover the most popular strings (heavy hitters) and their counts privately in a federated setting. This colab demonstrates a solution to this problem with the following privacy properties:

  • Secure aggregation: Computes the aggregated string counts such that it should not be possible for the server to learn any client's individual value. See tff.federated_secure_sum for more information.
  • Differential pirvacy (DP): A widely used method for bounding and quantifying the privacy leakage of sensitive data in analytics. You can apply user-level central DP to the heavy hitter results.

The secure aggregation API tff.federated_secure_sum supports linear sums of integer vectors. If the strings are from a closed set of size n, then it is easy to encode each client's strings to a vector of size n: let the value at index i of the vector be the count of the ith string in the closed set. Then you can securely sum the vectors of all clients to get the counts of strings in the whole population. However, if the strings are from an open set, it is not obvious how to encode them properly for secure sum. In this work, you can encode the strings into Invertible Bloom Lookup Tables (IBLT), which is a probabilistic data structure that have the ability to encode items in a large (or open) domain in an efficient manner. IBLT sketches can be linearly summed, so they are compatible with secure sum.

You can use tff.analytics.heavy_hitters.iblt.build_iblt_computation to create a TFF computation that encodes each client's local strings to an IBLT structure. These structures are securely summed via a cryptographic secure multi-party computation protocol into an aggregated IBLT structure which the server can decode. The server then can return the top heavy hitters. The following sections show how to use this API to create a TFF computation and run simulations with the Shakespeare dataset.

Load and Preprocess the Federated Shakespeare Data

The Shakespeare dataset contains lines of characters of Shakespeare plays. In this example, a subset of characters (that is, clients) are selected. A preprocessor converts each character's lines into a list of strings, and any string that is only punctuation or symbols is dropped.

# Load the simulation data.
source, _ = tff.simulation.datasets.shakespeare.load_data()
# Preprocessing function to tokenize a line into words.
def tokenize(ds):
  """Tokenizes a line into words with alphanum characters."""
  def extract_strings(example):
    return tf.expand_dims(example['snippets'], 0)

  def tokenize_line(line):
    return tf.data.Dataset.from_tensor_slices(tokenizer.tokenize(line)[0])

  def mask_all_symbolic_words(word):
    return tf.math.logical_not(
        tf_text.wordshape(word, tf_text.WordShape.IS_PUNCT_OR_SYMBOL))

  tokenizer = tf_text.WhitespaceTokenizer()
  ds = ds.map(extract_strings)
  ds = ds.flat_map(tokenize_line)
  ds = ds.map(tf_text.case_fold_utf8)
  ds = ds.filter(mask_all_symbolic_words)
  return ds

batch_size = 5

def client_data(n: int) -> tf.data.Dataset:
  return tokenize(source.create_tf_dataset_for_client(
      source.client_ids[n])).batch(batch_size)

# Pick a subset of client devices to participate in the computation.
dataset = [client_data(n) for n in range(10)]

Simulations

To run simulations to discover the most popular words (heavy hitters) in the Shakespeare dataset, first you need to create a TFF computation using the tff.analytics.heavy_hitters.iblt.build_iblt_computation API with the following parameters:

  • capacity: The capacity of the IBLT sketch. This number should be roughly the total number of unique strings that could appear in one round of computation. Defaults to 1000. If this number is too small, the decoding could fail due to collision of hashed values. If this number is too large, it would consume more memory than necessary.
  • string_max_bytes: The maximum length of a string in the IBLT. Defaults to 10. Must be positive. The strings longer than string_max_bytes will be truncated.
  • max_words_per_user: The maximum number of strings each client is allowed to contribute. If not None, must be a positive integer. Defaults to None, which means all the clients contribute all their strings.
  • max_heavy_hitters: The maximum number of items to return. If the decoded results have more than this number of items, will order decreasingly by the estimated counts and return the top max_heavy_hitters items. Defaults to None, which means to return all the heavy hitters in the result.
  • secure_sum_bitwidth: The bitwidth used for secure sum. The default value is None, which disables secure sum. If not None, must be in the range [1,62]. See tff.federated_secure_sum.
  • multi_contribution: Whether each client is allowed to contribute multiple counts or only a count of one for each unique word. Defaults to True. This argument could improve the utility when differential privacy is required.
  • batch_size: The number of elements in each batch of the dataset. Defaults to 1, means the input dataset is processed by tf.data.Dataset.batch(1). Must be a positive integer.
max_words_per_user = 8
iblt_computation = tff.analytics.heavy_hitters.iblt.build_iblt_computation(
    capacity=100,
    string_max_bytes=20,
    max_words_per_user=max_words_per_user,
    max_heavy_hitters=10,
    secure_sum_bitwidth=32,
    multi_contribution=False,
    batch_size=batch_size)

Now you are ready to run simulations with TFF computation iblt_computation and the preprocess input dataset. The output iblt_computation of has four attributes:

  • clients: A scalar number of clients that participated in the computation.
  • heavy_hitters: A list of aggregated heavy hitters.
  • heavy_hitters_counts: A list of the counts of aggregated heavy hitters.
  • num_not_decoded: A scalar number of strings that are not successfully decoded.
def run_simulation(one_round_computation: tff.Computation, dataset):
  output = one_round_computation(dataset)
  heavy_hitters = output.heavy_hitters
  heavy_hitters_counts = output.heavy_hitters_counts
  heavy_hitters = [word.decode('utf-8', 'ignore') for word in heavy_hitters]

  results = {}
  for index in range(len(heavy_hitters)):
    results[heavy_hitters[index]] = heavy_hitters_counts[index]
  return output.clients, dict(results)
clients, result = run_simulation(iblt_computation, dataset)
print(f'Number of clients participated: {clients}')
print('Discovered heavy hitters and counts:')
print(result)
Number of clients participated: 10
Discovered heavy hitters and counts:
{'to': 8, 'the': 8, 'and': 7, 'you': 4, 'i': 4, 'a': 3, 'he': 3, 'your': 3, 'is': 3, 'of': 2}

Private Heavy Hitters with Differential Privacy

To obtain private heavy hitters with central DP, a DP mechanism is applied for open set histograms. The idea is to add noise to the counts of strings in the aggregated histogram, then only keep the strings with counts above a certain threhold. The noise and threhold depends on (epsilon, delta)-DP budget, see this doc for detailed algorithm and proof. The noisy counts are rounded to integers as a post-processing step, which does not weaken the DP guarantee. Note that you will discover less heavy hitters when DP is required. This is because the thresholding step filters out strings with low counts.

iblt_computation = tff.analytics.heavy_hitters.iblt.build_iblt_computation(
    capacity=100,
    string_max_bytes=20,
    max_words_per_user=max_words_per_user,
    secure_sum_bitwidth=32,
    multi_contribution=False,
    batch_size=batch_size)

clients, result = run_simulation(iblt_computation, dataset)
# DP parameters
eps = 20
delta = 0.01

# Calculating scale for Laplace noise
scale = max_words_per_user / eps

# Calculating the threshold
tau = 1 + (max_words_per_user / eps) * np.log(max_words_per_user / (2 * delta))

result_with_dp = {}
for word in result:
  noised_count = result[word] + np.random.laplace(scale=scale)
  if noised_count >= tau:
    result_with_dp[word] = int(noised_count)
print(f'Discovered heavy hitters and counts with central DP:')
print(result_with_dp)
Discovered heavy hitters and counts with central DP:
{'the': 8, 'you': 4, 'to': 7, 'tear': 3, 'and': 7, 'i': 3}