Understanding TensorFlow Distributions Shapes

View on TensorFlow.org Run in Google Colab View source on GitHub Download notebook
import collections

import tensorflow as tf
tf.compat.v2.enable_v2_behavior()

import tensorflow_probability as tfp
tfd = tfp.distributions
tfb = tfp.bijectors

Basics

There are three important concepts associated with TensorFlow Distributions shapes:

  • Event shape describes the shape of a single draw from the distribution; it may be dependent across dimensions. For scalar distributions, the event shape is []. For a 5-dimensional MultivariateNormal, the event shape is [5].
  • Batch shape describes independent, not identically distributed draws, aka a "batch" of distributions.
  • Sample shape describes independent, identically distributed draws of batches from the distribution family.

The event shape and the batch shape are properties of a Distribution object, whereas the sample shape is associated with a specific call to sample or log_prob.

This notebook's purpose is to illustrate these concepts through examples, so if this isn't immediately obvious, don't worry!

For another conceptual overview of these concepts, see this blog post.

A note on TensorFlow Eager.

This entire notebook is written using TensorFlow Eager. None of the concepts presented rely on Eager, although with Eager, distribution batch and event shapes are evaluated (and therefore known) when the Distribution object is created in Python, whereas in graph (non-Eager mode), it is possible to define distributions whose event and batch shapes are undetermined until the graph is run.

Scalar Distributions

As we noted above, a Distribution object has defined event and batch shapes. We'll start with a utility to describe distributions:

def describe_distributions(distributions):
  print('\n'.join([str(d) for d in distributions]))

In this section we'll explore scalar distributions: distributions with an event shape of []. A typical example is the Poisson distribution, specified by a rate:

poisson_distributions = [
    tfd.Poisson(rate=1., name='One Poisson Scalar Batch'),
    tfd.Poisson(rate=[1., 10., 100.], name='Three Poissons'),
    tfd.Poisson(rate=[[1., 10., 100.,], [2., 20., 200.]],
                name='Two-by-Three Poissons'),
    tfd.Poisson(rate=[1.], name='One Poisson Vector Batch'),
    tfd.Poisson(rate=[[1.]], name='One Poisson Expanded Batch')
]

describe_distributions(poisson_distributions)
tfp.distributions.Poisson("One_Poisson_Scalar_Batch", batch_shape=[], event_shape=[], dtype=float32)
tfp.distributions.Poisson("Three_Poissons", batch_shape=[3], event_shape=[], dtype=float32)
tfp.distributions.Poisson("Two_by_Three_Poissons", batch_shape=[2, 3], event_shape=[], dtype=float32)
tfp.distributions.Poisson("One_Poisson_Vector_Batch", batch_shape=[1], event_shape=[], dtype=float32)
tfp.distributions.Poisson("One_Poisson_Expanded_Batch", batch_shape=[1, 1], event_shape=[], dtype=float32)

The Poisson distribution is a scalar distribution, so its event shape is always []. If we specify more rates, these show up in the batch shape. The final pair of examples is interesting: there's only a single rate, but because that rate is embedded in a numpy array with non-empty shape, that shape becomes the batch shape.

The standard Normal distribution is also a scalar. It's event shape is [], just like for the Poisson, but we'll play with it to see our first example of broadcasting. The Normal is specified using loc and scale parameters:

normal_distributions = [
    tfd.Normal(loc=0., scale=1., name='Standard'),
    tfd.Normal(loc=[0.], scale=1., name='Standard Vector Batch'),
    tfd.Normal(loc=[0., 1., 2., 3.], scale=1., name='Different Locs'),
    tfd.Normal(loc=[0., 1., 2., 3.], scale=[[1.], [5.]],
               name='Broadcasting Scale')
]

describe_distributions(normal_distributions)
tfp.distributions.Normal("Standard", batch_shape=[], event_shape=[], dtype=float32)
tfp.distributions.Normal("Standard_Vector_Batch", batch_shape=[1], event_shape=[], dtype=float32)
tfp.distributions.Normal("Different_Locs", batch_shape=[4], event_shape=[], dtype=float32)
tfp.distributions.Normal("Broadcasting_Scale", batch_shape=[2, 4], event_shape=[], dtype=float32)

The interesting example above is the Broadcasting Scale distribution. The loc parameter has shape [4], and the scale parameter has shape [2, 1]. Using Numpy broadcasting rules, the batch shape is [2, 4]. An equivalent (but less elegant and not-recommended) way to define the "Broadcasting Scale" distribution would be:

describe_distributions(
    [tfd.Normal(loc=[[0., 1., 2., 3], [0., 1., 2., 3.]],
                scale=[[1., 1., 1., 1.], [5., 5., 5., 5.]])])
tfp.distributions.Normal("Normal", batch_shape=[2, 4], event_shape=[], dtype=float32)

We can see why the broadcasting notation is useful, although it's also a source of headaches and bugs.

Sampling Scalar Distributions

There are two main things we can do with distributions: we can sample from them and we can compute log_probs. Let's explore sampling first. The basic rule is that when we sample from a distribution, the resulting Tensor has shape [sample_shape, batch_shape, event_shape], where batch_shape and event_shape are provided by the Distribution object, and sample_shape is provided by the call to sample. For scalar distributions, event_shape = [], so the Tensor returned from sample will have shape [sample_shape, batch_shape]. Let's try it:

def describe_sample_tensor_shape(sample_shape, distribution):
    print('Sample shape:', sample_shape)
    print('Returned sample tensor shape:',
          distribution.sample(sample_shape).shape)

def describe_sample_tensor_shapes(distributions, sample_shapes):
    started = False
    for distribution in distributions:
      print(distribution)
      for sample_shape in sample_shapes:
        describe_sample_tensor_shape(sample_shape, distribution)
      print()

sample_shapes = [1, 2, [1, 5], [3, 4, 5]]
describe_sample_tensor_shapes(poisson_distributions, sample_shapes)
tfp.distributions.Poisson("One_Poisson_Scalar_Batch", batch_shape=[], event_shape=[], dtype=float32)
Sample shape: 1
Returned sample tensor shape: (1,)
Sample shape: 2
Returned sample tensor shape: (2,)
Sample shape: [1, 5]
Returned sample tensor shape: (1, 5)
Sample shape: [3, 4, 5]
Returned sample tensor shape: (3, 4, 5)

tfp.distributions.Poisson("Three_Poissons", batch_shape=[3], event_shape=[], dtype=float32)
Sample shape: 1
Returned sample tensor shape: (1, 3)
Sample shape: 2
Returned sample tensor shape: (2, 3)
Sample shape: [1, 5]
Returned sample tensor shape: (1, 5, 3)
Sample shape: [3, 4, 5]
Returned sample tensor shape: (3, 4, 5, 3)

tfp.distributions.Poisson("Two_by_Three_Poissons", batch_shape=[2, 3], event_shape=[], dtype=float32)
Sample shape: 1
Returned sample tensor shape: (1, 2, 3)
Sample shape: 2
Returned sample tensor shape: (2, 2, 3)
Sample shape: [1, 5]
Returned sample tensor shape: (1, 5, 2, 3)
Sample shape: [3, 4, 5]
Returned sample tensor shape: (3, 4, 5, 2, 3)

tfp.distributions.Poisson("One_Poisson_Vector_Batch", batch_shape=[1], event_shape=[], dtype=float32)
Sample shape: 1
Returned sample tensor shape: (1, 1)
Sample shape: 2
Returned sample tensor shape: (2, 1)
Sample shape: [1, 5]
Returned sample tensor shape: (1, 5, 1)
Sample shape: [3, 4, 5]
Returned sample tensor shape: (3, 4, 5, 1)

tfp.distributions.Poisson("One_Poisson_Expanded_Batch", batch_shape=[1, 1], event_shape=[], dtype=float32)
Sample shape: 1
Returned sample tensor shape: (1, 1, 1)
Sample shape: 2
Returned sample tensor shape: (2, 1, 1)
Sample shape: [1, 5]
Returned sample tensor shape: (1, 5, 1, 1)
Sample shape: [3, 4, 5]
Returned sample tensor shape: (3, 4, 5, 1, 1)
describe_sample_tensor_shapes(normal_distributions, sample_shapes)
tfp.distributions.Normal("Standard", batch_shape=[], event_shape=[], dtype=float32)
Sample shape: 1
Returned sample tensor shape: (1,)
Sample shape: 2
Returned sample tensor shape: (2,)
Sample shape: [1, 5]
Returned sample tensor shape: (1, 5)
Sample shape: [3, 4, 5]
Returned sample tensor shape: (3, 4, 5)

tfp.distributions.Normal("Standard_Vector_Batch", batch_shape=[1], event_shape=[], dtype=float32)
Sample shape: 1
Returned sample tensor shape: (1, 1)
Sample shape: 2
Returned sample tensor shape: (2, 1)
Sample shape: [1, 5]
Returned sample tensor shape: (1, 5, 1)
Sample shape: [3, 4, 5]
Returned sample tensor shape: (3, 4, 5, 1)

tfp.distributions.Normal("Different_Locs", batch_shape=[4], event_shape=[], dtype=float32)
Sample shape: 1
Returned sample tensor shape: (1, 4)
Sample shape: 2
Returned sample tensor shape: (2, 4)
Sample shape: [1, 5]
Returned sample tensor shape: (1, 5, 4)
Sample shape: [3, 4, 5]
Returned sample tensor shape: (3, 4, 5, 4)

tfp.distributions.Normal("Broadcasting_Scale", batch_shape=[2, 4], event_shape=[], dtype=float32)
Sample shape: 1
Returned sample tensor shape: (1, 2, 4)
Sample shape: 2
Returned sample tensor shape: (2, 2, 4)
Sample shape: [1, 5]
Returned sample tensor shape: (1, 5, 2, 4)
Sample shape: [3, 4, 5]
Returned sample tensor shape: (3, 4, 5, 2, 4)

That's about all there is to say about sample: returned sample tensors have shape [sample_shape, batch_shape, event_shape].

Computing log_prob For Scalar Distributions

Now let's take a look at log_prob, which is somewhat trickier. log_prob takes as input a (non-empty) tensor representing the location(s) at which to compute the log_prob for the distribution. In the most straightforward case, this tensor will have a shape of the form [sample_shape, batch_shape, event_shape], where batch_shape and event_shape match the batch and event shapes of the distribution. Recall once more that for scalar distributions, event_shape = [], so the input tensor has shape [sample_shape, batch_shape] In this case, we get back a tensor of shape [sample_shape, batch_shape]:

three_poissons = tfd.Poisson(rate=[1., 10., 100.], name='Three Poissons')
three_poissons
<tfp.distributions.Poisson 'Three_Poissons' batch_shape=[3] event_shape=[] dtype=float32>
three_poissons.log_prob([[1., 10., 100.], [100., 10., 1]])  # sample_shape is [2].
<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[  -1.       ,   -2.0785608,   -3.2223587],
       [-364.73938  ,   -2.0785608,  -95.39484  ]], dtype=float32)>
three_poissons.log_prob([[[[1., 10., 100.], [100., 10., 1.]]]])  # sample_shape is [1, 1, 2].
<tf.Tensor: shape=(1, 1, 2, 3), dtype=float32, numpy=
array([[[[  -1.       ,   -2.0785608,   -3.2223587],
         [-364.73938  ,   -2.0785608,  -95.39484  ]]]], dtype=float32)>

Note how in the first example, the input and output have shape [2, 3] and in the second example they have shape [1, 1, 2, 3].

That would be all there was to say, if it weren't for broadcasting. Here are the rules once we take broadcasting into account. We describe it in full generality and note simplifications for scalar distributions:

  1. Define n = len(batch_shape) + len(event_shape). (For scalar distributions, len(event_shape)=0.)
  2. If the input tensor t has fewer than n dimensions, pad its shape by adding dimensions of size 1 on the left until it has exactly n dimensions. Call the resulting tensor t'.
  3. Broadcast the n rightmost dimensions of t' against the [batch_shape, event_shape] of the distribution you're computing a log_prob for. In more detail: for the dimensions where t' already matches the distribution, do nothing, and for the dimensions where t' has a singleton, replicate that singleton the appropriate number of times. Any other situation is an error. (For scalar distributions, we only broadcast against batch_shape, since event_shape = [].)
  4. Now we're finally able to compute the log_prob. The resulting tensor will have shape [sample_shape, batch_shape], where sample_shape is defined to be any dimensions of t or t' to the left of the n-rightmost dimensions: sample_shape = shape(t)[:-n].

This might be a mess if you don't know what it means, so let's work some examples:

three_poissons.log_prob([10.])
<tf.Tensor: shape=(3,), dtype=float32, numpy=array([-16.104412 ,  -2.0785608, -69.05272  ], dtype=float32)>

The tensor [10.] (with shape [1]) is broadcast across the batch_shape of 3, so we evaluate all three Poissons' log probability at the value 10.

three_poissons.log_prob([[[1.], [10.]], [[100.], [1000.]]])
<tf.Tensor: shape=(2, 2, 3), dtype=float32, numpy=
array([[[-1.0000000e+00, -7.6974149e+00, -9.5394836e+01],
        [-1.6104412e+01, -2.0785608e+00, -6.9052719e+01]],

       [[-3.6473938e+02, -1.4348087e+02, -3.2223587e+00],
        [-5.9131279e+03, -3.6195427e+03, -1.4069575e+03]]], dtype=float32)>

In the above example, the input tensor has shape [2, 2, 1], while the distributions object has a batch shape of 3. So for each of the [2, 2] sample dimensions, the single value provided gets broadcats to each of the three Poissons.

A possibly useful way to think of it: because three_poissons has batch_shape = [2, 3], a call to log_prob must take a Tensor whose last dimension is either 1 or 3; anything else is an error. (The numpy broadcasting rules treat the special case of a scalar as being totally equivalent to a Tensor of shape [1].)

Let's test our chops by playing with the more complex Poisson distribution with batch_shape = [2, 3]:

poisson_2_by_3 = tfd.Poisson(
    rate=[[1., 10., 100.,], [2., 20., 200.]],
    name='Two-by-Three Poissons')
poisson_2_by_3.log_prob(1.)
<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[  -1.       ,   -7.697415 ,  -95.39484  ],
       [  -1.3068528,  -17.004269 , -194.70169  ]], dtype=float32)>
poisson_2_by_3.log_prob([1.])  # Exactly equivalent to above, demonstrating the scalar special case.
<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[  -1.       ,   -7.697415 ,  -95.39484  ],
       [  -1.3068528,  -17.004269 , -194.70169  ]], dtype=float32)>
poisson_2_by_3.log_prob([[1., 1., 1.], [1., 1., 1.]])  # Another way to write the same thing. No broadcasting.
<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[  -1.       ,   -7.697415 ,  -95.39484  ],
       [  -1.3068528,  -17.004269 , -194.70169  ]], dtype=float32)>
poisson_2_by_3.log_prob([[1., 10., 100.]])  # Input is [1, 3] broadcast to [2, 3].
<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[ -1.       ,  -2.0785608,  -3.2223587],
       [ -1.3068528,  -5.14709  , -33.90767  ]], dtype=float32)>
poisson_2_by_3.log_prob([[1., 10., 100.], [1., 10., 100.]])  # Equivalent to above. No broadcasting.
<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[ -1.       ,  -2.0785608,  -3.2223587],
       [ -1.3068528,  -5.14709  , -33.90767  ]], dtype=float32)>
poisson_2_by_3.log_prob([[1., 1., 1.], [2., 2., 2.]])  # No broadcasting.
<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[  -1.       ,   -7.697415 ,  -95.39484  ],
       [  -1.3068528,  -14.701683 , -190.09653  ]], dtype=float32)>
poisson_2_by_3.log_prob([[1.], [2.]])  # Equivalent to above. Input shape [2, 1] broadcast to [2, 3].
<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[  -1.       ,   -7.697415 ,  -95.39484  ],
       [  -1.3068528,  -14.701683 , -190.09653  ]], dtype=float32)>

The above examples involved broadcasting over the batch, but the sample shape was empty. Suppose we have a collection of values, and we want to get the log probability of each value at each point in the batch. We could do it manually:

poisson_2_by_3.log_prob([[[1., 1., 1.], [1., 1., 1.]], [[2., 2., 2.], [2., 2., 2.]]])  # Input shape [2, 2, 3].
<tf.Tensor: shape=(2, 2, 3), dtype=float32, numpy=
array([[[  -1.       ,   -7.697415 ,  -95.39484  ],
        [  -1.3068528,  -17.004269 , -194.70169  ]],

       [[  -1.6931472,   -6.087977 ,  -91.48282  ],
        [  -1.3068528,  -14.701683 , -190.09653  ]]], dtype=float32)>

Or we could let broadcasting handle the last batch dimension:

poisson_2_by_3.log_prob([[[1.], [1.]], [[2.], [2.]]])  # Input shape [2, 2, 1].
<tf.Tensor: shape=(2, 2, 3), dtype=float32, numpy=
array([[[  -1.       ,   -7.697415 ,  -95.39484  ],
        [  -1.3068528,  -17.004269 , -194.70169  ]],

       [[  -1.6931472,   -6.087977 ,  -91.48282  ],
        [  -1.3068528,  -14.701683 , -190.09653  ]]], dtype=float32)>

We can also (perhaps somewhat less naturally) let broadcasting handle just the first batch dimension:

poisson_2_by_3.log_prob([[[1., 1., 1.]], [[2., 2., 2.]]])  # Input shape [2, 1, 3].
<tf.Tensor: shape=(2, 2, 3), dtype=float32, numpy=
array([[[  -1.       ,   -7.697415 ,  -95.39484  ],
        [  -1.3068528,  -17.004269 , -194.70169  ]],

       [[  -1.6931472,   -6.087977 ,  -91.48282  ],
        [  -1.3068528,  -14.701683 , -190.09653  ]]], dtype=float32)>

Or we could let broadcasting handle both batch dimensions:

poisson_2_by_3.log_prob([[[1.]], [[2.]]])  # Input shape [2, 1, 1].
<tf.Tensor: shape=(2, 2, 3), dtype=float32, numpy=
array([[[  -1.       ,   -7.697415 ,  -95.39484  ],
        [  -1.3068528,  -17.004269 , -194.70169  ]],

       [[  -1.6931472,   -6.087977 ,  -91.48282  ],
        [  -1.3068528,  -14.701683 , -190.09653  ]]], dtype=float32)>

The above worked fine when we had only two values we wanted, but suppose we had a long list of values we wanted to evaluate at every batch point. For that, the following notation, which adds extra dimensions of size 1 to the right side of the shape, is extremely useful:

poisson_2_by_3.log_prob(tf.constant([1., 2.])[..., tf.newaxis, tf.newaxis])
<tf.Tensor: shape=(2, 2, 3), dtype=float32, numpy=
array([[[  -1.       ,   -7.697415 ,  -95.39484  ],
        [  -1.3068528,  -17.004269 , -194.70169  ]],

       [[  -1.6931472,   -6.087977 ,  -91.48282  ],
        [  -1.3068528,  -14.701683 , -190.09653  ]]], dtype=float32)>

This is an instance of strided slice notation, which is worth knowing.

Going back to three_poissons for completeness, the same example looks like:

three_poissons.log_prob([[1.], [10.], [50.], [100.]])
<tf.Tensor: shape=(4, 3), dtype=float32, numpy=
array([[  -1.       ,   -7.697415 ,  -95.39484  ],
       [ -16.104412 ,   -2.0785608,  -69.05272  ],
       [-149.47777  ,  -43.34851  ,  -18.219261 ],
       [-364.73938  , -143.48087  ,   -3.2223587]], dtype=float32)>
three_poissons.log_prob(tf.constant([1., 10., 50., 100.])[..., tf.newaxis])  # Equivalent to above.
<tf.Tensor: shape=(4, 3), dtype=float32, numpy=
array([[  -1.       ,   -7.697415 ,  -95.39484  ],
       [ -16.104412 ,   -2.0785608,  -69.05272  ],
       [-149.47777  ,  -43.34851  ,  -18.219261 ],
       [-364.73938  , -143.48087  ,   -3.2223587]], dtype=float32)>

Multivariate distributions

We now turn to multivariate distributions, which have non-empty event shape. Let's look at multinomial distributions.

multinomial_distributions = [
    # Multinomial is a vector-valued distribution: if we have k classes,
    # an individual sample from the distribution has k values in it, so the
    # event_shape is `[k]`.
    tfd.Multinomial(total_count=100., probs=[.5, .4, .1],
                    name='One Multinomial'),
    tfd.Multinomial(total_count=[100., 1000.], probs=[.5, .4, .1],
                    name='Two Multinomials Same Probs'),
    tfd.Multinomial(total_count=100., probs=[[.5, .4, .1], [.1, .2, .7]],
                    name='Two Multinomials Same Counts'),
    tfd.Multinomial(total_count=[100., 1000.],
                    probs=[[.5, .4, .1], [.1, .2, .7]],
                    name='Two Multinomials Different Everything')

]

describe_distributions(multinomial_distributions)
tfp.distributions.Multinomial("One_Multinomial", batch_shape=[], event_shape=[3], dtype=float32)
tfp.distributions.Multinomial("Two_Multinomials_Same_Probs", batch_shape=[2], event_shape=[3], dtype=float32)
tfp.distributions.Multinomial("Two_Multinomials_Same_Counts", batch_shape=[2], event_shape=[3], dtype=float32)
tfp.distributions.Multinomial("Two_Multinomials_Different_Everything", batch_shape=[2], event_shape=[3], dtype=float32)

Note how in the last three examples, the batch_shape is always [2], but we can use broadcasting to either have a shared total_count or a shared probs (or neither), because under the hood they are broadcast to have the same shape.

Sampling is straightforward, given what we know already:

describe_sample_tensor_shapes(multinomial_distributions, sample_shapes)
tfp.distributions.Multinomial("One_Multinomial", batch_shape=[], event_shape=[3], dtype=float32)
Sample shape: 1
Returned sample tensor shape: (1, 3)
Sample shape: 2
Returned sample tensor shape: (2, 3)
Sample shape: [1, 5]
Returned sample tensor shape: (1, 5, 3)
Sample shape: [3, 4, 5]
Returned sample tensor shape: (3, 4, 5, 3)

tfp.distributions.Multinomial("Two_Multinomials_Same_Probs", batch_shape=[2], event_shape=[3], dtype=float32)
Sample shape: 1
Returned sample tensor shape: (1, 2, 3)
Sample shape: 2
Returned sample tensor shape: (2, 2, 3)
Sample shape: [1, 5]
Returned sample tensor shape: (1, 5, 2, 3)
Sample shape: [3, 4, 5]
Returned sample tensor shape: (3, 4, 5, 2, 3)

tfp.distributions.Multinomial("Two_Multinomials_Same_Counts", batch_shape=[2], event_shape=[3], dtype=float32)
Sample shape: 1
Returned sample tensor shape: (1, 2, 3)
Sample shape: 2
Returned sample tensor shape: (2, 2, 3)
Sample shape: [1, 5]
Returned sample tensor shape: (1, 5, 2, 3)
Sample shape: [3, 4, 5]
Returned sample tensor shape: (3, 4, 5, 2, 3)

tfp.distributions.Multinomial("Two_Multinomials_Different_Everything", batch_shape=[2], event_shape=[3], dtype=float32)
Sample shape: 1
Returned sample tensor shape: (1, 2, 3)
Sample shape: 2
Returned sample tensor shape: (2, 2, 3)
Sample shape: [1, 5]
Returned sample tensor shape: (1, 5, 2, 3)
Sample shape: [3, 4, 5]
Returned sample tensor shape: (3, 4, 5, 2, 3)

Computing log probabilities is equally straightforward. Let's work an example with diagonal Multivariate Normal distributions. (Multinomials are not very broadcast friendly, since the constraints on the counts and probabilities mean broadcasting will often produce inadmissible values.) We'll use a batch of 2 3-dimensional distributions with the same mean but different scales (standard deviations):

two_multivariate_normals = tfd.MultivariateNormalDiag(loc=[1., 2., 3.], scale_diag=tf.ones([2, 3]) * [[1.], [2.]])
two_multivariate_normals
<tfp.distributions.MultivariateNormalDiag 'MultivariateNormalDiag' batch_shape=[2] event_shape=[3] dtype=float32>

Now let's evaluate the log probability of each batch point at its mean and at a shifted mean:

two_multivariate_normals.log_prob([[[1., 2., 3.]], [[3., 4., 5.]]])  # Input has shape [2,1,3].
<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[-2.7568154, -4.836257 ],
       [-8.756816 , -6.336257 ]], dtype=float32)>

Exactly equivalently, we can use https://www.tensorflow.org/api_docs/cc/class/tensorflow/ops/strided-slice to insert an extra shape=1 dimension in the middle of a constant:

two_multivariate_normals.log_prob(
    tf.constant([[1., 2., 3.], [3., 4., 5.]])[:, tf.newaxis, :])  # Equivalent to above.
<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[-2.7568154, -4.836257 ],
       [-8.756816 , -6.336257 ]], dtype=float32)>

On the other hand, if we don't insert the extra dimension, we pass [1., 2., 3.] to the first batch point and [3., 4., 5.] to the second:

two_multivariate_normals.log_prob(tf.constant([[1., 2., 3.], [3., 4., 5.]]))
<tf.Tensor: shape=(2,), dtype=float32, numpy=array([-2.7568154, -6.336257 ], dtype=float32)>

Shape Manipulation Techniques

The Reshape Bijector

The Reshape bijector can be used to reshape the event_shape of a distribution. Let's see an example:

six_way_multinomial = tfd.Multinomial(total_count=1000., probs=[.3, .25, .2, .15, .08, .02])
six_way_multinomial
<tfp.distributions.Multinomial 'Multinomial' batch_shape=[] event_shape=[6] dtype=float32>

We created a multinomial with an event shape of [6]. The Reshape Bijector allows us to treat this as a distribution with an event shape of [2, 3].

A Bijector represents a differentiable, one-to-one function on an open subset of \({\mathbb R}^n\). Bijectors are used in conjunction with TransformedDistribution, which models a distribution \(p(y)\) in terms of a base distribution \(p(x)\) and a Bijector that represents \(Y = g(X)\). Let's see it in action:

transformed_multinomial = tfd.TransformedDistribution(
    distribution=six_way_multinomial,
    bijector=tfb.Reshape(event_shape_out=[2, 3]))
transformed_multinomial
<tfp.distributions.TransformedDistribution 'reshapeMultinomial' batch_shape=[] event_shape=[2, 3] dtype=float32>
six_way_multinomial.log_prob([500., 100., 100., 150., 100., 50.])
<tf.Tensor: shape=(), dtype=float32, numpy=-178.21973>
transformed_multinomial.log_prob([[500., 100., 100.], [150., 100., 50.]])
<tf.Tensor: shape=(), dtype=float32, numpy=-178.21973>

This is the only thing the Reshape bijector can do: it cannot turn event dimensions into batch dimensions or vice-versa.

The Independent Distribution

The Independent distribution is used to treat a collection of independent, not-necessarily-identical (aka a batch of) distributions as a single distribution. More concisely, Independent allows to convert dimensions in batch_shape to dimensions in event_shape. We'll illustrate by example:

two_by_five_bernoulli = tfd.Bernoulli(
    probs=[[.05, .1, .15, .2, .25], [.3, .35, .4, .45, .5]],
    name="Two By Five Bernoulli")
two_by_five_bernoulli
<tfp.distributions.Bernoulli 'Two_By_Five_Bernoulli' batch_shape=[2, 5] event_shape=[] dtype=int32>

We can think of this as two-by-five array of coins with the associated probabilities of heads. Let's evaluate the probability of a particular, arbitrary set of ones-and-zeros:

pattern = [[1., 0., 0., 1., 0.], [0., 0., 1., 1., 1.]]
two_by_five_bernoulli.log_prob(pattern)
<tf.Tensor: shape=(2, 5), dtype=float32, numpy=
array([[-2.9957323 , -0.10536051, -0.16251892, -1.609438  , -0.2876821 ],
       [-0.35667497, -0.4307829 , -0.91629076, -0.79850775, -0.6931472 ]],
      dtype=float32)>

We can use Independent to turn this into two different "sets of five Bernoulli's", which is useful if we want to consider a "row" of coin flips coming up in a given pattern as a single outcome:

two_sets_of_five = tfd.Independent(
    distribution=two_by_five_bernoulli,
    reinterpreted_batch_ndims=1,
    name="Two Sets Of Five")
two_sets_of_five
<tfp.distributions.Independent 'Two_Sets_Of_Five' batch_shape=[2] event_shape=[5] dtype=int32>

Mathematically, we're computing the log probability of each "set" of five by summing the log probabilities of the five "independent" coin flips in the set, which is where the distribution gets its name:

two_sets_of_five.log_prob(pattern)
<tf.Tensor: shape=(2,), dtype=float32, numpy=array([-5.160732 , -3.1954036], dtype=float32)>

We can go even further and use Independent to create a distribution where individual events are a set of two-by-five Bernoulli's:

one_set_of_two_by_five = tfd.Independent(
    distribution=two_by_five_bernoulli, reinterpreted_batch_ndims=2,
    name="One Set Of Two By Five")
one_set_of_two_by_five.log_prob(pattern)
<tf.Tensor: shape=(), dtype=float32, numpy=-8.356134>

It's worth noting that from the perspective of sample, using Independent changes nothing:

describe_sample_tensor_shapes(
    [two_by_five_bernoulli,
     two_sets_of_five,
     one_set_of_two_by_five],
    [[3, 5]])
tfp.distributions.Bernoulli("Two_By_Five_Bernoulli", batch_shape=[2, 5], event_shape=[], dtype=int32)
Sample shape: [3, 5]
Returned sample tensor shape: (3, 5, 2, 5)

tfp.distributions.Independent("Two_Sets_Of_Five", batch_shape=[2], event_shape=[5], dtype=int32)
Sample shape: [3, 5]
Returned sample tensor shape: (3, 5, 2, 5)

tfp.distributions.Independent("One_Set_Of_Two_By_Five", batch_shape=[], event_shape=[2, 5], dtype=int32)
Sample shape: [3, 5]
Returned sample tensor shape: (3, 5, 2, 5)

As a parting exercise for the reader, we suggest considering the differences and similarities between a vector batch of Normal distributions and a MultivariateNormalDiag distribution from a sampling and log probability perspective. How can we use Independent to construct a MultivariateNormalDiag from a batch of Normals? (Note that MultivariateNormalDiag is not actually implemented this way.)