tf.contrib.factorization.KMeans

View source on GitHub

Creates the graph for k-means clustering.

inputs An input tensor or list of input tensors. It is assumed that the data points have been previously randomly permuted.
num_clusters An integer tensor specifying the number of clusters. This argument is ignored if initial_clusters is a tensor or numpy array.
initial_clusters Specifies the clusters used during initialization. One of the following:

  • a tensor or numpy array with the initial cluster centers.
  • a function f(inputs, k) that returns up to k centers from inputs.
  • "random": Choose centers randomly from inputs.
  • "kmeans_plus_plus": Use kmeans++ to choose centers from inputs.
  • "kmc2": Use the fast k-MC2 algorithm to choose centers from inputs. In the last three cases, one batch of inputs may not yield num_clusters centers, in which case initialization will require multiple batches until enough centers are chosen. In the case of "random" or "kmeans_plus_plus", if the input size is <= num_clusters then the entire batch is chosen to be cluster centers.
distance_metric Distance metric used for clustering. Supported options: "squared_euclidean", "cosine".
use_mini_batch If true, use the mini-batch k-means algorithm. Else assume full batch.
mini_batch_steps_per_iteration Number of steps after which the updated cluster centers are synced back to a master copy.
random_seed Seed for PRNG used to initialize seeds.
kmeans_plus_plus_num_retries For each point that is sampled during kmeans++ initialization, this parameter specifies the number of additional points to draw from the current distribution before selecting the best. If a negative value is specified, a heuristic is used to sample O(log(num_to_sample)) additional points.
kmc2_chain_length Determines how many candidate points are used by the k-MC2 algorithm to produce one new cluster centers. If a (mini-)batch contains less points, one new cluster center is generated from the (mini-)batch.

ValueError An invalid argument was passed to initial_clusters or distance_metric.

Methods

training_graph

View source

Generate a training graph for kmeans algorithm.

This returns, among other things, an op that chooses initial centers (init_op), a boolean variable that is set to True when the initial centers are chosen (cluster_centers_initialized), and an op to perform either an entire Lloyd iteration or a mini-batch of a Lloyd iteration (training_op). The caller should use these components as follows. A single worker should execute init_op multiple times until cluster_centers_initialized becomes True. Then multiple workers may execute training_op any number of times.

Returns
A tuple consisting of:
all_scores A matrix (or list of matrices) of dimensions (num_input, num_clusters) where the value is the distance of an input vector and a cluster center.
cluster_idx A vector (or list of vectors). Each element in the vector corresponds to an input row in 'inp' and specifies the cluster id corresponding to the input.
scores Similar to cluster_idx but specifies the distance to the assigned cluster instead.
cluster_centers_initialized scalar indicating whether clusters have been initialized.
init_op an op to initialize the clusters.
training_op an op that runs an iteration of training.