Creates the graph for k-means clustering.
tf.contrib.factorization.KMeans(
inputs, num_clusters, initial_clusters=RANDOM_INIT,
distance_metric=SQUARED_EUCLIDEAN_DISTANCE, use_mini_batch=False,
mini_batch_steps_per_iteration=1, random_seed=0, kmeans_plus_plus_num_retries=2,
kmc2_chain_length=200
)
Args | |
---|---|
inputs
|
An input tensor or list of input tensors. It is assumed that the data points have been previously randomly permuted. |
num_clusters
|
An integer tensor specifying the number of clusters. This argument is ignored if initial_clusters is a tensor or numpy array. |
initial_clusters
|
Specifies the clusters used during initialization. One
of the following:
|
distance_metric
|
Distance metric used for clustering. Supported options: "squared_euclidean", "cosine". |
use_mini_batch
|
If true, use the mini-batch k-means algorithm. Else assume full batch. |
mini_batch_steps_per_iteration
|
Number of steps after which the updated cluster centers are synced back to a master copy. |
random_seed
|
Seed for PRNG used to initialize seeds. |
kmeans_plus_plus_num_retries
|
For each point that is sampled during kmeans++ initialization, this parameter specifies the number of additional points to draw from the current distribution before selecting the best. If a negative value is specified, a heuristic is used to sample O(log(num_to_sample)) additional points. |
kmc2_chain_length
|
Determines how many candidate points are used by the k-MC2 algorithm to produce one new cluster centers. If a (mini-)batch contains less points, one new cluster center is generated from the (mini-)batch. |
Raises | |
---|---|
ValueError
|
An invalid argument was passed to initial_clusters or distance_metric. |
Methods
training_graph
training_graph()
Generate a training graph for kmeans algorithm.
This returns, among other things, an op that chooses initial centers (init_op), a boolean variable that is set to True when the initial centers are chosen (cluster_centers_initialized), and an op to perform either an entire Lloyd iteration or a mini-batch of a Lloyd iteration (training_op). The caller should use these components as follows. A single worker should execute init_op multiple times until cluster_centers_initialized becomes True. Then multiple workers may execute training_op any number of times.
Returns | |
---|---|
A tuple consisting of: | |
all_scores
|
A matrix (or list of matrices) of dimensions (num_input, num_clusters) where the value is the distance of an input vector and a cluster center. |
cluster_idx
|
A vector (or list of vectors). Each element in the vector corresponds to an input row in 'inp' and specifies the cluster id corresponding to the input. |
scores
|
Similar to cluster_idx but specifies the distance to the assigned cluster instead. |
cluster_centers_initialized
|
scalar indicating whether clusters have been initialized. |
init_op
|
an op to initialize the clusters. |
training_op
|
an op that runs an iteration of training. |