Creates the graph for k-means clustering.
tf.contrib.factorization.KMeans(
inputs, num_clusters, initial_clusters=RANDOM_INIT,
distance_metric=SQUARED_EUCLIDEAN_DISTANCE, use_mini_batch=False,
mini_batch_steps_per_iteration=1, random_seed=0, kmeans_plus_plus_num_retries=2,
kmc2_chain_length=200
)
Args |
inputs
|
An input tensor or list of input tensors. It is assumed that the
data points have been previously randomly permuted.
|
num_clusters
|
An integer tensor specifying the number of clusters. This
argument is ignored if initial_clusters is a tensor or numpy array.
|
initial_clusters
|
Specifies the clusters used during initialization. One
of the following:
- a tensor or numpy array with the initial cluster centers.
- a function f(inputs, k) that returns up to k centers from
inputs .
- "random": Choose centers randomly from
inputs .
- "kmeans_plus_plus": Use kmeans++ to choose centers from
inputs .
- "kmc2": Use the fast k-MC2 algorithm to choose centers from
inputs .
In the last three cases, one batch of inputs may not yield
num_clusters centers, in which case initialization will require
multiple batches until enough centers are chosen. In the case of
"random" or "kmeans_plus_plus", if the input size is <= num_clusters
then the entire batch is chosen to be cluster centers.
|
distance_metric
|
Distance metric used for clustering. Supported options:
"squared_euclidean", "cosine".
|
use_mini_batch
|
If true, use the mini-batch k-means algorithm. Else assume
full batch.
|
mini_batch_steps_per_iteration
|
Number of steps after which the updated
cluster centers are synced back to a master copy.
|
random_seed
|
Seed for PRNG used to initialize seeds.
|
kmeans_plus_plus_num_retries
|
For each point that is sampled during
kmeans++ initialization, this parameter specifies the number of
additional points to draw from the current distribution before selecting
the best. If a negative value is specified, a heuristic is used to
sample O(log(num_to_sample)) additional points.
|
kmc2_chain_length
|
Determines how many candidate points are used by the
k-MC2 algorithm to produce one new cluster centers. If a (mini-)batch
contains less points, one new cluster center is generated from the
(mini-)batch.
|
Raises |
ValueError
|
An invalid argument was passed to initial_clusters or
distance_metric.
|
Methods
training_graph
View source
training_graph()
Generate a training graph for kmeans algorithm.
This returns, among other things, an op that chooses initial centers
(init_op), a boolean variable that is set to True when the initial centers
are chosen (cluster_centers_initialized), and an op to perform either an
entire Lloyd iteration or a mini-batch of a Lloyd iteration (training_op).
The caller should use these components as follows. A single worker should
execute init_op multiple times until cluster_centers_initialized becomes
True. Then multiple workers may execute training_op any number of times.
Returns |
A tuple consisting of:
|
all_scores
|
A matrix (or list of matrices) of dimensions (num_input,
num_clusters) where the value is the distance of an input vector and a
cluster center.
|
cluster_idx
|
A vector (or list of vectors). Each element in the vector
corresponds to an input row in 'inp' and specifies the cluster id
corresponding to the input.
|
scores
|
Similar to cluster_idx but specifies the distance to the
assigned cluster instead.
|
cluster_centers_initialized
|
scalar indicating whether clusters have been
initialized.
|
init_op
|
an op to initialize the clusters.
|
training_op
|
an op that runs an iteration of training.
|