Builds a graph based on dense embeddings and persists it in TSV format.

Used in the notebooks

Used in the tutorials

This function reads input instances from one or more TFRecord files, each containing tf.train.Example protos. Each input example is expected to contain at least the following 2 features:

  • id: A singleton bytes_list feature that identifies each example.
  • embedding: A float_list feature that contains the (dense) embedding of each example.

id and embedding are not necessarily the literal feature names; if your features have different names, you can specify them using the graph_builder_config fields named id_feature_name and embedding_feature_name, respectively.

This function then uses the node embeddings to compute the edges of a graph. Graph construction is configured by the graph_builder_config instance. The general algorithm is to consider pairs of input examples (each with an ID and an associated dense embedding, as described above), and to generate an edge in the graph between those two examples if the cosine similarity between the two embeddings is at least graph_builder_config.similarity_threshold.

Of course, the number of pairs of points is quadratic in the number of inputs, so comparing all pairs of points will take too long for large inputs. To address that problem, this implementation can be configured to use locality sensitive hashing (LSH) for better performance. The idea behind LSH is to randomly hash each input into one or more LSH "buckets" such that points in the same bucket are likely to be considered similar. In this way, we need to compare just the pairs of points within each bucket for similarity, which can lead to dramatically faster running times.

The lsh_splits configuration attribute is used to control the maximum number of LSH buckets. In particular, if lsh_splits has the value n, then there can be at most 2^n LSH buckets. Using a larger value for lsh_splits will (generally) result in a larger number of buckets, and therefore, smaller number of instances in each bucket that need to be compared to each other. As a result, increasing lsh_splits can lead to dramatically faster running times.

The disadvantage to using too many LSH buckets, however, is that we won't create a graph edge between two instances that are highly similar if they happen to be randomly hashed into two different LSH buckets. To address that problem, the lsh_rounds parameter can be used to perform multiple rounds of the LSH bucketing process. Even if two similar instances may get hashed to different LSH buckets during the first round, they may get hashed into the same LSH bucket on a subsequent round. An edge is created in the output graph if two intances are hashed into the same bucket and deemed to be similar enough on any of the LSH rounds (i.e., the resulting graph is the union of the graph edges generated on each LSH round).

To illustrate these concepts and how various lsh_splits and lsh_rounds values correlate with graph building running times, we performed multiple runs of the graph builder on a dataset containing 50,000 instances, each with a 100-dimensional embedding. When lsh_splits = 0, the program has to compare each instance against every other instance, for a total of roughly 2.5B comparisons, which takes nearly half an hour to complete and generates a total of 35,313 graph edges (when similarity_threshold = 0.9). As lsh_splits is increased, we lose recall (i.e., fewer than 35,313 edges are generated), but the recall can then be improved by increasing lsh_rounds. This table shows the minimum lsh_rounds value required to achieve a recall of >= 99.7% (except for the lsh_splits = 1 case), as well as the elapsed running time:

lsh_splits  lsh_rounds    Recall    Running time
    0          N/A        100.0%      27m 46s
    1           2          99.4%      24m 33s
    2           3          99.8%      15m 35s
    3           4          99.7%       9m 37.9s
    4           6          99.9%       7m 07.5s
    5           8          99.9%       4m 59.2s
    6           9          99.7%       3m 01.2s
    7          11          99.8%       2m 02.3s
    8          13          99.8%       1m 20.8s
    9          16          99.7%          58.5s
   10          18          99.7%          43.6s

As the table illustrates, by increasing both lsh_splits and lsh_rounds, we can dramatically decrease the running time of the graph builder without sacrificing edge recall. We have found that a good rule of thumb is to set lsh_splits >= ceiling(log_2(num_instances / 1000)), so the expected LSH bucket size will be at most 1000. However, if your instances are clustered or you want an even faster run, you may want to use a larger lsh_splits value. Note, however, that when the similarity threshold is lower, recall rates are reduced more quickly the larger the value of lsh_splits is, so be careful not to set that parameter too high for smaller similarity_threshold values.

The generated graph edges are written to the TSV file named by output_graph_path. Each output edge is represented by a TSV line with the following form:


All edges in the output will be symmetric (i.e., if edge A--w-->B exists in the output, then so will edge B--w-->A).

embedding_files A list of names of TFRecord files containing tf.train.Example objects, which in turn contain dense embeddings.
output_graph_path Name of the file to which the output graph in TSV format should be written.
graph_builder_config A nsl.configs.GraphBuilderConfig specifying the graph building parameters.

ValueError If lsh_splits < 0 or if lsh_splits > 0 and lsh_rounds < 1.