web_graph

  • Description:

This dataset contains a sparse graph representing web link structure for a small subset of the Web.

Its a processed version of a single crawl performed by CommonCrawl in 2021 where we strip everything and keep only the link->outlinks structure. The final dataset is basically int -> List[int] format with each integer id representing a url.

Also, in order to increase the value of this resource, we created 6 different version of WebGraph, each varying in the sparsity pattern and locale. We took the following processing steps, in order:

  • We started with WAT files from June 2021 crawl.
  • Since the outlinks in HTTP-Response-Metadata are stored as relative paths, we convert them to absolute paths using urllib after validating each link.
  • To study locale-specific graphs, we further filter based on 2 top level domains: ‘de’ and ‘in’, each producing a graph with an order of magnitude less number of nodes.
  • These graphs can still have arbitrary sparsity patterns and dangling links. Thus we further filter the nodes in each graph to have minimum of K ∈ [10, 50] inlinks and outlinks. Note that we only do this processing once, thus this is still an approximation i.e. the resulting graph might have nodes with less than K links.
  • Using both locale and count filters, we finalize 6 versions of WebGraph dataset, summarized in the folling table.
Version Top level domain Min count Num nodes Num edges
sparse 10 365.4M 30B
dense 50 136.5M 22B
de-sparse de 10 19.7M 1.19B
de-dense de 50 5.7M 0.82B
in-sparse in 10 1.5M 0.14B
in-dense in 50 0.5M 0.12B

All versions of the dataset have following features:

  • "row_tag": a unique identifier of the row (source link).
  • "col_tag": a list of unique identifiers of non-zero columns (dest outlinks).
  • "gt_tag": a list of unique identifiers of non-zero columns used as ground truth (dest outlinks), empty for train/train_t splits.

  • Homepage: https://arxiv.org/abs/2112.02194

  • Source code: tfds.structured.web_graph.WebGraph

  • Versions:

    • 1.0.0 (default): Initial release.
  • Download size: Unknown size

  • Auto-cached (documentation): No

  • Feature structure:

FeaturesDict({
    'col_tag': Sequence(int64),
    'gt_tag': Sequence(int64),
    'row_tag': int64,
})
  • Feature documentation:
Feature Class Shape Dtype Description
FeaturesDict
col_tag Sequence(Tensor) (None,) int64
gt_tag Sequence(Tensor) (None,) int64
row_tag Tensor int64
@article{mehta2021alx,
    title={ALX: Large Scale Matrix Factorization on TPUs},
    author={Harsh Mehta and Steffen Rendle and Walid Krichene and Li Zhang},
    year={2021},
    eprint={2112.02194},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

web_graph/sparse (default config)

  • Config description: WebGraph-sparse contains around 30B edges and around 365M nodes.

  • Dataset size: 273.38 GiB

  • Splits:

Split Examples
'test' 39,871,321
'train' 372,049,054
'train_t' 410,867,007

web_graph/dense

  • Config description: WebGraph-dense contains around 22B edges and around 136.5M nodes.

  • Dataset size: 170.87 GiB

  • Splits:

Split Examples
'test' 13,256,496
'train' 122,815,749
'train_t' 136,019,364

web_graph/de-sparse

  • Config description: WebGraph-de-sparse contains around 1.19B edges and around 19.7M nodes.

  • Dataset size: 10.25 GiB

  • Splits:

Split Examples
'test' 1,903,443
'train' 17,688,633
'train_t' 19,566,045

web_graph/de-dense

  • Config description: WebGraph-de-dense contains around 0.82B edges and around 5.7M nodes.

  • Dataset size: 5.90 GiB

  • Splits:

Split Examples
'test' 553,270
'train' 5,118,902
'train_t' 5,672,473

web_graph/in-sparse

  • Config description: WebGraph-de-sparse contains around 0.14B edges and around 1.5M nodes.

  • Dataset size: 960.57 MiB

  • Splits:

Split Examples
'test' 140,313
'train' 1,309,063
'train_t' 1,445,042

web_graph/in-dense

  • Config description: WebGraph-de-dense contains around 0.12B edges and around 0.5M nodes.

  • Dataset size: 711.72 MiB

  • Splits:

Split Examples
'test' 47,894
'train' 443,786
'train_t' 491,634