web_graph
Stay organized with collections
Save and categorize content based on your preferences.
This dataset contains a sparse graph representing web link structure for a small
subset of the Web.
Its a processed version of a single crawl performed by CommonCrawl in 2021 where
we strip everything and keep only the link->outlinks structure. The final
dataset is basically int -> List[int] format with each integer id representing a
url.
Also, in order to increase the value of this resource, we created 6 different
version of WebGraph, each varying in the sparsity pattern and locale. We took
the following processing steps, in order:
- We started with WAT files from June 2021 crawl.
- Since the outlinks in HTTP-Response-Metadata are stored as relative paths,
we convert them to absolute paths using urllib after validating each link.
- To study locale-specific graphs, we further filter based on 2 top level
domains: ‘de’ and ‘in’, each producing a graph with an order of magnitude
less number of nodes.
- These graphs can still have arbitrary sparsity patterns and dangling links.
Thus we further filter the nodes in each graph to have minimum of K ∈ [10,
50] inlinks and outlinks. Note that we only do this processing once, thus
this is still an approximation i.e. the resulting graph might have nodes
with less than K links.
- Using both locale and count filters, we finalize 6 versions of WebGraph
dataset, summarized in the folling table.
Version |
Top level domain |
Min count |
Num nodes |
Num edges |
sparse |
|
10 |
365.4M |
30B |
dense |
|
50 |
136.5M |
22B |
de-sparse |
de |
10 |
19.7M |
1.19B |
de-dense |
de |
50 |
5.7M |
0.82B |
in-sparse |
in |
10 |
1.5M |
0.14B |
in-dense |
in |
50 |
0.5M |
0.12B |
All versions of the dataset have following features:
- "row_tag": a unique identifier of the row (source link).
- "col_tag": a list of unique identifiers of non-zero columns (dest outlinks).
"gt_tag": a list of unique identifiers of non-zero columns used as ground
truth (dest outlinks), empty for train/train_t splits.
Homepage:
https://arxiv.org/abs/2112.02194
Source code:
tfds.structured.web_graph.WebGraph
Versions:
1.0.0
(default): Initial release.
Download size: Unknown size
Auto-cached
(documentation):
No
Feature structure:
FeaturesDict({
'col_tag': Sequence(int64),
'gt_tag': Sequence(int64),
'row_tag': int64,
})
Feature |
Class |
Shape |
Dtype |
Description |
|
FeaturesDict |
|
|
|
col_tag |
Sequence(Tensor) |
(None,) |
int64 |
|
gt_tag |
Sequence(Tensor) |
(None,) |
int64 |
|
row_tag |
Tensor |
|
int64 |
|
@article{mehta2021alx,
title={ALX: Large Scale Matrix Factorization on TPUs},
author={Harsh Mehta and Steffen Rendle and Walid Krichene and Li Zhang},
year={2021},
eprint={2112.02194},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
web_graph/sparse (default config)
Split |
Examples |
'test' |
39,871,321 |
'train' |
372,049,054 |
'train_t' |
410,867,007 |
web_graph/dense
Split |
Examples |
'test' |
13,256,496 |
'train' |
122,815,749 |
'train_t' |
136,019,364 |
web_graph/de-sparse
Split |
Examples |
'test' |
1,903,443 |
'train' |
17,688,633 |
'train_t' |
19,566,045 |
web_graph/de-dense
Split |
Examples |
'test' |
553,270 |
'train' |
5,118,902 |
'train_t' |
5,672,473 |
web_graph/in-sparse
Split |
Examples |
'test' |
140,313 |
'train' |
1,309,063 |
'train_t' |
1,445,042 |
web_graph/in-dense
Split |
Examples |
'test' |
47,894 |
'train' |
443,786 |
'train_t' |
491,634 |
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2022-11-23 UTC.
[null,null,["Last updated 2022-11-23 UTC."],[],[],null,["# web_graph\n\n\u003cbr /\u003e\n\n- **Description**:\n\nThis dataset contains a sparse graph representing web link structure for a small\nsubset of the Web.\n\nIts a processed version of a single crawl performed by CommonCrawl in 2021 where\nwe strip everything and keep only the link-\\\u003eoutlinks structure. The final\ndataset is basically int -\\\u003e List\\[int\\] format with each integer id representing a\nurl.\n\nAlso, in order to increase the value of this resource, we created 6 different\nversion of WebGraph, each varying in the sparsity pattern and locale. We took\nthe following processing steps, in order:\n\n- We started with WAT files from June 2021 crawl.\n- Since the outlinks in HTTP-Response-Metadata are stored as relative paths, we convert them to absolute paths using urllib after validating each link.\n- To study locale-specific graphs, we further filter based on 2 top level domains: 'de' and 'in', each producing a graph with an order of magnitude less number of nodes.\n- These graphs can still have arbitrary sparsity patterns and dangling links. Thus we further filter the nodes in each graph to have minimum of K ∈ \\[10, 50\\] inlinks and outlinks. Note that we only do this processing once, thus this is still an approximation i.e. the resulting graph might have nodes with less than K links.\n- Using both locale and count filters, we finalize 6 versions of WebGraph dataset, summarized in the folling table.\n\n| Version | Top level domain | Min count | Num nodes | Num edges |\n|-----------|------------------|-----------|-----------|-----------|\n| sparse | | 10 | 365.4M | 30B |\n| dense | | 50 | 136.5M | 22B |\n| de-sparse | de | 10 | 19.7M | 1.19B |\n| de-dense | de | 50 | 5.7M | 0.82B |\n| in-sparse | in | 10 | 1.5M | 0.14B |\n| in-dense | in | 50 | 0.5M | 0.12B |\n\nAll versions of the dataset have following features:\n\n- \"row_tag\": a unique identifier of the row (source link).\n- \"col_tag\": a list of unique identifiers of non-zero columns (dest outlinks).\n- \"gt_tag\": a list of unique identifiers of non-zero columns used as ground\n truth (dest outlinks), empty for train/train_t splits.\n\n- **Homepage** :\n \u003chttps://arxiv.org/abs/2112.02194\u003e\n\n- **Source code** :\n [`tfds.structured.web_graph.WebGraph`](https://github.com/tensorflow/datasets/tree/master/tensorflow_datasets/structured/web_graph/web_graph.py)\n\n- **Versions**:\n\n - **`1.0.0`** (default): Initial release.\n- **Download size** : `Unknown size`\n\n- **Auto-cached**\n ([documentation](https://www.tensorflow.org/datasets/performances#auto-caching)):\n No\n\n- **Feature structure**:\n\n FeaturesDict({\n 'col_tag': Sequence(int64),\n 'gt_tag': Sequence(int64),\n 'row_tag': int64,\n })\n\n- **Feature documentation**:\n\n| Feature | Class | Shape | Dtype | Description |\n|---------|------------------|---------|-------|-------------|\n| | FeaturesDict | | | |\n| col_tag | Sequence(Tensor) | (None,) | int64 | |\n| gt_tag | Sequence(Tensor) | (None,) | int64 | |\n| row_tag | Tensor | | int64 | |\n\n- **Supervised keys** (See\n [`as_supervised` doc](https://www.tensorflow.org/datasets/api_docs/python/tfds/load#args)):\n `None`\n\n- **Figure**\n ([tfds.show_examples](https://www.tensorflow.org/datasets/api_docs/python/tfds/visualization/show_examples)):\n Not supported.\n\n- **Citation**:\n\n @article{mehta2021alx,\n title={ALX: Large Scale Matrix Factorization on TPUs},\n author={Harsh Mehta and Steffen Rendle and Walid Krichene and Li Zhang},\n year={2021},\n eprint={2112.02194},\n archivePrefix={arXiv},\n primaryClass={cs.LG}\n }\n\nweb_graph/sparse (default config)\n---------------------------------\n\n- **Config description**: WebGraph-sparse contains around 30B edges and around\n 365M nodes.\n\n- **Dataset size** : `273.38 GiB`\n\n- **Splits**:\n\n| Split | Examples |\n|-------------|-------------|\n| `'test'` | 39,871,321 |\n| `'train'` | 372,049,054 |\n| `'train_t'` | 410,867,007 |\n\n- **Examples** ([tfds.as_dataframe](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_dataframe)):\n\nDisplay examples... \n\nweb_graph/dense\n---------------\n\n- **Config description**: WebGraph-dense contains around 22B edges and around\n 136.5M nodes.\n\n- **Dataset size** : `170.87 GiB`\n\n- **Splits**:\n\n| Split | Examples |\n|-------------|-------------|\n| `'test'` | 13,256,496 |\n| `'train'` | 122,815,749 |\n| `'train_t'` | 136,019,364 |\n\n- **Examples** ([tfds.as_dataframe](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_dataframe)):\n\nDisplay examples... \n\nweb_graph/de-sparse\n-------------------\n\n- **Config description**: WebGraph-de-sparse contains around 1.19B edges and\n around 19.7M nodes.\n\n- **Dataset size** : `10.25 GiB`\n\n- **Splits**:\n\n| Split | Examples |\n|-------------|------------|\n| `'test'` | 1,903,443 |\n| `'train'` | 17,688,633 |\n| `'train_t'` | 19,566,045 |\n\n- **Examples** ([tfds.as_dataframe](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_dataframe)):\n\nDisplay examples... \n\nweb_graph/de-dense\n------------------\n\n- **Config description**: WebGraph-de-dense contains around 0.82B edges and\n around 5.7M nodes.\n\n- **Dataset size** : `5.90 GiB`\n\n- **Splits**:\n\n| Split | Examples |\n|-------------|-----------|\n| `'test'` | 553,270 |\n| `'train'` | 5,118,902 |\n| `'train_t'` | 5,672,473 |\n\n- **Examples** ([tfds.as_dataframe](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_dataframe)):\n\nDisplay examples... \n\nweb_graph/in-sparse\n-------------------\n\n- **Config description**: WebGraph-de-sparse contains around 0.14B edges and\n around 1.5M nodes.\n\n- **Dataset size** : `960.57 MiB`\n\n- **Splits**:\n\n| Split | Examples |\n|-------------|-----------|\n| `'test'` | 140,313 |\n| `'train'` | 1,309,063 |\n| `'train_t'` | 1,445,042 |\n\n- **Examples** ([tfds.as_dataframe](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_dataframe)):\n\nDisplay examples... \n\nweb_graph/in-dense\n------------------\n\n- **Config description**: WebGraph-de-dense contains around 0.12B edges and\n around 0.5M nodes.\n\n- **Dataset size** : `711.72 MiB`\n\n- **Splits**:\n\n| Split | Examples |\n|-------------|----------|\n| `'test'` | 47,894 |\n| `'train'` | 443,786 |\n| `'train_t'` | 491,634 |\n\n- **Examples** ([tfds.as_dataframe](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_dataframe)):\n\nDisplay examples..."]]