protein_net
Stay organized with collections
Save and categorize content based on your preferences.
ProteinNet is a standardized data set for machine learning of protein structure.
It provides protein sequences, structures (secondary and tertiary), multiple
sequence alignments (MSAs), position-specific scoring matrices (PSSMs), and
standardized training / validation / test splits. ProteinNet builds on the
biennial CASP assessments, which carry out blind predictions of recently solved
but publicly unavailable protein structures, to provide test sets that push the
frontiers of computational methodology. It is organized as a series of data
sets, spanning CASP 7 through 12 (covering a ten-year period), to provide a
range of data set sizes that enable assessment of new methods in relatively data
poor and data rich regimes.
FeaturesDict({
'evolutionary': Tensor(shape=(None, 21), dtype=float32),
'id': Text(shape=(), dtype=string),
'length': int32,
'mask': Tensor(shape=(None,), dtype=bool),
'primary': Sequence(ClassLabel(shape=(), dtype=int64, num_classes=20)),
'tertiary': Tensor(shape=(None, 3), dtype=float32),
})
Feature |
Class |
Shape |
Dtype |
Description |
|
FeaturesDict |
|
|
|
evolutionary |
Tensor |
(None, 21) |
float32 |
|
id |
Text |
|
string |
|
length |
Tensor |
|
int32 |
|
mask |
Tensor |
(None,) |
bool |
|
primary |
Sequence(ClassLabel) |
(None,) |
int64 |
|
tertiary |
Tensor |
(None, 3) |
float32 |
|
@article{ProteinNet19,
title = { {ProteinNet}: a standardized data set for machine learning of protein structure},
author = {AlQuraishi, Mohammed},
journal = {BMC bioinformatics},
volume = {20},
number = {1},
pages = {1--10},
year = {2019},
publisher = {BioMed Central}
}
protein_net/casp7 (default config)
Download size: 3.18 GiB
Dataset size: 2.53 GiB
Splits:
Split |
Examples |
'test' |
93 |
'train_100' |
34,557 |
'train_30' |
10,333 |
'train_50' |
13,024 |
'train_70' |
15,207 |
'train_90' |
17,611 |
'train_95' |
17,938 |
'validation' |
224 |
protein_net/casp8
Download size: 4.96 GiB
Dataset size: 3.55 GiB
Splits:
Split |
Examples |
'test' |
120 |
'train_100' |
48,087 |
'train_30' |
13,881 |
'train_50' |
17,970 |
'train_70' |
21,191 |
'train_90' |
24,556 |
'train_95' |
25,035 |
'validation' |
224 |
protein_net/casp9
Download size: 6.65 GiB
Dataset size: 4.54 GiB
Splits:
Split |
Examples |
'test' |
116 |
'train_100' |
60,350 |
'train_30' |
16,973 |
'train_50' |
22,172 |
'train_70' |
26,263 |
'train_90' |
30,513 |
'train_95' |
31,128 |
'validation' |
224 |
protein_net/casp10
Download size: 8.65 GiB
Dataset size: 5.57 GiB
Splits:
Split |
Examples |
'test' |
95 |
'train_100' |
73,116 |
'train_30' |
19,495 |
'train_50' |
25,897 |
'train_70' |
31,001 |
'train_90' |
36,258 |
'train_95' |
37,033 |
'validation' |
224 |
protein_net/casp11
Download size: 10.81 GiB
Dataset size: 6.72 GiB
Splits:
Split |
Examples |
'test' |
81 |
'train_100' |
87,573 |
'train_30' |
22,344 |
'train_50' |
29,936 |
'train_70' |
36,005 |
'train_90' |
42,507 |
'train_95' |
43,544 |
'validation' |
224 |
protein_net/casp12
Download size: 13.18 GiB
Dataset size: 8.05 GiB
Splits:
Split |
Examples |
'test' |
40 |
'train_100' |
104,059 |
'train_30' |
25,299 |
'train_50' |
34,039 |
'train_70' |
41,522 |
'train_90' |
49,600 |
'train_95' |
50,914 |
'validation' |
224 |
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2022-12-16 UTC.
[null,null,["Last updated 2022-12-16 UTC."],[],[],null,["# protein_net\n\n\u003cbr /\u003e\n\n- **Description**:\n\nProteinNet is a standardized data set for machine learning of protein structure.\nIt provides protein sequences, structures (secondary and tertiary), multiple\nsequence alignments (MSAs), position-specific scoring matrices (PSSMs), and\nstandardized training / validation / test splits. ProteinNet builds on the\nbiennial CASP assessments, which carry out blind predictions of recently solved\nbut publicly unavailable protein structures, to provide test sets that push the\nfrontiers of computational methodology. It is organized as a series of data\nsets, spanning CASP 7 through 12 (covering a ten-year period), to provide a\nrange of data set sizes that enable assessment of new methods in relatively data\npoor and data rich regimes.\n\n- **Homepage** :\n \u003chttps://github.com/aqlaboratory/proteinnet\u003e\n\n- **Source code** :\n [`tfds.datasets.protein_net.Builder`](https://github.com/tensorflow/datasets/tree/master/tensorflow_datasets/datasets/protein_net/protein_net_dataset_builder.py)\n\n- **Versions**:\n\n - **`1.0.0`** (default): Initial release.\n- **Auto-cached**\n ([documentation](https://www.tensorflow.org/datasets/performances#auto-caching)):\n No\n\n- **Feature structure**:\n\n FeaturesDict({\n 'evolutionary': Tensor(shape=(None, 21), dtype=float32),\n 'id': Text(shape=(), dtype=string),\n 'length': int32,\n 'mask': Tensor(shape=(None,), dtype=bool),\n 'primary': Sequence(ClassLabel(shape=(), dtype=int64, num_classes=20)),\n 'tertiary': Tensor(shape=(None, 3), dtype=float32),\n })\n\n- **Feature documentation**:\n\n| Feature | Class | Shape | Dtype | Description |\n|--------------|----------------------|------------|---------|-------------|\n| | FeaturesDict | | | |\n| evolutionary | Tensor | (None, 21) | float32 | |\n| id | Text | | string | |\n| length | Tensor | | int32 | |\n| mask | Tensor | (None,) | bool | |\n| primary | Sequence(ClassLabel) | (None,) | int64 | |\n| tertiary | Tensor | (None, 3) | float32 | |\n\n- **Supervised keys** (See\n [`as_supervised` doc](https://www.tensorflow.org/datasets/api_docs/python/tfds/load#args)):\n `('primary', 'tertiary')`\n\n- **Figure**\n ([tfds.show_examples](https://www.tensorflow.org/datasets/api_docs/python/tfds/visualization/show_examples)):\n Not supported.\n\n- **Citation**:\n\n @article{ProteinNet19,\n title = { {ProteinNet}: a standardized data set for machine learning of protein structure},\n author = {AlQuraishi, Mohammed},\n journal = {BMC bioinformatics},\n volume = {20},\n number = {1},\n pages = {1--10},\n year = {2019},\n publisher = {BioMed Central}\n }\n\nprotein_net/casp7 (default config)\n----------------------------------\n\n- **Download size** : `3.18 GiB`\n\n- **Dataset size** : `2.53 GiB`\n\n- **Splits**:\n\n| Split | Examples |\n|----------------|----------|\n| `'test'` | 93 |\n| `'train_100'` | 34,557 |\n| `'train_30'` | 10,333 |\n| `'train_50'` | 13,024 |\n| `'train_70'` | 15,207 |\n| `'train_90'` | 17,611 |\n| `'train_95'` | 17,938 |\n| `'validation'` | 224 |\n\n- **Examples** ([tfds.as_dataframe](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_dataframe)):\n\nDisplay examples... \n\nprotein_net/casp8\n-----------------\n\n- **Download size** : `4.96 GiB`\n\n- **Dataset size** : `3.55 GiB`\n\n- **Splits**:\n\n| Split | Examples |\n|----------------|----------|\n| `'test'` | 120 |\n| `'train_100'` | 48,087 |\n| `'train_30'` | 13,881 |\n| `'train_50'` | 17,970 |\n| `'train_70'` | 21,191 |\n| `'train_90'` | 24,556 |\n| `'train_95'` | 25,035 |\n| `'validation'` | 224 |\n\n- **Examples** ([tfds.as_dataframe](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_dataframe)):\n\nDisplay examples... \n\nprotein_net/casp9\n-----------------\n\n- **Download size** : `6.65 GiB`\n\n- **Dataset size** : `4.54 GiB`\n\n- **Splits**:\n\n| Split | Examples |\n|----------------|----------|\n| `'test'` | 116 |\n| `'train_100'` | 60,350 |\n| `'train_30'` | 16,973 |\n| `'train_50'` | 22,172 |\n| `'train_70'` | 26,263 |\n| `'train_90'` | 30,513 |\n| `'train_95'` | 31,128 |\n| `'validation'` | 224 |\n\n- **Examples** ([tfds.as_dataframe](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_dataframe)):\n\nDisplay examples... \n\nprotein_net/casp10\n------------------\n\n- **Download size** : `8.65 GiB`\n\n- **Dataset size** : `5.57 GiB`\n\n- **Splits**:\n\n| Split | Examples |\n|----------------|----------|\n| `'test'` | 95 |\n| `'train_100'` | 73,116 |\n| `'train_30'` | 19,495 |\n| `'train_50'` | 25,897 |\n| `'train_70'` | 31,001 |\n| `'train_90'` | 36,258 |\n| `'train_95'` | 37,033 |\n| `'validation'` | 224 |\n\n- **Examples** ([tfds.as_dataframe](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_dataframe)):\n\nDisplay examples... \n\nprotein_net/casp11\n------------------\n\n- **Download size** : `10.81 GiB`\n\n- **Dataset size** : `6.72 GiB`\n\n- **Splits**:\n\n| Split | Examples |\n|----------------|----------|\n| `'test'` | 81 |\n| `'train_100'` | 87,573 |\n| `'train_30'` | 22,344 |\n| `'train_50'` | 29,936 |\n| `'train_70'` | 36,005 |\n| `'train_90'` | 42,507 |\n| `'train_95'` | 43,544 |\n| `'validation'` | 224 |\n\n- **Examples** ([tfds.as_dataframe](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_dataframe)):\n\nDisplay examples... \n\nprotein_net/casp12\n------------------\n\n- **Download size** : `13.18 GiB`\n\n- **Dataset size** : `8.05 GiB`\n\n- **Splits**:\n\n| Split | Examples |\n|----------------|----------|\n| `'test'` | 40 |\n| `'train_100'` | 104,059 |\n| `'train_30'` | 25,299 |\n| `'train_50'` | 34,039 |\n| `'train_70'` | 41,522 |\n| `'train_90'` | 49,600 |\n| `'train_95'` | 50,914 |\n| `'validation'` | 224 |\n\n- **Examples** ([tfds.as_dataframe](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_dataframe)):\n\nDisplay examples..."]]