コレクションでコンテンツを整理
必要に応じて、コンテンツの保存と分類を行います。
概要
このチュートリアルでは、一般的に使用されるゲノミクス IO 機能を提供するtfio.genome
パッケージについて解説します。これは、いくつかのゲノミクスファイル形式を読み取り、データを準備するための一般的な演算を提供します (例: One-Hot エンコーディングまたは Phred クオリティスコアを確率に解析します)。
このパッケージは、Google Nucleus ライブラリを使用して、主な機能の一部を提供します。
セットアップ
try:
%tensorflow_version 2.x
except Exception:
pass
!pip install -q tensorflow-io
import tensorflow_io as tfio
import tensorflow as tf
FASTQ データ
FASTQ は、基本的な品質情報に加えて両方の配列情報を保存する一般的なゲノミクスファイル形式です。
まず、サンプルのfastq
ファイルをダウンロードします。
# Download some sample data:
curl -OL https://raw.githubusercontent.com/tensorflow/io/master/tests/test_genome/test.fastq
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 407 100 407 0 0 1229 0 --:--:-- --:--:-- --:--:-- 1229
FASTQ データの読み込み
tfio.genome.read_fastq
を使用してこのファイルを読みこみます (tf.data
API は近日中にリリースされる予定です)。
fastq_data = tfio.genome.read_fastq(filename="test.fastq")
print(fastq_data.sequences)
print(fastq_data.raw_quality)
tf.Tensor(
[b'GATTACA'
b'CGTTAGCGCAGGGGGCATCTTCACACTGGTGACAGGTAACCGCCGTAGTAAAGGTTCCGCCTTTCACT'
b'CGGCTGGTCAGGCTGACATCGCCGCCGGCCTGCAGCGAGCCGCTGC' b'CGG'], shape=(4,), dtype=string)
tf.Tensor(
[b'BB>B@FA'
b'AAAAABF@BBBDGGGG?FFGFGHBFBFBFABBBHGGGFHHCEFGGGGG?FGFFHEDG3EFGGGHEGHG'
b'FAFAF;F/9;.:/;999B/9A.DFFF;-->.AAB/FC;9-@-=;=.' b'FAD'], shape=(4,), dtype=string)
ご覧のとおり、返されたfastq_data
には fastq ファイル内のすべてのシーケンスの文字列テンソル (それぞれ異なるサイズにすることが可能) であるfastq_data.sequences
、および、シーケンスで読み取られた各塩基の品質に関する Phred エンコードされた品質情報を含むfastq_data.raw_quality
が含まれています。
品質
関心がある場合は、ヘルパーオペレーションを使用して、この品質情報を確率に変換できます。
quality = tfio.genome.phred_sequences_to_probability(fastq_data.raw_quality)
print(quality.shape)
print(quality.row_lengths().numpy())
print(quality)
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py:605: calling map_fn_v2 (from tensorflow.python.ops.map_fn) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Use fn_output_signature instead
(4, None, 1)
[ 7 68 46 3]
<tf.RaggedTensor [[[0.0005011872854083776], [0.0005011872854083776], [0.0012589250691235065], [0.0005011872854083776], [0.0007943279924802482], [0.00019952619913965464], [0.0006309573072940111]], [[0.0006309573072940111], [0.0006309573072940111], [0.0006309573072940111], [0.0006309573072940111], [0.0006309573072940111], [0.0005011872854083776], [0.00019952619913965464], [0.0007943279924802482], [0.0005011872854083776], [0.0005011872854083776], [0.0005011872854083776], [0.0003162277862429619], [0.0001584893325343728], [0.0001584893325343728], [0.0001584893325343728], [0.0001584893325343728], [0.0010000000474974513], [0.00019952619913965464], [0.00019952619913965464], [0.0001584893325343728], [0.00019952619913965464], [0.0001584893325343728], [0.00012589251855388284], [0.0005011872854083776], [0.00019952619913965464], [0.0005011872854083776], [0.00019952619913965464], [0.0005011872854083776], [0.00019952619913965464], [0.0006309573072940111], [0.0005011872854083776], [0.0005011872854083776], [0.0005011872854083776], [0.00012589251855388284], [0.0001584893325343728], [0.0001584893325343728], [0.0001584893325343728], [0.00019952619913965464], [0.00012589251855388284], [0.00012589251855388284], [0.00039810704765841365], [0.0002511885832063854], [0.00019952619913965464], [0.0001584893325343728], [0.0001584893325343728], [0.0001584893325343728], [0.0001584893325343728], [0.0001584893325343728], [0.0010000000474974513], [0.00019952619913965464], [0.0001584893325343728], [0.00019952619913965464], [0.00019952619913965464], [0.00012589251855388284], [0.0002511885832063854], [0.0003162277862429619], [0.0001584893325343728], [0.015848929062485695], [0.0002511885832063854], [0.00019952619913965464], [0.0001584893325343728], [0.0001584893325343728], [0.0001584893325343728], [0.00012589251855388284], [0.0002511885832063854], [0.0001584893325343728], [0.00012589251855388284], [0.0001584893325343728]], [[0.00019952619913965464], [0.0006309573072940111], [0.00019952619913965464], [0.0006309573072940111], [0.00019952619913965464], [0.0025118854828178883], [0.00019952619913965464], [0.03981072083115578], [0.003981070592999458], [0.0025118854828178883], [0.050118714570999146], [0.003162277629598975], [0.03981072083115578], [0.0025118854828178883], [0.003981070592999458], [0.003981070592999458], [0.003981070592999458], [0.0005011872854083776], [0.03981072083115578], [0.003981070592999458], [0.0006309573072940111], [0.050118714570999146], [0.0003162277862429619], [0.00019952619913965464], [0.00019952619913965464], [0.00019952619913965464], [0.0025118854828178883], [0.06309573352336884], [0.06309573352336884], [0.0012589250691235065], [0.050118714570999146], [0.0006309573072940111], [0.0006309573072940111], [0.0005011872854083776], [0.03981072083115578], [0.00019952619913965464], [0.00039810704765841365], [0.0025118854828178883], [0.003981070592999458], [0.06309573352336884], [0.0007943279924802482], [0.06309573352336884], [0.00158489344175905], [0.0025118854828178883], [0.00158489344175905], [0.050118714570999146]], [[0.00019952619913965464], [0.0006309573072940111], [0.0003162277862429619]]]>
One-Hot エンコーディング
また、One-Hot エンコーダ―を使用してゲノムシーケンスデータ (A
T
C
G
の塩基配列で構成される) をエンコードすることもできます。これに役立つ演算が組み込まれています。
print(tfio.genome.sequences_to_onehot.__doc__)
Convert DNA sequences into a one hot nucleotide encoding.
Each nucleotide in each sequence is mapped as follows:
A -> [1, 0, 0, 0]
C -> [0, 1, 0, 0]
G -> [0 ,0 ,1, 0]
T -> [0, 0, 0, 1]
If for some reason a non (A, T, C, G) character exists in the string, it is
currently mapped to a error one hot encoding [1, 1, 1, 1].
Args:
sequences: A tf.string tensor where each string represents a DNA sequence
Returns:
tf.RaggedTensor: The output sequences with nucleotides one hot encoded.
print(tfio.genome.sequences_to_onehot.__doc__)
Convert DNA sequences into a one hot nucleotide encoding.
Each nucleotide in each sequence is mapped as follows:
A -> [1, 0, 0, 0]
C -> [0, 1, 0, 0]
G -> [0 ,0 ,1, 0]
T -> [0, 0, 0, 1]
If for some reason a non (A, T, C, G) character exists in the string, it is
currently mapped to a error one hot encoding [1, 1, 1, 1].
Args:
sequences: A tf.string tensor where each string represents a DNA sequence
Returns:
tf.RaggedTensor: The output sequences with nucleotides one hot encoded.
特に記載のない限り、このページのコンテンツはクリエイティブ・コモンズの表示 4.0 ライセンスにより使用許諾されます。コードサンプルは Apache 2.0 ライセンスにより使用許諾されます。詳しくは、Google Developers サイトのポリシーをご覧ください。Java は Oracle および関連会社の登録商標です。
最終更新日 2022-01-24 UTC。
[null,null,["最終更新日 2022-01-24 UTC。"],[],[],null,["\u003cbr /\u003e\n\n|--------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------|\n| [View on TensorFlow.org](https://www.tensorflow.org/io/tutorials/genome) | [Run in Google Colab](https://colab.research.google.com/github/tensorflow/io/blob/master/docs/tutorials/genome.ipynb) | [View source on GitHub](https://github.com/tensorflow/io/blob/master/docs/tutorials/genome.ipynb) | [Download notebook](https://storage.googleapis.com/tensorflow_docs/io/docs/tutorials/genome.ipynb) |\n\nOverview\n--------\n\nThis tutorial demonstrates the [`tfio.genome`](https://www.tensorflow.org/io/api_docs/python/tfio/genome) package that provides commonly used genomics IO functionality--namely reading several genomics file formats and also providing some common operations for preparing the data (for example--one hot encoding or parsing Phred quality into probabilities).\n\nThis package uses the [Google Nucleus](https://github.com/google/nucleus) library to provide some of the core functionality.\n\nSetup\n-----\n\n try:\n %tensorflow_version 2.x\n except Exception:\n pass\n !pip install -q tensorflow-io\n\n import tensorflow_io as tfio\n import tensorflow as tf\n\nFASTQ Data\n----------\n\nFASTQ is a common genomics file format that stores both sequence information in addition to base quality information.\n\nFirst, let's download a sample `fastq` file. \n\n # Download some sample data:\n curl -OL https://raw.githubusercontent.com/tensorflow/io/master/tests/test_genome/test.fastq\n\n```\n% Total % Received % Xferd Average Speed Time Time Time Current\n Dload Upload Total Spent Left Speed\n100 407 100 407 0 0 2035 0 --:--:-- --:--:-- --:--:-- 2035\n```\n\n### Read FASTQ Data\n\nNow, let's use [`tfio.genome.read_fastq`](https://www.tensorflow.org/io/api_docs/python/tfio/genome/read_fastq) to read this file (note a [`tf.data`](https://www.tensorflow.org/api_docs/python/tf/data) API coming soon). \n\n fastq_data = tfio.genome.read_fastq(filename=\"test.fastq\")\n print(fastq_data.sequences)\n print(fastq_data.raw_quality)\n\n```\ntf.Tensor(\n[b'GATTACA'\n b'CGTTAGCGCAGGGGGCATCTTCACACTGGTGACAGGTAACCGCCGTAGTAAAGGTTCCGCCTTTCACT'\n b'CGGCTGGTCAGGCTGACATCGCCGCCGGCCTGCAGCGAGCCGCTGC' b'CGG'], shape=(4,), dtype=string)\ntf.Tensor(\n[b'BB\u003eB@FA'\n b'AAAAABF@BBBDGGGG?FFGFGHBFBFBFABBBHGGGFHHCEFGGGGG?FGFFHEDG3EFGGGHEGHG'\n b'FAFAF;F/9;.:/;999B/9A.DFFF;--\u003e.AAB/FC;9-@-=;=.' b'FAD'], shape=(4,), dtype=string)\n```\n\nAs you see, the returned `fastq_data` has `fastq_data.sequences` which is a string tensor of all sequences in the fastq file (which can each be a different size) along with `fastq_data.raw_quality` which includes Phred encoded quality information about the quality of each base read in the sequence.\n\n### Quality\n\nYou can use a helper op to convert this quality information into probabilities if you are interested. \n\n quality = tfio.genome.phred_sequences_to_probability(fastq_data.raw_quality)\n print(quality.shape)\n print(quality.row_lengths().numpy())\n print(quality)\n\n```\nWARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py:574: calling map_fn_v2 (from tensorflow.python.ops.map_fn) with dtype is deprecated and will be removed in a future version.\nInstructions for updating:\nUse fn_output_signature instead\n(4, None, 1)\n[ 7 68 46 3]\n\u003ctf.RaggedTensor [[[0.0005011872854083776], [0.0005011872854083776], [0.0012589251855388284], [0.0005011872854083776], [0.0007943279924802482], [0.00019952621369156986], [0.0006309572490863502]], [[0.0006309572490863502], [0.0006309572490863502], [0.0006309572490863502], [0.0006309572490863502], [0.0006309572490863502], [0.0005011872854083776], [0.00019952621369156986], [0.0007943279924802482], [0.0005011872854083776], [0.0005011872854083776], [0.0005011872854083776], [0.0003162277571391314], [0.0001584893325343728], [0.0001584893325343728], [0.0001584893325343728], [0.0001584893325343728], [0.0010000000474974513], [0.00019952621369156986], [0.00019952621369156986], [0.0001584893325343728], [0.00019952621369156986], [0.0001584893325343728], [0.00012589251855388284], [0.0005011872854083776], [0.00019952621369156986], [0.0005011872854083776], [0.00019952621369156986], [0.0005011872854083776], [0.00019952621369156986], [0.0006309572490863502], [0.0005011872854083776], [0.0005011872854083776], [0.0005011872854083776], [0.00012589251855388284], [0.0001584893325343728], [0.0001584893325343728], [0.0001584893325343728], [0.00019952621369156986], [0.00012589251855388284], [0.00012589251855388284], [0.0003981070767622441], [0.0002511885541025549], [0.00019952621369156986], [0.0001584893325343728], [0.0001584893325343728], [0.0001584893325343728], [0.0001584893325343728], [0.0001584893325343728], [0.0010000000474974513], [0.00019952621369156986], [0.0001584893325343728], [0.00019952621369156986], [0.00019952621369156986], [0.00012589251855388284], [0.0002511885541025549], [0.0003162277571391314], [0.0001584893325343728], [0.015848929062485695], [0.0002511885541025549], [0.00019952621369156986], [0.0001584893325343728], [0.0001584893325343728], [0.0001584893325343728], [0.00012589251855388284], [0.0002511885541025549], [0.0001584893325343728], [0.00012589251855388284], [0.0001584893325343728]], [[0.00019952621369156986], [0.0006309572490863502], [0.00019952621369156986], [0.0006309572490863502], [0.00019952621369156986], [0.002511885715648532], [0.00019952621369156986], [0.03981072083115578], [0.003981071058660746], [0.002511885715648532], [0.050118714570999146], [0.003162277629598975], [0.03981072083115578], [0.002511885715648532], [0.003981071058660746], [0.003981071058660746], [0.003981071058660746], [0.0005011872854083776], [0.03981072083115578], [0.003981071058660746], [0.0006309572490863502], [0.050118714570999146], [0.0003162277571391314], [0.00019952621369156986], [0.00019952621369156986], [0.00019952621369156986], [0.002511885715648532], [0.06309572607278824], [0.06309572607278824], [0.0012589251855388284], [0.050118714570999146], [0.0006309572490863502], [0.0006309572490863502], [0.0005011872854083776], [0.03981072083115578], [0.00019952621369156986], [0.0003981070767622441], [0.002511885715648532], [0.003981071058660746], [0.06309572607278824], [0.0007943279924802482], [0.06309572607278824], [0.001584893325343728], [0.002511885715648532], [0.001584893325343728], [0.050118714570999146]], [[0.00019952621369156986], [0.0006309572490863502], [0.0003162277571391314]]]\u003e\n```\n\n### One hot encodings\n\nYou may also want to encode the genome sequence data (which consists of `A` `T` `C` `G` bases) using a one hot encoder. There's a built in operation that can help with this. \n\n one_hot = tfio.genome.sequences_to_onehot(fastq_data.sequences)\n print(one_hot)\n print(one_hot.shape)\n\n```\n\u003ctf.RaggedTensor [[[0, 0, 1, 0], [1, 0, 0, 0], [0, 0, 0, 1], [0, 0, 0, 1], [1, 0, 0, 0], [0, 1, 0, 0], [1, 0, 0, 0]], [[0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1], [0, 0, 0, 1], [1, 0, 0, 0], [0, 0, 1, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 1, 0, 0], [1, 0, 0, 0], [0, 0, 1, 0], [0, 0, 1, 0], [0, 0, 1, 0], [0, 0, 1, 0], [0, 0, 1, 0], [0, 1, 0, 0], [1, 0, 0, 0], [0, 0, 0, 1], [0, 1, 0, 0], [0, 0, 0, 1], [0, 0, 0, 1], [0, 1, 0, 0], [1, 0, 0, 0], [0, 1, 0, 0], [1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 0, 1], [0, 0, 1, 0], [0, 0, 1, 0], [0, 0, 0, 1], [0, 0, 1, 0], [1, 0, 0, 0], [0, 1, 0, 0], [1, 0, 0, 0], [0, 0, 1, 0], [0, 0, 1, 0], [0, 0, 0, 1], [1, 0, 0, 0], [1, 0, 0, 0], [0, 1, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 1, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1], [1, 0, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1], [1, 0, 0, 0], [1, 0, 0, 0], [1, 0, 0, 0], [0, 0, 1, 0], [0, 0, 1, 0], [0, 0, 0, 1], [0, 0, 0, 1], [0, 1, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 1, 0, 0], [0, 1, 0, 0], [0, 0, 0, 1], [0, 0, 0, 1], [0, 0, 0, 1], [0, 1, 0, 0], [1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 0, 1]], [[0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 1, 0], [0, 1, 0, 0], [0, 0, 0, 1], [0, 0, 1, 0], [0, 0, 1, 0], [0, 0, 0, 1], [0, 1, 0, 0], [1, 0, 0, 0], [0, 0, 1, 0], [0, 0, 1, 0], [0, 1, 0, 0], [0, 0, 0, 1], [0, 0, 1, 0], [1, 0, 0, 0], [0, 1, 0, 0], [1, 0, 0, 0], [0, 0, 0, 1], [0, 1, 0, 0], [0, 0, 1, 0], [0, 1, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 1, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 1, 0], [0, 1, 0, 0], [0, 1, 0, 0], [0, 0, 0, 1], [0, 0, 1, 0], [0, 1, 0, 0], [1, 0, 0, 0], [0, 0, 1, 0], [0, 1, 0, 0], [0, 0, 1, 0], [1, 0, 0, 0], [0, 0, 1, 0], [0, 1, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 1, 0, 0], [0, 0, 0, 1], [0, 0, 1, 0], [0, 1, 0, 0]], [[0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 1, 0]]]\u003e\n(4, None, 4)\n``` \n\n print(tfio.genome.sequences_to_onehot.__doc__)\n\n```\nConvert DNA sequences into a one hot nucleotide encoding.\n\n Each nucleotide in each sequence is mapped as follows:\n A -\u003e [1, 0, 0, 0]\n C -\u003e [0, 1, 0, 0]\n G -\u003e [0 ,0 ,1, 0]\n T -\u003e [0, 0, 0, 1]\n\n If for some reason a non (A, T, C, G) character exists in the string, it is\n currently mapped to a error one hot encoding [1, 1, 1, 1].\n\n Args:\n sequences: A tf.string tensor where each string represents a DNA sequence\n\n Returns:\n tf.RaggedTensor: The output sequences with nucleotides one hot encoded.\n```"]]