在 TensorFlow.org 上查看 在 Google Colab 中运行 在 GitHub 中查看源代码 {img1下载笔记本

概述

本教程将演示 tfio.genome 软件包,其中提供了常用的基因组学 IO 功能,即读取多种基因组学文件格式,以及提供一些用于准备数据(例如,独热编码或将 Phred 质量解析为概率)的常用运算。

此软件包使用 Google Nucleus 库来提供一些核心功能。

设置

try:
  %tensorflow_version 2.x
except Exception:
  pass
!pip install -q tensorflow-io
import tensorflow_io as tfio
import tensorflow as tf

FASTQ 数据

FASTQ 是一种常见的基因组学文件格式,除了基本的质量信息外,还存储序列信息。

首先,让我们下载一个样本 fastq 文件。

# Download some sample data:
curl -OL https://raw.githubusercontent.com/tensorflow/io/master/tests/test_genome/test.fastq
% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   407  100   407    0     0   1833      0 --:--:-- --:--:-- --:--:--  1833

读取 FASTQ 数据

现在,让我们使用 tfio.genome.read_fastq 读取此文件(请注意,tf.data API 即将发布)。

fastq_data = tfio.genome.read_fastq(filename="test.fastq")
print(fastq_data.sequences)
print(fastq_data.raw_quality)
tf.Tensor(
[b'GATTACA'
 b'CGTTAGCGCAGGGGGCATCTTCACACTGGTGACAGGTAACCGCCGTAGTAAAGGTTCCGCCTTTCACT'
 b'CGGCTGGTCAGGCTGACATCGCCGCCGGCCTGCAGCGAGCCGCTGC' b'CGG'], shape=(4,), dtype=string)
tf.Tensor(
[b'BB>B@FA'
 b'AAAAABF@BBBDGGGG?FFGFGHBFBFBFABBBHGGGFHHCEFGGGGG?FGFFHEDG3EFGGGHEGHG'
 b'FAFAF;F/9;.:/;999B/9A.DFFF;-->.AAB/FC;9-@-=;=.' b'FAD'], shape=(4,), dtype=string)

如您所见,返回的 fastq_data 具有 fastq_data.sequences,后者是 fastq 文件中所有序列的字符串张量(大小可以不同);并具有 fastq_data.raw_quality,其中包含与在序列中读取的每个碱基的质量有关的 Phred 编码质量信息。

质量

如有兴趣,您可以使用辅助运算将此质量信息转换为概率。

quality = tfio.genome.phred_sequences_to_probability(fastq_data.raw_quality)
print(quality.shape)
print(quality.row_lengths().numpy())
print(quality)
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py:574: calling map_fn_v2 (from tensorflow.python.ops.map_fn) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Use fn_output_signature instead
(4, None, 1)
[ 7 68 46  3]
<tf.RaggedTensor [[[0.0005011872854083776], [0.0005011872854083776], [0.0012589250691235065], [0.0005011872854083776], [0.0007943279924802482], [0.00019952619913965464], [0.0006309573072940111]], [[0.0006309573072940111], [0.0006309573072940111], [0.0006309573072940111], [0.0006309573072940111], [0.0006309573072940111], [0.0005011872854083776], [0.00019952619913965464], [0.0007943279924802482], [0.0005011872854083776], [0.0005011872854083776], [0.0005011872854083776], [0.0003162277862429619], [0.0001584893325343728], [0.0001584893325343728], [0.0001584893325343728], [0.0001584893325343728], [0.0010000000474974513], [0.00019952619913965464], [0.00019952619913965464], [0.0001584893325343728], [0.00019952619913965464], [0.0001584893325343728], [0.00012589251855388284], [0.0005011872854083776], [0.00019952619913965464], [0.0005011872854083776], [0.00019952619913965464], [0.0005011872854083776], [0.00019952619913965464], [0.0006309573072940111], [0.0005011872854083776], [0.0005011872854083776], [0.0005011872854083776], [0.00012589251855388284], [0.0001584893325343728], [0.0001584893325343728], [0.0001584893325343728], [0.00019952619913965464], [0.00012589251855388284], [0.00012589251855388284], [0.00039810704765841365], [0.0002511885832063854], [0.00019952619913965464], [0.0001584893325343728], [0.0001584893325343728], [0.0001584893325343728], [0.0001584893325343728], [0.0001584893325343728], [0.0010000000474974513], [0.00019952619913965464], [0.0001584893325343728], [0.00019952619913965464], [0.00019952619913965464], [0.00012589251855388284], [0.0002511885832063854], [0.0003162277862429619], [0.0001584893325343728], [0.015848929062485695], [0.0002511885832063854], [0.00019952619913965464], [0.0001584893325343728], [0.0001584893325343728], [0.0001584893325343728], [0.00012589251855388284], [0.0002511885832063854], [0.0001584893325343728], [0.00012589251855388284], [0.0001584893325343728]], [[0.00019952619913965464], [0.0006309573072940111], [0.00019952619913965464], [0.0006309573072940111], [0.00019952619913965464], [0.0025118854828178883], [0.00019952619913965464], [0.03981072083115578], [0.003981070592999458], [0.0025118854828178883], [0.050118714570999146], [0.003162277629598975], [0.03981072083115578], [0.0025118854828178883], [0.003981070592999458], [0.003981070592999458], [0.003981070592999458], [0.0005011872854083776], [0.03981072083115578], [0.003981070592999458], [0.0006309573072940111], [0.050118714570999146], [0.0003162277862429619], [0.00019952619913965464], [0.00019952619913965464], [0.00019952619913965464], [0.0025118854828178883], [0.06309573352336884], [0.06309573352336884], [0.0012589250691235065], [0.050118714570999146], [0.0006309573072940111], [0.0006309573072940111], [0.0005011872854083776], [0.03981072083115578], [0.00019952619913965464], [0.00039810704765841365], [0.0025118854828178883], [0.003981070592999458], [0.06309573352336884], [0.0007943279924802482], [0.06309573352336884], [0.00158489344175905], [0.0025118854828178883], [0.00158489344175905], [0.050118714570999146]], [[0.00019952619913965464], [0.0006309573072940111], [0.0003162277862429619]]]>

独热编码

您可能还需要使用独热编码器对基因组序列数据(由 A T C G 碱基组成)进行编码。有一项内置运算可以帮助编码。

print(tfio.genome.sequences_to_onehot.__doc__)
Convert DNA sequences into a one hot nucleotide encoding.

    Each nucleotide in each sequence is mapped as follows:
    A -> [1, 0, 0, 0]
    C -> [0, 1, 0, 0]
    G -> [0 ,0 ,1, 0]
    T -> [0, 0, 0, 1]

    If for some reason a non (A, T, C, G) character exists in the string, it is
    currently mapped to a error one hot encoding [1, 1, 1, 1].

    Args:
        sequences: A tf.string tensor where each string represents a DNA sequence

    Returns:
        tf.RaggedTensor: The output sequences with nucleotides one hot encoded.
print(tfio.genome.sequences_to_onehot.__doc__)
Convert DNA sequences into a one hot nucleotide encoding.

    Each nucleotide in each sequence is mapped as follows:
    A -> [1, 0, 0, 0]
    C -> [0, 1, 0, 0]
    G -> [0 ,0 ,1, 0]
    T -> [0, 0, 0, 1]

    If for some reason a non (A, T, C, G) character exists in the string, it is
    currently mapped to a error one hot encoding [1, 1, 1, 1].

    Args:
        sequences: A tf.string tensor where each string represents a DNA sequence

    Returns:
        tf.RaggedTensor: The output sequences with nucleotides one hot encoded.