TFRecord và tf.train.Example

Xem trên TensorFlow.org

Chạy trong Google Colab

Xem nguồn trên GitHub

Tải xuống sổ ghi chép

Định dạng TFRecord là một định dạng đơn giản để lưu trữ một chuỗi các bản ghi nhị phân.

Bộ đệm giao thức là một thư viện đa nền tảng, đa ngôn ngữ để tuần tự hóa dữ liệu có cấu trúc một cách hiệu quả.

Thông báo giao thức được xác định bởi các tệp .proto , đây thường là cách dễ nhất để hiểu một loại thông báo.

Thông báo tf.train.Example (hoặc protobuf) là một loại thông báo linh hoạt thể hiện ánh xạ {"string": value} . Nó được thiết kế để sử dụng với TensorFlow và được sử dụng trong các API cấp cao hơn như TFX .

Sổ tay này trình bày cách tạo, phân tích cú pháp và sử dụng thông báo tf.train.Example , sau đó tuần tự hóa, viết và đọc các thông báo tf.train.Example đến và từ các tệp .tfrecord .

Lưu ý: Nói chung, bạn nên chia nhỏ dữ liệu của mình trên nhiều tệp để có thể song song hóa I / O (trong một máy chủ duy nhất hoặc trên nhiều máy chủ). Quy tắc chung là phải có ít nhất 10 lần số tệp vì sẽ có máy chủ đọc dữ liệu. Đồng thời, mỗi tệp phải đủ lớn (ít nhất 10 MB + và lý tưởng là 100 MB +) để bạn có thể hưởng lợi từ việc tìm nạp trước I / O. Ví dụ: giả sử bạn có X GB dữ liệu và bạn định đào tạo trên tối đa N máy chủ. Tốt nhất, bạn nên chia nhỏ dữ liệu thành ~ 10*N tệp, miễn là ~ X/(10*N) là 10 MB + (và lý tưởng nhất là 100 MB +). Nếu nó nhỏ hơn, bạn có thể cần tạo ít phân đoạn hơn để đánh đổi lợi ích song song và lợi ích tìm nạp trước I / O.

Thành lập

import tensorflow as tf

import numpy as np
import IPython.display as display

`tf.train.Example`

Các kiểu dữ liệu cho `tf.train.Example`

Về cơ bản, một tf.train.Example là một ánh xạ {"string": tf.train.Feature} .

Loại thông báo tf.train.Feature có thể chấp nhận một trong ba loại sau (Xem tệp .proto để tham khảo). Hầu hết các loại chung khác có thể bị ép buộc vào một trong những điều này:

tf.train.BytesList (các loại sau có thể bị ép buộc)
- string
- byte
tf.train.FloatList (các loại sau có thể bị ép buộc)
- float ( float32 )
- double ( float64 )
tf.train.Int64List (các loại sau có thể bị cưỡng chế)
- bool
- enum
- int32
- uint32
- int64
- uint64

Để chuyển đổi loại TensorFlow tiêu chuẩn thành tf.train.Example tương thích tf.train.Feature , bạn có thể sử dụng các hàm phím tắt bên dưới. Lưu ý rằng mỗi hàm nhận một giá trị đầu vào vô hướng và trả về một tf.train.Feature chứa một trong ba loại list ở trên:

# The following functions can be used to convert a value to a type compatible
# with tf.train.Example.

def _bytes_feature(value):
  """Returns a bytes_list from a string / byte."""
  if isinstance(value, type(tf.constant(0))):
    value = value.numpy() # BytesList won't unpack a string from an EagerTensor.
  return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def _float_feature(value):
  """Returns a float_list from a float / double."""
  return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))

def _int64_feature(value):
  """Returns an int64_list from a bool / enum / int / uint."""
  return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

Dưới đây là một số ví dụ về cách hoạt động của các hàm này. Lưu ý các loại đầu vào khác nhau và các loại đầu ra được tiêu chuẩn hóa. Nếu kiểu đầu vào cho một hàm không khớp với một trong các kiểu bắt buộc đã nêu ở trên, hàm sẽ tạo ra một ngoại lệ (ví dụ: _int64_feature(1.0) sẽ bị lỗi vì 1.0 là float — do đó, nó nên được sử dụng với hàm _float_feature để thay thế ):

print(_bytes_feature(b'test_string'))
print(_bytes_feature(u'test_bytes'.encode('utf-8')))

print(_float_feature(np.exp(1)))

print(_int64_feature(True))
print(_int64_feature(1))

bytes_list {
  value: "test_string"
}

bytes_list {
  value: "test_bytes"
}

float_list {
  value: 2.7182817459106445
}

int64_list {
  value: 1
}

int64_list {
  value: 1
}

Tất cả các thông điệp proto có thể được tuần tự hóa thành chuỗi nhị phân bằng cách sử dụng phương thức .SerializeToString :

feature = _float_feature(np.exp(1))

feature.SerializeToString()

b'\x12\x06\n\x04T\xf8-@'

Tạo thông báo `tf.train.Example`

Giả sử bạn muốn tạo một thông báo tf.train.Example từ dữ liệu hiện có. Trên thực tế, tập dữ liệu có thể đến từ bất kỳ đâu, nhưng quy trình tạo thông báo tf.train.Example từ một lần quan sát sẽ giống nhau:

Trong mỗi quan sát, mỗi giá trị cần được chuyển đổi thành tf.train.Feature chứa một trong 3 loại tương thích, sử dụng một trong các chức năng ở trên.
Bạn tạo bản đồ (từ điển) từ chuỗi tên đối tượng địa lý đến giá trị đối tượng địa lý được mã hóa trong # 1.
Bản đồ được tạo ở bước 2 được chuyển đổi thành thông báo Features .

Trong sổ tay này, bạn sẽ tạo tập dữ liệu bằng NumPy.

Tập dữ liệu này sẽ có 4 tính năng:

một tính năng boolean, False hoặc True với xác suất như nhau
một đối tượng số nguyên được chọn ngẫu nhiên thống nhất từ [0, 5]
một tính năng chuỗi được tạo từ một bảng chuỗi bằng cách sử dụng tính năng số nguyên làm chỉ mục
một tính năng float từ phân phối chuẩn chuẩn

Hãy xem xét một mẫu bao gồm 10.000 quan sát được phân bổ độc lập và giống hệt nhau từ mỗi phân bố trên:

# The number of observations in the dataset.
n_observations = int(1e4)

# Boolean feature, encoded as False or True.
feature0 = np.random.choice([False, True], n_observations)

# Integer feature, random from 0 to 4.
feature1 = np.random.randint(0, 5, n_observations)

# String feature.
strings = np.array([b'cat', b'dog', b'chicken', b'horse', b'goat'])
feature2 = strings[feature1]

# Float feature, from a standard normal distribution.
feature3 = np.random.randn(n_observations)

Mỗi tính năng này có thể được ép buộc thành một loại tương thích tf.train.Example bằng cách sử dụng một trong các _bytes_feature , _float_feature , _int64_feature . Sau đó, bạn có thể tạo một thông báo tf.train.Example từ các tính năng được mã hóa sau:

def serialize_example(feature0, feature1, feature2, feature3):
  """
  Creates a tf.train.Example message ready to be written to a file.
  """
  # Create a dictionary mapping the feature name to the tf.train.Example-compatible
  # data type.
  feature = {
      'feature0': _int64_feature(feature0),
      'feature1': _int64_feature(feature1),
      'feature2': _bytes_feature(feature2),
      'feature3': _float_feature(feature3),
  }

  # Create a Features message using tf.train.Example.

  example_proto = tf.train.Example(features=tf.train.Features(feature=feature))
  return example_proto.SerializeToString()

Ví dụ: giả sử bạn có một quan sát từ tập dữ liệu, [False, 4, bytes('goat'), 0.9876] . Bạn có thể tạo và in thông báo tf.train.Example cho quan sát này bằng cách sử dụng create_message() . Mỗi quan sát đơn lẻ sẽ được viết dưới dạng thông báo Features như ở trên. Lưu ý rằng thông báo tf.train.Example chỉ là một trình bao bọc xung quanh thông báo Features :

# This is an example observation from the dataset.

example_observation = []

serialized_example = serialize_example(False, 4, b'goat', 0.9876)
serialized_example

b'\nR\n\x14\n\x08feature2\x12\x08\n\x06\n\x04goat\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x04\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x00\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04[\xd3|?'

Để giải mã thông báo, hãy sử dụng phương thức tf.train.Example.FromString .

example_proto = tf.train.Example.FromString(serialized_example)
example_proto

features {
  feature {
    key: "feature0"
    value {
      int64_list {
        value: 0
      }
    }
  }
  feature {
    key: "feature1"
    value {
      int64_list {
        value: 4
      }
    }
  }
  feature {
    key: "feature2"
    value {
      bytes_list {
        value: "goat"
      }
    }
  }
  feature {
    key: "feature3"
    value {
      float_list {
        value: 0.9876000285148621
      }
    }
  }
}

Chi tiết định dạng TFRecords

Tệp TFRecord chứa một chuỗi các bản ghi. Tệp chỉ có thể được đọc tuần tự.

Mỗi bản ghi chứa một chuỗi byte, cho tải trọng dữ liệu, cộng với độ dài dữ liệu và hàm băm CRC-32C ( 32-bit CRC sử dụng đa thức Castagnoli ) để kiểm tra tính toàn vẹn.

Mỗi bản ghi được lưu trữ ở các định dạng sau:

uint64 length
uint32 masked_crc32_of_length
byte   data[length]
uint32 masked_crc32_of_data

Các bản ghi được nối với nhau để tạo ra tệp. CRC được mô tả ở đây và mặt nạ của CRC là:

masked_crc = ((crc >> 15) | (crc << 17)) + 0xa282ead8ul

Tệp TFRecord sử dụng `tf.data`

Mô-đun tf.data cũng cung cấp các công cụ để đọc và ghi dữ liệu trong TensorFlow.

Viết tệp TFRecord

Cách dễ nhất để đưa dữ liệu vào tập dữ liệu là sử dụng phương thức from_tensor_slices .

Được áp dụng cho một mảng, nó trả về một tập dữ liệu vô hướng:

tf.data.Dataset.from_tensor_slices(feature1)

<TensorSliceDataset element_spec=TensorSpec(shape=(), dtype=tf.int64, name=None)>

Được áp dụng cho một bộ nhiều mảng, nó trả về một tập dữ liệu gồm các bộ giá trị:

features_dataset = tf.data.Dataset.from_tensor_slices((feature0, feature1, feature2, feature3))
features_dataset

<TensorSliceDataset element_spec=(TensorSpec(shape=(), dtype=tf.bool, name=None), TensorSpec(shape=(), dtype=tf.int64, name=None), TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.float64, name=None))>

# Use `take(1)` to only pull one example from the dataset.
for f0,f1,f2,f3 in features_dataset.take(1):
  print(f0)
  print(f1)
  print(f2)
  print(f3)

tf.Tensor(False, shape=(), dtype=bool)
tf.Tensor(4, shape=(), dtype=int64)
tf.Tensor(b'goat', shape=(), dtype=string)
tf.Tensor(0.5251196235602504, shape=(), dtype=float64)

Sử dụng phương thức tf.data.Dataset.map để áp dụng một hàm cho từng phần tử của Dataset .

Hàm được ánh xạ phải hoạt động ở chế độ đồ thị TensorFlow — nó phải hoạt động trên và trả về tf.Tensors . Một hàm không phải tensor, như serialize_example , có thể được bao bọc bằng tf.py_function để làm cho nó tương thích.

Việc sử dụng tf.py_function yêu cầu xác định hình dạng và loại thông tin không có sẵn:

def tf_serialize_example(f0,f1,f2,f3):
  tf_string = tf.py_function(
    serialize_example,
    (f0, f1, f2, f3),  # Pass these args to the above function.
    tf.string)      # The return type is `tf.string`.
  return tf.reshape(tf_string, ()) # The result is a scalar.

tf_serialize_example(f0, f1, f2, f3)

<tf.Tensor: shape=(), dtype=string, numpy=b'\nR\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04=n\x06?\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x00\n\x14\n\x08feature2\x12\x08\n\x06\n\x04goat\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x04'>

Áp dụng hàm này cho từng phần tử trong tập dữ liệu:

serialized_features_dataset = features_dataset.map(tf_serialize_example)
serialized_features_dataset

<MapDataset element_spec=TensorSpec(shape=(), dtype=tf.string, name=None)>

def generator():
  for features in features_dataset:
    yield serialize_example(*features)

serialized_features_dataset = tf.data.Dataset.from_generator(
    generator, output_types=tf.string, output_shapes=())

serialized_features_dataset

<FlatMapDataset element_spec=TensorSpec(shape=(), dtype=tf.string, name=None)>

Và ghi chúng vào tệp TFRecord:

filename = 'test.tfrecord'
writer = tf.data.experimental.TFRecordWriter(filename)
writer.write(serialized_features_dataset)

WARNING:tensorflow:From /tmp/ipykernel_25215/3575438268.py:2: TFRecordWriter.__init__ (from tensorflow.python.data.experimental.ops.writers) is deprecated and will be removed in a future version.
Instructions for updating:
To write TFRecords to disk, use `tf.io.TFRecordWriter`. To save and load the contents of a dataset, use `tf.data.experimental.save` and `tf.data.experimental.load`

Đọc tệp TFRecord

Bạn cũng có thể đọc tệp TFRecord bằng lớp tf.data.TFRecordDataset .

Bạn có thể tìm thêm thông tin về cách sử dụng tệp TFRecord bằng tf.data trong tf.data: Hướng dẫn xây dựng đường ống đầu vào TensorFlow .

Sử dụng TFRecordDataset s có thể hữu ích để chuẩn hóa dữ liệu đầu vào và tối ưu hóa hiệu suất.

filenames = [filename]
raw_dataset = tf.data.TFRecordDataset(filenames)
raw_dataset

<TFRecordDatasetV2 element_spec=TensorSpec(shape=(), dtype=tf.string, name=None)>

Tại thời điểm này, tập dữ liệu chứa các thông báo tf.train.Example được tuần tự hóa. Khi được lặp lại, nó trả về những thứ này dưới dạng tensors chuỗi vô hướng.

Sử dụng phương thức .take để chỉ hiển thị 10 bản ghi đầu tiên.

for raw_record in raw_dataset.take(10):
  print(repr(raw_record))

<tf.Tensor: shape=(), dtype=string, numpy=b'\nR\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x00\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x04\n\x14\n\x08feature2\x12\x08\n\x06\n\x04goat\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04=n\x06?'>
<tf.Tensor: shape=(), dtype=string, numpy=b'\nR\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x04\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x01\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04\x9d\xfa\x98\xbe\n\x14\n\x08feature2\x12\x08\n\x06\n\x04goat'>
<tf.Tensor: shape=(), dtype=string, numpy=b'\nQ\n\x13\n\x08feature2\x12\x07\n\x05\n\x03dog\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x01\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04a\xc0r?\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x01'>
<tf.Tensor: shape=(), dtype=string, numpy=b'\nQ\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x01\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x00\n\x13\n\x08feature2\x12\x07\n\x05\n\x03cat\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04\x92Q(?'>
<tf.Tensor: shape=(), dtype=string, numpy=b'\nR\n\x14\n\x08feature2\x12\x08\n\x06\n\x04goat\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04>\xc0\xe5>\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x01\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x04'>
<tf.Tensor: shape=(), dtype=string, numpy=b'\nU\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04I!\xde\xbe\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x02\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x00\n\x17\n\x08feature2\x12\x0b\n\t\n\x07chicken'>
<tf.Tensor: shape=(), dtype=string, numpy=b'\nQ\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x00\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x01\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04\xe0\x1a\xab\xbf\n\x13\n\x08feature2\x12\x07\n\x05\n\x03cat'>
<tf.Tensor: shape=(), dtype=string, numpy=b'\nQ\n\x13\n\x08feature2\x12\x07\n\x05\n\x03cat\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x01\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04\x87\xb2\xd7?\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x00'>
<tf.Tensor: shape=(), dtype=string, numpy=b'\nR\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x04\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x01\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04n\xe19>\n\x14\n\x08feature2\x12\x08\n\x06\n\x04goat'>
<tf.Tensor: shape=(), dtype=string, numpy=b'\nR\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04\x1as\xd9\xbf\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x01\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x04\n\x14\n\x08feature2\x12\x08\n\x06\n\x04goat'>

Các tensor này có thể được phân tích cú pháp bằng cách sử dụng hàm bên dưới. Lưu ý rằng feature_description là cần thiết ở đây vì tf.data.Dataset sử dụng việc thực thi đồ thị và cần mô tả này để xây dựng hình dạng và chữ ký kiểu của chúng:

# Create a description of the features.
feature_description = {
    'feature0': tf.io.FixedLenFeature([], tf.int64, default_value=0),
    'feature1': tf.io.FixedLenFeature([], tf.int64, default_value=0),
    'feature2': tf.io.FixedLenFeature([], tf.string, default_value=''),
    'feature3': tf.io.FixedLenFeature([], tf.float32, default_value=0.0),
}

def _parse_function(example_proto):
  # Parse the input `tf.train.Example` proto using the dictionary above.
  return tf.io.parse_single_example(example_proto, feature_description)

Ngoài ra, sử dụng tf.parse example để phân tích cú pháp toàn bộ lô cùng một lúc. Áp dụng chức năng này cho từng mục trong tập dữ liệu bằng phương thức tf.data.Dataset.map :

parsed_dataset = raw_dataset.map(_parse_function)
parsed_dataset

<MapDataset element_spec={'feature0': TensorSpec(shape=(), dtype=tf.int64, name=None), 'feature1': TensorSpec(shape=(), dtype=tf.int64, name=None), 'feature2': TensorSpec(shape=(), dtype=tf.string, name=None), 'feature3': TensorSpec(shape=(), dtype=tf.float32, name=None)}>

Sử dụng thực thi háo hức để hiển thị các quan sát trong tập dữ liệu. Có 10.000 quan sát trong tập dữ liệu này, nhưng bạn sẽ chỉ hiển thị 10. Dữ liệu đầu tiên được hiển thị dưới dạng từ điển về các tính năng. Mỗi mục là một tf.Tensor và phần tử numpy của tensor này hiển thị giá trị của đối tượng địa lý:

for parsed_record in parsed_dataset.take(10):
  print(repr(parsed_record))

{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=4>, 'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'goat'>, 'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=0.5251196>}
{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=4>, 'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'goat'>, 'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=-0.29878703>}
{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'dog'>, 'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=0.94824797>}
{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'cat'>, 'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=0.65749466>}
{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=4>, 'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'goat'>, 'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=0.44873232>}
{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=2>, 'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'chicken'>, 'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=-0.4338477>}
{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'cat'>, 'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=-1.3367577>}
{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'cat'>, 'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=1.6851357>}
{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=4>, 'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'goat'>, 'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=0.18152401>}
{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=4>, 'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'goat'>, 'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=-1.6988251>}

Ở đây, hàm tf.parse_example giải nén các trường tf.train.Example thành các tensor tiêu chuẩn.

Tệp TFRecord bằng Python

Mô-đun tf.io cũng chứa các hàm thuần Python để đọc và ghi các tệp TFRecord.

Viết tệp TFRecord

Tiếp theo, ghi 10.000 quan sát vào tệp test.tfrecord . Mỗi quan sát được chuyển đổi thành một thông báo tf.train.Example , sau đó được ghi vào tệp. Sau đó, bạn có thể xác minh rằng tệp test.tfrecord đã được tạo:

# Write the `tf.train.Example` observations to the file.
with tf.io.TFRecordWriter(filename) as writer:
  for i in range(n_observations):
    example = serialize_example(feature0[i], feature1[i], feature2[i], feature3[i])
    writer.write(example)

du -sh {filename}

984K    test.tfrecord

Đọc tệp TFRecord

Các bộ căng được tuần tự hóa này có thể được phân tích cú pháp dễ dàng bằng tf.train.Example.ParseFromString :

filenames = [filename]
raw_dataset = tf.data.TFRecordDataset(filenames)
raw_dataset

<TFRecordDatasetV2 element_spec=TensorSpec(shape=(), dtype=tf.string, name=None)>

for raw_record in raw_dataset.take(1):
  example = tf.train.Example()
  example.ParseFromString(raw_record.numpy())
  print(example)

features {
  feature {
    key: "feature0"
    value {
      int64_list {
        value: 0
      }
    }
  }
  feature {
    key: "feature1"
    value {
      int64_list {
        value: 4
      }
    }
  }
  feature {
    key: "feature2"
    value {
      bytes_list {
        value: "goat"
      }
    }
  }
  feature {
    key: "feature3"
    value {
      float_list {
        value: 0.5251196026802063
      }
    }
  }
}

Điều đó trả về một proto tf.train.Example khó sử dụng như hiện tại, nhưng về cơ bản nó là một đại diện của:

Dict[str,
     Union[List[float],
           List[int],
           List[str]]]

Đoạn mã sau đây sẽ chuyển đổi thủ công Example thành một từ điển của mảng NumPy mà không cần sử dụng TensorFlow Ops. Tham khảo tệp PROTO để biết thông tin chi tiết.

result = {}
# example.features.feature is the dictionary
for key, feature in example.features.feature.items():
  # The values are the Feature objects which contain a `kind` which contains:
  # one of three fields: bytes_list, float_list, int64_list

  kind = feature.WhichOneof('kind')
  result[key] = np.array(getattr(feature, kind).value)

result

{'feature3': array([0.5251196]),
 'feature1': array([4]),
 'feature0': array([0]),
 'feature2': array([b'goat'], dtype='|S4')}

Hướng dẫn: Đọc và ghi dữ liệu hình ảnh

Đây là một ví dụ đầu cuối về cách đọc và ghi dữ liệu hình ảnh bằng TFRecords. Sử dụng hình ảnh làm dữ liệu đầu vào, bạn sẽ ghi dữ liệu dưới dạng tệp TFRecord, sau đó đọc lại tệp và hiển thị hình ảnh.

Điều này có thể hữu ích nếu, ví dụ, bạn muốn sử dụng một số mô hình trên cùng một tập dữ liệu đầu vào. Thay vì lưu trữ dữ liệu hình ảnh thô, nó có thể được xử lý trước thành định dạng TFRecords và có thể được sử dụng trong tất cả các quá trình xử lý và mô hình hóa tiếp theo.

Đầu tiên, hãy tải xuống hình ảnh một con mèo trong tuyết này và bức ảnh này về Cầu Williamsburg, NYC đang được xây dựng.

Tìm nạp hình ảnh

cat_in_snow  = tf.keras.utils.get_file(
    '320px-Felis_catus-cat_on_snow.jpg',
    'https://storage.googleapis.com/download.tensorflow.org/example_images/320px-Felis_catus-cat_on_snow.jpg')

williamsburg_bridge = tf.keras.utils.get_file(
    '194px-New_East_River_Bridge_from_Brooklyn_det.4a09796u.jpg',
    'https://storage.googleapis.com/download.tensorflow.org/example_images/194px-New_East_River_Bridge_from_Brooklyn_det.4a09796u.jpg')

Downloading data from https://storage.googleapis.com/download.tensorflow.org/example_images/320px-Felis_catus-cat_on_snow.jpg
24576/17858 [=========================================] - 0s 0us/step
32768/17858 [=======================================================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/download.tensorflow.org/example_images/194px-New_East_River_Bridge_from_Brooklyn_det.4a09796u.jpg
16384/15477 [===============================] - 0s 0us/step
24576/15477 [===============================================] - 0s 0us/step

display.display(display.Image(filename=cat_in_snow))
display.display(display.HTML('Image cc-by: <a "href=https://commons.wikimedia.org/wiki/File:Felis_catus-cat_on_snow.jpg">Von.grzanka</a>'))

jpeg

display.display(display.Image(filename=williamsburg_bridge))
display.display(display.HTML('<a "href=https://commons.wikimedia.org/wiki/File:New_East_River_Bridge_from_Brooklyn_det.4a09796u.jpg">From Wikimedia</a>'))

jpeg

Ghi tệp TFRecord

Như trước đây, hãy mã hóa các tính năng dưới dạng các loại tương thích với tf.train.Example . Điều này lưu trữ tính năng chuỗi hình ảnh thô, cũng như chiều cao, chiều rộng, chiều sâu và tính năng label tùy ý. Cái sau được sử dụng khi bạn viết tệp để phân biệt giữa hình ảnh con mèo và hình ảnh cây cầu. Sử dụng 0 cho hình ảnh con mèo và 1 cho hình ảnh cây cầu:

image_labels = {
    cat_in_snow : 0,
    williamsburg_bridge : 1,
}

# This is an example, just using the cat image.
image_string = open(cat_in_snow, 'rb').read()

label = image_labels[cat_in_snow]

# Create a dictionary with features that may be relevant.
def image_example(image_string, label):
  image_shape = tf.io.decode_jpeg(image_string).shape

  feature = {
      'height': _int64_feature(image_shape[0]),
      'width': _int64_feature(image_shape[1]),
      'depth': _int64_feature(image_shape[2]),
      'label': _int64_feature(label),
      'image_raw': _bytes_feature(image_string),
  }

  return tf.train.Example(features=tf.train.Features(feature=feature))

for line in str(image_example(image_string, label)).split('\n')[:15]:
  print(line)
print('...')

features {
  feature {
    key: "depth"
    value {
      int64_list {
        value: 3
      }
    }
  }
  feature {
    key: "height"
    value {
      int64_list {
        value: 213
      }
...

Lưu ý rằng tất cả các tính năng hiện được lưu trữ trong thông báo tf.train.Example . Tiếp theo, chức năng hóa đoạn mã trên và viết các thông báo mẫu vào tệp có tên images.tfrecords :

# Write the raw image files to `images.tfrecords`.
# First, process the two images into `tf.train.Example` messages.
# Then, write to a `.tfrecords` file.
record_file = 'images.tfrecords'
with tf.io.TFRecordWriter(record_file) as writer:
  for filename, label in image_labels.items():
    image_string = open(filename, 'rb').read()
    tf_example = image_example(image_string, label)
    writer.write(tf_example.SerializeToString())

du -sh {record_file}

36K images.tfrecords

Đọc tệp TFRecord

Bây giờ bạn có images.tfrecords —và bây giờ có thể lặp lại các bản ghi trong đó để đọc lại những gì bạn đã viết. Giả sử trong ví dụ này, bạn sẽ chỉ tái tạo hình ảnh, tính năng duy nhất bạn cần là chuỗi hình ảnh thô. Giải nén nó bằng cách sử dụng các getters được mô tả ở trên, cụ thể là example.features.feature['image_raw'].bytes_list.value[0] . Bạn cũng có thể sử dụng các nhãn để xác định bản ghi nào là mèo và bản ghi nào là cầu nối:

raw_image_dataset = tf.data.TFRecordDataset('images.tfrecords')

# Create a dictionary describing the features.
image_feature_description = {
    'height': tf.io.FixedLenFeature([], tf.int64),
    'width': tf.io.FixedLenFeature([], tf.int64),
    'depth': tf.io.FixedLenFeature([], tf.int64),
    'label': tf.io.FixedLenFeature([], tf.int64),
    'image_raw': tf.io.FixedLenFeature([], tf.string),
}

def _parse_image_function(example_proto):
  # Parse the input tf.train.Example proto using the dictionary above.
  return tf.io.parse_single_example(example_proto, image_feature_description)

parsed_image_dataset = raw_image_dataset.map(_parse_image_function)
parsed_image_dataset

<MapDataset element_spec={'depth': TensorSpec(shape=(), dtype=tf.int64, name=None), 'height': TensorSpec(shape=(), dtype=tf.int64, name=None), 'image_raw': TensorSpec(shape=(), dtype=tf.string, name=None), 'label': TensorSpec(shape=(), dtype=tf.int64, name=None), 'width': TensorSpec(shape=(), dtype=tf.int64, name=None)}>

Khôi phục hình ảnh từ tệp TFRecord:

for image_features in parsed_image_dataset:
  image_raw = image_features['image_raw'].numpy()
  display.display(display.Image(data=image_raw))

jpeg

TFRecord và tf.train.Example Sử dụng bộ sưu tập để sắp xếp ngăn nắp các trang Lưu và phân loại nội dung dựa trên lựa chọn ưu tiên của bạn.

Thành lập

tf.train.Example

Các kiểu dữ liệu cho tf.train.Example

Tạo thông báo tf.train.Example

Chi tiết định dạng TFRecords

Tệp TFRecord sử dụng tf.data

Viết tệp TFRecord

Đọc tệp TFRecord

Tệp TFRecord bằng Python

Viết tệp TFRecord

Đọc tệp TFRecord

Hướng dẫn: Đọc và ghi dữ liệu hình ảnh

Tìm nạp hình ảnh

Ghi tệp TFRecord

Đọc tệp TFRecord

TFRecord và tf.train.Example

`tf.train.Example`

Các kiểu dữ liệu cho `tf.train.Example`

Tạo thông báo `tf.train.Example`

Tệp TFRecord sử dụng `tf.data`