TFRecord और tf.train.उदाहरण

TensorFlow.org पर देखें

Google Colab में चलाएं

GitHub पर स्रोत देखें

नोटबुक डाउनलोड करें

बाइनरी रिकॉर्ड के अनुक्रम को संग्रहीत करने के लिए TFRecord प्रारूप एक सरल प्रारूप है।

प्रोटोकॉल बफ़र्स संरचित डेटा के कुशल क्रमांकन के लिए एक क्रॉस-प्लेटफ़ॉर्म, क्रॉस-लैंग्वेज लाइब्रेरी हैं।

प्रोटोकॉल संदेशों को .proto फ़ाइलों द्वारा परिभाषित किया जाता है, यह अक्सर संदेश प्रकार को समझने का सबसे आसान तरीका होता है।

tf.train.Example संदेश (या प्रोटोबफ) एक लचीला संदेश प्रकार है जो {"string": value} मैपिंग का प्रतिनिधित्व करता है। इसे TensorFlow के साथ उपयोग के लिए डिज़ाइन किया गया है और इसका उपयोग TFX जैसे उच्च-स्तरीय API में किया जाता है।

यह नोटबुक दर्शाती है कि tf.train.Example संदेश कैसे बनाएं, पार्स करें और उसका उपयोग करें, और फिर tf.train.Example संदेशों को .tfrecord फ़ाइलों में और से क्रमबद्ध करें, लिखें और पढ़ें।

नोट: सामान्य तौर पर, आपको अपने डेटा को कई फाइलों में बांटना चाहिए ताकि आप I/O को समानांतर कर सकें (एक ही होस्ट के भीतर या कई होस्ट में)। अंगूठे का नियम कम से कम 10 गुना अधिक फाइलों का होना है क्योंकि मेजबान डेटा पढ़ रहे होंगे। साथ ही, प्रत्येक फ़ाइल काफी बड़ी होनी चाहिए (कम से कम 10 एमबी+ और आदर्श रूप से 100 एमबी+) ताकि आप I/O प्रीफ़ेचिंग से लाभ उठा सकें। उदाहरण के लिए, मान लें कि आपके पास X GB डेटा है और आप अधिकतम N होस्ट को प्रशिक्षित करने की योजना बना रहे हैं। आदर्श रूप से, आपको डेटा को ~ 10*N फ़ाइलों तक शार्प करना चाहिए, जब तक ~ X/(10*N) 10 एमबी+ (और आदर्श रूप से 100 एमबी+) है। यदि यह इससे कम है, तो आपको समानता के लाभों और I/O प्रीफ़ेचिंग लाभों का व्यापार करने के लिए कम शार्क बनाने की आवश्यकता हो सकती है।

सेट अप

import tensorflow as tf

import numpy as np
import IPython.display as display

`tf.train.Example`

`tf.train.Example` के लिए डेटा प्रकार। उदाहरण

मूल रूप से, एक tf.train.Example .उदाहरण एक {"string": tf.train.Feature} मैपिंग है।

tf.train.Feature संदेश प्रकार निम्नलिखित तीन प्रकारों में से एक को स्वीकार कर सकता है (संदर्भ के लिए .proto फ़ाइल देखें)। अधिकांश अन्य सामान्य प्रकारों को इनमें से किसी एक में ज़बरदस्ती किया जा सकता है:

tf.train.BytesList (निम्न प्रकारों को ज़बरदस्ती किया जा सकता है)
- string
- byte
tf.train.FloatList (निम्न प्रकारों को ज़बरदस्ती किया जा सकता है)
- float (फ्लोट float32 )
- double ( float64 )
tf.train.Int64List (निम्न प्रकारों को ज़बरदस्ती किया जा सकता है)
- bool
- enum
- int32
- uint32
- int64
- uint64

एक मानक TensorFlow प्रकार को tf.train.Example -संगत tf.train.Feature में बदलने के लिए, आप नीचे दिए गए शॉर्टकट फ़ंक्शंस का उपयोग कर सकते हैं। ध्यान दें कि प्रत्येक फ़ंक्शन एक स्केलर इनपुट मान लेता है और एक tf.train.Feature देता है जिसमें उपरोक्त तीन list प्रकारों में से एक होता है:

# The following functions can be used to convert a value to a type compatible
# with tf.train.Example.

def _bytes_feature(value):
  """Returns a bytes_list from a string / byte."""
  if isinstance(value, type(tf.constant(0))):
    value = value.numpy() # BytesList won't unpack a string from an EagerTensor.
  return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def _float_feature(value):
  """Returns a float_list from a float / double."""
  return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))

def _int64_feature(value):
  """Returns an int64_list from a bool / enum / int / uint."""
  return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

नीचे कुछ उदाहरण दिए गए हैं कि ये फ़ंक्शन कैसे काम करते हैं। अलग-अलग इनपुट प्रकारों और मानकीकृत आउटपुट प्रकारों पर ध्यान दें। यदि किसी फ़ंक्शन के लिए इनपुट प्रकार ऊपर बताए गए जबरदस्ती प्रकारों में से एक से मेल नहीं खाता है, तो फ़ंक्शन एक अपवाद उठाएगा (उदाहरण के लिए _int64_feature(1.0) त्रुटि होगी क्योंकि 1.0 एक फ्लोट है- इसलिए, इसका उपयोग इसके बजाय _float_feature फ़ंक्शन के साथ किया जाना चाहिए। ):

print(_bytes_feature(b'test_string'))
print(_bytes_feature(u'test_bytes'.encode('utf-8')))

print(_float_feature(np.exp(1)))

print(_int64_feature(True))
print(_int64_feature(1))

bytes_list {
  value: "test_string"
}

bytes_list {
  value: "test_bytes"
}

float_list {
  value: 2.7182817459106445
}

int64_list {
  value: 1
}

int64_list {
  value: 1
}

सभी प्रोटो संदेशों को .SerializeToString विधि का उपयोग करके बाइनरी-स्ट्रिंग में क्रमबद्ध किया जा सकता है:

feature = _float_feature(np.exp(1))

feature.SerializeToString()

b'\x12\x06\n\x04T\xf8-@'

एक `tf.train.Example` बनाना। उदाहरण संदेश

मान लीजिए कि आप मौजूदा डेटा से एक tf.train.Example संदेश बनाना चाहते हैं। व्यवहार में, डेटासेट कहीं से भी आ सकता है, लेकिन tf.train.Example संदेश बनाने की प्रक्रिया एक ही अवलोकन से समान होगी:

प्रत्येक अवलोकन के भीतर, प्रत्येक मान को एक tf.train.Feature में परिवर्तित करने की आवश्यकता होती है। उपरोक्त कार्यों में से किसी एक का उपयोग करके 3 संगत प्रकारों में से एक वाला फीचर।
आप फीचर नाम स्ट्रिंग से # 1 में निर्मित एन्कोडेड फीचर मान के लिए एक नक्शा (शब्दकोश) बनाते हैं।
चरण 2 में निर्मित नक्शा एक Features संदेश में परिवर्तित हो जाता है।

इस नोटबुक में, आप NumPy का उपयोग करके एक डेटासेट बनाएंगे।

इस डेटासेट में 4 विशेषताएं होंगी:

एक बूलियन विशेषता, समान संभावना के साथ False या True
एक पूर्णांक विशेषता समान रूप से यादृच्छिक रूप से [0, 5] से चुनी जाती है
एक अनुक्रमणिका के रूप में पूर्णांक सुविधा का उपयोग करके एक स्ट्रिंग तालिका से उत्पन्न एक स्ट्रिंग सुविधा
एक मानक सामान्य वितरण से एक फ्लोट सुविधा

उपरोक्त प्रत्येक वितरण से स्वतंत्र रूप से और समान रूप से वितरित प्रेक्षणों के 10,000 वाले नमूने पर विचार करें:

# The number of observations in the dataset.
n_observations = int(1e4)

# Boolean feature, encoded as False or True.
feature0 = np.random.choice([False, True], n_observations)

# Integer feature, random from 0 to 4.
feature1 = np.random.randint(0, 5, n_observations)

# String feature.
strings = np.array([b'cat', b'dog', b'chicken', b'horse', b'goat'])
feature2 = strings[feature1]

# Float feature, from a standard normal distribution.
feature3 = np.random.randn(n_observations)

इन सुविधाओं में से प्रत्येक को tf.train.Example संगत प्रकार में _bytes_feature , _float_feature , _int64_feature में से किसी एक का उपयोग करके ज़बरदस्ती किया जा सकता है। फिर आप इन एन्कोडेड सुविधाओं से एक tf.train.Example .उदाहरण संदेश बना सकते हैं:

def serialize_example(feature0, feature1, feature2, feature3):
  """
  Creates a tf.train.Example message ready to be written to a file.
  """
  # Create a dictionary mapping the feature name to the tf.train.Example-compatible
  # data type.
  feature = {
      'feature0': _int64_feature(feature0),
      'feature1': _int64_feature(feature1),
      'feature2': _bytes_feature(feature2),
      'feature3': _float_feature(feature3),
  }

  # Create a Features message using tf.train.Example.

  example_proto = tf.train.Example(features=tf.train.Features(feature=feature))
  return example_proto.SerializeToString()

उदाहरण के लिए, मान लें कि आपके पास डेटासेट से एक ही अवलोकन है, [False, 4, bytes('goat'), 0.9876] । आप create_message() का उपयोग करके इस अवलोकन के लिए tf.train.Example संदेश बना और प्रिंट कर सकते हैं। प्रत्येक एक अवलोकन उपरोक्त के अनुसार एक Features संदेश के रूप में लिखा जाएगा। ध्यान दें कि tf.train.Example संदेश Features संदेश के चारों ओर सिर्फ एक आवरण है:

# This is an example observation from the dataset.

example_observation = []

serialized_example = serialize_example(False, 4, b'goat', 0.9876)
serialized_example

b'\nR\n\x14\n\x08feature2\x12\x08\n\x06\n\x04goat\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x04\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x00\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04[\xd3|?'

संदेश को डीकोड करने के लिए tf.train.Example.FromString विधि का उपयोग करें।

example_proto = tf.train.Example.FromString(serialized_example)
example_proto

features {
  feature {
    key: "feature0"
    value {
      int64_list {
        value: 0
      }
    }
  }
  feature {
    key: "feature1"
    value {
      int64_list {
        value: 4
      }
    }
  }
  feature {
    key: "feature2"
    value {
      bytes_list {
        value: "goat"
      }
    }
  }
  feature {
    key: "feature3"
    value {
      float_list {
        value: 0.9876000285148621
      }
    }
  }
}

TFRecords प्रारूप विवरण

एक TFRecord फ़ाइल में अभिलेखों का एक क्रम होता है। फ़ाइल को केवल क्रमिक रूप से पढ़ा जा सकता है।

प्रत्येक रिकॉर्ड में डेटा-पेलोड के लिए एक बाइट-स्ट्रिंग, साथ ही डेटा-लंबाई, और CRC-32C ( Castagnoli बहुपद का उपयोग करते हुए 32-बिट CRC ) हैश अखंडता जाँच के लिए होता है।

प्रत्येक रिकॉर्ड निम्न स्वरूपों में संग्रहीत किया जाता है:

uint64 length
uint32 masked_crc32_of_length
byte   data[length]
uint32 masked_crc32_of_data

फ़ाइल बनाने के लिए रिकॉर्ड्स को एक साथ जोड़ा जाता है। सीआरसी का वर्णन यहां किया गया है , और सीआरसी का मुखौटा है:

masked_crc = ((crc >> 15) | (crc << 17)) + 0xa282ead8ul

नोट: TFRecord फ़ाइलों में tf.train.Example का उपयोग करने की कोई आवश्यकता नहीं है। tf.train.Example शब्दकोशों को बाइट-स्ट्रिंग्स में क्रमबद्ध करने का एक तरीका है। किसी भी बाइट-स्ट्रिंग को TensorFlow में डिकोड किया जा सकता है जिसे TFRecord फ़ाइल में संग्रहीत किया जा सकता है। उदाहरणों में शामिल हैं: टेक्स्ट की पंक्तियाँ, JSON ( tf.io.decode_json_example का उपयोग करके), एन्कोडेड छवि डेटा, या क्रमबद्ध tf.Tensors ( tf.io.serialize_tensor / tf.io.parse_tensor का उपयोग करके)। अधिक विकल्पों के लिए tf.io मॉड्यूल देखें।

tf.data का उपयोग कर `tf.data` फ़ाइलें

tf.data मॉड्यूल TensorFlow में डेटा पढ़ने और लिखने के लिए टूल भी प्रदान करता है।

TFRecord फ़ाइल लिखना

डेटासेट में डेटा प्राप्त करने का सबसे आसान तरीका from_tensor_slices विधि का उपयोग करना है।

एक सरणी पर लागू, यह स्केलर का एक डेटासेट देता है:

tf.data.Dataset.from_tensor_slices(feature1)

<TensorSliceDataset element_spec=TensorSpec(shape=(), dtype=tf.int64, name=None)>

सरणियों के टपल पर लागू, यह टुपल्स का एक डेटासेट देता है:

features_dataset = tf.data.Dataset.from_tensor_slices((feature0, feature1, feature2, feature3))
features_dataset

<TensorSliceDataset element_spec=(TensorSpec(shape=(), dtype=tf.bool, name=None), TensorSpec(shape=(), dtype=tf.int64, name=None), TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.float64, name=None))>

# Use `take(1)` to only pull one example from the dataset.
for f0,f1,f2,f3 in features_dataset.take(1):
  print(f0)
  print(f1)
  print(f2)
  print(f3)

tf.Tensor(False, shape=(), dtype=bool)
tf.Tensor(4, shape=(), dtype=int64)
tf.Tensor(b'goat', shape=(), dtype=string)
tf.Tensor(0.5251196235602504, shape=(), dtype=float64)

Dataset के प्रत्येक तत्व पर फ़ंक्शन लागू करने के लिए tf.data.Dataset.map विधि का उपयोग करें।

मैप किए गए फ़ंक्शन को TensorFlow ग्राफ़ मोड में काम करना चाहिए-इसे tf.Tensors पर काम करना चाहिए और वापस करना चाहिए। एक गैर-टेंसर फ़ंक्शन, जैसे serialize_example , को इसे संगत बनाने के लिए tf.py_function के साथ लपेटा जा सकता है।

tf.py_function का उपयोग करने के लिए आकार और प्रकार की जानकारी निर्दिष्ट करने की आवश्यकता होती है जो अन्यथा अनुपलब्ध है:

def tf_serialize_example(f0,f1,f2,f3):
  tf_string = tf.py_function(
    serialize_example,
    (f0, f1, f2, f3),  # Pass these args to the above function.
    tf.string)      # The return type is `tf.string`.
  return tf.reshape(tf_string, ()) # The result is a scalar.

tf_serialize_example(f0, f1, f2, f3)

<tf.Tensor: shape=(), dtype=string, numpy=b'\nR\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04=n\x06?\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x00\n\x14\n\x08feature2\x12\x08\n\x06\n\x04goat\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x04'>

इस फ़ंक्शन को डेटासेट में प्रत्येक तत्व पर लागू करें:

serialized_features_dataset = features_dataset.map(tf_serialize_example)
serialized_features_dataset

<MapDataset element_spec=TensorSpec(shape=(), dtype=tf.string, name=None)>

def generator():
  for features in features_dataset:
    yield serialize_example(*features)

प्लेसहोल्डर26

serialized_features_dataset = tf.data.Dataset.from_generator(
    generator, output_types=tf.string, output_shapes=())

serialized_features_dataset

<FlatMapDataset element_spec=TensorSpec(shape=(), dtype=tf.string, name=None)>

और उन्हें एक TFRecord फ़ाइल में लिखें:

filename = 'test.tfrecord'
writer = tf.data.experimental.TFRecordWriter(filename)
writer.write(serialized_features_dataset)

WARNING:tensorflow:From /tmp/ipykernel_25215/3575438268.py:2: TFRecordWriter.__init__ (from tensorflow.python.data.experimental.ops.writers) is deprecated and will be removed in a future version.
Instructions for updating:
To write TFRecords to disk, use `tf.io.TFRecordWriter`. To save and load the contents of a dataset, use `tf.data.experimental.save` and `tf.data.experimental.load`

एक TFRecord फ़ाइल पढ़ना

आप tf.data.TFRecordDataset वर्ग का उपयोग करके TFRecord फ़ाइल भी पढ़ सकते हैं।

tf.data का उपयोग करके tf.data फ़ाइलों का उपभोग करने के बारे में अधिक जानकारी tf.data में पाई जा सकती है : TensorFlow इनपुट पाइपलाइन गाइड बनाएँ ।

TFRecordDataset s का उपयोग इनपुट डेटा को मानकीकृत करने और प्रदर्शन को अनुकूलित करने के लिए उपयोगी हो सकता है।

filenames = [filename]
raw_dataset = tf.data.TFRecordDataset(filenames)
raw_dataset

<TFRecordDatasetV2 element_spec=TensorSpec(shape=(), dtype=tf.string, name=None)>

प्लेसहोल्डर33

इस बिंदु पर डेटासेट में क्रमबद्ध tf.train.Example संदेश होते हैं। जब इसे पुनरावृत्त किया जाता है तो यह इन्हें स्केलर स्ट्रिंग टेंसर के रूप में लौटाता है।

केवल पहले 10 रिकॉर्ड दिखाने के लिए .take विधि का उपयोग करें।

for raw_record in raw_dataset.take(10):
  print(repr(raw_record))

34 एल10एन-प्लेसहोल्डर

<tf.Tensor: shape=(), dtype=string, numpy=b'\nR\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x00\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x04\n\x14\n\x08feature2\x12\x08\n\x06\n\x04goat\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04=n\x06?'>
<tf.Tensor: shape=(), dtype=string, numpy=b'\nR\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x04\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x01\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04\x9d\xfa\x98\xbe\n\x14\n\x08feature2\x12\x08\n\x06\n\x04goat'>
<tf.Tensor: shape=(), dtype=string, numpy=b'\nQ\n\x13\n\x08feature2\x12\x07\n\x05\n\x03dog\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x01\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04a\xc0r?\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x01'>
<tf.Tensor: shape=(), dtype=string, numpy=b'\nQ\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x01\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x00\n\x13\n\x08feature2\x12\x07\n\x05\n\x03cat\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04\x92Q(?'>
<tf.Tensor: shape=(), dtype=string, numpy=b'\nR\n\x14\n\x08feature2\x12\x08\n\x06\n\x04goat\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04>\xc0\xe5>\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x01\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x04'>
<tf.Tensor: shape=(), dtype=string, numpy=b'\nU\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04I!\xde\xbe\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x02\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x00\n\x17\n\x08feature2\x12\x0b\n\t\n\x07chicken'>
<tf.Tensor: shape=(), dtype=string, numpy=b'\nQ\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x00\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x01\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04\xe0\x1a\xab\xbf\n\x13\n\x08feature2\x12\x07\n\x05\n\x03cat'>
<tf.Tensor: shape=(), dtype=string, numpy=b'\nQ\n\x13\n\x08feature2\x12\x07\n\x05\n\x03cat\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x01\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04\x87\xb2\xd7?\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x00'>
<tf.Tensor: shape=(), dtype=string, numpy=b'\nR\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x04\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x01\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04n\xe19>\n\x14\n\x08feature2\x12\x08\n\x06\n\x04goat'>
<tf.Tensor: shape=(), dtype=string, numpy=b'\nR\n\x14\n\x08feature3\x12\x08\x12\x06\n\x04\x1as\xd9\xbf\n\x11\n\x08feature0\x12\x05\x1a\x03\n\x01\x01\n\x11\n\x08feature1\x12\x05\x1a\x03\n\x01\x04\n\x14\n\x08feature2\x12\x08\n\x06\n\x04goat'>

इन टेंसरों को नीचे दिए गए फ़ंक्शन का उपयोग करके पार्स किया जा सकता है। ध्यान दें कि यहां feature_description आवश्यक है क्योंकि tf.data.Dataset s ग्राफ़-निष्पादन का उपयोग करते हैं, और उनके आकार और प्रकार के हस्ताक्षर के निर्माण के लिए इस विवरण की आवश्यकता होती है:

# Create a description of the features.
feature_description = {
    'feature0': tf.io.FixedLenFeature([], tf.int64, default_value=0),
    'feature1': tf.io.FixedLenFeature([], tf.int64, default_value=0),
    'feature2': tf.io.FixedLenFeature([], tf.string, default_value=''),
    'feature3': tf.io.FixedLenFeature([], tf.float32, default_value=0.0),
}

def _parse_function(example_proto):
  # Parse the input `tf.train.Example` proto using the dictionary above.
  return tf.io.parse_single_example(example_proto, feature_description)

वैकल्पिक रूप से, पूरे बैच को एक साथ पार्स करने के tf.parse example का उपयोग करें। इस फ़ंक्शन को tf.data.Dataset.map विधि का उपयोग करके डेटासेट में प्रत्येक आइटम पर लागू करें:

parsed_dataset = raw_dataset.map(_parse_function)
parsed_dataset

<MapDataset element_spec={'feature0': TensorSpec(shape=(), dtype=tf.int64, name=None), 'feature1': TensorSpec(shape=(), dtype=tf.int64, name=None), 'feature2': TensorSpec(shape=(), dtype=tf.string, name=None), 'feature3': TensorSpec(shape=(), dtype=tf.float32, name=None)}>

डेटासेट में प्रेक्षणों को प्रदर्शित करने के लिए उत्सुक निष्पादन का उपयोग करें। इस डेटासेट में 10,000 अवलोकन हैं, लेकिन आप केवल पहले 10 प्रदर्शित करेंगे। डेटा को सुविधाओं के शब्दकोश के रूप में प्रदर्शित किया जाता है। प्रत्येक आइटम एक tf.Tensor है, और इस टेंसर का numpy तत्व फीचर का मान प्रदर्शित करता है:

for parsed_record in parsed_dataset.take(10):
  print(repr(parsed_record))

{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=4>, 'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'goat'>, 'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=0.5251196>}
{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=4>, 'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'goat'>, 'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=-0.29878703>}
{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'dog'>, 'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=0.94824797>}
{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'cat'>, 'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=0.65749466>}
{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=4>, 'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'goat'>, 'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=0.44873232>}
{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=2>, 'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'chicken'>, 'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=-0.4338477>}
{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'cat'>, 'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=-1.3367577>}
{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'cat'>, 'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=1.6851357>}
{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=4>, 'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'goat'>, 'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=0.18152401>}
{'feature0': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'feature1': <tf.Tensor: shape=(), dtype=int64, numpy=4>, 'feature2': <tf.Tensor: shape=(), dtype=string, numpy=b'goat'>, 'feature3': <tf.Tensor: shape=(), dtype=float32, numpy=-1.6988251>}

यहां, tf.parse_example फ़ंक्शन tf.train.Example फ़ील्ड को मानक टेंसर में खोल देता है।

पायथन में TFRecord फ़ाइलें

tf.io मॉड्यूल में TFRecord फ़ाइलों को पढ़ने और लिखने के लिए शुद्ध-पायथन फ़ंक्शन भी शामिल हैं।

TFRecord फ़ाइल लिखना

इसके बाद, फ़ाइल test.tfrecord में 10,000 अवलोकन लिखें। प्रत्येक अवलोकन को एक tf.train.Example .उदाहरण संदेश में परिवर्तित किया जाता है, फिर फ़ाइल में लिखा जाता है। फिर आप सत्यापित कर सकते हैं कि फ़ाइल test.tfrecord बनाया गया है:

# Write the `tf.train.Example` observations to the file.
with tf.io.TFRecordWriter(filename) as writer:
  for i in range(n_observations):
    example = serialize_example(feature0[i], feature1[i], feature2[i], feature3[i])
    writer.write(example)

du -sh {filename}

984K    test.tfrecord

एक TFRecord फ़ाइल पढ़ना

इन क्रमबद्ध टेंसरों को tf.train.Example.ParseFromString का उपयोग करके आसानी से पार्स किया जा सकता है:

filenames = [filename]
raw_dataset = tf.data.TFRecordDataset(filenames)
raw_dataset

<TFRecordDatasetV2 element_spec=TensorSpec(shape=(), dtype=tf.string, name=None)>

for raw_record in raw_dataset.take(1):
  example = tf.train.Example()
  example.ParseFromString(raw_record.numpy())
  print(example)

features {
  feature {
    key: "feature0"
    value {
      int64_list {
        value: 0
      }
    }
  }
  feature {
    key: "feature1"
    value {
      int64_list {
        value: 4
      }
    }
  }
  feature {
    key: "feature2"
    value {
      bytes_list {
        value: "goat"
      }
    }
  }
  feature {
    key: "feature3"
    value {
      float_list {
        value: 0.5251196026802063
      }
    }
  }
}

यह एक tf.train.Example proto देता है जिसका उपयोग करना मुश्किल है, लेकिन यह मूल रूप से एक का प्रतिनिधित्व है:

Dict[str,
     Union[List[float],
           List[int],
           List[str]]]

निम्नलिखित कोड TensorFlow Ops का उपयोग किए बिना, Example को मैन्युअल रूप से NumPy सरणियों के शब्दकोश में परिवर्तित करता है। विवरण के लिए प्रोटो फ़ाइल देखें।

result = {}
# example.features.feature is the dictionary
for key, feature in example.features.feature.items():
  # The values are the Feature objects which contain a `kind` which contains:
  # one of three fields: bytes_list, float_list, int64_list

  kind = feature.WhichOneof('kind')
  result[key] = np.array(getattr(feature, kind).value)

result

{'feature3': array([0.5251196]),
 'feature1': array([4]),
 'feature0': array([0]),
 'feature2': array([b'goat'], dtype='|S4')}

पूर्वाभ्यास: छवि डेटा पढ़ना और लिखना

यह TFRecords का उपयोग करके छवि डेटा को पढ़ने और लिखने का एक एंड-टू-एंड उदाहरण है। इनपुट डेटा के रूप में एक छवि का उपयोग करके, आप डेटा को TFRecord फ़ाइल के रूप में लिखेंगे, फिर फ़ाइल को वापस पढ़ेंगे और छवि प्रदर्शित करेंगे।

यह उपयोगी हो सकता है, उदाहरण के लिए, आप एक ही इनपुट डेटासेट पर कई मॉडल का उपयोग करना चाहते हैं। छवि डेटा को कच्चा संग्रहीत करने के बजाय, इसे TFRecords प्रारूप में पूर्व-संसाधित किया जा सकता है, और इसका उपयोग आगे की सभी प्रसंस्करण और मॉडलिंग में किया जा सकता है।

सबसे पहले, आइए बर्फ में एक बिल्ली की इस छवि और निर्माणाधीन विलियम्सबर्ग ब्रिज, NYC की इस तस्वीर को डाउनलोड करें।

चित्र प्राप्त करें

cat_in_snow  = tf.keras.utils.get_file(
    '320px-Felis_catus-cat_on_snow.jpg',
    'https://storage.googleapis.com/download.tensorflow.org/example_images/320px-Felis_catus-cat_on_snow.jpg')

williamsburg_bridge = tf.keras.utils.get_file(
    '194px-New_East_River_Bridge_from_Brooklyn_det.4a09796u.jpg',
    'https://storage.googleapis.com/download.tensorflow.org/example_images/194px-New_East_River_Bridge_from_Brooklyn_det.4a09796u.jpg')

Downloading data from https://storage.googleapis.com/download.tensorflow.org/example_images/320px-Felis_catus-cat_on_snow.jpg
24576/17858 [=========================================] - 0s 0us/step
32768/17858 [=======================================================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/download.tensorflow.org/example_images/194px-New_East_River_Bridge_from_Brooklyn_det.4a09796u.jpg
16384/15477 [===============================] - 0s 0us/step
24576/15477 [===============================================] - 0s 0us/step

display.display(display.Image(filename=cat_in_snow))
display.display(display.HTML('Image cc-by: <a "href=https://commons.wikimedia.org/wiki/File:Felis_catus-cat_on_snow.jpg">Von.grzanka</a>'))

जेपीईजी

display.display(display.Image(filename=williamsburg_bridge))
display.display(display.HTML('<a "href=https://commons.wikimedia.org/wiki/File:New_East_River_Bridge_from_Brooklyn_det.4a09796u.jpg">From Wikimedia</a>'))

जेपीईजी

TFRecord फ़ाइल लिखें

पहले की तरह, सुविधाओं को tf.train.Example के साथ संगत प्रकार के रूप में एन्कोड करें। यह कच्ची छवि स्ट्रिंग सुविधा, साथ ही ऊंचाई, चौड़ाई, गहराई और मनमानी label सुविधा को संग्रहीत करता है। उत्तरार्द्ध का उपयोग तब किया जाता है जब आप बिल्ली की छवि और पुल की छवि के बीच अंतर करने के लिए फ़ाइल लिखते हैं। कैट इमेज के लिए 0 और ब्रिज इमेज के लिए 1 का प्रयोग करें:

image_labels = {
    cat_in_snow : 0,
    williamsburg_bridge : 1,
}

# This is an example, just using the cat image.
image_string = open(cat_in_snow, 'rb').read()

label = image_labels[cat_in_snow]

# Create a dictionary with features that may be relevant.
def image_example(image_string, label):
  image_shape = tf.io.decode_jpeg(image_string).shape

  feature = {
      'height': _int64_feature(image_shape[0]),
      'width': _int64_feature(image_shape[1]),
      'depth': _int64_feature(image_shape[2]),
      'label': _int64_feature(label),
      'image_raw': _bytes_feature(image_string),
  }

  return tf.train.Example(features=tf.train.Features(feature=feature))

for line in str(image_example(image_string, label)).split('\n')[:15]:
  print(line)
print('...')

features {
  feature {
    key: "depth"
    value {
      int64_list {
        value: 3
      }
    }
  }
  feature {
    key: "height"
    value {
      int64_list {
        value: 213
      }
...

ध्यान दें कि सभी सुविधाएं अब tf.train.Example संदेश में संग्रहीत हैं। इसके बाद, ऊपर दिए गए कोड को क्रियान्वित करें और इमेजेज. images.tfrecords नाम की फ़ाइल में उदाहरण संदेश लिखें:

# Write the raw image files to `images.tfrecords`.
# First, process the two images into `tf.train.Example` messages.
# Then, write to a `.tfrecords` file.
record_file = 'images.tfrecords'
with tf.io.TFRecordWriter(record_file) as writer:
  for filename, label in image_labels.items():
    image_string = open(filename, 'rb').read()
    tf_example = image_example(image_string, label)
    writer.write(tf_example.SerializeToString())

du -sh {record_file}

36K images.tfrecords

TFRecord फ़ाइल पढ़ें

अब आपके पास फ़ाइल images.tfrecords — और अब आप जो कुछ भी लिखा है उसे पढ़ने के लिए इसमें रिकॉर्ड्स पर पुनरावृति कर सकते हैं। यह देखते हुए कि इस उदाहरण में आप केवल छवि को पुन: पेश करेंगे, केवल एक विशेषता जिसकी आपको आवश्यकता होगी वह है कच्ची छवि स्ट्रिंग। ऊपर वर्णित गेटर्स का उपयोग करके इसे निकालें, अर्थात् example.features.feature['image_raw'].bytes_list.value[0] । आप लेबल का उपयोग यह निर्धारित करने के लिए भी कर सकते हैं कि कौन सा रिकॉर्ड बिल्ली है और कौन सा पुल है:

raw_image_dataset = tf.data.TFRecordDataset('images.tfrecords')

# Create a dictionary describing the features.
image_feature_description = {
    'height': tf.io.FixedLenFeature([], tf.int64),
    'width': tf.io.FixedLenFeature([], tf.int64),
    'depth': tf.io.FixedLenFeature([], tf.int64),
    'label': tf.io.FixedLenFeature([], tf.int64),
    'image_raw': tf.io.FixedLenFeature([], tf.string),
}

def _parse_image_function(example_proto):
  # Parse the input tf.train.Example proto using the dictionary above.
  return tf.io.parse_single_example(example_proto, image_feature_description)

parsed_image_dataset = raw_image_dataset.map(_parse_image_function)
parsed_image_dataset

<MapDataset element_spec={'depth': TensorSpec(shape=(), dtype=tf.int64, name=None), 'height': TensorSpec(shape=(), dtype=tf.int64, name=None), 'image_raw': TensorSpec(shape=(), dtype=tf.string, name=None), 'label': TensorSpec(shape=(), dtype=tf.int64, name=None), 'width': TensorSpec(shape=(), dtype=tf.int64, name=None)}>

TFRecord फ़ाइल से छवियों को पुनर्प्राप्त करें:

for image_features in parsed_image_dataset:
  image_raw = image_features['image_raw'].numpy()
  display.display(display.Image(data=image_raw))

जेपीईजी

सेट अप

tf.train.Example

tf.train.Example के लिए डेटा प्रकार। उदाहरण

एक tf.train.Example बनाना। उदाहरण संदेश

TFRecords प्रारूप विवरण

tf.data का उपयोग कर tf.data फ़ाइलें

TFRecord फ़ाइल लिखना

एक TFRecord फ़ाइल पढ़ना

पायथन में TFRecord फ़ाइलें

TFRecord फ़ाइल लिखना

एक TFRecord फ़ाइल पढ़ना

पूर्वाभ्यास: छवि डेटा पढ़ना और लिखना

चित्र प्राप्त करें

TFRecord फ़ाइल लिखें

TFRecord फ़ाइल पढ़ें

`tf.train.Example`

`tf.train.Example` के लिए डेटा प्रकार। उदाहरण

एक `tf.train.Example` बनाना। उदाहरण संदेश

tf.data का उपयोग कर `tf.data` फ़ाइलें