TensorFlow 2.0 Beta is available Learn more

Load CSV with tf.data

View on TensorFlow.org View source on GitHub Download notebook

This tutorial provides an example of how to load CSV data from a file into a tf.data.Dataset.

The data used in this tutorial are taken from the Titanic passenger list. The model will predict the likelihood a passenger survived based on characteristics like age, gender, ticket class, and whether the person was traveling alone.

Setup

try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass
from __future__ import absolute_import, division, print_function, unicode_literals
import functools

import numpy as np
import tensorflow as tf
TRAIN_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/train.csv"
TEST_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/eval.csv"

train_file_path = tf.keras.utils.get_file("train.csv", TRAIN_DATA_URL)
test_file_path = tf.keras.utils.get_file("eval.csv", TEST_DATA_URL)
Downloading data from https://storage.googleapis.com/tf-datasets/titanic/train.csv
32768/30874 [===============================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/tf-datasets/titanic/eval.csv
16384/13049 [=====================================] - 0s 0us/step
# Make numpy values easier to read.
np.set_printoptions(precision=3, suppress=True)

Load data

To start, let's look at the top of the CSV file to see how it is formatted.

!head {train_file_path}
survived,sex,age,n_siblings_spouses,parch,fare,class,deck,embark_town,alone
0,male,22.0,1,0,7.25,Third,unknown,Southampton,n
1,female,38.0,1,0,71.2833,First,C,Cherbourg,n
1,female,26.0,0,0,7.925,Third,unknown,Southampton,y
1,female,35.0,1,0,53.1,First,C,Southampton,n
0,male,28.0,0,0,8.4583,Third,unknown,Queenstown,y
0,male,2.0,3,1,21.075,Third,unknown,Southampton,n
1,female,27.0,0,2,11.1333,Third,unknown,Southampton,n
1,female,14.0,1,0,30.0708,Second,unknown,Cherbourg,n
1,female,4.0,1,1,16.7,Third,G,Southampton,n

You can load this using pandas, and pass the NumPy arrays to TensorFlow. If you need to scale up to a large set of files, or need a loader that integrates with TensorFlow and tf.data then use the tf.data.experimental.make_csv_dataset function:

The only column you need to identify explicitly is the one with the value that the model is intended to predict.

LABEL_COLUMN = 'survived'
LABELS = [0, 1]

Now read the CSV data from the file and create a dataset.

(For the full documentation, see tf.data.experimental.make_csv_dataset)

def get_dataset(file_path, **kwargs):
  dataset = tf.data.experimental.make_csv_dataset(
      file_path,
      batch_size=5, # Artificially small to make examples easier to show.
      label_name=LABEL_COLUMN,
      na_value="?",
      num_epochs=1,
      ignore_errors=True, 
      **kwargs)
  return dataset

raw_train_data = get_dataset(train_file_path)
raw_test_data = get_dataset(test_file_path)
WARNING: Logging before flag parsing goes to stderr.
W0813 05:57:53.344539 140615240066816 deprecation.py:323] From /tmpfs/src/tf_docs_env/lib/python3.5/site-packages/tensorflow/python/data/experimental/ops/readers.py:498: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_determinstic`.
def show_batch(dataset):
  for batch, label in dataset.take(1):
    for key, value in batch.items():
      print("{:20s}: {}".format(key,value.numpy()))

Each item in the dataset is a batch, represented as a tuple of (many examples, many labels). The data from the examples is organized in column-based tensors (rather than row-based tensors), each with as many elements as the batch size (12 in this case).

It might help to see this yourself.

show_batch(raw_train_data)
sex                 : [b'female' b'female' b'male' b'male' b'male']
age                 : [41. 27. 49. 24. 27.]
n_siblings_spouses  : [0 1 1 2 0]
parch               : [5 0 0 0 0]
fare                : [39.688 13.858 89.104 73.5    7.896]
class               : [b'Third' b'Second' b'First' b'Second' b'Third']
deck                : [b'unknown' b'unknown' b'C' b'unknown' b'unknown']
embark_town         : [b'Southampton' b'Cherbourg' b'Cherbourg' b'Southampton' b'Southampton']
alone               : [b'n' b'n' b'n' b'n' b'y']

As you can see, the columns in the CSV are named. The dataset constructor will pick these names up automatically. If the file you are working with does not contain the column names in the first line, pass them in a list of strings to the column_names argument in the make_csv_dataset function.

CSV_COLUMNS = ['survived', 'sex', 'age', 'n_siblings_spouses', 'parch', 'fare', 'class', 'deck', 'embark_town', 'alone']

temp_dataset = get_dataset(train_file_path, column_names=CSV_COLUMNS)

show_batch(temp_dataset)
sex                 : [b'male' b'male' b'male' b'male' b'male']
age                 : [27. 33. 32. 20. 51.]
n_siblings_spouses  : [0 0 0 1 0]
parch               : [0 0 0 1 0]
fare                : [ 8.663 12.275  7.925 15.742 26.55 ]
class               : [b'Third' b'Second' b'Third' b'Third' b'First']
deck                : [b'unknown' b'unknown' b'unknown' b'unknown' b'E']
embark_town         : [b'Southampton' b'Southampton' b'Southampton' b'Cherbourg' b'Southampton']
alone               : [b'y' b'y' b'y' b'n' b'y']

This example is going to use all the available columns. If you need to omit some columns from the dataset, create a list of just the columns you plan to use, and pass it into the (optional) select_columns argument of the constructor.

SELECT_COLUMNS = ['survived', 'age', 'n_siblings_spouses', 'class', 'deck', 'alone']

temp_dataset = get_dataset(train_file_path, select_columns=SELECT_COLUMNS)

show_batch(temp_dataset)
age                 : [28. 18.  9. 22. 52.]
n_siblings_spouses  : [0 0 2 0 1]
class               : [b'Third' b'Second' b'Third' b'Third' b'First']
deck                : [b'unknown' b'unknown' b'unknown' b'unknown' b'D']
alone               : [b'y' b'n' b'n' b'y' b'n']

Data preprocessing

A CSV file can contain a variety of data types. Typically you want to convert from those mixed types to a fixed length vector before feeding the data into your model.

TensorFlow has a built-in system for describing common input conversions: tf.feature_column, see this tutorial for details.

You can preprocess your data using any tool you like (like nltk or sklearn), and just pass the processed output to TensorFlow.

The primary advantage of doing the preprocessing inside your model is that when you export the model it includes the preprocessing. This way you can pass the raw data directly to your model.

Continuous data

If your data is already in an apropriate numeric format, you can pack the data into a vector before passing it off to the model:

SELECT_COLUMNS = ['survived', 'age', 'n_siblings_spouses', 'parch', 'fare']
DEFAULTS = [0, 0.0, 0.0, 0.0, 0.0]
temp_dataset = get_dataset(train_file_path, 
                           select_columns=SELECT_COLUMNS,
                           column_defaults = DEFAULTS)

show_batch(temp_dataset)
age                 : [36. 28. 37. 30. 28.]
n_siblings_spouses  : [1. 0. 1. 1. 0.]
parch               : [2. 0. 0. 1. 0.]
fare                : [120.     56.496  53.1    24.15    8.05 ]
example_batch, labels_batch = next(iter(temp_dataset)) 

Here's a simple function that will pack together all the columns:

def pack(features, label):
  return tf.stack(list(features.values()), axis=-1), label

Apply this to each element of the dataset:

packed_dataset = temp_dataset.map(pack)

for features, labels in packed_dataset.take(1):
  print(features.numpy())
  print()
  print(labels.numpy())
[[ 36.      1.      2.    120.   ]
 [ 28.      0.      0.     56.496]
 [ 37.      1.      0.     53.1  ]
 [ 30.      1.      1.     24.15 ]
 [ 28.      0.      0.      8.05 ]]

[1 1 0 0 0]

If you have mixed datatypes you may want to separate out these simple-numeric fields. The tf.feature_column api can handle them, but this incurs some overhead and should be avoided unless really necessary. Switch back to the mixed dataset:

show_batch(raw_train_data)
sex                 : [b'female' b'female' b'male' b'male' b'male']
age                 : [41. 27. 49. 24. 27.]
n_siblings_spouses  : [0 1 1 2 0]
parch               : [5 0 0 0 0]
fare                : [39.688 13.858 89.104 73.5    7.896]
class               : [b'Third' b'Second' b'First' b'Second' b'Third']
deck                : [b'unknown' b'unknown' b'C' b'unknown' b'unknown']
embark_town         : [b'Southampton' b'Cherbourg' b'Cherbourg' b'Southampton' b'Southampton']
alone               : [b'n' b'n' b'n' b'n' b'y']
example_batch, labels_batch = next(iter(temp_dataset)) 

So define a more general preprocessor that selects a list of numeric features and packs them into a single column:

class PackNumericFeatures(object):
  def __init__(self, names):
    self.names = names

  def __call__(self, features, labels):
    numeric_freatures = [features.pop(name) for name in self.names]
    numeric_features = [tf.cast(feat, tf.float32) for feat in numeric_freatures]
    numeric_features = tf.stack(numeric_features, axis=-1)
    features['numeric'] = numeric_features

    return features, labels
NUMERIC_FEATURES = ['age','n_siblings_spouses','parch', 'fare']

packed_train_data = raw_train_data.map(
    PackNumericFeatures(NUMERIC_FEATURES))

packed_test_data = raw_test_data.map(
    PackNumericFeatures(NUMERIC_FEATURES))
show_batch(packed_train_data)
sex                 : [b'female' b'female' b'male' b'male' b'male']
class               : [b'Third' b'Second' b'First' b'Second' b'Third']
deck                : [b'unknown' b'unknown' b'C' b'unknown' b'unknown']
embark_town         : [b'Southampton' b'Cherbourg' b'Cherbourg' b'Southampton' b'Southampton']
alone               : [b'n' b'n' b'n' b'n' b'y']
numeric             : [[41.     0.     5.    39.688]
 [27.     1.     0.    13.858]
 [49.     1.     0.    89.104]
 [24.     2.     0.    73.5  ]
 [27.     0.     0.     7.896]]
example_batch, labels_batch = next(iter(packed_train_data)) 

Data Normalization

Continuous data should always be normalized.

import pandas as pd
desc = pd.read_csv(train_file_path)[NUMERIC_FEATURES].describe()
desc
MEAN = np.array(desc.T['mean'])
STD = np.array(desc.T['std'])
def normalize_numeric_data(data, mean, std):
  # Center the data
  return (data-mean)/std

Now create a numeric column. The tf.feature_columns.numeric_column API accepts a normalizer_fn argument, which will be run on each batch.

Bind the MEAN and STD to the normalizer fn using functools.partial.

# See what you just created.
normalizer = functools.partial(normalize_numeric_data, mean=MEAN, std=STD)

numeric_column = tf.feature_column.numeric_column('numeric', normalizer_fn=normalizer, shape=[len(NUMERIC_FEATURES)])
numeric_columns = [numeric_column]
numeric_column
NumericColumn(key='numeric', shape=(4,), default_value=None, dtype=tf.float32, normalizer_fn=functools.partial(<function normalize_numeric_data at 0x7fe3041c08c8>, std=array([12.512,  1.151,  0.793, 54.598]), mean=array([29.631,  0.545,  0.38 , 34.385])))

When you train the model, include this feature column to select and center this block of numeric data:

example_batch['numeric']
<tf.Tensor: id=579, shape=(5, 4), dtype=float32, numpy=
array([[41.   ,  0.   ,  5.   , 39.688],
       [27.   ,  1.   ,  0.   , 13.858],
       [49.   ,  1.   ,  0.   , 89.104],
       [24.   ,  2.   ,  0.   , 73.5  ],
       [27.   ,  0.   ,  0.   ,  7.896]], dtype=float32)>
numeric_layer = tf.keras.layers.DenseFeatures(numeric_columns)
numeric_layer(example_batch).numpy()
array([[ 0.909, -0.474,  5.827,  0.097],
       [-0.21 ,  0.395, -0.479, -0.376],
       [ 1.548,  0.395, -0.479,  1.002],
       [-0.45 ,  1.264, -0.479,  0.716],
       [-0.21 , -0.474, -0.479, -0.485]], dtype=float32)

The mean based normalization used here requires knowing the means of each column ahead of time.

Categorical data

Some of the columns in the CSV data are categorical columns. That is, the content should be one of a limited set of options.

Use the tf.feature_column API to create a collection with a tf.feature_column.indicator_column for each categorical column.

CATEGORIES = {
    'sex': ['male', 'female'],
    'class' : ['First', 'Second', 'Third'],
    'deck' : ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
    'embark_town' : ['Cherbourg', 'Southhampton', 'Queenstown'],
    'alone' : ['y', 'n']
}
categorical_columns = []
for feature, vocab in CATEGORIES.items():
  cat_col = tf.feature_column.categorical_column_with_vocabulary_list(
        key=feature, vocabulary_list=vocab)
  categorical_columns.append(tf.feature_column.indicator_column(cat_col))
# See what you just created.
categorical_columns
[IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='sex', vocabulary_list=('male', 'female'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
 IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='alone', vocabulary_list=('y', 'n'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
 IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='deck', vocabulary_list=('A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
 IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='embark_town', vocabulary_list=('Cherbourg', 'Southhampton', 'Queenstown'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
 IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='class', vocabulary_list=('First', 'Second', 'Third'), dtype=tf.string, default_value=-1, num_oov_buckets=0))]
categorical_layer = tf.keras.layers.DenseFeatures(categorical_columns)
print(categorical_layer(example_batch).numpy()[0])
W0813 05:57:54.966686 140615240066816 deprecation.py:323] From /tmpfs/src/tf_docs_env/lib/python3.5/site-packages/tensorflow/python/feature_column/feature_column_v2.py:2655: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
W0813 05:57:54.972520 140615240066816 deprecation.py:323] From /tmpfs/src/tf_docs_env/lib/python3.5/site-packages/tensorflow/python/feature_column/feature_column_v2.py:4215: IndicatorColumn._variable_shape (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
W0813 05:57:54.973515 140615240066816 deprecation.py:323] From /tmpfs/src/tf_docs_env/lib/python3.5/site-packages/tensorflow/python/feature_column/feature_column_v2.py:4270: VocabularyListCategoricalColumn._num_buckets (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.

[0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]

This will be become part of a data processing input later when you build the model.

Combined preprocessing layer

Add the two feature column collections and pass them to a tf.keras.layers.DenseFeatures to create an input layer that will extract and preprocess both input types:

preprocessing_layer = tf.keras.layers.DenseFeatures(categorical_columns+numeric_columns)
print(preprocessing_layer(example_batch).numpy()[0])
[ 0.     1.     0.     0.     1.     0.     0.     0.     0.     0.
  0.     0.     0.     0.     0.     0.     0.     0.     0.909 -0.474
  5.827  0.097  0.     1.   ]

Build the model

Build a tf.keras.Sequential, starting with the preprocessing_layer.

model = tf.keras.Sequential([
  preprocessing_layer,
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dense(1, activation='sigmoid'),
])

model.compile(
    loss='binary_crossentropy',
    optimizer='adam',
    metrics=['accuracy'])

Train, evaluate, and predict

Now the model can be instantiated and trained.

train_data = packed_train_data.shuffle(500)
test_data = packed_test_data
model.fit(train_data, epochs=20)
Epoch 1/20
126/126 [==============================] - 3s 21ms/step - loss: 0.5129 - accuracy: 0.7182
Epoch 2/20
126/126 [==============================] - 1s 6ms/step - loss: 0.4282 - accuracy: 0.8238
Epoch 3/20
126/126 [==============================] - 1s 6ms/step - loss: 0.4118 - accuracy: 0.8258
Epoch 4/20
126/126 [==============================] - 1s 6ms/step - loss: 0.3990 - accuracy: 0.8332
Epoch 5/20
126/126 [==============================] - 1s 6ms/step - loss: 0.3894 - accuracy: 0.8379
Epoch 6/20
126/126 [==============================] - 1s 6ms/step - loss: 0.3813 - accuracy: 0.8393
Epoch 7/20
126/126 [==============================] - 1s 6ms/step - loss: 0.3738 - accuracy: 0.8352
Epoch 8/20
126/126 [==============================] - 1s 6ms/step - loss: 0.3678 - accuracy: 0.8364
Epoch 9/20
126/126 [==============================] - 1s 6ms/step - loss: 0.3619 - accuracy: 0.8433
Epoch 10/20
126/126 [==============================] - 1s 6ms/step - loss: 0.3573 - accuracy: 0.8422
Epoch 11/20
126/126 [==============================] - 1s 6ms/step - loss: 0.3515 - accuracy: 0.8473
Epoch 12/20
126/126 [==============================] - 1s 6ms/step - loss: 0.3467 - accuracy: 0.8499
Epoch 13/20
126/126 [==============================] - 1s 6ms/step - loss: 0.3419 - accuracy: 0.8519
Epoch 14/20
126/126 [==============================] - 1s 6ms/step - loss: 0.3377 - accuracy: 0.8489
Epoch 15/20
126/126 [==============================] - 1s 6ms/step - loss: 0.3333 - accuracy: 0.8480
Epoch 16/20
126/126 [==============================] - 1s 6ms/step - loss: 0.3293 - accuracy: 0.8536
Epoch 17/20
126/126 [==============================] - 1s 6ms/step - loss: 0.3257 - accuracy: 0.8536
Epoch 18/20
126/126 [==============================] - 1s 6ms/step - loss: 0.3222 - accuracy: 0.8535
Epoch 19/20
126/126 [==============================] - 1s 6ms/step - loss: 0.3187 - accuracy: 0.8577
Epoch 20/20
126/126 [==============================] - 1s 6ms/step - loss: 0.3156 - accuracy: 0.8675

<tensorflow.python.keras.callbacks.History at 0x7fe3080a30f0>

Once the model is trained, you can check its accuracy on the test_data set.

test_loss, test_accuracy = model.evaluate(test_data)

print('\n\nTest Loss {}, Test Accuracy {}'.format(test_loss, test_accuracy))
     53/Unknown - 1s 12ms/step - loss: 0.4901 - accuracy: 0.8068

Test Loss 0.49009885305081896, Test Accuracy 0.8068181872367859

Use tf.keras.Model.predict to infer labels on a batch or a dataset of batches.

predictions = model.predict(test_data)

# Show some results
for prediction, survived in zip(predictions[:10], list(test_data)[0][1][:10]):
  print("Predicted survival: {:.2%}".format(prediction[0]),
        " | Actual outcome: ",
        ("SURVIVED" if bool(survived) else "DIED"))

Predicted survival: 82.06%  | Actual outcome:  SURVIVED
Predicted survival: 15.87%  | Actual outcome:  DIED
Predicted survival: 91.77%  | Actual outcome:  DIED
Predicted survival: 16.42%  | Actual outcome:  DIED
Predicted survival: 5.01%  | Actual outcome:  DIED