使用 tf.data 加载 pandas dataframes

View on TensorFlow.org Run in Google Colab View source on GitHub Download notebook

本教程提供了如何将 pandas dataframes 加载到 tf.data.Dataset

本教程使用了一个小型数据集,由克利夫兰诊所心脏病基金会(Cleveland Clinic Foundation for Heart Disease)提供. 此数据集中有几百行CSV。每行表示一个患者,每列表示一个属性(describe)。我们将使用这些信息来预测患者是否患有心脏病,这是一个二分类问题。

使用 pandas 读取数据

import pandas as pd
import tensorflow as tf
2022-08-17 04:34:36.107768: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2022-08-17 04:34:36.761046: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvrtc.so.11.1: cannot open shared object file: No such file or directory
2022-08-17 04:34:36.761325: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvrtc.so.11.1: cannot open shared object file: No such file or directory
2022-08-17 04:34:36.761339: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.

下载包含心脏数据集的 csv 文件。

csv_file = tf.keras.utils.get_file('heart.csv', 'https://storage.googleapis.com/download.tensorflow.org/data/heart.csv')
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/heart.csv
13273/13273 [==============================] - 0s 0us/step

使用 pandas 读取 csv 文件。

df = pd.read_csv(csv_file)
df.head()
df.dtypes
age           int64
sex           int64
cp            int64
trestbps      int64
chol          int64
fbs           int64
restecg       int64
thalach       int64
exang         int64
oldpeak     float64
slope         int64
ca            int64
thal         object
target        int64
dtype: object

thal 列(数据帧(dataframe)中的 object )转换为离散数值。

df['thal'] = pd.Categorical(df['thal'])
df['thal'] = df.thal.cat.codes
df.head()

使用 tf.data.Dataset 读取数据

使用 tf.data.Dataset.from_tensor_slices 从 pandas dataframe 中读取数值。

使用 tf.data.Dataset 的其中一个优势是可以允许您写一些简单而又高效的数据管道(data pipelines)。从 loading data guide 可以了解更多。

target = df.pop('target')
dataset = tf.data.Dataset.from_tensor_slices((df.values, target.values))
for feat, targ in dataset.take(5):
  print ('Features: {}, Target: {}'.format(feat, targ))
Features: [ 63.    1.    1.  145.  233.    1.    2.  150.    0.    2.3   3.    0.

   2. ], Target: 0
Features: [ 67.    1.    4.  160.  286.    0.    2.  108.    1.    1.5   2.    3.
   3. ], Target: 1
Features: [ 67.    1.    4.  120.  229.    0.    2.  129.    1.    2.6   2.    2.
   4. ], Target: 0
Features: [ 37.    1.    3.  130.  250.    0.    0.  187.    0.    3.5   3.    0.
   3. ], Target: 0
Features: [ 41.    0.    2.  130.  204.    0.    2.  172.    0.    1.4   1.    0.
   3. ], Target: 0

由于 pd.Series 实现了 __array__ 协议,因此几乎可以在任何使用 np.arraytf.Tensor 的地方透明地使用它。

tf.constant(df['thal'])
<tf.Tensor: shape=(303,), dtype=int8, numpy=
array([2, 3, 4, 3, 3, 3, 3, 3, 4, 4, 2, 3, 2, 4, 4, 3, 4, 3, 3, 3, 3, 3,
       3, 4, 4, 3, 3, 3, 3, 4, 3, 4, 3, 4, 3, 3, 4, 2, 4, 3, 4, 3, 4, 4,
       2, 3, 3, 4, 3, 3, 4, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 4, 4, 3, 3, 4,
       4, 2, 3, 3, 4, 3, 4, 3, 3, 4, 4, 3, 3, 4, 4, 3, 3, 3, 3, 4, 4, 4,
       3, 3, 4, 3, 4, 4, 3, 4, 3, 3, 3, 4, 3, 4, 4, 3, 3, 4, 4, 4, 4, 4,
       3, 3, 3, 3, 4, 3, 4, 3, 4, 4, 3, 3, 2, 4, 4, 2, 3, 3, 4, 4, 3, 4,
       3, 3, 4, 2, 4, 4, 3, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4,
       4, 3, 3, 3, 4, 3, 4, 3, 4, 3, 3, 3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 4, 3, 4, 3, 2,
       4, 4, 3, 3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 2, 2, 4, 3, 4, 2, 4, 3,
       3, 4, 3, 3, 3, 3, 4, 3, 4, 3, 4, 2, 2, 4, 3, 4, 3, 2, 4, 3, 3, 2,
       4, 4, 4, 4, 3, 0, 3, 3, 3, 3, 1, 4, 3, 3, 3, 4, 3, 4, 3, 3, 3, 4,
       3, 3, 4, 4, 4, 4, 3, 3, 4, 3, 4, 3, 4, 4, 3, 4, 4, 3, 4, 4, 3, 3,
       3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 3, 2, 4, 4, 4, 4], dtype=int8)>

随机读取(shuffle)并批量处理数据集。

train_dataset = dataset.shuffle(len(df)).batch(1)

创建并训练模型

def get_compiled_model():
  model = tf.keras.Sequential([
    tf.keras.layers.Dense(10, activation='relu'),
    tf.keras.layers.Dense(10, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
  ])

  model.compile(optimizer='adam',
                loss='binary_crossentropy',
                metrics=['accuracy'])
  return model
model = get_compiled_model()
model.fit(train_dataset, epochs=15)
Epoch 1/15
303/303 [==============================] - 2s 2ms/step - loss: 5.3122 - accuracy: 0.5611
Epoch 2/15
303/303 [==============================] - 1s 2ms/step - loss: 2.6524 - accuracy: 0.5776
Epoch 3/15
303/303 [==============================] - 1s 2ms/step - loss: 2.0161 - accuracy: 0.5809
Epoch 4/15
303/303 [==============================] - 1s 2ms/step - loss: 1.5642 - accuracy: 0.5908
Epoch 5/15
303/303 [==============================] - 1s 2ms/step - loss: 1.0725 - accuracy: 0.6535
Epoch 6/15
303/303 [==============================] - 1s 2ms/step - loss: 0.9007 - accuracy: 0.6733
Epoch 7/15
303/303 [==============================] - 1s 2ms/step - loss: 0.6849 - accuracy: 0.7096
Epoch 8/15
303/303 [==============================] - 1s 2ms/step - loss: 0.6602 - accuracy: 0.7525
Epoch 9/15
303/303 [==============================] - 1s 2ms/step - loss: 0.7093 - accuracy: 0.7195
Epoch 10/15
303/303 [==============================] - 1s 2ms/step - loss: 0.6546 - accuracy: 0.7426
Epoch 11/15
303/303 [==============================] - 1s 2ms/step - loss: 0.6147 - accuracy: 0.7327
Epoch 12/15
303/303 [==============================] - 1s 2ms/step - loss: 0.6334 - accuracy: 0.7492
Epoch 13/15
303/303 [==============================] - 1s 2ms/step - loss: 0.6366 - accuracy: 0.7558
Epoch 14/15
303/303 [==============================] - 1s 2ms/step - loss: 0.5893 - accuracy: 0.7294
Epoch 15/15
303/303 [==============================] - 1s 2ms/step - loss: 0.5902 - accuracy: 0.7525
<keras.callbacks.History at 0x7f8c08467cd0>

代替特征列

将字典作为输入传输给模型就像创建 tf.keras.layers.Input 层的匹配字典一样简单,应用任何预处理并使用 functional api。 您可以使用它作为 feature columns 的替代方法。

inputs = {key: tf.keras.layers.Input(shape=(), name=key) for key in df.keys()}
x = tf.stack(list(inputs.values()), axis=-1)

x = tf.keras.layers.Dense(10, activation='relu')(x)
output = tf.keras.layers.Dense(1, activation='sigmoid')(x)

model_func = tf.keras.Model(inputs=inputs, outputs=output)

model_func.compile(optimizer='adam',
                   loss='binary_crossentropy',
                   metrics=['accuracy'])

tf.data 一起使用时,保存 pd.DataFrame 列结构的最简单方法是将 pd.DataFrame 转换为 dict ,并对该字典进行切片。

dict_slices = tf.data.Dataset.from_tensor_slices((df.to_dict('list'), target.values)).batch(16)
for dict_slice in dict_slices.take(1):
  print (dict_slice)
({'age': <tf.Tensor: shape=(16,), dtype=int32, numpy=
array([63, 67, 67, 37, 41, 56, 62, 57, 63, 53, 57, 56, 56, 44, 52, 57],
      dtype=int32)>, 'sex': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1], dtype=int32)>, 'cp': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([1, 4, 4, 3, 2, 2, 4, 4, 4, 4, 4, 2, 3, 2, 3, 3], dtype=int32)>, 'trestbps': <tf.Tensor: shape=(16,), dtype=int32, numpy=
array([145, 160, 120, 130, 130, 120, 140, 120, 130, 140, 140, 140, 130,
       120, 172, 150], dtype=int32)>, 'chol': <tf.Tensor: shape=(16,), dtype=int32, numpy=
array([233, 286, 229, 250, 204, 236, 268, 354, 254, 203, 192, 294, 256,
       263, 199, 168], dtype=int32)>, 'fbs': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0], dtype=int32)>, 'restecg': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 2, 0, 0, 0], dtype=int32)>, 'thalach': <tf.Tensor: shape=(16,), dtype=int32, numpy=
array([150, 108, 129, 187, 172, 178, 160, 163, 147, 155, 148, 153, 142,
       173, 162, 174], dtype=int32)>, 'exang': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0], dtype=int32)>, 'oldpeak': <tf.Tensor: shape=(16,), dtype=float32, numpy=
array([2.3, 1.5, 2.6, 3.5, 1.4, 0.8, 3.6, 0.6, 1.4, 3.1, 0.4, 1.3, 0.6,

       0. , 0.5, 1.6], dtype=float32)>, 'slope': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([3, 2, 2, 3, 1, 1, 3, 1, 2, 3, 2, 2, 2, 1, 1, 1], dtype=int32)>, 'ca': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([0, 3, 2, 0, 0, 0, 2, 0, 1, 0, 0, 0, 1, 0, 0, 0], dtype=int32)>, 'thal': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([2, 3, 4, 3, 3, 3, 3, 3, 4, 4, 2, 3, 2, 4, 4, 3], dtype=int32)>}, <tf.Tensor: shape=(16,), dtype=int64, numpy=array([0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0])>)
model_func.fit(dict_slices, epochs=15)
Epoch 1/15
19/19 [==============================] - 0s 4ms/step - loss: 77.0238 - accuracy: 0.2739
Epoch 2/15
19/19 [==============================] - 0s 4ms/step - loss: 56.7002 - accuracy: 0.2739
Epoch 3/15
19/19 [==============================] - 0s 4ms/step - loss: 37.6057 - accuracy: 0.2739
Epoch 4/15
19/19 [==============================] - 0s 3ms/step - loss: 19.2780 - accuracy: 0.2739
Epoch 5/15
19/19 [==============================] - 0s 4ms/step - loss: 5.9422 - accuracy: 0.4224
Epoch 6/15
19/19 [==============================] - 0s 3ms/step - loss: 3.6487 - accuracy: 0.6271
Epoch 7/15
19/19 [==============================] - 0s 4ms/step - loss: 3.4792 - accuracy: 0.6205
Epoch 8/15
19/19 [==============================] - 0s 3ms/step - loss: 3.3901 - accuracy: 0.6007
Epoch 9/15
19/19 [==============================] - 0s 4ms/step - loss: 3.3224 - accuracy: 0.5908
Epoch 10/15
19/19 [==============================] - 0s 3ms/step - loss: 3.2393 - accuracy: 0.5908
Epoch 11/15
19/19 [==============================] - 0s 3ms/step - loss: 3.1605 - accuracy: 0.5941
Epoch 12/15
19/19 [==============================] - 0s 3ms/step - loss: 3.0813 - accuracy: 0.5941
Epoch 13/15
19/19 [==============================] - 0s 3ms/step - loss: 2.9993 - accuracy: 0.5941
Epoch 14/15
19/19 [==============================] - 0s 4ms/step - loss: 2.9160 - accuracy: 0.5974
Epoch 15/15
19/19 [==============================] - 0s 4ms/step - loss: 2.8319 - accuracy: 0.6007
<keras.callbacks.History at 0x7f8c080f5fd0>