![]() |
![]() |
![]() |
![]() |
本教程提供了如何将 pandas dataframes 加载到 tf.data.Dataset
。
本教程使用了一个小型数据集,由克利夫兰诊所心脏病基金会(Cleveland Clinic Foundation for Heart Disease)提供. 此数据集中有几百行CSV。每行表示一个患者,每列表示一个属性(describe)。我们将使用这些信息来预测患者是否患有心脏病,这是一个二分类问题。
使用 pandas 读取数据
import pandas as pd
import tensorflow as tf
2022-08-17 04:34:36.107768: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2022-08-17 04:34:36.761046: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvrtc.so.11.1: cannot open shared object file: No such file or directory 2022-08-17 04:34:36.761325: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvrtc.so.11.1: cannot open shared object file: No such file or directory 2022-08-17 04:34:36.761339: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
下载包含心脏数据集的 csv 文件。
csv_file = tf.keras.utils.get_file('heart.csv', 'https://storage.googleapis.com/download.tensorflow.org/data/heart.csv')
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/heart.csv 13273/13273 [==============================] - 0s 0us/step
使用 pandas 读取 csv 文件。
df = pd.read_csv(csv_file)
df.head()
df.dtypes
age int64 sex int64 cp int64 trestbps int64 chol int64 fbs int64 restecg int64 thalach int64 exang int64 oldpeak float64 slope int64 ca int64 thal object target int64 dtype: object
将 thal
列(数据帧(dataframe)中的 object
)转换为离散数值。
df['thal'] = pd.Categorical(df['thal'])
df['thal'] = df.thal.cat.codes
df.head()
使用 tf.data.Dataset
读取数据
使用 tf.data.Dataset.from_tensor_slices
从 pandas dataframe 中读取数值。
使用 tf.data.Dataset
的其中一个优势是可以允许您写一些简单而又高效的数据管道(data pipelines)。从 loading data guide 可以了解更多。
target = df.pop('target')
dataset = tf.data.Dataset.from_tensor_slices((df.values, target.values))
for feat, targ in dataset.take(5):
print ('Features: {}, Target: {}'.format(feat, targ))
Features: [ 63. 1. 1. 145. 233. 1. 2. 150. 0. 2.3 3. 0. 2. ], Target: 0 Features: [ 67. 1. 4. 160. 286. 0. 2. 108. 1. 1.5 2. 3. 3. ], Target: 1 Features: [ 67. 1. 4. 120. 229. 0. 2. 129. 1. 2.6 2. 2. 4. ], Target: 0 Features: [ 37. 1. 3. 130. 250. 0. 0. 187. 0. 3.5 3. 0. 3. ], Target: 0 Features: [ 41. 0. 2. 130. 204. 0. 2. 172. 0. 1.4 1. 0. 3. ], Target: 0
由于 pd.Series
实现了 __array__
协议,因此几乎可以在任何使用 np.array
或 tf.Tensor
的地方透明地使用它。
tf.constant(df['thal'])
<tf.Tensor: shape=(303,), dtype=int8, numpy= array([2, 3, 4, 3, 3, 3, 3, 3, 4, 4, 2, 3, 2, 4, 4, 3, 4, 3, 3, 3, 3, 3, 3, 4, 4, 3, 3, 3, 3, 4, 3, 4, 3, 4, 3, 3, 4, 2, 4, 3, 4, 3, 4, 4, 2, 3, 3, 4, 3, 3, 4, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4, 2, 3, 3, 4, 3, 4, 3, 3, 4, 4, 3, 3, 4, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 4, 3, 4, 4, 3, 4, 3, 3, 3, 4, 3, 4, 4, 3, 3, 4, 4, 4, 4, 4, 3, 3, 3, 3, 4, 3, 4, 3, 4, 4, 3, 3, 2, 4, 4, 2, 3, 3, 4, 4, 3, 4, 3, 3, 4, 2, 4, 4, 3, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, 4, 3, 4, 3, 4, 3, 3, 3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 4, 3, 4, 3, 2, 4, 4, 3, 3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 2, 2, 4, 3, 4, 2, 4, 3, 3, 4, 3, 3, 3, 3, 4, 3, 4, 3, 4, 2, 2, 4, 3, 4, 3, 2, 4, 3, 3, 2, 4, 4, 4, 4, 3, 0, 3, 3, 3, 3, 1, 4, 3, 3, 3, 4, 3, 4, 3, 3, 3, 4, 3, 3, 4, 4, 4, 4, 3, 3, 4, 3, 4, 3, 4, 4, 3, 4, 4, 3, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 3, 2, 4, 4, 4, 4], dtype=int8)>
随机读取(shuffle)并批量处理数据集。
train_dataset = dataset.shuffle(len(df)).batch(1)
创建并训练模型
def get_compiled_model():
model = tf.keras.Sequential([
tf.keras.layers.Dense(10, activation='relu'),
tf.keras.layers.Dense(10, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
return model
model = get_compiled_model()
model.fit(train_dataset, epochs=15)
Epoch 1/15 303/303 [==============================] - 2s 2ms/step - loss: 5.3122 - accuracy: 0.5611 Epoch 2/15 303/303 [==============================] - 1s 2ms/step - loss: 2.6524 - accuracy: 0.5776 Epoch 3/15 303/303 [==============================] - 1s 2ms/step - loss: 2.0161 - accuracy: 0.5809 Epoch 4/15 303/303 [==============================] - 1s 2ms/step - loss: 1.5642 - accuracy: 0.5908 Epoch 5/15 303/303 [==============================] - 1s 2ms/step - loss: 1.0725 - accuracy: 0.6535 Epoch 6/15 303/303 [==============================] - 1s 2ms/step - loss: 0.9007 - accuracy: 0.6733 Epoch 7/15 303/303 [==============================] - 1s 2ms/step - loss: 0.6849 - accuracy: 0.7096 Epoch 8/15 303/303 [==============================] - 1s 2ms/step - loss: 0.6602 - accuracy: 0.7525 Epoch 9/15 303/303 [==============================] - 1s 2ms/step - loss: 0.7093 - accuracy: 0.7195 Epoch 10/15 303/303 [==============================] - 1s 2ms/step - loss: 0.6546 - accuracy: 0.7426 Epoch 11/15 303/303 [==============================] - 1s 2ms/step - loss: 0.6147 - accuracy: 0.7327 Epoch 12/15 303/303 [==============================] - 1s 2ms/step - loss: 0.6334 - accuracy: 0.7492 Epoch 13/15 303/303 [==============================] - 1s 2ms/step - loss: 0.6366 - accuracy: 0.7558 Epoch 14/15 303/303 [==============================] - 1s 2ms/step - loss: 0.5893 - accuracy: 0.7294 Epoch 15/15 303/303 [==============================] - 1s 2ms/step - loss: 0.5902 - accuracy: 0.7525 <keras.callbacks.History at 0x7f8c08467cd0>
代替特征列
将字典作为输入传输给模型就像创建 tf.keras.layers.Input
层的匹配字典一样简单,应用任何预处理并使用 functional api。 您可以使用它作为 feature columns 的替代方法。
inputs = {key: tf.keras.layers.Input(shape=(), name=key) for key in df.keys()}
x = tf.stack(list(inputs.values()), axis=-1)
x = tf.keras.layers.Dense(10, activation='relu')(x)
output = tf.keras.layers.Dense(1, activation='sigmoid')(x)
model_func = tf.keras.Model(inputs=inputs, outputs=output)
model_func.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
与 tf.data
一起使用时,保存 pd.DataFrame
列结构的最简单方法是将 pd.DataFrame
转换为 dict
,并对该字典进行切片。
dict_slices = tf.data.Dataset.from_tensor_slices((df.to_dict('list'), target.values)).batch(16)
for dict_slice in dict_slices.take(1):
print (dict_slice)
({'age': <tf.Tensor: shape=(16,), dtype=int32, numpy= array([63, 67, 67, 37, 41, 56, 62, 57, 63, 53, 57, 56, 56, 44, 52, 57], dtype=int32)>, 'sex': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1], dtype=int32)>, 'cp': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([1, 4, 4, 3, 2, 2, 4, 4, 4, 4, 4, 2, 3, 2, 3, 3], dtype=int32)>, 'trestbps': <tf.Tensor: shape=(16,), dtype=int32, numpy= array([145, 160, 120, 130, 130, 120, 140, 120, 130, 140, 140, 140, 130, 120, 172, 150], dtype=int32)>, 'chol': <tf.Tensor: shape=(16,), dtype=int32, numpy= array([233, 286, 229, 250, 204, 236, 268, 354, 254, 203, 192, 294, 256, 263, 199, 168], dtype=int32)>, 'fbs': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0], dtype=int32)>, 'restecg': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 2, 0, 0, 0], dtype=int32)>, 'thalach': <tf.Tensor: shape=(16,), dtype=int32, numpy= array([150, 108, 129, 187, 172, 178, 160, 163, 147, 155, 148, 153, 142, 173, 162, 174], dtype=int32)>, 'exang': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0], dtype=int32)>, 'oldpeak': <tf.Tensor: shape=(16,), dtype=float32, numpy= array([2.3, 1.5, 2.6, 3.5, 1.4, 0.8, 3.6, 0.6, 1.4, 3.1, 0.4, 1.3, 0.6, 0. , 0.5, 1.6], dtype=float32)>, 'slope': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([3, 2, 2, 3, 1, 1, 3, 1, 2, 3, 2, 2, 2, 1, 1, 1], dtype=int32)>, 'ca': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([0, 3, 2, 0, 0, 0, 2, 0, 1, 0, 0, 0, 1, 0, 0, 0], dtype=int32)>, 'thal': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([2, 3, 4, 3, 3, 3, 3, 3, 4, 4, 2, 3, 2, 4, 4, 3], dtype=int32)>}, <tf.Tensor: shape=(16,), dtype=int64, numpy=array([0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0])>)
model_func.fit(dict_slices, epochs=15)
Epoch 1/15 19/19 [==============================] - 0s 4ms/step - loss: 77.0238 - accuracy: 0.2739 Epoch 2/15 19/19 [==============================] - 0s 4ms/step - loss: 56.7002 - accuracy: 0.2739 Epoch 3/15 19/19 [==============================] - 0s 4ms/step - loss: 37.6057 - accuracy: 0.2739 Epoch 4/15 19/19 [==============================] - 0s 3ms/step - loss: 19.2780 - accuracy: 0.2739 Epoch 5/15 19/19 [==============================] - 0s 4ms/step - loss: 5.9422 - accuracy: 0.4224 Epoch 6/15 19/19 [==============================] - 0s 3ms/step - loss: 3.6487 - accuracy: 0.6271 Epoch 7/15 19/19 [==============================] - 0s 4ms/step - loss: 3.4792 - accuracy: 0.6205 Epoch 8/15 19/19 [==============================] - 0s 3ms/step - loss: 3.3901 - accuracy: 0.6007 Epoch 9/15 19/19 [==============================] - 0s 4ms/step - loss: 3.3224 - accuracy: 0.5908 Epoch 10/15 19/19 [==============================] - 0s 3ms/step - loss: 3.2393 - accuracy: 0.5908 Epoch 11/15 19/19 [==============================] - 0s 3ms/step - loss: 3.1605 - accuracy: 0.5941 Epoch 12/15 19/19 [==============================] - 0s 3ms/step - loss: 3.0813 - accuracy: 0.5941 Epoch 13/15 19/19 [==============================] - 0s 3ms/step - loss: 2.9993 - accuracy: 0.5941 Epoch 14/15 19/19 [==============================] - 0s 4ms/step - loss: 2.9160 - accuracy: 0.5974 Epoch 15/15 19/19 [==============================] - 0s 4ms/step - loss: 2.8319 - accuracy: 0.6007 <keras.callbacks.History at 0x7f8c080f5fd0>