Watch talks from the 2019 TensorFlow Dev Summit Watch now

使用 Estimator 构建线性模型

在 TensorFlow.org 上查看 在 Google Colab 中运行 查看 GitHub 上的源代码

本教程使用 TensorFlow 中的 tf.estimator API 来解决基准二元分类问题。Estimator 是可扩展性最强且面向生产的 TensorFlow 模型类型。如需了解详情,请参阅 Estimator 指南

概述

我们会使用包含个人年龄、受教育程度、婚姻状况和职业(即特征)数据在内的普查数据,尝试预测个人年收入是否超过 5 万美元(即目标标签)。我们会训练一个逻辑回归模型,若给出某个人的信息,该模型会输出一个介于 0 到 1 之间的值,可将该值解读为个人年收入超过 5 万美元的概率。

设置

导入 TensorFlow、特征列支持和支持模块:

import tensorflow as tf
import tensorflow.feature_column as fc

import os
import sys

import matplotlib.pyplot as plt
from IPython.display import clear_output

然后启用 Eager Execution,以在此程序运行时检查此程序:

tf.enable_eager_execution()

下载官方实现

我们将使用 TensorFlow 模型代码库中提供的宽度模型和深度模型。下载代码、将根目录添加到 Python 路径,然后跳转到 wide_deep 目录:

! pip install -q requests
! git clone --depth 1 https://github.com/tensorflow/models
Cloning into 'models'...
remote: Enumerating objects: 2999, done.
remote: Counting objects: 100% (2999/2999), done.
remote: Compressing objects: 100% (2544/2544), done.
remote: Total 2999 (delta 509), reused 1899 (delta 378), pack-reused 0
Receiving objects: 100% (2999/2999), 376.95 MiB | 41.75 MiB/s, done.
Resolving deltas: 100% (509/509), done.
Checking connectivity... done.

将该代码库的根目录添加到 Python 路径:

models_path = os.path.join(os.getcwd(), 'models')

sys.path.append(models_path)

下载数据集:

from official.wide_deep import census_dataset
from official.wide_deep import census_main

census_dataset.download("/tmp/census_data/")

命令行用法

该代码库包含一个完整的程序,可用于对此类模型进行实验。

要从命令行执行教程代码,先将 tensorflow/models 路径添加到您的 PYTHONPATH

#export PYTHONPATH=${PYTHONPATH}:"$(pwd)/models"
#running from python you need to set the `os.environ` or the subprocess will not see the directory.

if "PYTHONPATH" in os.environ:
  os.environ['PYTHONPATH'] += os.pathsep +  models_path
else:
  os.environ['PYTHONPATH'] = models_path

使用 --help 查看可用的命令行选项:

!python -m official.wide_deep.census_main --help
2018-10-23 20:35:48.219648: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Train DNN on census income dataset.
flags:

/docker/output/models/official/wide_deep/census_main.py:
  -bs,--batch_size:
    Batch size for training and evaluation. When using multiple gpus, this is
    the
    global batch size for all devices. For example, if the batch size is 32 and
    there are 4 GPUs, each GPU will get 8 examples on each step.
    (default: '40')
    (an integer)
  --[no]clean:
    If set, model_dir will be removed if it exists.
    (default: 'false')
  -dd,--data_dir:
    The location of the input data.
    (default: '/tmp/census_data')
  --[no]download_if_missing:
    Download data to data_dir if it is not already present.
    (default: 'true')
  -ebe,--epochs_between_evals:
    The number of training epochs to run between evaluations.
    (default: '2')
    (an integer)
  -ed,--export_dir:
    If set, a SavedModel serialization of the model will be exported to this
    directory at the end of training. See the README for more details and
    relevant
    links.
  -hk,--hooks:
    A list of (case insensitive) strings to specify the names of training hooks.
      Hook:
        loggingmetrichook
        loggingtensorhook
        profilerhook
        examplespersecondhook
      Example: `--hooks ProfilerHook,ExamplesPerSecondHook`
    See official.utils.logs.hooks_helper for details.
    (default: 'LoggingTensorHook')
    (a comma separated list)
  -md,--model_dir:
    The location of the model checkpoint files.
    (default: '/tmp/census_model')
  -mt,--model_type: <wide|deep|wide_deep>: Select model topology.
    (default: 'wide_deep')
  -te,--train_epochs:
    The number of epochs used to train.
    (default: '40')
    (an integer)

Try --helpfull to get a list of all flags.

现在,运行模型:

!python -m official.wide_deep.census_main --model_type=wide --train_epochs=2
2018-10-23 20:35:50.514356: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
I1023 20:35:50.517620 140481409083136 tf_logging.py:115] Using config: {'_tf_random_seed': None, '_protocol': None, '_model_dir': '/tmp/census_model', '_service': None, '_evaluation_master': '', '_num_ps_replicas': 0, '_train_distribute': None, '_keep_checkpoint_max': 5, '_num_worker_replicas': 1, '_save_summary_steps': 100, '_device_fn': None, '_master': '', '_task_type': 'worker', '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fc40b258780>, '_log_step_count_steps': 100, '_eval_distribute': None, '_save_checkpoints_secs': 600, '_keep_checkpoint_every_n_hours': 10000, '_session_config': device_count {
  key: "GPU"
}
, '_experimental_distribute': None, '_global_id_in_cluster': 0, '_is_chief': True, '_save_checkpoints_steps': None}
W1023 20:35:50.518451 140481409083136 tf_logging.py:120] 'cpuinfo' not imported. CPU info will not be logged.
W1023 20:35:50.518746 140481409083136 tf_logging.py:120] 'psutil' not imported. Memory info will not be logged.
I1023 20:35:50.690451 140481409083136 tf_logging.py:115] Benchmark run: {'model_name': 'wide_deep', 'run_date': '2018-10-23T20:35:50.518131Z', 'test_id': None, 'dataset': {'name': 'Census Income'}, 'tensorflow_environment_variables': [], 'machine_config': {'gpu_info': {'count': 0}}, 'tensorflow_version': {'git_hash': "b'unknown'", 'version': '1.12.0-rc1'}, 'run_parameters': [{'long_value': 40, 'name': 'batch_size'}, {'name': 'model_type', 'string_value': 'wide'}, {'long_value': 2, 'name': 'train_epochs'}]}
I1023 20:35:50.731611 140481409083136 tf_logging.py:115] Parsing /tmp/census_data/adult.data
I1023 20:35:50.768753 140481409083136 tf_logging.py:115] Calling model_fn.
I1023 20:35:51.686829 140481409083136 tf_logging.py:115] Done calling model_fn.
I1023 20:35:51.687152 140481409083136 tf_logging.py:115] Create CheckpointSaverHook.
I1023 20:35:52.062342 140481409083136 tf_logging.py:115] Graph was finalized.
I1023 20:35:52.136169 140481409083136 tf_logging.py:115] Running local_init_op.
I1023 20:35:52.152031 140481409083136 tf_logging.py:115] Done running local_init_op.
I1023 20:35:52.853039 140481409083136 tf_logging.py:115] Saving checkpoints for 0 into /tmp/census_model/model.ckpt.
I1023 20:35:53.326310 140481409083136 tf_logging.py:115] average_loss = 0.6931472, loss = 27.725887
I1023 20:35:53.326710 140481409083136 tf_logging.py:115] loss = 27.725887, step = 1
I1023 20:35:53.849500 140481409083136 tf_logging.py:115] global_step/sec: 190.937
I1023 20:35:53.850343 140481409083136 tf_logging.py:115] average_loss = 0.37621957, loss = 15.048783 (0.524 sec)
I1023 20:35:53.850646 140481409083136 tf_logging.py:115] loss = 15.048783, step = 101 (0.524 sec)
I1023 20:35:54.155820 140481409083136 tf_logging.py:115] global_step/sec: 326.465
I1023 20:35:54.156685 140481409083136 tf_logging.py:115] average_loss = 0.58093274, loss = 23.23731 (0.306 sec)
I1023 20:35:54.156997 140481409083136 tf_logging.py:115] loss = 23.23731, step = 201 (0.306 sec)
I1023 20:35:54.465681 140481409083136 tf_logging.py:115] global_step/sec: 322.717
I1023 20:35:54.466574 140481409083136 tf_logging.py:115] average_loss = 0.2793787, loss = 11.175148 (0.310 sec)
I1023 20:35:54.466908 140481409083136 tf_logging.py:115] loss = 11.175148, step = 301 (0.310 sec)
I1023 20:35:54.773294 140481409083136 tf_logging.py:115] global_step/sec: 325.113
I1023 20:35:54.774204 140481409083136 tf_logging.py:115] average_loss = 0.2649246, loss = 10.596983 (0.308 sec)
I1023 20:35:54.774599 140481409083136 tf_logging.py:115] loss = 10.596983, step = 401 (0.308 sec)
I1023 20:35:55.068387 140481409083136 tf_logging.py:115] global_step/sec: 338.842
I1023 20:35:55.069169 140481409083136 tf_logging.py:115] average_loss = 0.4484293, loss = 17.937172 (0.295 sec)
I1023 20:35:55.069455 140481409083136 tf_logging.py:115] loss = 17.937172, step = 501 (0.295 sec)
I1023 20:35:55.381950 140481409083136 tf_logging.py:115] global_step/sec: 318.911
I1023 20:35:55.382749 140481409083136 tf_logging.py:115] average_loss = 0.2969147, loss = 11.876588 (0.314 sec)
I1023 20:35:55.383039 140481409083136 tf_logging.py:115] loss = 11.876588, step = 601 (0.314 sec)
I1023 20:35:55.689289 140481409083136 tf_logging.py:115] global_step/sec: 325.381
I1023 20:35:55.690118 140481409083136 tf_logging.py:115] average_loss = 0.36402485, loss = 14.560994 (0.307 sec)
I1023 20:35:55.690504 140481409083136 tf_logging.py:115] loss = 14.560994, step = 701 (0.307 sec)
I1023 20:35:56.000257 140481409083136 tf_logging.py:115] global_step/sec: 321.56
I1023 20:35:56.001024 140481409083136 tf_logging.py:115] average_loss = 0.37127346, loss = 14.850939 (0.311 sec)
I1023 20:35:56.001280 140481409083136 tf_logging.py:115] loss = 14.850939, step = 801 (0.311 sec)
I1023 20:35:56.349296 140481409083136 tf_logging.py:115] global_step/sec: 286.496
I1023 20:35:56.350041 140481409083136 tf_logging.py:115] average_loss = 0.29277754, loss = 11.711102 (0.349 sec)
I1023 20:35:56.350322 140481409083136 tf_logging.py:115] loss = 11.711102, step = 901 (0.349 sec)
I1023 20:35:56.644212 140481409083136 tf_logging.py:115] global_step/sec: 339.082
I1023 20:35:56.644878 140481409083136 tf_logging.py:115] average_loss = 0.29483682, loss = 11.793472 (0.295 sec)
I1023 20:35:56.645112 140481409083136 tf_logging.py:115] loss = 11.793472, step = 1001 (0.295 sec)
I1023 20:35:56.936556 140481409083136 tf_logging.py:115] global_step/sec: 342.074
I1023 20:35:56.937312 140481409083136 tf_logging.py:115] average_loss = 0.3045118, loss = 12.180471 (0.292 sec)
I1023 20:35:56.937615 140481409083136 tf_logging.py:115] loss = 12.180471, step = 1101 (0.293 sec)
I1023 20:35:57.226970 140481409083136 tf_logging.py:115] global_step/sec: 344.336
I1023 20:35:57.227720 140481409083136 tf_logging.py:115] average_loss = 0.45661148, loss = 18.26446 (0.290 sec)
I1023 20:35:57.228007 140481409083136 tf_logging.py:115] loss = 18.26446, step = 1201 (0.290 sec)
I1023 20:35:57.543720 140481409083136 tf_logging.py:115] global_step/sec: 315.785
I1023 20:35:57.544777 140481409083136 tf_logging.py:115] average_loss = 0.44982958, loss = 17.993183 (0.317 sec)
I1023 20:35:57.545178 140481409083136 tf_logging.py:115] loss = 17.993183, step = 1301 (0.317 sec)
I1023 20:35:57.845958 140481409083136 tf_logging.py:115] global_step/sec: 330.78
I1023 20:35:57.846761 140481409083136 tf_logging.py:115] average_loss = 0.42331782, loss = 16.932713 (0.302 sec)
I1023 20:35:57.847053 140481409083136 tf_logging.py:115] loss = 16.932713, step = 1401 (0.302 sec)
I1023 20:35:58.156264 140481409083136 tf_logging.py:115] global_step/sec: 322.325
I1023 20:35:58.157321 140481409083136 tf_logging.py:115] average_loss = 0.44173947, loss = 17.669579 (0.311 sec)
I1023 20:35:58.157631 140481409083136 tf_logging.py:115] loss = 17.669579, step = 1501 (0.311 sec)
I1023 20:35:58.440013 140481409083136 tf_logging.py:115] global_step/sec: 352.351
I1023 20:35:58.440789 140481409083136 tf_logging.py:115] average_loss = 0.4071808, loss = 16.287231 (0.283 sec)
I1023 20:35:58.441095 140481409083136 tf_logging.py:115] loss = 16.287231, step = 1601 (0.283 sec)
I1023 20:35:58.530341 140481409083136 tf_logging.py:115] Saving checkpoints for 1629 into /tmp/census_model/model.ckpt.
I1023 20:35:58.664809 140481409083136 tf_logging.py:115] Loss for final step: 0.32237834.
I1023 20:35:58.681264 140481409083136 tf_logging.py:115] Parsing /tmp/census_data/adult.test
I1023 20:35:58.714178 140481409083136 tf_logging.py:115] Calling model_fn.
W1023 20:35:59.812553 140481409083136 tf_logging.py:125] Trapezoidal rule is known to produce incorrect PR-AUCs; please switch to "careful_interpolation" instead.
W1023 20:35:59.830484 140481409083136 tf_logging.py:125] Trapezoidal rule is known to produce incorrect PR-AUCs; please switch to "careful_interpolation" instead.
I1023 20:35:59.848089 140481409083136 tf_logging.py:115] Done calling model_fn.
I1023 20:35:59.866235 140481409083136 tf_logging.py:115] Starting evaluation at 2018-10-23-20:35:59
I1023 20:35:59.986163 140481409083136 tf_logging.py:115] Graph was finalized.
I1023 20:35:59.987520 140481409083136 tf_logging.py:115] Restoring parameters from /tmp/census_model/model.ckpt-1629
I1023 20:36:00.079737 140481409083136 tf_logging.py:115] Running local_init_op.
I1023 20:36:00.108657 140481409083136 tf_logging.py:115] Done running local_init_op.
I1023 20:36:01.741130 140481409083136 tf_logging.py:115] Finished evaluation at 2018-10-23-20:36:01
I1023 20:36:01.741404 140481409083136 tf_logging.py:115] Saving dict for global step 1629: accuracy = 0.8356366, accuracy_baseline = 0.76377374, auc = 0.88405776, auc_precision_recall = 0.6956143, average_loss = 0.35087836, global_step = 1629, label/mean = 0.23622628, loss = 14.001595, precision = 0.6881029, prediction/mean = 0.23948322, recall = 0.55642223
I1023 20:36:02.061108 140481409083136 tf_logging.py:115] Saving 'checkpoint_path' summary for global step 1629: /tmp/census_model/model.ckpt-1629
I1023 20:36:02.061893 140481409083136 tf_logging.py:115] Results at epoch 2 / 2
I1023 20:36:02.062013 140481409083136 tf_logging.py:115] ------------------------------------------------------------
I1023 20:36:02.062126 140481409083136 tf_logging.py:115] accuracy: 0.8356366
I1023 20:36:02.062221 140481409083136 tf_logging.py:115] accuracy_baseline: 0.76377374
I1023 20:36:02.062330 140481409083136 tf_logging.py:115] auc: 0.88405776
I1023 20:36:02.062401 140481409083136 tf_logging.py:115] auc_precision_recall: 0.6956143
I1023 20:36:02.062466 140481409083136 tf_logging.py:115] average_loss: 0.35087836
I1023 20:36:02.062540 140481409083136 tf_logging.py:115] global_step: 1629
I1023 20:36:02.062605 140481409083136 tf_logging.py:115] label/mean: 0.23622628
I1023 20:36:02.062668 140481409083136 tf_logging.py:115] loss: 14.001595
I1023 20:36:02.062729 140481409083136 tf_logging.py:115] precision: 0.6881029
I1023 20:36:02.062791 140481409083136 tf_logging.py:115] prediction/mean: 0.23948322
I1023 20:36:02.062853 140481409083136 tf_logging.py:115] recall: 0.55642223
I1023 20:36:02.062995 140481409083136 tf_logging.py:115] Benchmark metric: {'name': 'accuracy', 'extras': [], 'unit': None, 'timestamp': '2018-10-23T20:36:02.062954Z', 'global_step': 1629, 'value': 0.8356366157531738}
I1023 20:36:02.063118 140481409083136 tf_logging.py:115] Benchmark metric: {'name': 'accuracy_baseline', 'extras': [], 'unit': None, 'timestamp': '2018-10-23T20:36:02.063097Z', 'global_step': 1629, 'value': 0.7637737393379211}
I1023 20:36:02.063222 140481409083136 tf_logging.py:115] Benchmark metric: {'name': 'auc', 'extras': [], 'unit': None, 'timestamp': '2018-10-23T20:36:02.063202Z', 'global_step': 1629, 'value': 0.8840577602386475}
I1023 20:36:02.063322 140481409083136 tf_logging.py:115] Benchmark metric: {'name': 'auc_precision_recall', 'extras': [], 'unit': None, 'timestamp': '2018-10-23T20:36:02.063303Z', 'global_step': 1629, 'value': 0.6956142783164978}
I1023 20:36:02.063421 140481409083136 tf_logging.py:115] Benchmark metric: {'name': 'average_loss', 'extras': [], 'unit': None, 'timestamp': '2018-10-23T20:36:02.063402Z', 'global_step': 1629, 'value': 0.35087835788726807}
I1023 20:36:02.063518 140481409083136 tf_logging.py:115] Benchmark metric: {'name': 'label/mean', 'extras': [], 'unit': None, 'timestamp': '2018-10-23T20:36:02.063500Z', 'global_step': 1629, 'value': 0.23622627556324005}
I1023 20:36:02.063615 140481409083136 tf_logging.py:115] Benchmark metric: {'name': 'loss', 'extras': [], 'unit': None, 'timestamp': '2018-10-23T20:36:02.063596Z', 'global_step': 1629, 'value': 14.001594543457031}
I1023 20:36:02.063710 140481409083136 tf_logging.py:115] Benchmark metric: {'name': 'precision', 'extras': [], 'unit': None, 'timestamp': '2018-10-23T20:36:02.063692Z', 'global_step': 1629, 'value': 0.6881029009819031}
I1023 20:36:02.063807 140481409083136 tf_logging.py:115] Benchmark metric: {'name': 'prediction/mean', 'extras': [], 'unit': None, 'timestamp': '2018-10-23T20:36:02.063788Z', 'global_step': 1629, 'value': 0.23948322236537933}
I1023 20:36:02.063902 140481409083136 tf_logging.py:115] Benchmark metric: {'name': 'recall', 'extras': [], 'unit': None, 'timestamp': '2018-10-23T20:36:02.063884Z', 'global_step': 1629, 'value': 0.556422233581543}

读取美国普查数据

下面的示例使用 1994 年和 1995 年的美国普查收入数据集。我们提供了 census_dataset.py 脚本,用于下载数据和执行少量清理工作。

由于该任务解决的是二元分类问题,因此我们将构建一个名为“标签”的标签列:如果收入超过 5 万美元,则该值为 1;否则,该值为 0。有关参考,请参阅 census_main.py 中的 input_fn

我们来看一下数据,看看我们可以使用哪些列预测目标标签:

!ls  /tmp/census_data/
adult.data  adult.test
train_file = "/tmp/census_data/adult.data"
test_file = "/tmp/census_data/adult.test"

pandas 提供了一些便捷的数据分析实用工具。下面列出了普查收入数据集中的列:

import pandas

train_df = pandas.read_csv(train_file, header = None, names = census_dataset._CSV_COLUMNS)
test_df = pandas.read_csv(test_file, header = None, names = census_dataset._CSV_COLUMNS)

train_df.head()
age workclass fnlwgt education education_num marital_status occupation relationship race gender capital_gain capital_loss hours_per_week native_country income_bracket
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
3 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
4 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K

这些列分为两类 - 类别列和连续列:

  • 如果某个列的值只能是一个有限集合中的类别之一,则该列称为类别列。例如,婚恋状况(妻子、丈夫、未婚等)或受教育程度(高中、大学等)属于类别列。
  • 如果某个列的值可以是连续范围内的任意数值,则该列称为连续列。例如,一个人的资本收益(如 14084 美元)属于连续列。

将数据转换为张量

构建 tf.estimator 模型时,使用输入函数(即 input_fn)指定输入数据。此构建器函数返回 tf.data.Dataset(包含 (features-dict, label) 对批次数据)。在传递给 tf.estimator.Estimator 方法(如 trainevaluate)之前不会进行调用。

输入构建器函数返回以下对:

  1. features:一种字典,表示从特征名称到包含批量特征的 TensorsSparseTensors 的映射。
  2. labels:包含批量标签的 Tensor

features 的键用于配置模型的输入层。

对于诸如此类的小问题,可轻松构建 tf.data.Dataset(通过切片 pandas.DataFrame):

def easy_input_function(df, label_key, num_epochs, shuffle, batch_size):
  label = df[label_key]
  ds = tf.data.Dataset.from_tensor_slices((dict(df),label))

  if shuffle:
    ds = ds.shuffle(10000)

  ds = ds.batch(batch_size).repeat(num_epochs)

  return ds

我们已启用 Eager Execution,因此可轻松检查生成的数据集:

ds = easy_input_function(train_df, label_key='income_bracket', num_epochs=5, shuffle=True, batch_size=10)

for feature_batch, label_batch in ds.take(1):
  print('Some feature keys:', list(feature_batch.keys())[:5])
  print()
  print('A batch of Ages  :', feature_batch['age'])
  print()
  print('A batch of Labels:', label_batch )
Some feature keys: ['marital_status', 'workclass', 'native_country', 'relationship', 'gender']

A batch of Ages  : tf.Tensor([29 30 31 46 49 30 20 47 19 52], shape=(10,), dtype=int32)

A batch of Labels: tf.Tensor(
[b'>50K' b'<=50K' b'>50K' b'>50K' b'<=50K' b'<=50K' b'<=50K' b'<=50K'
 b'<=50K' b'<=50K'], shape=(10,), dtype=string)

不过,这种方法的可扩展性很差。较大的数据集应从磁盘流式传输。census_dataset.input_fn 提供了如何使用 tf.decode_csvtf.data.TextLineDataset 执行此操作的示例:

import inspect
print(inspect.getsource(census_dataset.input_fn))
def input_fn(data_file, num_epochs, shuffle, batch_size):
  """Generate an input function for the Estimator."""
  assert tf.gfile.Exists(data_file), (
      '%s not found. Please make sure you have run census_dataset.py and '
      'set the --data_dir argument to the correct path.' % data_file)

  def parse_csv(value):
    tf.logging.info('Parsing {}'.format(data_file))
    columns = tf.decode_csv(value, record_defaults=_CSV_COLUMN_DEFAULTS)
    features = dict(zip(_CSV_COLUMNS, columns))
    labels = features.pop('income_bracket')
    classes = tf.equal(labels, '>50K')  # binary classification
    return features, classes

  # Extract lines from input files using the Dataset API.
  dataset = tf.data.TextLineDataset(data_file)

  if shuffle:
    dataset = dataset.shuffle(buffer_size=_NUM_EXAMPLES['train'])

  dataset = dataset.map(parse_csv, num_parallel_calls=5)

  # We call repeat after shuffling, rather than before, to prevent separate
  # epochs from blending together.
  dataset = dataset.repeat(num_epochs)
  dataset = dataset.batch(batch_size)
  return dataset

input_fn 返回等效输出:

ds = census_dataset.input_fn(train_file, num_epochs=5, shuffle=True, batch_size=10)

for feature_batch, label_batch in ds.take(1):
  print('Feature keys:', list(feature_batch.keys())[:5])
  print()
  print('Age batch   :', feature_batch['age'])
  print()
  print('Label batch :', label_batch )
INFO:tensorflow:Parsing /tmp/census_data/adult.data
WARNING: Logging before flag parsing goes to stderr.
I1023 20:36:03.893301 140603783849728 tf_logging.py:115] Parsing /tmp/census_data/adult.data
Feature keys: ['marital_status', 'workclass', 'native_country', 'relationship', 'gender']

Age batch   : tf.Tensor([33 33 59 21 19 66 55 41 35 44], shape=(10,), dtype=int32)

Label batch : tf.Tensor([False False  True False False False  True  True False False], shape=(10,), dtype=bool)

因为 Estimators 要求 input_fn 不接受任何参数,因此我们通常会将可配置的输入函数封装到带预期签名的对象中。对于此笔记本,请配置 train_inpf 以迭代数据两次:

import functools

train_inpf = functools.partial(census_dataset.input_fn, train_file, num_epochs=2, shuffle=True, batch_size=64)
test_inpf = functools.partial(census_dataset.input_fn, test_file, num_epochs=1, shuffle=False, batch_size=64)

为模型选择特征并进行特征工程处理

Estimator 使用名为特征列的机制来描述模型应如何解读每个原始输入特征。Estimator 需要数值输入向量,而特征列会描述模型应如何转换每个特征。

选择和创建一组正确的特征列是学习有效模型的关键。特征列可以是原始特征 dict 中的其中一个原始输入(基准特征列),也可以是对一个或多个基准列进行转换而创建的任意新列(衍生特征列)。

特征列是一个抽象概念,表示可用于预测目标标签的任何原始变量或衍生变量。

基准特征列

数值列

最简单的 feature_columnnumeric_column。它表示特征是数值,应直接输入到模型中。例如:

age = fc.numeric_column('age')

模型将使用 feature_column 定义来构建模型输入。您可以使用 input_layer 函数检查生成的输出:

fc.input_layer(feature_batch, [age]).numpy()
array([[33.],
       [33.],
       [59.],
       [21.],
       [19.],
       [66.],
       [55.],
       [41.],
       [35.],
       [44.]], dtype=float32)

以下代码将仅使用 age 特征训练和评估模型:

classifier = tf.estimator.LinearClassifier(feature_columns=[age])
classifier.train(train_inpf)
result = classifier.evaluate(test_inpf)

clear_output()  # used for display in notebook
print(result)
{'recall': 0.13858554, 'accuracy': 0.7088017, 'loss': 34.97567, 'precision': 0.2718001, 'average_loss': 0.54780394, 'global_step': 1018, 'auc': 0.67835975, 'accuracy_baseline': 0.76377374, 'label/mean': 0.23622628, 'prediction/mean': 0.33516836, 'auc_precision_recall': 0.31139234}

同样,我们可以为要在模型中使用的每个连续特征列定义 NumericColumn

education_num = tf.feature_column.numeric_column('education_num')
capital_gain = tf.feature_column.numeric_column('capital_gain')
capital_loss = tf.feature_column.numeric_column('capital_loss')
hours_per_week = tf.feature_column.numeric_column('hours_per_week')

my_numeric_columns = [age,education_num, capital_gain, capital_loss, hours_per_week]

fc.input_layer(feature_batch, my_numeric_columns).numpy()
array([[33.,  0.,  0., 13., 45.],
       [33.,  0.,  0.,  9., 40.],
       [59.,  0.,  0., 14., 50.],
       [21.,  0.,  0., 10., 10.],
       [19.,  0.,  0., 10., 40.],
       [66.,  0.,  0.,  5., 30.],
       [55.,  0.,  0., 13., 40.],
       [41.,  0.,  0., 11., 43.],
       [35.,  0.,  0.,  9., 40.],
       [44.,  0.,  0.,  8., 40.]], dtype=float32)

您可以通过更改构造函数的 feature_columns 参数,用这些特征重新训练模型:

classifier = tf.estimator.LinearClassifier(feature_columns=my_numeric_columns)
classifier.train(train_inpf)

result = classifier.evaluate(test_inpf)

clear_output()

for key,value in sorted(result.items()):
  print('%s: %s' % (key, value))
accuracy: 0.7828143
accuracy_baseline: 0.76377374
auc: 0.80679595
auc_precision_recall: 0.5932311
average_loss: 0.63657933
global_step: 1018
label/mean: 0.23622628
loss: 40.64372
precision: 0.5752427
prediction/mean: 0.35611457
recall: 0.30811232

类别列

要为类别特征定义特征列,请使用其中一个 tf.feature_column.categorical_column* 函数创建 CategoricalColumn

如果您知道某个列的所有可能特征值的集合,并且集合中只有几个值,请使用 categorical_column_with_vocabulary_list。列表中的每个键会被分配自动递增的 ID(从 0 开始)。例如,对于 relationship 列,我们可以将整数 ID 0 分配给特征字符串 Husband,将 1 分配给“Not-in-family”,以此类推。

relationship = fc.categorical_column_with_vocabulary_list(
    'relationship',
    ['Husband', 'Not-in-family', 'Wife', 'Own-child', 'Unmarried', 'Other-relative'])

上述代码将根据原始输入特征创建一个稀疏的独热向量。

我们使用的 input_layer 函数是专为 DNN 模型设计的,要求提供密集输入。要演示类别列,我们必须将其封装在 tf.feature_column.indicator_column 中,以创建密集的独热输出(线性 Estimators 通常可以跳过此密集步骤)。

运行使用 agerelationship 列进行配置的输入层:

fc.input_layer(feature_batch, [age, fc.indicator_column(relationship)])
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/sparse_ops.py:1165: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Create a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.
W1023 20:36:17.327173 140603783849728 tf_logging.py:125] From /usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/sparse_ops.py:1165: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Create a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.

如果我们事先不知道可能值的集合,请改用 categorical_column_with_hash_bucket

occupation = tf.feature_column.categorical_column_with_hash_bucket(
    'occupation', hash_bucket_size=1000)

在本示例中,我们在训练中看到的特征列 occupation 中的每个可能值都已经过哈希处理,变成整数 ID。以下示例批次包含几种不同的职业:

for item in feature_batch['occupation'].numpy():
    print(item.decode())
Prof-specialty
Other-service
Prof-specialty
Adm-clerical
Other-service
Other-service
Exec-managerial
Craft-repair
Craft-repair
Transport-moving

如果我们使用经过哈希处理的列运行 input_layer,则看到输出形状为 (batch_size, hash_bucket_size)

occupation_result = fc.input_layer(feature_batch, [fc.indicator_column(occupation)])

occupation_result.numpy().shape
(10, 1000)

如果我们对 hash_bucket_size 维度执行 tf.argmax 操作,则更容易看到实际结果。请注意,所有重复的职业都映射到相同的伪随机索引:

tf.argmax(occupation_result, axis=1).numpy()
array([979, 527, 979,  96, 527, 527, 800, 466, 466, 420])

无论我们如何选择定义 SparseColumn,都可通过查询固定映射或通过哈希处理将各个特征字符串映射到一个整数 ID。实际上,LinearModel 类负责管理映射和创建 tf.Variable,以便存储每个特征 ID 的模型参数(模型权重)。这些模型参数是通过稍后介绍的模型训练流程学习的。

我们可以使用类似的技巧定义其他类别特征:

education = tf.feature_column.categorical_column_with_vocabulary_list(
    'education', [
        'Bachelors', 'HS-grad', '11th', 'Masters', '9th', 'Some-college',
        'Assoc-acdm', 'Assoc-voc', '7th-8th', 'Doctorate', 'Prof-school',
        '5th-6th', '10th', '1st-4th', 'Preschool', '12th'])

marital_status = tf.feature_column.categorical_column_with_vocabulary_list(
    'marital_status', [
        'Married-civ-spouse', 'Divorced', 'Married-spouse-absent',
        'Never-married', 'Separated', 'Married-AF-spouse', 'Widowed'])

workclass = tf.feature_column.categorical_column_with_vocabulary_list(
    'workclass', [
        'Self-emp-not-inc', 'Private', 'State-gov', 'Federal-gov',
        'Local-gov', '?', 'Self-emp-inc', 'Without-pay', 'Never-worked'])

my_categorical_columns = [relationship, occupation, education, marital_status, workclass]

使用以上两组列可轻松配置使用所有这些特征的模型:

classifier = tf.estimator.LinearClassifier(feature_columns=my_numeric_columns+my_categorical_columns)
classifier.train(train_inpf)
result = classifier.evaluate(test_inpf)

clear_output()

for key,value in sorted(result.items()):
  print('%s: %s' % (key, value))
accuracy: 0.8248265
accuracy_baseline: 0.76377374
auc: 0.8244441
auc_precision_recall: 0.64094245
average_loss: 0.91969806
global_step: 1018
label/mean: 0.23622628
loss: 58.720016
precision: 0.6778812
prediction/mean: 0.20155187
recall: 0.49245968

衍生特征列

通过分桶将连续特征变成类别特征

有时,连续特征与标签不是线性关系。例如,年龄和收入 - 一个人的收入在其职业生涯早期阶段会增长,然后在某一阶段,增长速度减慢,最后,在退休后减少。在这种情况下,使用原始 age 作为实值特征列也许并非理想之选,因为模型只能学习以下三种情况之一:

  1. 收入始终随着年龄的增长而以某一速率增长(正相关);
  2. 收入始终随着年龄的增长而以某一速率减少(负相关);或者
  3. 无论年龄多大,收入都保持不变(不相关)。

如果我们要分别学习收入与各个年龄段之间的精细关系,则可以采用分桶技巧。分桶是将整个连续特征范围分割为一组连续分桶,然后根据值所在的分桶将原数值特征转换为分桶 ID(作为类别特征)的过程。因此,我们可以针对 agebucketized_column 定义为:

age_buckets = tf.feature_column.bucketized_column(
    age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])

boundaries 是分桶边界列表。本示例中有 10 个边界,因此会生成 11 个年龄段分桶(17 岁及以下、18-24 岁、25-29 岁、…,以及 65 岁及以上)。

通过分桶,模型将每个分桶视为一个独热特征:

fc.input_layer(feature_batch, [age, age_buckets]).numpy()
array([[33.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [33.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [59.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.],
       [21.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [19.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [66.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.],
       [55.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.],
       [41.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.],
       [35.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [44.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.]],
      dtype=float32)

通过组合列学习复杂关系

单独使用各个基准特征列可能不足以解释数据。例如,对于不同的职业,受教育程度和标签(收入超过 5 万美元)之间的相关性可能各不相同。因此,如果我们仅学习 education="Bachelors"education="Masters" 的单个模型权重,则无法捕获每个受教育程度-职业组合(例如,区分 education="Bachelors" AND occupation="Exec-managerial"education="Bachelors" AND occupation="Craft-repair")。

要了解各个特征组合之间的差异,我们可以向模型中添加组合特征列:

education_x_occupation = tf.feature_column.crossed_column(
    ['education', 'occupation'], hash_bucket_size=1000)

我们还可以针对两个以上的列创建一个 crossed_column。每个组成列可以是类别基准特征列 (SparseColumn)、分桶实值特征列,也可以是其他 CrossColumn。例如:

age_buckets_x_education_x_occupation = tf.feature_column.crossed_column(
    [age_buckets, 'education', 'occupation'], hash_bucket_size=1000)

这些组合列始终使用哈希分桶来避免类别数量的指数爆炸,由用户控制模型权重的数量。

有关哈希分桶与组合列所产生影响的直观示例,请参阅此笔记本

定义逻辑回归模型

处理输入数据并定义所有特征列后,我们可以将它们整合在一起,并构建逻辑回归模型。上一部分介绍了几类基准特征列和衍生特征列,包括:

  • CategoricalColumn
  • NumericColumn
  • BucketizedColumn
  • CrossedColumn

所有这些列都是抽象 FeatureColumn 类的子类,并且可添加到模型的 feature_columns 字段中:

import tempfile

base_columns = [
    education, marital_status, relationship, workclass, occupation,
    age_buckets,
]

crossed_columns = [
    tf.feature_column.crossed_column(
        ['education', 'occupation'], hash_bucket_size=1000),
    tf.feature_column.crossed_column(
        [age_buckets, 'education', 'occupation'], hash_bucket_size=1000),
]

model = tf.estimator.LinearClassifier(
    model_dir=tempfile.mkdtemp(),
    feature_columns=base_columns + crossed_columns,
    optimizer=tf.train.FtrlOptimizer(learning_rate=0.1))
INFO:tensorflow:Using default config.
I1023 20:36:26.724623 140603783849728 tf_logging.py:115] Using default config.
INFO:tensorflow:Using config: {'_experimental_distribute': None, '_num_worker_replicas': 1, '_task_type': 'worker', '_save_checkpoints_steps': None, '_service': None, '_num_ps_replicas': 0, '_is_chief': True, '_keep_checkpoint_every_n_hours': 10000, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fe07c8f77b8>, '_keep_checkpoint_max': 5, '_evaluation_master': '', '_protocol': None, '_log_step_count_steps': 100, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_eval_distribute': None, '_train_distribute': None, '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_model_dir': '/tmp/tmpcu3z47jc', '_tf_random_seed': None, '_save_summary_steps': 100, '_device_fn': None}
I1023 20:36:26.727277 140603783849728 tf_logging.py:115] Using config: {'_experimental_distribute': None, '_num_worker_replicas': 1, '_task_type': 'worker', '_save_checkpoints_steps': None, '_service': None, '_num_ps_replicas': 0, '_is_chief': True, '_keep_checkpoint_every_n_hours': 10000, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fe07c8f77b8>, '_keep_checkpoint_max': 5, '_evaluation_master': '', '_protocol': None, '_log_step_count_steps': 100, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_eval_distribute': None, '_train_distribute': None, '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_model_dir': '/tmp/tmpcu3z47jc', '_tf_random_seed': None, '_save_summary_steps': 100, '_device_fn': None}

模型会自动学习偏差项,偏差项可控制所进行的预测,而无需观察任何特征。学习的模型文件存储在 model_dir 中。

训练和评估模型

将所有特征添加到模型中后,开始训练模型。训练模型只需一个使用 tf.estimator API 的命令:

train_inpf = functools.partial(census_dataset.input_fn, train_file,
                               num_epochs=40, shuffle=True, batch_size=64)

model.train(train_inpf)

clear_output()  # used for notebook display

训练模型后,通过预测维持数据的标签来评估模型的准确率:

results = model.evaluate(test_inpf)

clear_output()

for key,value in sorted(result.items()):
  print('%s: %0.2f' % (key, value))
accuracy: 0.82
accuracy_baseline: 0.76
auc: 0.82
auc_precision_recall: 0.64
average_loss: 0.92
global_step: 1018.00
label/mean: 0.24
loss: 58.72
precision: 0.68
prediction/mean: 0.20
recall: 0.49

输出的第一行应显示诸如 accuracy: 0.83 的内容,表示准确率为 83%。您可以尝试使用更多特征和转换,看看能否取得更好的结果!

评估模型后,我们可以向模型输入某个人的信息,让其预测这个人的年收入是否超过 5 万美元。

我们来详细了解一下模型的表现:

import numpy as np

predict_df = test_df[:20].copy()

pred_iter = model.predict(
    lambda:easy_input_function(predict_df, label_key='income_bracket',
                               num_epochs=1, shuffle=False, batch_size=10))

classes = np.array(['<=50K', '>50K'])
pred_class_id = []

for pred_dict in pred_iter:
  pred_class_id.append(pred_dict['class_ids'])

predict_df['predicted_class'] = classes[np.array(pred_class_id)]
predict_df['correct'] = predict_df['predicted_class'] == predict_df['income_bracket']

clear_output()

predict_df[['income_bracket','predicted_class', 'correct']]
income_bracket predicted_class correct
0 <=50K <=50K True
1 <=50K <=50K True
2 >50K <=50K False
3 >50K <=50K False
4 <=50K <=50K True
5 <=50K <=50K True
6 <=50K <=50K True
7 >50K >50K True
8 <=50K <=50K True
9 <=50K <=50K True
10 >50K <=50K False
11 <=50K >50K False
12 <=50K <=50K True
13 <=50K <=50K True
14 >50K <=50K False
15 >50K >50K True
16 <=50K <=50K True
17 <=50K <=50K True
18 <=50K <=50K True
19 >50K >50K True

有关可使用的端到端示例,请下载我们的示例代码并将 model_type 标记设置为 wide

添加正则化以防止过拟合

正则化是用于避免过拟合的技术。如果模型在训练数据上表现良好,但在从未见过的测试数据上表现糟糕,就表示发生了过拟合。模型过于复杂(例如,相对于观察的训练数据量而言,采用的参数过多)会发生过拟合。借助正则化,您能够控制模型的复杂度,并使模型能够更好地泛化到未见过的数据。

您可以使用以下代码向模型中添加 L1 和 L2 正则化:

model_l1 = tf.estimator.LinearClassifier(
    feature_columns=base_columns + crossed_columns,
    optimizer=tf.train.FtrlOptimizer(
        learning_rate=0.1,
        l1_regularization_strength=10.0,
        l2_regularization_strength=0.0))

model_l1.train(train_inpf)

results = model_l1.evaluate(test_inpf)
clear_output()
for key in sorted(results):
  print('%s: %0.2f' % (key, results[key]))
accuracy: 0.84
accuracy_baseline: 0.76
auc: 0.88
auc_precision_recall: 0.69
average_loss: 0.35
global_step: 20351.00
label/mean: 0.24
loss: 22.47
precision: 0.69
prediction/mean: 0.24
recall: 0.56
model_l2 = tf.estimator.LinearClassifier(
    feature_columns=base_columns + crossed_columns,
    optimizer=tf.train.FtrlOptimizer(
        learning_rate=0.1,
        l1_regularization_strength=0.0,
        l2_regularization_strength=10.0))

model_l2.train(train_inpf)

results = model_l2.evaluate(test_inpf)
clear_output()
for key in sorted(results):
  print('%s: %0.2f' % (key, results[key]))
accuracy: 0.84
accuracy_baseline: 0.76
auc: 0.88
auc_precision_recall: 0.69
average_loss: 0.35
global_step: 20351.00
label/mean: 0.24
loss: 22.46
precision: 0.69
prediction/mean: 0.24
recall: 0.56

这些经过正则化的模型表现得并不比基准模型好很多。我们来看看模型的权重分布,以更好地了解正则化的影响:

def get_flat_weights(model):
  weight_names = [
      name for name in model.get_variable_names()
      if "linear_model" in name and "Ftrl" not in name]

  weight_values = [model.get_variable_value(name) for name in weight_names]

  weights_flat = np.concatenate([item.flatten() for item in weight_values], axis=0)

  return weights_flat

weights_flat = get_flat_weights(model)
weights_flat_l1 = get_flat_weights(model_l1)
weights_flat_l2 = get_flat_weights(model_l2)

模型中包含很多由未使用的哈希分箱所致的零值权重(在某些列中,哈希分箱的数量多于类别数量)。查看权重分布时,我们可以掩盖这些权重:

weight_mask = weights_flat != 0

weights_base = weights_flat[weight_mask]
weights_l1 = weights_flat_l1[weight_mask]
weights_l2 = weights_flat_l2[weight_mask]

现在,绘制分布图:

plt.figure()
_ = plt.hist(weights_base, bins=np.linspace(-3,3,30))
plt.title('Base Model')
plt.ylim([0,500])

plt.figure()
_ = plt.hist(weights_l1, bins=np.linspace(-3,3,30))
plt.title('L1 - Regularization')
plt.ylim([0,500])

plt.figure()
_ = plt.hist(weights_l2, bins=np.linspace(-3,3,30))
plt.title('L2 - Regularization')
_=plt.ylim([0,500])

png

png

png

这两种类型的正则化都使权重向零靠近,使其分布在零附近。L2 正则化对分布尾端的影响比较大,可消除极端权重。L1 正则化生成更多确切的零值,在本示例中,它将 ~200 设置为零值。