View on TensorFlow.org | View source on GitHub | Download notebook |

In a *regression* problem, we aim to predict the output of a continuous value, like a price or a probability. Contrast this with a *classification* problem, where we aim to select a class from a list of classes (for example, where a picture contains an apple or an orange, recognizing which fruit is in the picture).

This notebook uses the classic Auto MPG Dataset and builds a model to predict the fuel efficiency of late-1970s and early 1980s automobiles. To do this, we'll provide the model with a description of many automobiles from that time period. This description includes attributes like: cylinders, displacement, horsepower, and weight.

This example uses the `tf.keras`

API, see this guide for details.

```
# Use seaborn for pairplot
!pip install -q seaborn
# Use some functions from tensorflow_docs
!pip install -q git+https://github.com/tensorflow/docs
```

```
import pathlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
```

```
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
print(tf.__version__)
```

2.1.0

```
import tensorflow_docs as tfdocs
import tensorflow_docs.plots
import tensorflow_docs.modeling
```

## The Auto MPG dataset

The dataset is available from the UCI Machine Learning Repository.

### Get the data

First download the dataset.

```
dataset_path = keras.utils.get_file("auto-mpg.data", "http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data")
dataset_path
```

Downloading data from http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data 32768/30286 [================================] - 0s 1us/step '/home/kbuilder/.keras/datasets/auto-mpg.data'

Import it using pandas

```
column_names = ['MPG','Cylinders','Displacement','Horsepower','Weight',
'Acceleration', 'Model Year', 'Origin']
raw_dataset = pd.read_csv(dataset_path, names=column_names,
na_values = "?", comment='\t',
sep=" ", skipinitialspace=True)
dataset = raw_dataset.copy()
dataset.tail()
```

### Clean the data

The dataset contains a few unknown values.

```
dataset.isna().sum()
```

MPG 0 Cylinders 0 Displacement 0 Horsepower 6 Weight 0 Acceleration 0 Model Year 0 Origin 0 dtype: int64

To keep this initial tutorial simple drop those rows.

```
dataset = dataset.dropna()
```

The `"Origin"`

column is really categorical, not numeric. So convert that to a one-hot:

```
dataset['Origin'] = dataset['Origin'].map({1: 'USA', 2: 'Europe', 3: 'Japan'})
```

```
dataset = pd.get_dummies(dataset, prefix='', prefix_sep='')
dataset.tail()
```

### Split the data into train and test

Now split the dataset into a training set and a test set.

We will use the test set in the final evaluation of our model.

```
train_dataset = dataset.sample(frac=0.8,random_state=0)
test_dataset = dataset.drop(train_dataset.index)
```

### Inspect the data

Have a quick look at the joint distribution of a few pairs of columns from the training set.

```
sns.pairplot(train_dataset[["MPG", "Cylinders", "Displacement", "Weight"]], diag_kind="kde")
```

<seaborn.axisgrid.PairGrid at 0x7fda00c47470>

Also look at the overall statistics:

```
train_stats = train_dataset.describe()
train_stats.pop("MPG")
train_stats = train_stats.transpose()
train_stats
```

### Split features from labels

Separate the target value, or "label", from the features. This label is the value that you will train the model to predict.

```
train_labels = train_dataset.pop('MPG')
test_labels = test_dataset.pop('MPG')
```

### Normalize the data

Look again at the `train_stats`

block above and note how different the ranges of each feature are.

It is good practice to normalize features that use different scales and ranges. Although the model *might* converge without feature normalization, it makes training more difficult, and it makes the resulting model dependent on the choice of units used in the input.

```
def norm(x):
return (x - train_stats['mean']) / train_stats['std']
normed_train_data = norm(train_dataset)
normed_test_data = norm(test_dataset)
```

This normalized data is what we will use to train the model.

## The model

### Build the model

Let's build our model. Here, we'll use a `Sequential`

model with two densely connected hidden layers, and an output layer that returns a single, continuous value. The model building steps are wrapped in a function, `build_model`

, since we'll create a second model, later on.

```
def build_model():
model = keras.Sequential([
layers.Dense(64, activation='relu', input_shape=[len(train_dataset.keys())]),
layers.Dense(64, activation='relu'),
layers.Dense(1)
])
optimizer = tf.keras.optimizers.RMSprop(0.001)
model.compile(loss='mse',
optimizer=optimizer,
metrics=['mae', 'mse'])
return model
```

```
model = build_model()
```

### Inspect the model

Use the `.summary`

method to print a simple description of the model

```
model.summary()
```

Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense (Dense) (None, 64) 640 _________________________________________________________________ dense_1 (Dense) (None, 64) 4160 _________________________________________________________________ dense_2 (Dense) (None, 1) 65 ================================================================= Total params: 4,865 Trainable params: 4,865 Non-trainable params: 0 _________________________________________________________________

Now try out the model. Take a batch of `10`

examples from the training data and call `model.predict`

on it.

```
example_batch = normed_train_data[:10]
example_result = model.predict(example_batch)
example_result
```

array([[ 0.03299644], [ 0.22193381], [-0.32025507], [ 0.11727733], [ 0.26874357], [-0.10959077], [ 0.26036227], [ 0.34398937], [-0.16758749], [ 0.30486187]], dtype=float32)

It seems to be working, and it produces a result of the expected shape and type.

### Train the model

Train the model for 1000 epochs, and record the training and validation accuracy in the `history`

object.

```
EPOCHS = 1000
history = model.fit(
normed_train_data, train_labels,
epochs=EPOCHS, validation_split = 0.2, verbose=0,
callbacks=[tfdocs.modeling.EpochDots()])
```

Epoch: 0, loss:568.1550, mae:22.5943, mse:568.1550, val_loss:556.9423, val_mae:22.3343, val_mse:556.9423, .................................................................................................... Epoch: 100, loss:6.2290, mae:1.7288, mse:6.2290, val_loss:8.0316, val_mae:2.1801, val_mse:8.0316, .................................................................................................... Epoch: 200, loss:5.4309, mae:1.6137, mse:5.4309, val_loss:8.3624, val_mae:2.2117, val_mse:8.3624, .................................................................................................... Epoch: 300, loss:4.9404, mae:1.5381, mse:4.9404, val_loss:8.5247, val_mae:2.2191, val_mse:8.5247, .................................................................................................... Epoch: 400, loss:4.5385, mae:1.4569, mse:4.5385, val_loss:8.6025, val_mae:2.1262, val_mse:8.6025, .................................................................................................... Epoch: 500, loss:4.3172, mae:1.4095, mse:4.3172, val_loss:8.5202, val_mae:2.1395, val_mse:8.5202, .................................................................................................... Epoch: 600, loss:3.9400, mae:1.3338, mse:3.9400, val_loss:9.1496, val_mae:2.2391, val_mse:9.1496, .................................................................................................... Epoch: 700, loss:3.4086, mae:1.2328, mse:3.4086, val_loss:8.6998, val_mae:2.1477, val_mse:8.6998, .................................................................................................... Epoch: 800, loss:3.1403, mae:1.1747, mse:3.1403, val_loss:8.7389, val_mae:2.1732, val_mse:8.7389, .................................................................................................... Epoch: 900, loss:3.0132, mae:1.1281, mse:3.0132, val_loss:8.9137, val_mae:2.2226, val_mse:8.9137, ....................................................................................................

Visualize the model's training progress using the stats stored in the `history`

object.

```
hist = pd.DataFrame(history.history)
hist['epoch'] = history.epoch
hist.tail()
```

```
plotter = tfdocs.plots.HistoryPlotter(smoothing_std=2)
```

```
plotter.plot({'Basic': history}, metric = "mae")
plt.ylim([0, 10])
plt.ylabel('MAE [MPG]')
```

Text(0, 0.5, 'MAE [MPG]')

```
plotter.plot({'Basic': history}, metric = "mse")
plt.ylim([0, 20])
plt.ylabel('MSE [MPG^2]')
```

Text(0, 0.5, 'MSE [MPG^2]')

This graph shows little improvement, or even degradation in the validation error after about 100 epochs. Let's update the `model.fit`

call to automatically stop training when the validation score doesn't improve. We'll use an *EarlyStopping callback* that tests a training condition for every epoch. If a set amount of epochs elapses without showing improvement, then automatically stop the training.

You can learn more about this callback here.

```
model = build_model()
# The patience parameter is the amount of epochs to check for improvement
early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=10)
early_history = model.fit(normed_train_data, train_labels,
epochs=EPOCHS, validation_split = 0.2, verbose=0,
callbacks=[early_stop, tfdocs.modeling.EpochDots()])
```

Epoch: 0, loss:577.7917, mae:22.6916, mse:577.7917, val_loss:580.4527, val_mae:22.6452, val_mse:580.4528, ........................................................................................

```
plotter.plot({'Early Stopping': early_history}, metric = "mae")
plt.ylim([0, 10])
plt.ylabel('MAE [MPG]')
```

Text(0, 0.5, 'MAE [MPG]')

The graph shows that on the validation set, the average error is usually around +/- 2 MPG. Is this good? We'll leave that decision up to you.

Let's see how well the model generalizes by using the **test** set, which we did not use when training the model. This tells us how well we can expect the model to predict when we use it in the real world.

```
loss, mae, mse = model.evaluate(normed_test_data, test_labels, verbose=2)
print("Testing set Mean Abs Error: {:5.2f} MPG".format(mae))
```

78/78 - 0s - loss: 6.3184 - mae: 2.0134 - mse: 6.3184 Testing set Mean Abs Error: 2.01 MPG

### Make predictions

Finally, predict MPG values using data in the testing set:

```
test_predictions = model.predict(normed_test_data).flatten()
a = plt.axes(aspect='equal')
plt.scatter(test_labels, test_predictions)
plt.xlabel('True Values [MPG]')
plt.ylabel('Predictions [MPG]')
lims = [0, 50]
plt.xlim(lims)
plt.ylim(lims)
_ = plt.plot(lims, lims)
```

It looks like our model predicts reasonably well. Let's take a look at the error distribution.

```
error = test_predictions - test_labels
plt.hist(error, bins = 25)
plt.xlabel("Prediction Error [MPG]")
_ = plt.ylabel("Count")
```

It's not quite gaussian, but we might expect that because the number of samples is very small.

## Conclusion

This notebook introduced a few techniques to handle a regression problem.

- Mean Squared Error (MSE) is a common loss function used for regression problems (different loss functions are used for classification problems).
- Similarly, evaluation metrics used for regression differ from classification. A common regression metric is Mean Absolute Error (MAE).
- When numeric input data features have values with different ranges, each feature should be scaled independently to the same range.
- If there is not much training data, one technique is to prefer a small network with few hidden layers to avoid overfitting.
- Early stopping is a useful technique to prevent overfitting.

```
#
# Copyright (c) 2017 François Chollet
#
# Permission is hereby granted, free of charge, to any person obtaining a
# copy of this software and associated documentation files (the "Software"),
# to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense,
# and/or sell copies of the Software, and to permit persons to whom the
# Software is furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
# DEALINGS IN THE SOFTWARE.
```