TensorFlow Cloud

View on TensorFlow.org View on GitHub Download notebook Kaggle logoRun in Kaggle

TensorFlow Cloud is a library that makes it easier to do training and hyperparameter tuning of Keras models on Google Cloud.

Using TensorFlow Cloud's run API, you can send your model code directly to your Google Cloud account, and use Google Cloud compute resources without needing to login and interact with the Cloud UI (once you have set up your project in the console).

This means that you can use your Google Cloud compute resources from inside directly a Python notebook: a notebook just like this one! You can also send models to Google Cloud from a plain .py Python script.

Simple example

This is a simple introductory example to demonstrate how to train a model remotely using TensorFlow Cloud and Google Cloud.

You can just read through it to get an idea of how this works, or you can run the notebook in Google Colab. Running the notebook requires connecting to a Google Cloud account and entering your credentials and project ID. See Setting Up and Connecting To Your Google Cloud Account if you don't have an account yet or are not sure how to set up a project in the console.

Import required modules

import tensorflow as tf
tf.version.VERSION
'2.6.0'
! pip install -q tensorflow-cloud
import tensorflow_cloud as tfc
print(tfc.__version__)
import sys

Project Configurations

Set project parameters. If you don't know what your GCP_PROJECT_ID or GCS_BUCKET should be, see Setting Up and Connecting To Your Google Cloud Account.

The JOB_NAME is optional, and you can set it to any string. If you are doing multiple training experiemnts (for example) as part of a larger project, you may want to give each of them a unique JOB_NAME.

# Set Google Cloud Specific parameters

# TODO: Please set GCP_PROJECT_ID to your own Google Cloud project ID.
GCP_PROJECT_ID = 'YOUR_PROJECT_ID'

# TODO: set GCS_BUCKET to your own Google Cloud Storage (GCS) bucket.
GCS_BUCKET = 'YOUR_GCS_BUCKET_NAME'

# DO NOT CHANGE: Currently only the 'us-central1' region is supported.
REGION = 'us-central1'

# OPTIONAL: You can change the job name to any string.
JOB_NAME = 'mnist'

# Setting location were training logs and checkpoints will be stored
GCS_BASE_PATH = f'gs://{GCS_BUCKET}/{JOB_NAME}'
TENSORBOARD_LOGS_DIR = os.path.join(GCS_BASE_PATH,"logs")
MODEL_CHECKPOINT_DIR = os.path.join(GCS_BASE_PATH,"checkpoints")
SAVED_MODEL_DIR = os.path.join(GCS_BASE_PATH,"saved_model")

Authenticating the notebook to use your Google Cloud Project

This code authenticates the notebook, checking your valid Google Cloud credentials and identity. It is inside the if not tfc.remote() block to ensure that it is only run in the notebook, and will not be run when the notebook code is sent to Google Cloud.

# Using tfc.remote() to ensure this code only runs in notebook
if not tfc.remote():

    # Authentication for Kaggle Notebooks
    if "kaggle_secrets" in sys.modules:
        from kaggle_secrets import UserSecretsClient
        UserSecretsClient().set_gcloud_credentials(project=GCP_PROJECT_ID)

    # Authentication for Colab Notebooks
    if "google.colab" in sys.modules:
        from google.colab import auth
        auth.authenticate_user()
        os.environ["GOOGLE_CLOUD_PROJECT"] = GCP_PROJECT_ID

Model and data setup

From here we are following the basic procedure for setting up a simple Keras model to run classification on the MNIST dataset.

Load and split data

Read raw data and split to train and test data sets.

(x_train, y_train), (_, _) = tf.keras.datasets.mnist.load_data()

x_train = x_train.reshape((60000, 28 * 28))
x_train = x_train.astype('float32') / 255

Create a model and prepare for training

Create a simple model and set up a few callbacks for it.

from tensorflow.keras import layers

model = tf.keras.Sequential([
  tf.keras.layers.Dense(512, activation='relu', input_shape=(28 * 28,)),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10, activation='softmax')
])

model.compile(loss='sparse_categorical_crossentropy',
              optimizer=tf.keras.optimizers.Adam(),
              metrics=['accuracy'])

Quick validation training

We'll train the model for one (1) epoch just to make sure everything is set up correctly, and we'll wrap that training command in if not tfc.remote, so that it only happens here in the runtime environment in which you are reading this, not when it is sent to Google Cloud.

if not tfc.remote():
    # Run the training for 1 epoch and a small subset of the data to validate setup
    model.fit(x=x_train[:100], y=y_train[:100], validation_split=0.2, epochs=1)

Prepare for remote training

The code below will only run when the notebook code is sent to Google Cloud, not inside the runtime in which you are reading this.

First, we set up callbacks which will:

  • Create logs for TensorBoard.
  • Create checkpoints and save them to the checkpoints directory specified above.
  • Stop model training if loss is not improving sufficiently.

Then we call model.fit and model.save, which (when this code is running on Google Cloud) which actually run the full training (100 epochs) and then save the trained model in the GCS Bucket and directory defined above.

if tfc.remote():
    # Configure Tensorboard logs
    callbacks=[
        tf.keras.callbacks.TensorBoard(log_dir=TENSORBOARD_LOGS_DIR),
        tf.keras.callbacks.ModelCheckpoint(
            MODEL_CHECKPOINT_DIR,
            save_best_only=True),
        tf.keras.callbacks.EarlyStopping(
            monitor='loss',
            min_delta =0.001,
            patience=3)]

    model.fit(x=x_train, y=y_train, epochs=100,
              validation_split=0.2, callbacks=callbacks)

    model.save(SAVED_MODEL_DIR)

Start the remote training

TensorFlow Cloud takes all the code from its local execution environment (this notebook), wraps it up, and sends it to Google Cloud for execution. (That's why the if and if not tfc.remote wrappers are important.)

This step will prepare your code from this notebook for remote execution and then start a remote training job on Google Cloud Platform to train the model.

First we add the tensorflow-cloud Python package to a requirements.txt file, which will be sent along with the code in this notebook. You can add other packages here as needed.

Then a GPU and a CPU image are specified. You only need to specify one or the other; the GPU is used in the code that follows.

Finally, the heart of TensorFlow cloud: the call to tfc.run. When this is executed inside this notebook, all the code from this notebook, and the rest of the files in this directory, will be packaged and sent to Google Cloud for execution. The parameters on the run method specify the details of the GPU CPU images are specified. You only need to specify one or the other; the GPU is used in the code that follows.

Finally, the heart of TensorFlow cloud: the call to tfc.run. When this is executed inside this notebook, all the code from this notebook, and the rest of the files in this directory, will be packaged and sent to Google Cloud for execution. The parameters on the run method specify the details of the GPU and CPU images are specified. You only need to specify one or the other; the GPU is used in the code that follows.

Finally, the heart of TensorFlow cloud: the call to tfc.run. When this is executed inside this notebook, all the code from this notebook, and the rest of the files in this directory, will be packaged and sent to Google Cloud for execution. The parameters on the run method specify the details of the execution environment and the distribution strategy (if any) to be used.

Once the job is submitted you can go to the next step to monitor the jobs progress via Tensorboard.

# If you are using a custom image you can install modules via requirements
# txt file.
with open('requirements.txt','w') as f:
    f.write('tensorflow-cloud\n')

# Optional: Some recommended base images. If you provide none the system
# will choose one for you.
TF_GPU_IMAGE= "tensorflow/tensorflow:latest-gpu"
TF_CPU_IMAGE= "tensorflow/tensorflow:latest"

# Submit a single node training job using GPU.
tfc.run(
    distribution_strategy='auto',
    requirements_txt='requirements.txt',
    docker_config=tfc.DockerConfig(
        parent_image=TF_GPU_IMAGE,
        image_build_bucket=GCS_BUCKET
        ),
    chief_config=tfc.COMMON_MACHINE_CONFIGS['K80_1X'],
    job_labels={'job': JOB_NAME}
)

Training Results

Reconnect your Colab instance

Most remote training jobs are long running. If you are using Colab, it may time out before the training results are available.

In that case, rerun the following sections in order to reconnect and configure your Colab instance to access the training results.

  1. Import required modules
  2. Project Configurations
  3. Authenticating the notebook to use your Google Cloud Project

DO NOT rerun the rest of the code.

Load Tensorboard

While the training is in progress you can use Tensorboard to view the results. Note the results will show only after your training has started. This may take a few minutes.

%load_ext tensorboard
%tensorboard --logdir $TENSORBOARD_LOGS_DIR

Load your trained model

Once training is complete, you can retrieve your model from the GCS Bucket you specified above.

trained_model = tf.keras.models.load_model(SAVED_MODEL_DIR)
trained_model.summary()