tf.experimental.dtensor.initialize_accelerator_system

Initializes accelerators and communication fabrics for DTensor.

DTensor configures TensorFlow to run in the local mode or multi-client mode.

  • In local mode, a mesh can only use devices attached to the current process.
  • In multi-client mode, a mesh can span across devices from multiple clients.

If DTENSOR_JOBS is non-empty, DTensor configures TensorFlow to run in the multi-client mode using the distributed runtime. In multi-client mode devices on different clients can communicate with each other.

The following environment variables controls the behavior of this function.

  • DTENSOR_JOBS: string, a comma separated list. Each item in the list is of format {hostname}:{port}. If empty, DTensor runs in the local mode. Examples of valid DTENSOR_JOBS values:
    • 4 clients on localhost: localhost:10000,localhost:10001,localhost:10002,localhost:10003
    • 2 clients on host1, 2 clients on host2 host1:10000,host1:10001,host2:10000,host2:10003 If the hostnames are BNS addresses, the items must be sorted in alphabetical order.
  • DTENSOR_CLIENT_ID: integer, between 0 to num_clients - 1, to identify the client id of the current process. The default value is 0.
  • DTENSOR_JOB_NAME: string, a string for the name of the TensorFlow job. The job name controls the job name section of the TensorFlow DeviceSpecs, e.g., job:worker in /job:worker/replica:0/task:0/device:TPU:0 when the job name is worker. The default value is localhost in local mode, and worker when in the multi-client mode. All DTensor clients within the same multi-client cluster share the same job name.

device_type Type of accelerator to use, can be CPU, GPU, or TPU. If None, uses tf.experimental.dtensor.preferred_device_type().
enable_coordination_service If true, enable distributed coordination service to make sure that workers know the devices on each other, when there is more than 1 client.

device_type the type of accelerator that was initialized.