How to use multiple GPU node to train/fine-tune large language models?

I am using UT’s Texas Advanced Computing Center (TACC)'s resources to fine tune large language models. However, I am able to fine-tune it with 1-batch size making the fine-tuning extremely slow and unstable.

My understanding is using multi-GPU node would allow me to mitigate the out of memory problem. So far when submitting jobs asking for multi GPU nodes, the out of memory problem still exist even though the size of the language models and the batch size of the training data is at least 3 times smaller than combined GPU memory (80GB x2).

The node’s memory is 256Gb and other details are in this link: Lonestar6

I have been unsuccessful fine tuning GPT-2 (small/medium): GitHub - openai/gpt-2: Code for the paper "Language Models are Unsupervised Multitask Learners"
I am planning to train llama which is even larger than GPT-2

I assume Tensorflow automatically takes advantages of the multi-GPU in the node based on what I can find from the tensorflow documentation.

TensorFlow API for configuring distribute workloads:
My understanding is that TensorFlow does not automatically utilize multiple GPUs by default. TensorFlow offers the tools and capabilities to harness the potential of multi-GPUs, it does not automatically utilize them. Can you share documentation that refers to the fact that TensorFlow automatically takes advantage of multi-GPUs so I can take a look into it?

TensorFlow has documentation on how to distribute computations across multiple GPUs, and it requires explicit programming and utilization of specific TensorFlow APIs and strategies.

Guide to distributed Training

You can use APIs in TensorFlow, such as tf.distribute.Strategy, to facilitate distributed computations across multiple GPUs or machines and leverage the power of multiple GPUs for improved performance and faster training. However, the responsibility lies with the developer to explicitly implement and configure these distributed strategies in their TensorFlow code.

TensorFlow on TACC Resources
Also, look into specific documentation from TACC for using TensorFlow on their systems. This will help in using the submitting an appropriate script regarding your job to the HPC system. The guide below covers the systems Lonestar6, Maverick2, Frontera, & Stampede2; as you are using Lonestar6 this guide might help with your use case:

Guide for TENSORFLOW AT TACC

Based on this guide TACC provides support for the TensorFlow+Horovod stack, which offers convenient interfaces for deep learning architecture specification, model training, tuning, and validation. The guide above has instructions on installing TensorFlow, as well as downloading and running benchmarks in both single-node and multi-node configurations.

Installation instructions are detailed and specific, taking into account variations in TensorFlow and Python versions, as well as their compatibility with Intel compilers and CUDA libraries. TACC advises to pay careful attention to these installation instructions is crucial for a successful setup.

TensorFlow Distributed Training Example:
To use distributed training in TensorFlow, you first need to create a tf.distribute.Strategy object. There are a number of different strategies available, each with its own advantages and disadvantages. Once you have created a strategy object, you can use it to distribute your training code.
See the example code below:

strategy = tf.distribute.MirroredStrategy()

with strategy.scope():
	model = tf.keras.Sequential([
				tf.keras.layers.Dense(128, activation='relu'),
				tf.keras.layers.Dense(10, activation='softmax')
			])

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

model.fit(x_train, y_train, epochs=10)

This code will create two copies of the model, one on each GPU. The model will then be trained on batches of data that are split between the two GPUs.

There are several ways of using distributed training with tensorflow. For example,

  • tf.distribute.Strategy: This is the main API for distributed training in TensorFlow. It provides a number of different strategies for distributing training across multiple machines or GPUs.
  • Keras: Keras is a high-level API for building and training machine learning models in TensorFlow. It provides a number of built-in support for distributed training, including the tf.distribute.Strategy API.
  • tf.data: The tf.data API provides a number of features for working with large datasets, including support for distributed loading and processing.

There are other strategies for distributing the training as well. For example,

  • MultiWorkerMirroredStrategy
  • Central Storage Strategy
  • Parameter Server Strategy