What are the key points?

Amazon SageMaker AI now supports P6-B200 instances equipped with 8 NVIDIA Blackwell GPUs. Blackwell GPUs provide up to 268 GB of memory, enabling larger batch sizes and longer sequences for training. Developers can use activation checkpointing and precision formats like MXFP8 to optimize throughput for models exceeding 14B parameters.

Amazon SageMaker AI Adds NVIDIA Blackwell GPU Support

•Amazon SageMaker AI now supports P6-B200 instances equipped with 8 NVIDIA Blackwell GPUs.
•Blackwell GPUs provide up to 268 GB of memory, enabling larger batch sizes and longer sequences for training.
•Developers can use activation checkpointing and precision formats like MXFP8 to optimize throughput for models exceeding 14B parameters.

Amazon SageMaker AI now supports P6-B200 instances, featuring 8 NVIDIA Blackwell GPUs, to improve training efficiency for large-scale machine learning models. These instances utilize the Blackwell architecture, which offers higher memory bandwidth and new precision formats to address common training bottlenecks like memory limits and communication overhead. Amazon SageMaker AI manages the compute infrastructure, allowing developers to focus on algorithm tuning and data preparation while utilizing tools like the Flexible Training Plan for predictable capacity.

Optimizing training requires balancing batch size, sequence length, and model sharding. Blackwell’s B200 and B300 GPUs provide 180 GB and 268 GB of memory respectively, enabling larger batch sizes that reduce gradient synchronization steps. For models exceeding 14B parameters, activation checkpointing (a technique that recomputes intermediate values to save memory) is a prerequisite for stable training. In testing, a 1B-parameter model using MXFP8 precision and 8K sequence length achieved ~51K tokens/sec throughput with checkpointing and a batch size of 16, significantly outperforming the ~6K tokens/sec baseline.

Precision formats such as FP8, MXFP8, and NVFP4 utilize Blackwell's fifth-generation Tensor Cores to boost throughput. While these formats primarily assist compute-bound workloads, their effectiveness depends on model scale. For small models under 14B parameters, FP8 is a recommended default. For larger models where memory is the primary constraint, MXFP8 balances accuracy and efficiency, while NVFP4 offers higher throughput at the cost of increased implementation complexity. Engineers should benchmark their specific configurations, as reduced-precision techniques introduce quantization overhead.

To deploy on SageMaker, developers must use custom Docker containers built on AWS Deep Learning Containers (DLC) with TransformerEngine 2.11 installed. The configuration process involves creating a training script using PyTorch Fully Sharded Data Parallel (FSDP) and defining a launch script to manage hyperparameter execution. Users can secure capacity via Flexible Training Plans for reserved, continuous access or Managed Spot Training for cost-optimized, interruptible workloads. Once configured, jobs are submitted using the SageMaker Python SDK, with checkpointing to Amazon S3 recommended to ensure fault tolerance for Spot instance usage.

Amazon SageMaker AI now supports P6-B200 instances, featuring 8 NVIDIA Blackwell GPUs, to improve training efficiency for large-scale machine learning models. These instances utilize the Blackwell architecture, which offers higher memory bandwidth and new precision formats to address common training bottlenecks like memory limits and communication overhead. Amazon SageMaker AI manages the compute infrastructure, allowing developers to focus on algorithm tuning and data preparation while utilizing tools like the Flexible Training Plan for predictable capacity.

Optimizing training requires balancing batch size, sequence length, and model sharding. Blackwell’s B200 and B300 GPUs provide 180 GB and 268 GB of memory respectively, enabling larger batch sizes that reduce gradient synchronization steps. For models exceeding 14B parameters, activation checkpointing (a technique that recomputes intermediate values to save memory) is a prerequisite for stable training. In testing, a 1B-parameter model using MXFP8 precision and 8K sequence length achieved ~51K tokens/sec throughput with checkpointing and a batch size of 16, significantly outperforming the ~6K tokens/sec baseline.

Precision formats such as FP8, MXFP8, and NVFP4 utilize Blackwell's fifth-generation Tensor Cores to boost throughput. While these formats primarily assist compute-bound workloads, their effectiveness depends on model scale. For small models under 14B parameters, FP8 is a recommended default. For larger models where memory is the primary constraint, MXFP8 balances accuracy and efficiency, while NVFP4 offers higher throughput at the cost of increased implementation complexity. Engineers should benchmark their specific configurations, as reduced-precision techniques introduce quantization overhead.

To deploy on SageMaker, developers must use custom Docker containers built on AWS Deep Learning Containers (DLC) with TransformerEngine 2.11 installed. The configuration process involves creating a training script using PyTorch Fully Sharded Data Parallel (FSDP) and defining a launch script to manage hyperparameter execution. Users can secure capacity via Flexible Training Plans for reserved, continuous access or Managed Spot Training for cost-optimized, interruptible workloads. Once configured, jobs are submitted using the SageMaker Python SDK, with checkpointing to Amazon S3 recommended to ensure fault tolerance for Spot instance usage.