NVIDIA GPUs H100 vs. A100 – Architecture, Performance, and Cost Comparison

NVIDIA GPUs H100 and A100 GPUs represent the cutting edge of AI acceleration. Initially designed for tasks like rendering images and videos, they have become indispensable for AI and ML due to their Tensor Cores that speed up matrix operations fundamental to neural networks.

Both the H100 and A100 offer impressive improvements over non-GPU computing methods. However, the H100 is a significant upgrade on the A100, and has several new features. This article compares their performance, cost, and architectures, helping you determine which GPU best meets the needs of your AI workload.

NVIDIA GPUs H100 vs. A100: Architectural differences

Overview

NVIDIA GPU architecture (overview)

Any NVIDIA GPU contains several Graphics Processing Clusters (GPCs). Each GPC has a raster engine and several Texture Processing Clusters(TPCs). Each TPC contains multiple Streaming Multiprocessors(SMs) where the actual data computing occurs. The SMs contain CUDA cores, Tensor cores, and other key components.

NVIDIA regularly releases new architectures (same basic design but technical specification differences). The A100 with NVIDIA Ampere architecture was launched in 2020, primarily focusing on ML workloads. In comparison, the H100, launched in 2022, was primarily designed for transformer-based neural networks. Its NVIDIA Hopper architecture was planned with large language models (LLMs) in mind.

Each Ampere GPU contains 7 GPCs, 28 TPCs, and 108 SMs. Every Hopper GPU contains 8 GPCs, 32 TPCs, and 80 SMs. These numbers reflect the design choices made by NVIDIA to optimize performance and efficiency in each architecture,

Next, look at more design differences between Ampere (A100) and Hopper(H100). 

Tensor Cores

The Tensor Core is the main component in the SM that performs deep learning mathematical calculations. The A100 contains a third-generation Tensor Core that supports

  • FP64 (64-bit floating point precision ) for high-performance scientific computing.
  • TF32 (TensorFloat-32) for 32-bit range at slightly reduced precision to balance speed and accuracy.
  • INT8/INT4 for inference acceleration, where precision can be traded off for speed.

Precision is the level of detail and accuracy used in numerical computations, particularly in representing floating-point (decimal) numbers. It determines how accurately a GPU can process and store numerical data during computations.

In contrast, Hopper brings fourth-generation Tensor Cores that handle the A100 mixed-precision matrix operations but also include advanced support for FP8 precision. This allows even faster training and inference performance without a significant accuracy loss in transformer models.

Structured sparsity

Both Tensor Cores support structured sparsity, a feature that removes certain weights in neural networks (e.g., redundant or zero weights). AI models only process the essential data components and experience a performance boost without reducing model accuracy. 

However, the Hopper Tensor Core supports sparsity with enhanced efficiency. It delivers double the math throughput clock-for-clock, compared to A100. You can save 30% on operand delivery power for similar workloads.

H100 has a more efficient, 4th generation tensor core (Source)

CUDA Cores

Along with Tensor Core, both GPUs include CUDA cores for general-purpose parallel computing. CUDA gives direct access to the GPU’s virtual instruction set and memory for non-AI workloads like graphics rendering, physics simulations, gaming, and 3D modeling.

A100 CUDA cores deliver up to 2.5x the performance of the previous Volta architecture, especially in single-precision FP32 tasks. However, H100 CUDA cores give higher throughput in both FP32 and FP64 operations.

Thread block cluster

A100 CUDA programming organizes GPU tasks into a hierarchy of threads, thread blocks, and grids. Each thread block runs on a single SM, and thread blocks within the grid run independently. The H100 introduces a new thread block cluster feature. It allows developers to group multiple thread blocks together into a cluster, which can span across multiple SMs. 

Different thread blocks running on different SMs synchronize more easily. Developers can also programmatically control clusters for increased efficiency in HPC tasks.

Dynamic Programming Accelerators

H100 CUDA cores also introduce Dynamic Programming Accelerators(DPX) capability. DPX accelerates algorithms that solve complex problems by breaking them down into simpler subproblems, solving each once and storing their solutions. The DPX instructions in the H100 can accelerate these algorithms by up to 7x compared to the A100.

Multi-Instance GPU technology

Multi-Instance GPU (MIG) technology allows the A100 GPU to be partitioned into multiple isolated instances. Each instance has its resources like memory, cache, and compute cores. This lets you allocate your GPU hardware to different users or workloads and increase overall hardware utilization.

H100 has a second-generation MIG that provides 3X more compute and 2X more memory capacity per GPU instance. It also includes Confidential Computing capability with TEE to secure workloads at the MIG level.

Memory architecture and bandwidth

The A100 features a High-Bandwidth Memory (HBM2e) that delivers up to 80 GB of memory and a 1.6 TB/s bandwidth. In contrast, the H100 upgrades to HBM3 memory, offering 80 GB capacity and 3.35 TB/s bandwidth. You get ultra-fast data transfer between the memory and the GPU for processing large HPC datasets.

In traditional GPU designs, including A100, data communication between SMs typically requires going through the global memory. In contrast, H100 introduces the Distributed Shared Memory feature. SMs can communicate directly with each other without needing to transfer data through global memory. It reduces bottlenecks and data access times and enables faster synchronization between SMs.

NVIDIA NVLink and NVSwitch

In practice, AI and HPC workloads run across multiple GPUs that work together to compute data. Ampere supports NVLink and NVSwitch technology, allowing multiple GPUs to be interconnected at high speeds. NVLink in Ampere delivers twice the throughput of the previous generation, enabling data transfer speeds of up to 600 GB/s between GPUs. 

Hopper GPUs feature the next generation of NVLink technology, with NVLink 4.0, allowing 900 GB/s bandwidth. It minimizes communication bottlenecks between GPUs.

NVIDIA GPUs H100 vs. A100: Performance differences

What is the performance impact of all the hardware changes? Isolated tests conducted by NVIDIA show a significant performance boost on H100—up to 4X higher GPT3 training performance and 30X times higher inference performance compared to A100. Performances for non-AI workloads also show an increase of 7X thanks to the DPX technology.

H100 vs A100 performance benchmarks(NVIDIA)

While the results may be accurate for NVIDIA testing, they don’t necessarily translate across the board for every workload. For example, independent benchmarking by CoreWeave shows a 2-3X increase in training speed.

On average, you can expect a performance improvement of 2-5X times with the H100 for training and 10-20X times for inference.

NVIDIA GPUs H100 vs. A100: Cost differences

The H100 costs significantly more than the A100. But does it really? Let’s dive in. (All costs given below are in USD)

Cloud costs

If you are running your workloads in the cloud, you may have to pay anywhere from $2-$3/GPU/hour for the H100, depending on the provider. In contrast, the A100 costs $1-$1.5/GPU/hour.

Let’s perform calculations assuming an average price of $2.5 and $1.25 per GPU.

 

GPU

Training resources

Training time

Cost per hour
(4 GPUs)

Total cost

A100

4 GPUs 

10 hours

1.25×4=$5

$50

H100

4 GPUs

4 hours (assuming 2.5X performance)

2.5X4=$10

$40

As you can see, the cost is lower for H100, even with conservative estimates. The more you optimize for H100 for better performance and scale, the more you realize cost savings.

Datacenter costs

Datacenter GPUs have their own role and purpose. Upfront costs of an 80GB A100 GPU vary from $15000-$20000. The upfront cost of an 80GB H100 varies from $35,000-$40,000. While purchasing hardware, investing in the latest technology is always better as older technology becomes obsolete quickly.

Also, it is essential to note that the power consumption of H100 is 30% less than A100, thanks to its Tensor Core design. You will see your power bills drop by 30% within the first month of making the switch! 

Organizations reserving GPU cloud instances 24X7 are always better off purchasing the hardware. You could pay more than the hardware cost in just a single year of operations.

TRG Datacenters takes care of GPU colocation for you. We specialize in high-density racks and can take care of management and setup for you to deploy remotely. We can provide air-cooled, immersion, or direct-to-chip cooling with multiple power options and 15+ carriers on site.

What does the H100 have that the A100 doesn’t?

The H100 has many features that the A100 does not have or has an older version. We present the summary in the table below.

Feature

A100

H100

Fourth-generation Tensor Cores

Third-generation Tensor Cores  

6x faster chip-to-chip, 2x MMA rates, 4x performance using FP8 

FP64 and FP32 Processing Rates

3x slower compared to H100

3x faster IEEE FP64 and FP32 rates chip-to-chip 

HBM3 Memory Subsystem

HBM2e, with 1.6 TB/s bandwidth

HBM3 memory with nearly 2x bandwidth increase to 3 TB/s 

L2 Cache

Smaller cache

50 MB L2 cache, caches large portions of models and datasets

Second-Generation MIG Technology

First-generation MIG

3x more compute capacity, nearly 2x more memory bandwidth per instance

Fourth-generation NVLink

NVLink with 600 GB/sec bandwidth.

50% bandwidth increase, 900 GB/sec total bandwidth.

Third-generation NVSwitch

Older NVSwitch technology 

New NVSwitch technology, with 13.6 Tbits/sec switch throughput, hardware acceleration for collective ops

PCIe Gen 5, 128 GB/sec

PCIe Gen 4, 64 GB/sec total bandwidth

Doubles the bandwidth, allowing faster data transfer rates between GPU components, 

DPX Instructions

Not available 

Accelerates dynamic programming algorithms (non-AI) by up to 7x.  

Thread Block Cluster Feature

Not available

Adds control over thread blocks across multiple SMs, improving synchronization

Distributed Shared Memory

Not available

Allows direct SM-to-SM communications 

Asynchronous Execution Features

Not available                                                                               

Includes Tensor Memory Accelerator (TMA) for faster data transfers between memory and SM.

Transformer Engine

Not available

Optimizes AI performance using FP8 and 16-bit, providing up to 9x faster training and 30x inference speed

Confidential Computing support

Not available  

First native Confidential Computing GPU, with MIG-level TEE and PCIe line rate protection. 

NVLink Switch System and Second-level NVLink Switches

Not available  

Enables up to 32 nodes or 256 GPUs to connect with 57.6 TB/sec all-to-all bandwidth  

NVIDIA A100 H100 comparison – how to choose?

Choosing between the NVIDIA A100 and H100 depends on your workload needs. Since H100 is optimized for transformer models, it is a must-have for LLM workloads. Whether training your LLM or fine-tuning an existing one, you will get better cost-performance benefits in the long run with H100. 

For non-AI workloads, if you have HPC workloads for scientific and medical research or other similar use cases, again, you should go with the H100. It offers much better performance and will help you achieve your research goals faster.

However, for smaller workloads like machine learning analytics, OCR, NLP, or fraud detection with fewer users, the A100 is more cost-effective. It provides sufficient performance without the higher investment of the H100. If your workload doesn’t justify the scale and upfront cost of the H100, the A100 is a great choice.

Conclusion

For decades, NVIDIA GPUs have reliably met performance requirements for data science, gaming, and other computing-intensive projects. NVIDIA is constantly innovating its hardware architecture to keep up with changing technologies. Organizations moving to LLM/AI adoption will benefit significantly from the architectural upgrades in H100. Cut down your power consumption and increase GPU usage for various use cases with the H100.

FAQs

How much better is H100 vs A100?

The NVIDIA H100 offers significantly improved performance over the A100, with up to 3x faster AI training and 30x faster AI inference for large language models. It features fourth-generation Tensor Cores, enhanced memory bandwidth, and additional architectural improvements, making it substantially more powerful for AI and HPC workloads.

What is the difference between H100 80GB and A100 40GB?

The H100 80GB has double the memory of the A100 40GB and uses faster HBM3 memory, providing nearly 2x the bandwidth(3 TB/s vs. 1.6 TB/s). The H100 also features fourth-generation Tensor Cores and FP8 precision, enabling faster performance for AI workloads than the A100.

How much faster is H100?

The H100 delivers up to 9x faster AI training and 30x faster AI inference on large language models compared to the A100, thanks to its advanced features like the transformer engine, improved Tensor Cores, and enhanced memory bandwidth. This makes it a superior choice for demanding AI applications.

Looking for GPU colocation?

Leverage our unparalleled GPU colocation and deploy reliable, high-density racks quickly & remotely