GPUYard

Posts

Maximizing GPU ROI: How to Partition NVIDIA A100 & H100 with MIG

July 17, 2026

Most AI teams provision GPUs the way they provision servers: one workload, one full device. But running a 7B-parameter inference endpoint or a batch embedding job doesn't require a massive 80GB of HBM3 memory. When you run lightweight workloads on a full A100 or H100, most of the silicon sits idle—while your budget burns away. NVIDIA’s Multi-Instance GPU (MIG) technology solves this by physically dividing a single GPU into up to 7 independent, hardware-isolated instances. Here is a quick breakdown of how it works and how you can implement it to maximize your hardware ROI. Why Choose MIG Over Time-Slicing? Unlike software-based sharing methods (like CUDA MPS or time-slicing) where processes cooperatively share resources, MIG offers true hardware-level isolation . Dedicated Resources: Each MIG instance gets its own assigned Streaming Multiprocessors (SMs), memory, and L2 cache. Zero Resource Contention: A massive, memory-heavy request on one instance cannot starve, slow down, or...

Why Network Latency is Killing Your AI App in Europe

July 10, 2026

Every millisecond between a user's request and your AI model's response is a design decision. For live applications like chatbots, recommendation engines, or real-time fraud detection network latency is often the difference between a product that feels instant and one that feels broken. If your GPU infrastructure sits in the wrong place, you are fighting a losing battle against physics. Here is what you need to know about optimizing AI inference for the UK and Europe. 1. Training Latency vs. Inference Latency Are Not the Same It is easy to lump "AI performance" into one bucket, but they have completely different tolerances for delay: Training jobs running for 12 hours do not care if a data batch takes an extra 200 milliseconds to load. Live inference is synchronous. A user is actively waiting on the other end. If your pipeline involves multiple steps (API gateway → database context → GPU compute → response), a poorly optimized network will ruin the user experience...

The 2026 Guide to NVLink 5.0 on Blackwell GPU Servers

June 24, 2026

If you are running large-scale AI training or inference workloads in 2026, one technology separates the systems that truly scale from those that merely pretend to: NVLink 5.0 on NVIDIA Blackwell GPU servers. Most guides on this topic stop at "NVLink is fast." That is not enough. If you are provisioning, configuring, or operating a Blackwell-based GPU server, you need to understand the full picture: how the hardware topology actually works, how to configure NCCL and IMEX correctly, and how to avoid the operational pitfalls that burned early adopters. Key Takeaways You Need to Know: Massive Bandwidth: NVLink 5.0 delivers 1.8 TB/s bidirectional bandwidth per GPU. The "One Massive GPU" Topology: The GB200 NVL72 rack connects 72 GPUs in a single flat NVLink domain with 130 TB/s aggregate bandwidth. The Bandwidth Cliff: Crossing an NVLink domain boundary without proper topology-aware scheduling causes a severe bandwidth drop from ~800+ GB/s to roughly 100–200 GB/s. H...

The Open-Source Robotaxi Revolution: Inside NVIDIA Alpamayo 2 Super

June 04, 2026

For years, the autonomous vehicle (AV) industry operated on a simple rule: the more proprietary your AI stack, the bigger your competitive moat. NVIDIA just challenged that assumption head-on at GTC Taipei 2026. With the launch of Alpamayo 2 Super —a 32-billion-parameter open reasoning Vision Language Action (VLA) model—NVIDIA is betting that an open-source ecosystem will accelerate Level 4 autonomy faster than any closed-loop approach ever could. If you are an AV developer, machine learning researcher, or infrastructure engineer, this release completely rewrites your development roadmap. 5 Core Upgrades Under the Hood Alpamayo 2 Super isn't just a minor iteration; it triples the scale of previous 10B models and introduces deep reasoning capabilities directly into the perception loop: 3× Parameter Scale (32B): Built on NVIDIA Cosmos, delivering vastly superior 3D spatial understanding and long-tail scenario handling. 360° Surround Perception: Expands from front-focused camera c...

How to Secure AI Workloads: NVIDIA Blackwell Confidential Computing Setup

May 21, 2026

Securing enterprise artificial intelligence workloads is no longer optional. When processing sensitive financial data, healthcare records, or proprietary foundational models, encrypting data at rest and in transit is simply not enough. You must protect "data in use." NVIDIA Confidential Computing (CC) on the Blackwell architecture (like the B200) solves this by leveraging hardware-based Trusted Execution Environments (TEEs). This ensures that neither the hypervisor, the host operating system, nor the infrastructure provider can access the unencrypted weights or datasets running on the GPU. The 4 Essential Steps to Enable Hardware Isolation To shift your AI security posture from perimeter defense to mathematical, hardware-level isolation, you need to configure your infrastructure across four main layers: Step 1: The BIOS Level You must first enable a CPU Trusted Execution Environment (AMD SEV-SNP or Intel TDX) and secure PCIe lane isolation in your server BIOS. Step 2: The...

NVIDIA H100 PCIe vs SXM: Which Multi-GPU Architecture is Best for Your AI Workload?

May 15, 2026

The AI arms race has made the NVIDIA H100 the undisputed standard for Large Language Models (LLMs). But when building a multi-GPU server, many engineering leaders make a critical, budget-draining mistake: misunderstanding the difference between PCIe and SXM architectures. Here is the quick breakdown of what you actually need to know before provisioning your AI hardware: 1. SXM & NVSwitch (The Heavyweight) Best for: Training trillion-parameter foundation models (like GPT-4) from scratch. The Tech: Fanless GPUs mounted on a custom HGX baseboard. The NVSwitch allows all 8 GPUs to communicate simultaneously at 900 GB/s. The Catch: It is massive architectural overkill and a huge budget drain for 95% of AI startups and mid-size enterprises. 2. PCIe + NVLink Bridge (The Smart Compromise) Best for: LLM fine-tuning (LoRA/QLoRA), RAG pipelines, and high-throughput inference. The Tech: Standard plug-in cards. By connecting pairs of PCIe GPUs with physical NVLink bridges , you bypas...

Search This Blog