Back to Blog
Community

GTC 2026: Next Phase of GPU Infrastructure — Inference, Agentic AI, and Physical AI

VESSL AI
VESSL AI
||7 min read
GTC 2026: Next Phase of GPU Infrastructure — Inference, Agentic AI, and Physical AI

Introduction

At GTC 2026, Jensen Huang opened with a line that reframed the entire week: "2025 was the year of inference." And if 2025 was when inference arrived, GTC 2026 made one thing clear — what's accelerating it next is moving even faster.

Three themes dominated the week: agentic tools that are compressing the AI development cycle from weeks to hours, Physical AI emerging as a genuinely continuous GPU workload, and a hardware roadmap built entirely around the assumption that inference demand has no ceiling.

The VESSL team had a booth at GTC 2026, spending the week across keynotes and sessions. In this post, we share the key insights we took away — on the hardware shift driving inference infrastructure, how agentic tools are concretely changing the velocity of AI development, the emergence of Physical AI as a continuous GPU workload, what enterprise adoption of Physical AI actually looks like on the ground, and where the neo-cloud landscape stands as all of this accelerates.


1. The Inference Shift, and How It Redesigned the Stack

Source: Jensen Huang Keynote — GTC 2026, March 17

The move from training to inference as the dominant GPU workload isn't just a demand shift — it's a structural one. Training is a project. Inference is a permanent operating cost that scales with every user, every query, and every agentic workflow added to the stack. The two are fundamentally different infrastructure problems:

Criteria Training Inference
Demand pattern Project-based, one-time 24/7, traffic-proportional
Bottleneck FLOPS Memory bandwidth, latency
Cost structure Short-term CapEx Long-term OpEx
Reasoning model impact Limited KV cache explosion → demand spike
Budget implication Scoped CapEx: plan, train, done Open-ended OpEx: scales with users, agents, and query complexity
Procurement decision Capacity planning around project timelines Ongoing vendor selection, cost optimization, and failover strategy
Infrastructure risk Job fails mid-run, restart and retry Sustained downtime = lost revenue, broken user experience

The most visible signal at GTC wasn't a new spec number. It was a form factor change. For years, the default was a GPU module slotted into a standard server. With Blackwell, the flagship product is now the GB200 — a rack-scale system that tightly integrates CPU and GPU. The reason is straightforward: as models get larger and reasoning traces get longer, GPU memory alone isn't enough. The architecture had to expand to match the workload.

Looking further ahead, Vera Rubin — NVIDIA's next-generation platform announced at GTC — pushes this further still, delivering 35x throughput per megawatt improvement over GB300. It's designed explicitly for always-on, agentic workloads: the kind of continuous, multi-model inference that doesn't have an off switch.

For cloud providers, the practical implication is a portfolio question. GB200 and GB300 are the right fit for large-scale inference and Physical AI at volume. HGX B200 — more modular, compatible with existing infrastructure — remains the right choice for the majority of fine-tuning and general inference workloads today.

The direction is clear. The hardware roadmap isn't hedging — it's betting fully on a world where inference is the dominant, always-on workload.


2. Agentic AI and the Velocity of Experimentation

Source: "Open Models: Where We Are and Where We're Headed" — GTC 2026 Panel Session

Panelists: Jensen Huang (NVIDIA), Arvind Srinivas (Perplexity), Harrison Chase (LangChain), Arthur Mensch (Mistral), Misha Laskin (Reflection AI), Robin Rombach (Black Forest Labs), Hanna Hajishirzi (AI2), and others

One of the most important shifts GTC 2026 surfaced wasn't about hardware at all. It was about what happens when the bottleneck on AI development stops being engineering hours and starts being compute.

Two new pieces of NVIDIA infrastructure make this concrete. OpenClaw is an open-source agentic operating system that lets AI agents use tools, manage files, spawn sub-agents, and complete multi-step tasks autonomously. NemoClaw layers enterprise-grade policy sandboxing and external guardrails enforcement on top, so those agents can be deployed safely at scale. Together, they represent the shift from AI that responds to AI that executes.

The implication is structural: when agents can autonomously design experiments, run them, evaluate results, and iterate, the number of experiments a team runs is no longer gated by headcount. It scales with compute. Every agentic workflow added to a team's stack generates inference calls at every step, continuously, not just when a human triggers it.

What this looks like in practice came through clearly in conversations with healthcare AI teams at GTC. One representative example: a fully MCP-orchestrated drug discovery pipeline, triggered by a single user prompt, that automatically chained FDA database search, protein structure analysis via OpenFold 3, compound generation through BioNeMo (thousands of SMILES candidates in minutes), docking simulation, binding evaluation, and final report generation. No fine-tuning required; the agent ran on Nemotron Super via MCP tool calls alone. What previously took a research team the better part of a week was compressed into a single automated pipeline run.


3. Physical AI: Continuous Demand, Real Constraints

Source: "Physical AI in Enterprises: What's Real, What Scales, What's Next" — GTC 2026 Panel Session (Siemens, Volkswagen Mexico, Deloitte)

Physical AI was the most substantive new theme at GTC — not as a concept, but as a demonstrably distinct compute demand pattern.

What makes Physical AI different from conventional AI workloads is that it isn't episodic. Training is a project with a start and end. Physical AI creates continuous GPU demand: simulation, synthetic data generation, validation, and redeployment run in parallel, on an ongoing basis. NVIDIA's COSMOS platform — covering world model reasoning, prediction, and video transfer — is designed precisely for this pattern.

Across teams working in robotics, healthcare, and manufacturing, a common theme emerged: the bottleneck isn't compute availability, it's integration. Physical AI development spans three distinct domains — large-scale cloud training, simulation-based validation, and edge deployment — and most teams are still stitching them together manually. A few representative examples:

  • Noble Machines (humanoid robotics): Jetson Thor for training, DGX Spark for inference — picks out objects on voice command
  • LEM Surgical (surgical robotics): Jetson Thor running Isaac for Healthcare
  • Luminary (Physical AI simulation): H100-based, using Physical NeMo rather than general-purpose VLMs
  • Digital Biology (protein structure prediction): RTX PRO 6000 for inference

The vertical scope is expanding well beyond robotics — autonomous vehicles, surgical simulation, factory monitoring, and drug discovery. Each vertical has a different compute profile, but all of them share the same characteristic: GPU demand that doesn't switch off.

The enterprise picture is more nuanced. A panel with Siemens and Volkswagen Mexico offered the most grounded perspective of the week, summarized in one line: "This is an evolution, not a revolution."

The opportunity is real. During the panel, Siemens pointed to hundreds of billions of dollars in potential factory productivity gains over the next decade. But the path is layered: from physical robots handling specific tasks today, through GenAI-powered scheduling and flexible automation, to real-time digital twins for operational decisions. The hardest step isn't deploying robots. It's keeping the digital twin live — factory data tends to freeze at the design stage, and making it continuously reflect real production conditions is where most teams get stuck.

For infrastructure providers, the implication is direct: Physical AI enterprise adoption creates sustained, long-term GPU demand — not a one-time training spike. Simulation, digital twin maintenance, and continuous redeployment require elastic, always-on compute.


4. The Neocloud Landscape: Marketing vs. Real Demand

Source: GTC 2026 — Conversations with hyperscaler and neocloud teams on the expo floor

A consistent pattern emerged across leading neoclouds at GTC: market with the latest GPU, sell what's actually in demand.

Booth after booth led marketing with B200, GB200, and B300. But as one hyperscaler engineer put it candidly: "The most popular resources are still H100s, then A100s. B300-scale demand remains limited to the largest foundation model labs."

The gap between marketing and actual demand is deliberate, not inconsistent:

  • Enterprise credibility: Signals cutting-edge hardware access and a close NVIDIA relationship
  • Demand preemption: When B200/GB300 demand normalizes, vendors with existing operational experience will be first in line
  • Price anchoring: A visible B200 makes H100 look like the rational, cost-effective choice

Simultaneously, leading neoclouds are building out private cloud offerings to capture on-premise enterprise demand: node-level single-tenant isolation, dedicated campus builds for hyperscale customers, full cloud stacks installed inside customer data centers, and pure-play private cloud targeting sovereign AI and data residency.

This matters particularly in regulated markets. When enterprise customers say they want "on-premise," what they often actually want is control and compliance assurance — not physical hardware ownership. Private cloud architecture, positioned as "on-premise feel, cloud-managed operations," directly addresses this without the operational burden of true on-premise.


What We Took Away

GTC 2026 didn't introduce a single breakthrough. It confirmed a direction — and made the pace of that direction hard to ignore.

Agentic tools have changed the velocity of AI development itself. Physical AI has introduced a genuinely continuous category of GPU demand — and enterprise adoption, while measured and layered, is structurally committed to the long term. Inference is no longer a workload type; it's the exhaust of an entire system that keeps accelerating. And the hardware roadmap — from GB200 to Vera Rubin — is designed for a world where that acceleration doesn't slow down.

What GTC made clear is that the breadth of this shift — across inference, agentic systems, and Physical AI — doesn't sit comfortably within a single provider. The workloads are too varied, the demand too elastic, and the pace of change too fast for any single cluster or cloud to absorb.

That's the infrastructure problem VESSL was built to solve: orchestrating GPU capacity across hyperscalers and neoclouds, matching the right compute to the right workload as demand shifts. Whether your team is scaling Physical AI simulation, running large-scale training, or managing elastic inference workloads, we'd love to show you how it works.

Try VESSL Cloud | Talk to our team


VESSL AI

VESSL AI