Robotics is a distributed computing problem. Engineers must think across the stack, from basic hardware up to software, data-centers, and networking. Eventually, wealth will be created with software, or, to be precise, with an app store-like infrastructure where robots can be enhanced with apps for various tasks. Similar to the iPhone, the utility of robots will be continuously enhanced with over-the-air software updates.
Hardware is the robot and all the silicon required to train and run the robot. Software is the abstraction layer that instructs the robot on what to do next. A comprehensive, whole-stack view of the problem is cost-efficient and will bring products to market much faster.
Tesla has the engineering expertise and focus to build high-quality cars and humanoid robots at scale and low cost. Tesla also has a data pipeline that captures scenes for real-world AI learning.
Nvidia is the world leader in data center architecture for large-scale machine learning, i.e., pattern recognition from large data sets. Nvidia also has a lead in 3D rendering and simulation.
A combination with Tesla would provide Nvidia with the industrial scale and data pipeline to feed their massive training infrastructure for the creation of foundation models for robots.
Together, Nvidia and Tesla could rapidly iterate and solve both hard physics and AI problems. Vertical integration allows for rapid iteration and accelerates time to market.
A merger between Nvidia and Tesla would create the first deep engineering company.
The case for a merger between Nvidia and Tesla:
In his keynote at GTC 24, Nvidia CEO Jensen Huang made it clear that embodied decision-making agents (robotics) are a core focus for the company. Nvidia wants to bring AI to the real world. To start with, they developed a unified representation for real-world objects, a platform they call "Omniverse". It’s a platform for connecting and building industrial metaverse applications. Omniverse allows teams to collaborate in a shared virtual space in real-time. Some key aspects of Nvidia Omniverse include:
3D Design Collaboration: It enables designers, artists, and engineers to work together on complex 3D scenes and assets concurrently from anywhere in the world.
Physics Simulation: Omniverse provides advanced physics simulation capabilities, allowing creators to accurately simulate the physical behaviors of objects and environments.
Photorealistic Rendering: It leverages Nvidia's RTX technology to generate physically accurate, photorealistic rendering and ray-traced graphics in real-time.
AI Training and Deployment: Omniverse integrates AI models and tools, enabling developers to train and deploy AI for various use cases like perception, recommendation systems, and digital twins.
Universal Scene Description: It uses Pixar's Universal Scene Description (USD) for describing and sharing 3D scene data across different applications.
Scalable and Open Platform: Omniverse is designed to be scalable and open, supporting multiple engines, renderers, and software ecosystems through developer kits.
The core idea is to provide a virtual world simulation and collaboration platform that connects different applications, assets, and teams involved in the design and development process across industries like manufacturing, architecture, entertainment, and more.
Omniverse, so Jensen, is the Cuda of today. Nvidia has been developing Cuda since 2006. It is a parallel computing platform and application programming interface (API) that enables software to utilize GPUs for accelerated general-purpose processing. It provides direct access to the GPU's virtual instruction set and parallel computational elements. By accessing the virtual instruction set of the GPUs, CUDA allows the programmer to solve higher level software problems and simultaneously exploit adaptability and flexibly at the hardware level.
Cuda brought the GPU to the AI world primarily by making parallel computing easier for practitioners. Omniverse serves a similar purpose.
One of the core ideas for the future is the multimodal foundation model for embodied decision-making agents, a concept that has captured academia as much as industry. To quote Sherry Yang from Deep Mind, “Both academia and industry are coming at it from different directions. Industry wants better, more entertaining videos, and academia wants to use videos to enable better simulation." Sherry refers to the opportunity of generative video as a platform for robot learning. Generate videos from real-world images as prompts and use the vast video library as a base to train foundation models for robots. Then solve the sim to real problem.
Nvidia wants to become the foundry for robotics. What does that mean? Foundry in Chip Land means you build a massive factory that others can use for their designs. What Jensen wants to do in robotics is train massive multimodal models and offer them as pre-trained transformers for others to fine-tune.
Let's analyze this a bit more. What is a multimodal transformer? To answer that question, let’s first ask: What is a transformer? Since the publishing of the seminal paper “Attention Is All You Need”, transformers have captured the AI world by storm. Pre-training a Large Language transformer means running a model that detects patterns between words on the internet. Once pre-trained, the transformer can be fine-tuned for certain tasks, such as chat bots. A fine- tuned chat bot is able to predict the next word in a sequence of words, i.e., a sentence. For example, if you prompt “I am walking on a.…..and enjoying the sunset, the blank will be filled with the most likely word, which is “beach”. Transformers are a compression of knowledge stored in latent space.
In the context of machine learning and artificial intelligence, latent space represents a compressed or low-dimensional vector space where items with similar characteristics are positioned closer together. This space is derived from the input data and captures essential features and attributes, enabling the development of efficient AI models by enhancing their capacity to comprehend and process information.
Now, take what GPT or Gemini have done with text and imagine the same thing but with images, video sound, and other sensors. All of these signals can be represented in machine-readable language, which is some sort of vector database. Omniverse’s universal scene description language is exactly that. It captures the real world, dissects it into machine-readable segments, i.e., tokens, and runs a model that figures out patterns between those tokens. Later, an inference engine can use this compressed knowledge of patterns in the real world to generate actions that are grounded in the real world. In other words, a multimodal transformer can capture physical behavior without knowing anything about physics, the same way Chat GPT can capture the poetic nuances of Shakespeare without knowing anything about poetry.
Nvidia wants to train such models. Multimodal foundation models require a comprehensive view of compute, storage, memory, networking, data, energy, and many other aspects of the compute stack.
But it doesn’t end here. Robotics is a distributed computing problem, and it includes the robot as well. You can’t think of robotics as a hardware thing with lots of computation attached. The stack starts all the way on the edge with the robot and ends in the data center.
That’s where Tesla comes in. In order to drive robotics forward, you need both Moore’s Law and Wright's Law on your side. Industrial manufacturing of robots at scale cannot be separated from the compute stack. It’s one whole. Tesla has expertise in hard engineering, which is the task of solving hard technical engineering problems. Whether it’s developing low-cost, high-performance EVs, batteries, FSD, or charging stations, Tesla has been at it for decades. The DNA of the company is to solve hard engineering problems. The key feature that distinguishes Tesla from any other engineering company is that they are not just driven to solve technical problems but also to do so at lower cost. Tesla is embracing Wright's Law as much as Moore’s Law. Nvidia needs both.
"Nvidia’s soul," says Jensen, is the combination of computer graphics, physics, and AI. They have two of the three: graphics and know-how to run AI at scale. But what about physics? That’s where Nvidia falls short. So far, all the massive sales growth for Nvidia has come from customers in the virtual world. Whether it’s Open AI training GPT or Amazon training models under the Bedrock platform, Nvidia has enjoyed massive growth in data center infrastructure for generative AI training.
But Nvidia’s physics track record is rather prosaic. When we first invested in Nvidia in 2012, we thought Nvidia could become a big player in the self-driving car space. We thought autonomous electric cars would replace ICE cars much faster because they’d be able to drive an order of magnitude more miles and thus reduce the cost per mile by as much compared to ICE cars. Today we are happy shareholders of Nvidia, but to be honest, we held the shares for the wrong reason. We thought the company would become a powerhouse in robotics, but in fact, they became market leaders in data center infrastructure for AI.
Today, we feel it’s time for that second leg to kick in. And Jensen made it clear at GTC 24 that he agrees with us. Nvidia is embarking on a massive investment cycle to drive robotics forward. For this to accelerate, they need a hard engineering company, and Tesla is the only obvious player that brings scale, expertise, and, most importantly, the mindset of a company that is willing to iterate, learn, and drive both Moore’s Law and Wright’s Law forward.