AI's Physical Intuition: How V-JEPA Learns Like a Baby (2026)

Imagine a baby watching a glass of water disappear behind a wooden board. When the board moves past the spot where the glass should be, the baby is surprised. This simple experiment reveals a profound truth: even infants grasp the concept of object permanence—the idea that things exist even when we can't see them. Now, a groundbreaking AI model is learning to do the same.

But here's where it gets fascinating: this isn't just about mimicking human behavior. Researchers at Meta have developed an AI system called Video Joint Embedding Predictive Architecture (V-JEPA) that learns to understand the physical world by watching videos. Unlike traditional AI models that get bogged down in pixel-level details, V-JEPA focuses on the bigger picture, capturing essential information and ignoring the noise. Think of it as teaching an AI to see the forest, not just the trees.

And this is the part most people miss: V-JEPA doesn’t just recognize objects; it develops a sense of “surprise” when something defies its learned understanding of physics. For instance, if a ball disappears behind an object and doesn’t reappear, V-JEPA reacts much like a curious infant—it’s surprised. This ability to intuit physical laws is a giant leap forward in AI, one that could revolutionize how machines interact with the world.

But let’s dive deeper. Traditional AI systems often struggle with understanding videos because they treat every pixel equally. Imagine trying to navigate a busy street while fixating on the rustling leaves instead of the traffic light. V-JEPA, however, uses latent representations—a fancy way of saying it focuses on the essential details, like the shape, size, and movement of objects, rather than getting lost in the minutiae. This approach allows it to predict what should happen next in a video, even when parts of it are hidden.

Here’s where it gets controversial: While V-JEPA’s ability to mimic human intuition is impressive, some experts argue it’s still missing a key element—uncertainty. Karl Friston, a computational neuroscientist, points out that V-JEPA doesn’t quantify how uncertain its predictions are. For example, if past frames don’t provide enough information to predict future events, the model doesn’t acknowledge this uncertainty. Is this a flaw, or just the next challenge to overcome? What do you think?

In June, Meta unveiled V-JEPA 2, a more advanced version trained on 22 million videos. They even applied it to robotics, showing how the model could plan a robot’s actions after just 60 hours of training. But despite these strides, V-JEPA 2 still has limitations. On a tougher test of intuitive physics, it performed only slightly better than chance. Why? Because its memory is short-lived, much like a goldfish’s. It can handle just a few seconds of video at a time, forgetting anything longer.

So, here’s the big question: Can AI ever truly match human intuition, or will it always fall short in some ways? And if it does, what does that mean for the future of autonomous systems? Let us know your thoughts in the comments—this is a conversation worth having.

AI's Physical Intuition: How V-JEPA Learns Like a Baby (2026)
Top Articles
Latest Posts
Recommended Articles
Article information

Author: Terence Hammes MD

Last Updated:

Views: 5734

Rating: 4.9 / 5 (49 voted)

Reviews: 80% of readers found this page helpful

Author information

Name: Terence Hammes MD

Birthday: 1992-04-11

Address: Suite 408 9446 Mercy Mews, West Roxie, CT 04904

Phone: +50312511349175

Job: Product Consulting Liaison

Hobby: Jogging, Motor sports, Nordic skating, Jigsaw puzzles, Bird watching, Nordic skating, Sculpting

Introduction: My name is Terence Hammes MD, I am a inexpensive, energetic, jolly, faithful, cheerful, proud, rich person who loves writing and wants to share my knowledge and understanding with you.