AI Made Friendly HERE

Can AI Understand the Laws of Physics? Google DeepMind Used 396 Videos to Find the Truth_team_This

This is an interesting story about artificial intelligence. As we see AI creating increasingly realistic videos, even generating cinematic-level visuals, have you ever wondered: Do these AIs truly understand the laws governing the physical world? Or are they merely skilled at imitation, like a talented painter who can create a lifelike apple but doesn’t really understand why the apple falls from the tree?

The research team at Google DeepMind recently conducted an in-depth study on this question. Led by Saman Motamed from INSAIT Sofia University and in collaboration with Google DeepMind, the research was published on February 28, 2025, and the complete paper can be accessed via arXiv:2501.09038. The research team includes scholars from Google DeepMind such as Laura Culp, Kevin Swersky, Priyank Jaini, and Robert Geirhos, who jointly designed a new evaluation system called “Physics-IQ.”

The core question of the research is quite simple: Do current AI models that can generate stunning videos, such as OpenAI’s Sora, Runway Gen 3, and Pika 1.0, truly understand the fundamental laws of the physical world? Or have they just learned to stitch together images that look realistic through powerful computational capabilities and vast amounts of data?

To answer this question, the research team created a test dataset containing 396 real videos, much like giving AI students a physics exam. These videos cover five main areas of physics: solid mechanics, fluid dynamics, optics, thermodynamics, and magnetism. Each video is a carefully designed physical experiment scenario, such as what happens when a rubber duck is placed in the middle of a collapsing domino setup, or how the reactions differ when throwing a kettle and a piece of paper onto a pillow.

The research team used high-quality Sony Alpha a6400 cameras to capture each scene from three different angles: left, center, and right, with each scene filmed twice to capture the natural variations of real-world physical phenomena. This approach ensures the rigor of the testing, akin to needing a control group in medical research.

The testing method is quite clever. The research team shows the first 3 seconds of the video to the AI model as the “question” and then asks it to predict what will happen in the next 5 seconds. This is similar to showing students the initial frame of a ball rolling off the edge of a table and then asking them what will happen next. If the AI truly understands the physical laws of gravity and inertia, it should be able to accurately predict that the ball will fall in a parabolic trajectory, rather than flying towards the ceiling or suddenly stopping in mid-air.

To assess the AI’s performance, the research team designed four evaluation criteria. The first criterion is called “spatial IoU,” which simply checks if the AI correctly predicts the location of the action, similar to testing whether students can accurately point out where the ball will land on the floor. The second is “spatiotemporal IoU,” which examines not only the correctness of the location but also the timing of when the ball should land. The third is “weighted spatial IoU,” which assesses whether the intensity of the action is reasonable; for example, the impact of a heavy object falling should be greater than that of a light object. The final one is “mean squared error” (MSE), which is the strictest standard requiring the visual details to be as close to the real situation as possible.

The research team tested eight of the most advanced AI video generation models currently available, including the well-known Sora, Runway Gen 3, Pika 1.0, as well as Lumiere, Stable Video Diffusion, and VideoPoet. The test results were both surprising and somewhat expected.

The results showed that even the best-performing multi-frame version of VideoPoet scored only 29.5% in physical understanding capability, while the theoretical maximum score is 100% (this perfect score is derived from comparing two real captures of the same scene). This indicates that even the strongest AI video models have significant room for improvement in understanding physical laws.

Interestingly, the research team found that there is almost no correlation between visual realism and physical understanding. The videos generated by Sora were the hardest for AI assistants to identify as fake (with a success rate of only 55.6%, close to the random guessing rate of 50%), indicating that its visual effects are indeed very realistic. However, in terms of physical understanding, Sora scored only 10%, ranking last among all tested models. It’s akin to an artist being able to paint an extremely realistic bird, but if asked to predict the bird’s flight trajectory, they might be completely at a loss.

The research also uncovered some interesting details. AI models that can accept multi-frame input (such as the multi-frame versions of VideoPoet and Lumiere) generally performed better than models that can only accept single images, which aligns with our intuition—seeing more information certainly allows for more accurate predictions. Additionally, the difficulty of different physical phenomena varies; all models performed relatively well in spatial localization (i.e., predicting where the action will occur) but struggled in time prediction and action intensity judgment.

In terms of performance in specific physics domains, each model had its own “strengths” and “weaknesses.” For example, in solid mechanics involving object collisions and material deformation, some models performed adequately; however, in fluid dynamics, such as liquid pouring and mixing, most models showed their limitations. Optical phenomena (like reflection and refraction), thermodynamic phenomena (like evaporation and heat transfer), and magnetic phenomena posed significant challenges for these AI models.

The research team showcased specific cases of success and failure in their paper. Successful examples include VideoPoet being able to correctly simulate the process of a rotating brush dipping into paint and applying it to a glass plate, and Runway Gen 3 accurately predicting the effect of red liquid being poured onto a rubber duck. However, the failures were equally thought-provoking: for instance, the AI could not correctly simulate the process of a ball falling into a plastic box or the physical reactions when cutting an orange with a knife.

This research reveals an important limitation of current AI technology: visual realism does not equate to a true understanding of the physical world. This finding has profound implications for the development of AI.

From a technical standpoint, the study indicates that current AI video generation models primarily rely on pattern matching and statistical learning, rather than a deep understanding of physical laws. They are like a student with an excellent memory who can recite everything from textbooks but doesn’t know how to apply this knowledge when faced with new situations.

The root of this limitation may lie in the training methods. Current AI models primarily learn through the method of “predicting the next frame.” While this approach has achieved great success in language models (like GPT), it may not be sufficient for understanding the physical world. The physical world involves complex concepts such as causality and action-reaction forces, which may require deeper reasoning abilities rather than just pattern recognition.

The research team also discussed a deeper philosophical question: Can one truly understand the world just through observation? This question has been a point of contention in the fields of artificial intelligence and cognitive science. One viewpoint argues that through extensive observation and prediction training, AI can eventually gain a deep understanding of the physical world, much like how human infants learn physical intuition by observing the world. Another viewpoint posits that true understanding requires interaction with the environment, the ability to conduct experiments, and observe causal relationships, rather than passively watching videos.

From a practical application perspective, the findings of this study have important implications for AI in various fields. For example, in autonomous driving, if AI cannot truly understand physical laws, it may fail to accurately predict the behavior of other vehicles or pedestrians. In robotics, robots lacking physical intuition may struggle with tasks requiring precise manipulation. In virtual reality and game development, this limitation could affect the realism of user experiences.

However, the research results are not entirely pessimistic. Although the overall performance of current models is unsatisfactory, they have demonstrated some degree of physical understanding in certain specific scenarios. This indicates that learning physical laws through observation is possible; the current technology just isn’t mature enough. With improvements in computational power, data set expansion, and algorithm enhancements, future AI models are likely to make breakthrough progress in physical understanding.

The research team also observed some intriguing phenomena. For instance, some more powerful models (like Runway Gen 3 and Sora) exhibited a “hallucination” phenomenon during generation, creating objects that did not originally exist. Interestingly, these hallucinations often remained consistent with the context of the scene; for example, a candle suddenly appearing in a match lighting scene indicates that the model possesses at least some contextual understanding ability.

The quality and design of the dataset are also noteworthy. Unlike many existing physical reasoning test datasets, Physics-IQ uses real-world videos rather than computer-generated synthetic images. This avoids the distribution difference problem between “real world – synthetic data,” making the evaluation results more reliable. Each scene is filmed from three different angles, with two captures per angle, ensuring not only the diversity of the data but also quantifying the natural range of physical phenomena in the real world.

The innovation of the research methodology is commendable. By designing “out-of-distribution” scenarios that require deep physical understanding to solve (such as placing a rubber duck in the middle of dominoes), the research team ensured that the test could not be simply solved by memorizing training data, but required a true understanding of physical principles. This design approach is also highly relevant to other AI capability evaluation studies.

The design of the evaluation metrics is equally clever. Four different evaluation dimensions test the AI’s understanding of “where,” “when,” “to what extent,” and “how,” forming a relatively complete evaluation system. Although these metrics are all indirect measurements and cannot directly quantify physical phenomena themselves, the comprehensive information they provide is sufficient to assess the AI’s level of physical understanding.

It is worth noting that the research team’s use of a multimodal large language model (Gemini 1.5 Pro) to assess visual realism is also an interesting innovation. By having the AI determine which video is generated, the research team avoided the subjectivity issues of human evaluation while also demonstrating the current capabilities of AI technology in this area.

Regarding research limitations, the team honestly acknowledged some shortcomings. For example, the evaluation metrics may be too strict for certain types of errors (like object hallucinations or camera transitions), which may have impacted the scores of certain models (particularly Sora). Additionally, while the metric design is comprehensive, it remains an indirect measurement of physical understanding and cannot directly assess the model’s grasp of physical principles.

From a broader perspective, this research touches on a core issue in the development of artificial intelligence: how to enable machines to truly understand the world, rather than merely imitate surface phenomena. This question is crucial not only in the fields of computer vision and video generation but also in many AI application areas such as natural language processing, robotics, and autonomous driving.

The research team has open-sourced the Physics-IQ dataset and evaluation code, providing valuable resources for future research. Other researchers can use this benchmark test to evaluate new models and advance the entire field. This open research attitude is vital for scientific progress.

Ultimately, this research teaches us an important lesson: surface realism and deep understanding are two different things. Although current AI video generation technology has reached an impressive level in visual effects, there is still a long way to go in understanding the physical laws that underpin these visual phenomena. This does not mean we should feel pessimistic about AI development; on the contrary, this discovery points to a clear direction for future research.

For ordinary users, this means maintaining a certain level of vigilance when using AI-generated video content, especially in applications requiring physical precision. For researchers, this work provides a clear challenge: how to enable AI not only to generate beautiful images but also to truly understand the physical world that supports these images.

Future research may need to explore new training methods, such as training in conjunction with physics simulators, introducing more interactive learning mechanisms, or developing new architectures capable of physical reasoning. Perhaps the real breakthroughs will come from interdisciplinary collaborations that integrate the latest achievements in computer science, physics, cognitive science, and neuroscience.

Regardless, Physics-IQ provides an important milestone, allowing us to quantify the true level of AI’s physical understanding and offering clear targets for future improvements. As the research team stated, while visual realism does not equal physical understanding, this finding itself is an important step towards advancing AI toward deeper intelligence. Readers interested in delving deeper into this research can access the complete paper via arXiv:2501.09038, and find related code and datasets on GitHub.

Q&A

Q1: What is the Physics-IQ test? How does it assess the physical understanding capability of AI video models?

A: Physics-IQ is an AI physical understanding capability test developed by Google DeepMind, containing 396 real physics experiment videos. The testing method involves showing the AI model the first 3 seconds of a video and asking it to predict the subsequent 5 seconds of physical changes. Four evaluation criteria (action location, timing accuracy, intensity reasonableness, and visual detail) are used to determine whether the AI truly understands physical laws such as gravity, collision, and fluid dynamics, rather than just stitching together images from memory.

Q2: Why does the Sora video look very realistic, but its physical understanding score is very low?

A: This is a key finding of the research: visual realism and physical understanding ability are two different things. Sora excels in visual effects, making it difficult for even AI assistants to recognize it as fake, but it primarily generates images through pattern matching rather than truly understanding physical laws. It’s like an artist who can paint a realistic apple but may not understand why the apple falls. This “surface skill” exposes its limitations in complex physical scenarios.

Q3: In which physical phenomena do current AI video models perform the worst? What impact does this have on practical applications?

A: AI models perform the worst in fluid dynamics, thermodynamics, and magnetism, and they also struggle with timing prediction and action intensity judgment. For example, they cannot accurately predict phenomena like liquid pouring or changes in heated objects. This means that in scenarios requiring precise physical simulation (such as autonomous driving, industrial simulation, and robotic operations), current AI may make erroneous judgments, affecting safety and reliability.返回搜狐,查看更多

Originally Appeared Here

You May Also Like

About the Author:

Early Bird