Yann LeCun’s vision for creating autonomous machines

Editor's note: This story has been updated to reflect that Yann LeCun's work touches on ML research previously conducted by German computer scientist Jürgen Schmidhuber.

In the midst of the heated debate about AI sentience, conscious machines and artificial general intelligence, Yann LeCun, chief AI scientist at Meta, published a blueprint for creating “autonomous machine intelligence.”

LeCun has compiled his ideas in a paper that draws inspiration from progress in machine learning, robotics, neuroscience and cognitive science. It examines some ML work by German computer scientist and AI professor Jürgen Schmidhuber between 1990 and 2015. LeCun lays out a roadmap for creating AI that can model and understand the world, reason and plan to do tasks on different timescales.

While the paper is not a scholarly document, as pointed out by several others in the field, it does provide an interesting framework for thinking about the different pieces needed to replicate animal and human intelligence. It also shows how the mindset of LeCun, an award-winning pioneer of deep learning, has changed and why he thinks current approaches to AI will not get us to human-level AI.

A modular structure

One element of LeCun’s vision is a modular structure of different components inspired by various parts of the brain. This is a break from the popular approach in deep learning, where a single model is trained end to end.

At the center of the architecture is a world model that predicts the states of the world. While modeling the world has been discussed and attempted in different AI architectures, they are task-specific and can’t be adapted to different tasks. LeCun suggests that like humans and animals, autonomous systems must have a single flexible world model.

“One hypothesis in this paper is that animals and humans have only one world model engine somewhere in their prefrontal cortex,” LeCun writes. “That world model engine is dynamically configurable for the task at hand. With a single, configurable world model engine, rather than a separate model for every situation, knowledge about how the world works may be shared across tasks. This may enable reasoning by analogy, by applying the model configured for one situation to another situation.”

_{LeCun's proposed architecture for autonomous machines}

The world model is complemented by several other modules that help the agent understand the world and take actions that are relevant to its goals. The “perception” module performs the role of the animal sensory system, collecting information from the world and estimating its current state with the help of the world model. In this regard, the world model performs two important tasks: First, it fills the missing pieces of information in the perception module (e.g., occluded objects), and second, it predicts the plausible future states of the world (e.g., where will the flying ball be in the next time step).

The “cost” module evaluates the agent’s “discomfort,” measured in energy. The agent must take actions that reduce its discomfort. Some of the costs are hardwired, or “intrinsic costs.” For example, in humans and animals, these costs would be hunger, thirst, pain, and fear. Another submodule is the “trainable critic,” whose goal is to reduce the costs of achieving a particular goal, such as navigating to a location, building a tool, etc.

The “short-term memory” module stores relevant information about the states of the world across time and the corresponding value of the intrinsic cost. Short-term memory plays an important role in helping the world model function properly and make accurate predictions.

The “actor” module turns predictions into specific actions. It gets its input from all other modules and controls the outward behavior of the agent.

Finally, a “configurator” module takes care of executive control, adjusting all other modules, including the world model, for the specific task that it wants to carry out. This is the key module that makes sure a single architecture can handle many different tasks. It adjusts the perception model, world model, cost function and actions of the agent based on the goal it wants to achieve. For example, if you’re looking for a tool to drive in a nail, your perception module should be configured to look for items that are heavy and solid, your actor module must plan actions to pick up the makeshift hammer and use it to drive the nail, and your cost module must be able to calculate whether the object is wieldy and near enough or you should be looking for something else that is within reach.

Interestingly, in his proposed architecture, LeCun considers two modes of operation, inspired by Daniel Kahneman’s “Thinking Fast and Slow” dichotomy. The autonomous agent should have a “Mode 1” operating model, a fast and reflexive behavior that directly links perceptions to actions, and a “Mode 2” operating model, which is slower and more involved and uses the world model and other modules to reason and plan.

Self-supervised learning

While the architecture that LeCun proposes is interesting, implementing it poses several big challenges. Among them is training all the modules to perform their tasks. In his paper, LeCun makes ample use of the terms “differentiable,” “gradient-based” and “optimization,” all of which indicate that he believes that the architecture will be based on a series of deep learning models as opposed to symbolic systems in which knowledge has been embedded in advance by humans.

LeCun is a proponent of self-supervised learning, a concept he has been talking about for several years. One of the main bottlenecks of many deep learning applications is their need for human-annotated examples, which is why they are called “supervised learning” models. Data labeling doesn’t scale, and it is slow and expensive.

On the other hand, unsupervised and self-supervised learning models learn by observing and analyzing data without the need for labels. Through self-supervision, human children acquire commonsense knowledge of the world, including gravity, dimensionality and depth, object persistence and even things like social relationships. Autonomous systems should also be able to learn on their own.

Recent years have seen some major advances in unsupervised learning and self-supervised learning, mainly in transformer models, the deep learning architecture used in large language models. Transformers learn the statistical relations of words by masking parts of a known text and trying to predict the missing part.

One of the most popular forms of self-supervised learning is “contrastive learning,” in which a model is taught to learn the latent features of images through masking, augmentation, and exposure to different poses of the same object.

However, LeCun proposes a different type of self-supervised learning, which he describes as “energy-based models.” EBMs try to encode high-dimensional data such as images into low-dimensional embedding spaces that only preserve the relevant features. By doing so, they can compute whether two observations are related to each other or not.

In his paper, LeCun proposes the “Joint Embedding Predictive Architecture” (JEPA), a model that uses EBM to capture dependencies between different observations.

_{Joint Embedding Predictive Architecture (JEPA)}

“A considerable advantage of JEPA is that it can choose to ignore the details that are not easily predictable,” LeCun writes. Basically, this means that instead of trying to predict the world state at the pixel level, JEPA predicts the latent, low-dimensional features that are relevant to the task at hand.

In the paper, LeCun further discusses Hierarchical JEPA (H-JEPA), a plan to stack JEPA models on top of each other to handle reasoning and planning at different time scales.

“The capacity of JEPA to learn abstractions suggests an extension of the architecture to handle prediction at multiple time scales and multiple levels of abstraction,” LeCun writes. “Intuitively, low-level representations contain a lot of details about the input, and can be used to predict in the short term. But it may be difficult to produce accurate long-term predictions with the same level of detail. Conversely high-level, abstract representation may enable long-term predictions, but at the cost of eliminating a lot of details.”

_{Hierarchical Joint Embedding Predictive Architecture (H-JEPA)}

The road to autonomous agents

In his paper, LeCun admits that many things remain unanswered, including configuring the models to learn the optimal latent features and a precise architecture and function for the short-term memory module and its beliefs about the world. LeCun also says that the configurator module still remains a mystery and more work needs to be done to make it work correctly.

But LeCun clearly states that current proposals for reaching human-level AI will not work. For example, one argument that has gained much traction in recent months is that of “it’s all about scale.” Some scientists suggest that by scaling transformer models with more layers and parameters and training them on bigger datasets, we’ll eventually reach artificial general intelligence.

LeCun refutes this theory, arguing that LLMs and transformers work as long as they are trained on discrete values.

“This approach doesn't work for high-dimensional continuous modalities, such as video. To represent such data, it is necessary to eliminate irrelevant information about the variable to be modeled through an encoder, as in the JEPA,” he writes.

Another theory is “reward is enough,” proposed by scientists at DeepMind. According to this theory, the right reward function and correct reinforcement learning algorithm are all you need to create artificial general intelligence.

But LeCun argues that while RL requires the agent to constantly interact with its environment, much of the learning that humans and animals do is through pure perception.

LeCun also refutes the hybrid “neuro-symbolic” approach, saying that the model probably won’t need explicit mechanisms for symbol manipulation, and describes reasoning as “energy minimization or constraint satisfaction by the actor using various search methods to find a suitable combination of actions and latent variables.”

Much more needs to happen before LeCun’s blueprint becomes a reality. “It is basically what I'm planning to work on, and what I'm hoping to inspire others to work on, over the next decade,” he wrote on Facebook after he published the paper.

A modular structure

Self-supervised learning

The road to autonomous agents

More