Introduction
Vision—how we perceive, interpret and make sense of the visual world—is such a deeply human faculty that when we ask whether AI “sees like us,” we immediately run into profound questions. Can machines form visual understanding in the same conceptual and hierarchical way humans do? Or are they still essentially performing pattern recognition with very different internal representations?
DeepMind’s recent essay “Teaching AI to See the World More Like Humans Do” provides new insight into these questions: it analyses how current vision models organise the visual world differently from humans, proposes methods to align them more with human-judged structure, and shows improved performance from doing so.
In short: No, AI models do not yet “see” in the same way humans do—but DeepMind’s work shows promising steps toward bridging that gap. Let’s unpack what they found.
How humans vs AI models organise visual information
In the DeepMind essay, the team begins by explaining that both humans and machine vision systems form representations of visual input—but the organisation of these representations differs. Humans generate mental representations of e.g. “cat-ness”, “furriness”, “animalness”, “texture”, “context”, which are richly interwoven; machines map images into high-dimensional embedding spaces where similarity is captured by distance (e.g. two sheep embed close together, a sheep and a birthday cake embed further apart).
They use a classic “odd-one-out” cognitive science task (given three images, pick the one that doesn’t belong) to compare humans and vision models. In many cases humans and models align. But in many others—especially when deeper conceptual/common-sense categories matter—they diverge. For example: a sheep, a tapir, and a birthday cake—they align. But in more subtle examples, models may pick based on texture or background rather than semantic category.
The essay provides visualisations: before alignment, the embedding map of a vision model is unstructured: animals, food, furniture are mixed. After alignment with human-judged structure, the embedding becomes more clearly organised into meaningful clusters (animals vs food vs objects).
One of the key observations: machine vision models, even very strong ones, often do not capture the hierarchical organisation of concepts the way humans do (e.g., sub-category → category → super‐category). They might treat a cat and a dog similarly because they look visually similar, but fail to group both under “mammal” or “animal” in the way humans readily do. DeepMind show that their alignment method brings the model’s internal distances into something like proportionality with conceptual/human semantic distances.
In short: machines “see” (in the sense of encoding) many of the same inputs, and can sort and classify them impressively—but they don’t yet see in the same structured, conceptual, hierarchical way humans do.
What DeepMind did to bring models closer to human-like vision
DeepMind propose a multi-step alignment method to improve how vision models mirror human conceptual structure. The key steps:
- Start with a powerful pretrained model (they used SigLIP-SO400M) and train a small adapter on a human‐judgement dataset (the THINGS dataset of odd-one‐out human decisions). The adapter is trained while the main model is frozen, thus preserving the pretrained skills.
- Use this “teacher” model to generate a large synthetic dataset (called AligNet) of millions of human-like odd-one‐out decisions across ~1 million images—far more than the human-collected set.
- Use the synthetic human‐like dataset to fine-tune “student” models—so that their internal embedding spaces restructure toward human-conceptual organisation.
The results: the aligned student models show much higher agreement with human judgments on odd-one-out and other similarity tasks; they even show human-like uncertainty (for example, the model’s uncertainty correlates with how long humans take to decide). And importantly, they also perform better on downstream tasks: few-shot learning (learning a new category from one image) and distribution shift (handling images outside the training distribution) improved.
Thus DeepMind demonstrate: aligning internal representations with human conceptual structure isn’t just philosophically interesting—it also yields more robust vision models.
Can we now say that AI “sees like humans”?
With this background, the question: Can AI see the world like humans? The answer: Not yet fully—but we’re getting closer, and DeepMind’s work makes important progress.
Where machines still fall short:
- They often rely on low-level cues (texture, background, colour) rather than high-level conceptual categories. DeepMind show examples where models pick the “odd one out” for reasons humans wouldn’t.
- Although alignment helps, it doesn’t make models human-identical in vision: human vision uses far more than just object category similarity—context, purpose, embodied experience, peripheral vision, temporal continuity, intentions, memory. There are studies outside DeepMind showing humans and machines differ in many visual tasks (e.g., out-of-distribution robustness).
- Current vision models still lack full embodiment and active perception: humans move their eyes, head, body, use memory, anticipate, simulate consequences, switch focus. Vision models typically passively classify static images rather than actively explore a scene.
- The internal semantic richness and experiential grounding of humans (we see not just “a dog” but “a dog running in the park, excited, maybe fetching a ball, maybe hungry”) remains largely missing from standard vision models.
Where progress is strong:
- The fact that aligning the embedding structure to human-like conceptual distances improves both human-model agreement and downstream performance is a major step. DeepMind show the two go hand in hand.
- The method of generating a large synthetic dataset to capture human-like judgments (AligNet) shows a scalable way to bridge the gap between small human-annotation sets and large model needs.
- The improved robustness to distribution shift is particularly encouraging: aligning with human concepts appears to yield more generalisable vision.
- Conceptually, the idea that “seeing like humans” means more than object classification—it includes being organised by conceptual hierarchy, context, semantics—is gaining ground.
Implications for AI vision, robotics and human-AI interaction
- For AI systems: if a vision model is organised more like a human’s internal visual/conceptual map, it may integrate better with higher-level reasoning modules: planning, prediction, embodied action. This could enable better performance in robotics, AR/VR, autonomous vehicles, assistive vision systems.
- For human-AI alignment: When AI vision models mirror human semantic structure, they may make errors that feel more “intuitive” or predictable to humans, improve trust, interpretability, and reduce bizarre mismatches (e.g., AI seeing a car and airplane as unrelated when a human sees “large metal vehicles”).
- For research: The gap between human and machine vision remains a rich research frontier—particularly in embodiment, temporal perception, attention dynamics, peripheral vision, active exploration. DeepMind’s work suggests structural alignment is a fruitful axis.
- For safety/robustness: Models aligned with human structure are less brittle under distribution shift, which is crucial if vision AI is used in safety-critical domains (e.g., driver assistance).
- For philosophy/AI consciousness debates: While the work does not claim vision models “see” exactly like humans in an experiential sense, it reframes the question: rather than “can AI see like us?” the question becomes “how similar must internal representation organisation become for AI to behave and generalise like humans?”
Open questions and caveats
- How far does “alignment” go? DeepMind show improved structure, but full human-level vision includes aspects such as simulation of future states, imagination, embodied sensorimotor loops, memory of past experience, purposeful attention selection. These remain largely unaddressed in standard vision models.
- Embodiment & active perception: Humans don’t just passively look; we explore, interact, anticipate. Vision is tightly coupled with action (grasping, locomotion, social interaction). Will models ever integrate that sufficiently?
- Conceptual richness: Humans group categories not just by shape or colour, but function, causal relations, intentions, context. Can vision models extend beyond surface similarity to causal and functional similarity?
- Dataset and bias issues: Human judgments (e.g., odd-one-out) still come from limited tasks; synthetic datasets (like AligNet) approximate human judgments, but may miss cultural, experiential, individual variation.
- Interpretability & transparency: Even if embedding maps become human-like, are the internal mechanisms understandable? What does it mean for a model to “see like a human”?
- The gap in “understanding”: Some researchers argue that no matter how good embeddings become, machines still lack “common sense” or “grounded experience” that humans bring (see for example criticisms of large language models).
- Robustness and failure modes: Alignment reduces some failures, but does it eliminate brittle behaviour under adversarial or radically new environments? The real world remains messier than benchmark sets.
Conclusion
In conclusion:
Yes—AI vision models are advancing toward seeing more like humans do. The work by DeepMind shows a meaningful path: aligning internal representations with human conceptual structure leads to better agreement with human judgments and better generalisation. But no, we cannot yet say that AI “sees the world like humans” in the full rich sense of human perception, conceptualisation, and embodied experience.
For now, the situation is:
- machines see in their own way, largely driven by visual features, statistics, embeddings;
- humans see via richly structured, context-laden, embodied, hierarchical conceptual maps;
- by aligning the former toward the latter, we narrow the gap and improve utility—but the full gap remains.
From a practical viewpoint, this means AI vision systems are becoming more trustworthy, more integrated with human conceptual frames, and more robust. From a philosophical viewpoint, it raises rich questions about the nature of “seeing,” “understanding,” and the human-machine divide.
For you (and for anyone thinking about “can AI see like humans?”) the takeaway is: focus not just on “can the AI identify objects?” but on how its representations relate to human conceptual structure, how it handles novelty, context, purpose, and whether it can engage with active, embodied perception.
