Yimei (Bonnie) Liu, March 8th, 2026
If we lost one human ability but kept our intelligence, I think vision would be the hardest to give up.
Not because sunsets are beautiful.
But because vision organizes reality.
Before children speak, they see. Before children can reason abstractly, they track faces. Before children understand language, they distinguish objects from background. Intelligence does not start with logic. It starts with perception.
Nearly a third of the human cortex is devoted to processing visual information. That statistic shows how important vision is for many sighted people’s everyday reasoning. But perception itself is broader than sight. When vision is absent or limited, other senses can expand to take on a larger role, and our brain can adapt. Vision simply offers one especially clear window into how perception works.
But what does it actually mean to see?
When light reflects off an object and reaches your eye, your brain does not receive the words tree or face. The retina converts photons into electrical signals. Those signals travel through the optic nerve to the visual cortex. There, perception begins — not with objects, but with edges.
In the earliest visual areas, neurons respond to simple features: lines at certain angles, contrasts between light and dark, and motion in specific directions. These are visual primitives. They are fragments.
As signals move deeper into the brain, those fragments combine. Edges come together to form contours. Contours form shapes. Shapes become recognizable structures. Only after several steps of processing do you consciously experience “that is a tree.”
Image: Human visual pathway (eye → retina → visual cortex) [1]
This biological architecture quietly reshaped artificial intelligence.
For decades, computers were good at calculation but terrible at perception. An image was just a grid of numbers representing pixel intensities, lacking any inherent meaning. Early computer vision efforts depended on handcrafted rules — defining edges, setting thresholds, hard-coding shapes. These systems were brittle. They failed outside controlled environments.
The breakthrough came when researchers stopped programming vision directly and asked a different question: what if machines learned to see like brains do?
Modern computer vision models are built in layers. In the earliest layers, the system detects simple patterns like edges, gradients, and textures. It is not told what an edge is. Instead, it learns that certain filters help reduce prediction errors. In deeper layers, simple features combine into corners, repeated motifs, and parts of objects. Higher layers combine those parts into full objects.
The structure mirrors biology.
Hierarchical abstraction is a method used to simplify complex systems by organizing information into different levels. Each layer is built upon the previous one, and both the visual cortex and modern neural networks follow the same logic.
Image: Visualization of CNN layers detecting edges → shapes → objects [2]
This idea became widely spread to the public with the success of large-scale visual datasets like ImageNet, led by Fei-Fei Li and her collaborators. ImageNet is a massive labelled database of millions of images. Its scale made it possible to train deep neural networks that could learn visual patterns directly from data, instead of handcrafted rules.
The result was a breakthrough: error rates dropped significantly while generalization improved, and machines showed something closer to perception rather than just memorizing patterns.
But recognizing a static object is only one part of vision.
Human perception is continuous. When a ball rolls behind a chair, you do not think it disappeared. You expect it to reappear. Your brain maintains identity over time. It predicts.
Vision and anticipation are deeply connected.
Many neuroscientists describe the brain as a predictive system that constantly generates hypotheses about sensory input and updates them when reality differs. Perception is not passive recording. It is active inference.
Modern computer vision systems increasingly reflect this principle. Tracking models maintain object identity across frames. Systems trained on video learn motion consistency. Models minimize prediction error to gradually internalize statistical regularities of the visual world.
Image: CNN feature map example [3]
If you visualize the intermediate layers of a trained neural network, the resemblance to biological vision becomes striking. Early layers act like edge detectors. Later layers capture textures and shapes. The model learns visual primitives without explicit instruction.
And yet there is an essential difference.
Human vision is embodied. When you see a cup, you naturally understand how it feels in your hand, how heavy it might be, how it tilts. Perception is grounded in physical interaction for human beings.
In contrast, computer vision has often been about processing images to produce labels, without any deeper meaning or context.
But that boundary is dissolving.
Vision systems now guide robots, autonomous vehicles, and interactive agents through feedback loops. Perception drives an action, that action changes the environment, and the new sensory feedback updates the next decision. Seeing increasingly shapes how machines act.
From photons striking the retina to pixels flowing through neural networks, the implementations differ, but the logic converges. Both brains and machines build understanding layer by layer.
Vision does not grant intelligence. But by tracing how light becomes understanding, we can see the architecture of intelligence more clearly, and the insights might unlock far greater machine intelligence in the future.
Further Reading:
Harvard Medical School Magazine. The Limits of Computer Vision and Our Own. https://magazine.hms.harvard.edu/articles/limits-computer-vision-and-our-own
Clark, A. Whatever next? Predictive brains, situated agents, and the future of cognitive science. Behavioral and Brain Sciences 36, 181–204 (2013). https://doi.org/10.1017/S0140525X12000477
IBM Technology. (2021). What are convolutional neural networks (CNNs)? [Video]. YouTube. https://www.youtube.com/watch?v=QzY57FaENXg
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei Li. ImageNet: A large-scale hierarchical image database. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2009). https://www.image-net.org
Hubel, D.H., Wiesel, T.N. Receptive fields, binocular interaction and functional architecture in the cat's visual cortex. Journal of Physiology 160, 106–154 (1962).
LeCun, Y., Bengio, Y., Hinton, G. Deep learning. Nature 521, 436–444 (2015). https://doi.org/10.1038/nature14539
Olah, C., Mordvintsev, A., Schubert, L. Feature Visualization. Distill (2017). https://distill.pub/2017/feature-visualization/
Media Credits:
[1] Visual pathway diagram. Lecturio – The Visual Pathway and Related Disorders. https://www.lecturio.com/concepts/the-visual-pathway-and-related-disorders/
[2] CNN feature visualization image. Olah, C., Mordvintsev, A., Schubert, L. Feature Visualization, Distill (2017). https://distill.pub/2017/feature-visualization/
[3] CNN activation / feature map example. Stanford CS231n – Understanding Convolutional Neural Networks. https://cs231n.github.io/understanding-cnn/
No comments:
Post a Comment