Seeing isn't for knowing what stuff looks like
If you tried to break a visual perception down into its parts, you’d probably reduce it to the kinds of elements an artist would talk about: shape, line, color, depth, angle, etc. In our minds, vision is an aggregation of these things: when you put together some lines, some colors, etc., you have a thing-that-looks-a-certain-way.
Neuroscientists have long assumed that the mind experiences things the way the brain does them. In the case of vision, this would mean that the visual system is made of parts that process different aspects of vision: neurons for encoding colors and lines, circuits for transmitting information about shapes and angles, etc. This perspective on how vision works in the brain creates a binding problem: how does the information about color, lines, etc., get added together to produce the visual images our minds perceive?
Any binding problem should get solved by economic coordination, but economic coordination doesn’t glue together existing pieces. Instead, new elements are constructed out of the coordination process. In the case of vision, this would mean that the apparent elements of vision, like lines and colors, are in fact constructed by the interactions of neurons rather than coded for by individual neurons. A visual perception consequently isn’t made out of these elements but is instead what we have in the aggregate when these “elements” are produced by a more general cognitive process.
In fact, it seems likely that you don’t construct one visual perception in a given situation but many. The thing you see is the virtual governor produced by the competing visual predictions. The visual predictions reach a consensus about what is going on, but that consensus isn’t a consensus about any particular thing; it’s a consensus about what parameters of a shared model are going to get them all to coordinate.
Instead of thinking about vision as being made of visual parts, like color and lines, psychological constructionism says that vision is made out of metabolic processes, specifically the process of allostasis. By thinking about vision as being constructing out of interoceptive signals rather than visual elements, we avoid the binding problem entirely.
The intuitive view, that vision is built out of visual elements being aggregated together, runs into the intractable binding problem and isn’t how vision works anyway. But the allostasis view of vision is counterintuitive because it says that the elements of a visual perception are produced within the brain and body rather than received from the environment. So what we see isn’t what the environment looks like—it’s something we create as a way of governing interactions inside of us.
What does the environment look like? I think this question cuts to the heart of the matter. We normally think of vision as being useful for telling us what stuff looks like—that things do look a certain way, and vision basically serves to report those looks to us. The existence of stuff like microscopic organisms and non-visible wavelengths of light shows that this perspective definitely is wrong, but it’s still highly intuitive. It’s related to the perception-cognition-action fallacy: the idea that perception passively receives the state of the environment and recreates it inside our brains as some kind of representation, however imperfect. For this to be true, the environment needs to already look like something, which our eyes then report at least a piece of. But there’s no reason to think that the environment has a preset visual appearance.
Relatedly, we normally think of perceptions as being about accuracy: trying to recreate the way the environment is as accurately as possible inside of us. It’s like we think there’s a photo of the world taken by Nature, and another taken by our eyes, and we try to compare them to see how closely they match, but Nature didn’t take a photo! Internal models are about action, not accuracy: transforming the environment, not recreating it. Objects are identified through the process of action selection, not visual recreation of the environment. Action affordances don’t come from the object; instead, the object is a psychological construction based on what actions are expected to be useful.
Even though we experience vision as a passive reception of the appearance of the environment, I’m unaware of any evidence that the environment looks like something when no one is looking at it—that looking is a receipt and recreation of preexisting visual information. If so, then vision might be a novel creation instead, like a collective preference or competency that can’t be attributed to any of the parts.