Apple’s newest AI study unlocks street navigation for blind users

3 hours ago 1

There’s no shortage of rumors about Apple’s plans to release camera-equipped wearables. And while it’s easy to get fatigued by yet another wave of upcoming AI-powered hardware, one powerful use case often gets lost in the shuffle: accessibility.

SceneScout, a new research prototype from Apple and Columbia University, isn’t a wearable. Yet. But it hints at what AI could eventually unlock for blind and low-vision users. As Apple’s and Columbia University’s researchers explain it:

People who are blind or have low vision (BLV) may hesitate to travel independently in unfamiliar environments due to uncertainty about the physical landscape. While most tools focus on in-situ navigation, those exploring pre-travel assistance typically provide only landmarks and turn-by-turn instructions, lacking detailed visual context. Street view imagery, which contains rich visual information and has the potential to reveal numerous environmental details, remains inaccessible to BLV people.

To try to close this gap, the researchers present this project that combines Apple Maps APIs with a multimodal large language model to provide interactive, AI-generated descriptions of street view images.

Towards AI Agent-driven Access to Street View Imagery for Blind Users

Instead of just relying on turn-by-turn directions or landmarks, users can explore an entire route or virtually explore a neighborhood block by block, with street-level descriptions that are tailored to their specific needs and preferences.

The system supports two main modes:

Route Preview, which lets users get a sense of what they’ll encounter along a specific path. That means sidewalk quality, intersections, visual landmarks, what a bus stop looks like, etc.

Virtual Exploration, which is more open-ended. Users describe what they’re searching for (like a quiet residential area with access to parks), and the AI helps them navigate intersections and explore in any direction based on that intent.

Behind the scenes, SceneScout grounds a GPT-4o-based agent within real-world map data and panoramic images from Apple Maps.

It simulates a pedestrian’s view, interprets what’s visible, and outputs structured text, broken into short, medium, or long descriptions. The web interface, designed with screen readers in mind, presents all of this in a fully accessible format.

The first tests showed promise, but also important (and dangerous) shortcomings

The research team ran a study with 10 blind or low vision users, most of whom were proficient with screen readers and worked in tech.

Participants used both Route Preview and Virtual Exploration, and gave the experience high marks for usefulness and relevance. The Virtual Exploration mode was especially praised, as many said it gave them access to information they would normally have to ask others about.

Still, there were important shortcomings. While about 72% of the generated descriptions were accurate, some included subtle hallucinations, like claiming a crosswalk had audio signals when it didn’t, or event mislabeling street signs.

And while most of the information was stable over time, a few descriptions referenced outdated or transient details like construction zones or parked vehicles.

Participants also pointed out that the system occasionally made assumptions, both about the user’s physical abilities, and about the environment itself. Several users emphasized the need for more objective language, and better spatial precision, especially for last-meter navigation. Others wished the system could adapt more dynamically to their preferences over time, instead of relying on static keywords.

SceneScout obviously isn’t a shipping product, and it explores the collaboration between a multimodal large language model and the Apple Maps API, rather than real-time, computer vision-based in-site world navigation. But one could easily draw a line from one to the other. In fact, that is brought up towards the end of the study:

Participants expressed a strong desire for real-time access to street view descriptions while walking. They envisioned applications that surface visual information through bone conduction headphones or transparency mode to provide relevant details as they move. As P9 put it, “Why can’t [maps] have a built-in ability to help [provide] detailed information about what you’re walking by.”

Participants suggested using even shorter, ‘mini’ (P1), descriptions while walking, highlighting only critical details such as landmarks or sidewalk conditions. More comprehensive descriptions, i.e. long descriptions, could be triggered on demand when users pause walking or reach intersections.

Another participant (P4) suggested a new form of interaction, in which users “could point the device in a certain direction” to receive on-demand descriptions, rather than having to physically align their phone camera to capture the surroundings. This would enable users to actively survey their environment in real time, making navigation more dynamic and responsive.

As with other studies published on arXiv, SceneScout: Towards AI Agent-driven Access to Street View Imagery for Blind Users hasn’t been peer-reviewed. Still, it is absolutely worth your time if you’d like to know where AI, wearables, and computer vision are inevitably heading.