Study: Apple’s newest AI model flags health conditions with up to 92% accuracy

2 hours ago 1

Apple Watch health sensors oxymeter blood glucose smartwatch

A new Apple-supported study argues that your behavior data (movement, sleep, exercise, etc.) can often be a stronger health signal than traditional biometric measurements like heart rate or blood oxygen. To prove it, the researchers developed a foundation model trained on behavioral data collected from wearables, and it performed surprisingly well. Here are the details.

This preprint paper, Beyond Sensor Data: Foundation Models of Behavioral Data from Wearables Improve Health Predictions, comes as a result of the Apple Heart and Movement Study (AHMS). They trained a new foundation model on more than 2.5 billion hours of wearable data, showing it can match (and even outperform) existing models built on low-level sensor data.

They call the new model WBM, which stands for Wearable Behavior Model. And while previous health-related foundation models mostly relied on raw sensor streams like the Apple Watch’s heart rate sensor (PPG, or photoplethysmograph) or its electrocardiograph (ECG), WBM learns directly from higher-level behavioral metrics: step count, gait stability, mobility, VO₂ max, and so on. All of which the Apple Watch produces in abundance.

But if the Apple Watch has these sensors, what’s the point of the new model?

Great question. And the answer is in the study:

“Consumer wearables, such as smartwatches and fitness trackers, provide rich information across diverse health domains (…). An important aspect of health monitoring is detecting a static health state – for instance, whether someone has a history of smoking, has a past diagnosis of hypertension, or is on a beta-blocker. Another crucial problem is detecting a transient health state, such as the quality of someone’s sleep or whether someone is currently pregnant. A key property of the data required for these predictions is that they are typically at the temporal resolution of human behavior (e.g., days and weeks) rather than at the lower-level time scales (e.g., seconds) at which raw sensor data is collected from wearables (…).

Though a majority of past work has considered modeling low-level sensor data (or simple features thereof), higher-level behavioral information from wearables such as physical activity, cardiovascular fitness, and mobility metrics, are the natural data type to help solve these detection tasks. Unlike raw sensors, these higher-level behavioral metrics are calculated using carefully validated algorithms derived from the raw sensors. These metrics are intentionally chosen by experts to align with physiologically relevant quantities and health states. Importantly, these data are sensitive to an individual’s behaviors, rather than being driven purely by physiology. These characteristics make behavioral data particularly promising for such health detection tasks. For example, mobility metrics that characterize walking gait and overall activity levels may be important behavioral factors to help detect a changing health state such as pregnancy.”

In other words, while the Apple Watch collects raw sensor data, that data can be noisy, overwhelming, and not always aligned with meaningful health events.

While the metrics used by WBM are based on that sensor data, the data is refined to highlight real-world behaviors and health-relevant trends. They’re more stable, easier to interpret, and better structured for modeling long-term health trends.

In practice, WBM learns from the patterns found in processed behavioral data, rather than relying directly on raw sensor signals.

The nerdy bits

WBM was trained on Apple Watch and iPhone data from 161,855 participants in AHMS. Instead of raw streams, the model was fed 27 human-interpretable behavioral metrics, such as active energy, walking pace, heart rate variability, respiratory rate, and sleep duration.

The data was broken down into weekly blocks and passed through a new architecture built on Mamba-2, which performs better than traditional Transformers (the base for GPT) for this use case.

When evaluated on 57 health-related tasks, WBM outperformed a strong PPG-based model in 18 of the 47 static health prediction tasks (like whether someone takes beta blockers), and in all but one of the dynamic tasks (like detecting pregnancy, sleep quality, or respiratory infection). The exception was diabetes, for which PPG alone won out.

Even better: combining both WBM and PPG data representations produced the most accurate results overall. The hybrid model achieved a whopping 92% accuracy for pregnancy detection, and consistent gains in sleep quality, infection, injury, and cardiovascular-related tasks like Afib detection.

In the end, the study doesn’t try to replace sensor data with WBM, but rather complement it. Models like WBM capture long-range behavioral signals, while PPG catches short-term physiological changes. But together, they’re better at flagging meaningful health shifts early.

If you’d like to know more about the Apple Heart and Movement Study and other studies, we’ve got you covered.