Topic & Motivation
With the rapid advancement in wireless and network technologies, mobile devices have evolved from mobile phones for calls and media playback to wearable devices such as smart glasses, XR headsets, and mobile robots capable of multimodal interaction in real‑time. As multimodal foundation models become more efficient to deploy across devices and edge/cloud, media consumption is changing into intelligent, embodied and context‑aware interaction. Instead of users passively viewing media, they can interact with embodied AI agents—avatars or assistive robots—that perceive, occupy and act in a shared space.
For such agents to operate naturally, they must reason about the world from multimodal data, maintain a grounded and continuously updated world understanding, run computationally efficient models suitable for resource‑constrained platforms, and ensure ethical and privacy‑aware data handling.
Key Questions
- How can an embodied AI system understand multimodal sensory inputs (e.g., visual, audio, language, tactile) from diverse data sources?
- How to build grounded, persistent and adaptive world understanding by integrating spatial computing algorithms and multimodal foundation models?
- How to deploy efficient algorithms across diverse platforms such as mobile devices, wearables, XR headsets, and edge networks?
- How to ensure responsible handling of sensitive, privacy‑rich data from egocentric views?
Keywords: mobile devices, embodied agents, multimodal, world models
Invited Speakers
Stephen Brewster
University of Glasgow
Stephen Brewster is a Professor of Human-Computer Interaction in the School of Computing Science at the University of Glasgow, where he leads the Multimodal Interaction Group within the GIST research section. His research focuses on multimodal HCI, combining audio, haptics, and gesture to create rich, natural human-computer interactions, with a strong emphasis on applying perceptual research to practical settings. He is a Fellow of the Royal Society of Edinburgh, a member of the ACM SIGCHI Academy, and an ACM Distinguished Speaker.
Andrea Cavallaro
EPFL & Queen Mary University of London
Andrea Cavallaro is a Full Professor at EPFL and Queen Mary University of London, where he founded the Centre for Intelligent Sensing, and a Turing Fellow at The Alan Turing Institute. He received his Ph.D. in Electrical Engineering from EPFL in 2002 and is a Fellow of IAPR for contributions to image processing and multi-sensor scene understanding. He serves as Editor-in-Chief of Signal Processing: Image Communication and Senior Area Editor for IEEE Transactions on Image Processing. His research spans privacy-aware visual analysis, person re-identification, and sensor data anonymization, and he has edited books on multi-camera networks and multimedia surveillance.
Tentative Schedule
Half‑day workshop (4 hours) – subject to minor adjustments
| Time | Event |
|---|---|
| 08:50–09:00 | Opening remarks |
| 09:00–09:30 | Invited Talk 1 |
| 09:30–10:00 | Invited Talk 2 |
| 10:00–11:00 | ☕ Coffee break and poster session for accepted contributions |
| 11:00–11:30 | Invited Talk 3 |
| 11:30–12:20 | Panel discussion, Q&A |
| 12:20–12:30 | Closing remarks |
We will also host an Embodied Reasoning Challenge based on the UNOBench benchmark for robotic grasping in cluttered scenes.
Workshop Organizers
Püren Güler
Ericsson Research
Hirokatsu Kataoka
AIST / Oxford VGG
Yoshihiro Fukuhara
AIST / CADDi
Fabio Poiesi
FBK, Trento
Hiba Alqaysi
Ericsson Research
Anastasia Grebenyuk
Ericsson Research
Haoyu Xiong
MIT
Marcus Valtonen Örnhag
Ericsson Research
Magnus Oskarsson
Lund University
Héctor Caltenco
Ericsson Research
Call for Papers
The submission portal is now open on OpenReview, follow this link.
We welcome submissions on all topics related to the embodied world models on device. The exact submission format and paper page limits will follow the ECCV 2026 official template and main conference guidelines, see here. Each submission will be reviewed under a double-blind policy.
We offer two submission tracks: Archival and Non-Archival. Archival track follows the standard ECCV paper format with a 14-page limit, are submitted via OpenReview, and will be published in the workshop proceedings. Non-Archival submissions are extended abstracts with a 4-page limit; they will not appear in the proceedings but will be featured on the workshop website. We welcome submissions of previously published work on topics relevant to the workshop as extended abstracts.
Topics include (but are not limited to):
- Embodied world models
- Multi-modal reasoning
- Spatial understanding for XR, robotics, autonomous driving, etc.
- Deployment of AI models and spatial compute at the edge
- Privacy preserving perception on devices
Important Dates
Tentative dates
| Milestone | Date |
|---|---|
| Workshop dates | September 8–9, 2026 (exact date, time, and place TBD) |
| Submission opens | June 2 |
| Paper submission deadline | July 9 |
| Paper acceptance notification | August 7 |
| Camera-ready version | August 14 |
Diversity & Inclusion
Our workshop emphasizes diversity across the organizing team and invited speakers. The team includes members of diverse gender representation and international backgrounds from Europe, the Middle East, and Asia. Organizers span multiple career stages and affiliations (industrial research, universities, national labs). Speaker institutions are globally recognized, and their expertise covers interdisciplinary areas—world‑model scaling, multimodal interaction, responsible AI—directly aligned with our workshop's key questions.