ECCV 2026 Workshop

On-device Embodied World Models

Grounded, efficient, and privacy‑aware world understanding for mobile devices and embodied agents

📍 ECCV 2026, Malmö, Sweden  |  🗓️ Half‑day workshop

Topic & Motivation

With the rapid advancement in wireless and network technologies, mobile devices have evolved from mobile phones for calls and media playback to wearable devices such as smart glasses, XR headsets, and mobile robots capable of multimodal interaction in real‑time. As multimodal foundation models become more efficient to deploy across devices and edge/cloud, media consumption is changing into intelligent, embodied and context‑aware interaction. Instead of users passively viewing media, they can interact with embodied AI agents—avatars or assistive robots—that perceive, occupy and act in a shared space.

For such agents to operate naturally, they must reason about the world from multimodal data, maintain a grounded and continuously updated world understanding, run computationally efficient models suitable for resource‑constrained platforms, and ensure ethical and privacy‑aware data handling.

Key Questions

  • How can an embodied AI system understand multimodal sensory inputs (e.g., visual, audio, language, tactile) from diverse data sources?
  • How to build grounded, persistent and adaptive world understanding by integrating spatial computing algorithms and multimodal foundation models?
  • How to deploy efficient algorithms across diverse platforms such as mobile devices, wearables, XR headsets, and edge networks?
  • How to ensure responsible handling of sensitive, privacy‑rich data from egocentric views?

Keywords: mobile devices, embodied agents, multimodal, world models

Invited Speakers

Stephen Brewster

Stephen Brewster

University of Glasgow

Stephen Brewster is a Professor of Human-Computer Interaction in the School of Computing Science at the University of Glasgow, where he leads the Multimodal Interaction Group within the GIST research section. His research focuses on multimodal HCI, combining audio, haptics, and gesture to create rich, natural human-computer interactions, with a strong emphasis on applying perceptual research to practical settings. He is a Fellow of the Royal Society of Edinburgh, a member of the ACM SIGCHI Academy, and an ACM Distinguished Speaker.

Andrea Cavallaro

Andrea Cavallaro

EPFL & Queen Mary University of London

Andrea Cavallaro is a Full Professor at EPFL and Queen Mary University of London, where he founded the Centre for Intelligent Sensing, and a Turing Fellow at The Alan Turing Institute. He received his Ph.D. in Electrical Engineering from EPFL in 2002 and is a Fellow of IAPR for contributions to image processing and multi-sensor scene understanding. He serves as Editor-in-Chief of Signal Processing: Image Communication and Senior Area Editor for IEEE Transactions on Image Processing. His research spans privacy-aware visual analysis, person re-identification, and sensor data anonymization, and he has edited books on multi-camera networks and multimedia surveillance.

Tentative Schedule

Half‑day workshop (4 hours) – subject to minor adjustments

Time Event
08:50–09:00Opening remarks
09:00–09:30Invited Talk 1
09:30–10:00Invited Talk 2
10:00–11:00☕ Coffee break and poster session for accepted contributions
11:00–11:30Invited Talk 3
11:30–12:20Panel discussion, Q&A
12:20–12:30Closing remarks

We will also host an Embodied Reasoning Challenge based on the UNOBench benchmark for robotic grasping in cluttered scenes.

Workshop Organizers

Püren Güler

Püren Güler

Ericsson Research

Hirokatsu Kataoka

Hirokatsu Kataoka

AIST / Oxford VGG

Yoshihiro Fukuhara

Yoshihiro Fukuhara

AIST / CADDi

Fabio Poiesi

Fabio Poiesi

FBK, Trento

Hiba Alqaysi

Hiba Alqaysi

Ericsson Research

Anastasia Grebenyuk

Anastasia Grebenyuk

Ericsson Research

Haoyu Xiong

Haoyu Xiong

MIT

Marcus Valtonen Örnhag

Marcus Valtonen Örnhag

Ericsson Research

Magnus Oskarsson

Magnus Oskarsson

Lund University

Hector Caltenco

Héctor Caltenco

Ericsson Research

Call for Papers

The submission portal is now open on OpenReview, follow this link.

We welcome submissions on all topics related to the embodied world models on device. The exact submission format and paper page limits will follow the ECCV 2026 official template and main conference guidelines, see here. Each submission will be reviewed under a double-blind policy.

We offer two submission tracks: Archival and Non-Archival. Archival track follows the standard ECCV paper format with a 14-page limit, are submitted via OpenReview, and will be published in the workshop proceedings. Non-Archival submissions are extended abstracts with a 4-page limit; they will not appear in the proceedings but will be featured on the workshop website. We welcome submissions of previously published work on topics relevant to the workshop as extended abstracts.

Topics include (but are not limited to):

  • Embodied world models
  • Multi-modal reasoning
  • Spatial understanding for XR, robotics, autonomous driving, etc.
  • Deployment of AI models and spatial compute at the edge
  • Privacy preserving perception on devices

Important Dates

Tentative dates

Milestone Date
Workshop datesSeptember 8–9, 2026 (exact date, time, and place TBD)
Submission opensJune 2
Paper submission deadlineJuly 9
Paper acceptance notificationAugust 7
Camera-ready versionAugust 14

Diversity & Inclusion

Our workshop emphasizes diversity across the organizing team and invited speakers. The team includes members of diverse gender representation and international backgrounds from Europe, the Middle East, and Asia. Organizers span multiple career stages and affiliations (industrial research, universities, national labs). Speaker institutions are globally recognized, and their expertise covers interdisciplinary areas—world‑model scaling, multimodal interaction, responsible AI—directly aligned with our workshop's key questions.

Sponsor