When 3D Models Become Rooms We Think Inside: Lessons from Apple’s visionOS Collaboration Demo

Filip Živaljić

Apple’s WWDC session on “Collaborate on structured 3D models in visionOS” looks, at first glance, like a technical tour of RealityKit APIs. Underneath, it is doing something more interesting. It is quietly sketching a grammar for how groups of people might think together inside shared, manipulable 3D models.

This grammar matters whether you are an academic studying virtual museums, a founder building design tools, or a policymaker wondering how to make complex public decisions legible. The talk is framed around AirPods and engine blocks, but the underlying ideas travel far beyond consumer hardware.

From objects on screens to rooms in which we think

The talk opens with a simple scene: three people in a SharePlay call, standing around a shared model of AirPods Pro. Everyone sees the same object, at the same fidelity, fixed in the same physical space in front of them. The case slides closer. The lid unlocks “the way it would on a workbench,” except the workbench is wherever you happen to be. One person lifts the internal assembly, rotates it, and a colleague sees exactly that pose. Someone points at the part that matters. No annotations. No screenshots. Just deictic gestures that everyone can understand.

Then clipping kicks in. A cross‑section cuts through the case and reveals the logic board “exposed in context, in a way that isn’t feasible on a 2D screen.” The clipping plane disappears. The whole device expands into an exploded view. A participant reaches in, pulls the motherboard free, and holds it up for everyone else.

The core shift is this: the 3D model stops being a file to look at and becomes a shared room to think inside. The rest of the session is essentially about what needs to be true, technically and conceptually, for that room to feel intelligible and collaborative.

Structure is behavior: why hierarchy suddenly matters

The first principle is almost philosophical: a 3D model is a part–whole relationship. If you flatten everything to the root, you keep the geometry but lose the meaning. You can still render the model, but you have destroyed the structure that would make it interactive.

In the session, the presenter contrasts two versions of an engine block. In the first, everything has been exported “without preserving its structure.” Every piece has a generic label like InteriorPart_01 or InteriorPart_47. There are no sub‑assemblies. On screen it looks fine. But if you want to highlight a single piston or animate it independently, you are effectively blind. Neither you nor your code knows where to look.

In the second version, the engine has a deep, nested hierarchy. The pistons, the crankshaft, the housing: all exist as named, organized nodes. The presenter hides the outside of the engine and all but one piston. Now the system can isolate, animate, and even let a person reach out and “pull” that one piston free.

The claim is simple but powerful: “Assets without structure are hard to reason about and hard to use in code.” Once the hierarchy is there, very small changes in component placement cause large changes in how people can act on the model.

This is particularly relevant for cultural heritage and urban studies, where hierarchical descriptions already exist. Museums maintain collection management databases with object parts, materials, and provenances. City planning departments describe buildings, parcels, networks, and utilities in layered GIS and BIM systems. The Apple talk is an implicit invitation: preserve those hierarchies in your 3D exports, do not flatten them for convenience. If you do, you can start to map curatorial and infrastructural knowledge onto spatial interactions.

Moving one component, changing everything

The next move is to show how interaction flows from that hierarchy. RealityKit offers a ManipulationComponent. Attach it to the root of an entity and people can grab, rotate, and scale the entire object with natural hand movements. Now move that same component down one level, onto the children of the assembly, and something qualitatively different happens.

Suddenly, the top enclosure can be pulled away while the bottom stays where it is. One person can rotate an earbud while someone else examines the other at the same time. The whole experience shifts “from thing to look at to thing to explore” purely because of where the component lives in the tree. The geometry has not changed. The file on disk has not changed. Only the locus of agency has shifted.

Conceptually, this is elegant. Component placement drives behavior. It suggests a design rule: when you are building collaborative spatial experiences, ask not only “what can be manipulated?” but “at what level in the semantic hierarchy does manipulation live?” In a museum, you might attach manipulation to the entire sculpture for one mode, and then to individual fragments, restorations, or historical layers in another. In a planning context, you might let people move an entire building block, then drop down and manipulate individual apartments or utilities.

For a founder, this is an appealing abstraction. You can expose a one‑click mode switch—“open this assembly”—that simply re‑parents a component in the tree. There is no need to rebuild the scene or manage parallel layouts. You are reassigning the power to move.

Clipping as an interaction, not a hack

The most evocative part of the talk for me is the treatment of clipping. In many 3D tools, clipping planes are a technical feature, buried somewhere in a viewport menu. Here, clipping is reimagined as a first‑class, embodied interaction mode.

RealityKit in visionOS 27 introduces a ClippingComponent with editable bounds. At its simplest, this is an axis‑aligned bounding box in the model’s local space. Anything outside that box is discarded by the renderer. There is a three‑state machine that governs clipping: off, on, and editing. In the off state, the assembly is untouched. In the on state, clipping is active and reveals interior layers. In the editing state, six planes appear—one for each face of the bounding box—and people can grab and move them.

Crucially, each plane controls exactly one number: one scalar value of the bounds. If you pull the +x face inward, you reveal more of the interior along that axis. If you push it back out, you hide it again. The interaction model collapses to “six planes, six numbers.”

Under the hood, the math is non‑trivial. The presenter breaks down how drag gestures live in one coordinate frame (the clipping plane), but the bounds live in another (the model). There is also the world frame and a clipping control frame. The drag vector is transformed between these spaces, then projected onto the direction that matters, such as the normal of the bounding box face. The projection step is explained as “measuring the shadow of the drag delta vector cast on the direction vector.”

Conceptually, this is doing something interesting. It uses linear algebra to produce something that feels almost analog.

Source: https://www.youtube.com/watch?v=zEyH34eLRlw&list=PLjODKV8YBFHYfjjIHqYNdKjSO7INGYDWm&index=15

You're looking at a sample

Members read the full library — every paper summary, every dataset, every market scale, updated weekly. Free during early access.

Join waitlist

When 3D Models Become Rooms We Think Inside: Lessons from Apple’s visionOS Collaboration Demo

From objects on screens to rooms in which we think

Structure is behavior: why hierarchy suddenly matters

Moving one component, changing everything

Clipping as an interaction, not a hack

When 3D Models Become Rooms We Think Inside: Lessons from Apple’s visionOS Collaboration Demo

Reality Composer Pro 3: Faster Spatial Development for VisionOS

What’s New for visionOS and Apple Vision Pro at WWDC 2026

Gary Vee Predicts The Death Of The Phone Screen: Why AR Glasses And Immersive Tech Will Eat Social Media

Virtualware Secures €800,000 ADIF Contract Extension for VR Training

GDC 2026 Highlights: What’s Next on Meta Horizon OS

You're looking at a sample