Environment understanding for mixed reality: why it is important and why it should improve

April 12, 2024April 12, 2024

These days I’m experimenting with the development of mixed reality applications, and on Quest I’ve so been playing around with environment understanding, seeing the good and the bad of Meta’s approach on this topic. In this article, I will explain to you where we are at, and what we have to change to make mixed reality truly useful.

Mixed reality and environment understanding

Mixed reality is one of the trends of the moment and it will surely be one of the dominant technologies of our future: in “5 to 10 years” (Vitillo’s law of technology), many people will probably wear AR/MR glasses, and live in a constantly augmented world. But this technology needs many advancements to happen.

One important feature for mixed reality, in my opinion, is environment understanding: the headset must understand what the user has around him/her. For instance, if I’m in my office, it should be able to understand that I’m in a room with a certain shape, and certain walls and that inside there is a desk, a computer, some chairs, a window, etc… All of this should happen automatically, with a friction that is as small as possible for the user.

apple roomplan — Apple RoomPlan lets you scan your room and detects not only the shape of the room but also of the pieces of furniture that there is inside it (Image by Apple)

The integration of the virtual experience with the environment is what makes the real and virtual realities really merge with each other, becoming a true mixed reality. It is what makes your mixed reality believable. For instance, the moment you throw a virtual ball against your wall, and it bounces against it, your mind clicks, and has a confirmation that the virtual element and your real wall are part of the same context. When I launched the First Encounters demo on Quest 3, and I saw the spaceship landing on my bed, I thought it was really cool, because it was like that virtual spaceship chose to land exactly on a part of my furniture. If a mixed reality application blends with the physical space around it, it feels like magic.

But it is not only about consistency and making our brains go wow: environment understanding may also be useful for the application to know how it should work, to give a context to it. For instance, if I develop an AI NPC that should assist developers, my MR application should detect if in the room there is a desk with a PC on, and if yes, it should put the small character there. If I make an accessibility application for people with eye impairments, my app should detect where is the door of the room, so that it can guide the user to exit.

It’s the mix of these two things that makes environmental understanding so important for mixed reality. Without it, mixed reality risks being just a passthrough background for your application, which in some cases it is still useful, but in most of the other ones, it feels like a gimmick, because the app could as well be in VR. Remember that by definition, VR is the technology where your experience has no relation with the environment you have around you, while AR is the one that happens in the environment around you… so if you don’t use any environment information in your application, it is mostly a virtual experience (there are some exceptions, of course, like apps that are related to the user body, or that enable passthrough to keep the user aware of your surroundings… but the general concept persists).

spaceship first encounters mixed reality — When the little spaceship enters your home, it creates a wall in the ceiling and then lands on a valid surface. This interaction with the environment makes it feel more real (Image by Meta)

Some important features for environment understanding

Environment understanding, to be as useful as possible, needs to be:

Fast: the user can not wait ages for the headset to detect your room before launching an MR experience
Automatic: the less the user has to do, the better
Frictionless: this is a combination of the above factors
Accurate: the more information it detects, the better: a system that detects only the walls and the floor is less interesting than a system that can also detect all pieces of furniture
Up-to-date: the information should reflect the current state of the room. Environments change: people enter, chairs are moved… and the model of the environment held by the MR application should reflect that, otherwise it happens that you launch your virtual ball against your real chair and it doesn’t bounce, because the MR model still has the chair in its old position, breaking the magic.

Of course, we are not at this point yet on MR headsets, because we are still in the early days. I’m experimenting a bit with Quest and I think that with Quest 3 we are halfway to obtaining the quality that we aim for. Let me tell you why… but before, let me perform a little dive in the past.

Scene understanding on HoloLens 1

meshing hololens 1 — Example of the coarse mesh of the environment created by HoloLens 1 (Image by Microsoft)

There are only two devices through which I was able to play around for a long time with environment understanding: HoloLens 1 and Quest 3. On HoloLens 1, the feature seemed like black magic, because at that time, no device was performing the same thing. Looking around me, the headset started creating a mesh of the environment and then was able also to detect the horizontal and vertical planes. It also tried to guess what kind of planes we were talking about, so if they were walls or tables.

It was pretty ambitious for the time, and in fact, on paper, it ticked many of the above boxes. You launched an application, and that application started the scanning process: the OS detected if you had already an existing model of that room and if yes, the new data was just used to update the old model; if no, a new model was started from scratch. That was very cool… I still believe that Microsoft was years ahead of all its competitors and it threw them away with the bad management of the HoloLens project.

But the reality was that it was also very buggy: the environment scanning was not so fast, not so accurate, and especially, the continuous updates of the room models made them become big and bloated, so a lot of time we had to delete them and restart from scratch. The plane detection also worked, but not perfectly, with the detected planes not always being exactly aligned with the real surfaces.

Microsoft evolved the system then with HoloLens 2, but I’ve not had enough time to play with it to express an opinion: I just remember that the meshing of the environment was more accurate.

Scene understanding on Quest 3: pros and cons

Coming to the present days, it’s a few weeks that I’ve been playing around with environment understanding on Quest 3 and it has been interesting evaluating its performances, and also comparing them with the experience I had with HoloLens in the past.

Quest 3 is the first consumer Quest that can perform some sort of environment understanding. You have for sure noticed it when you try to set up the Guardian: you look around you, and you see some fancy triangles coming up in your environment, with the system suggesting to you a play area specific for your space. A nice idea on paper, but actually the area it suggests to me is usually too restricted (the system plays really too safe), so I have to re-draw it by hand anyway most of the time.

I so thought that Quest 3 could work as an improved version of HoloLens, letting me scan my room, and then use it for mixed reality. Reading the documentation of the Meta Quest SDK, I’ve also seen that the model of the scene that the Quest keeps inside also includes pieces of furniture like sofas, chairs, and desks, and other room features like windows and doors. This was exactly what I was looking for: a full model of my room I could use to merge realities. For instance, a virtual NPC could be put on a chair and speak with the user (if you played Fragments on HoloLens 1, you know the black magic I’m talking about).

Fragments was another experience that was really ahead of its time

Actually, the reality is a bit different. The process of creating the model of the environment around the user is mostly manual on Quest… and ticks very few boxes of the features proposed above.

Meta took an approach that at its heart is similar to the one of Microsoft (and different from the one of other AR systems): it’s the OS, not the application, that stores a model of the room, so that every application that needs to know what’s around the user doesn’t have to re-scan the environment, but it just asks the room model to the OS. If there is no model, a new model must be created. This is a very smart approach, which contributes to reducing the overall friction because the big scan of the room needs to happen only once, at its first use by the first application. Later, you just need to update the model while the user uses the AR application.

But the big problem is that, as I’ve said, the model construction is mostly manual. On Quest 2, the user has to manually mark all the walls, the floor, the ceiling, and all the pieces of furniture. For this reason, I’ve never played around much with mixed reality on Quest 2: no user will ever do that. The process is long and boring and it adds too much friction: a user would do that only if he must run a specific MR application he truly needs. But currently, there are no “killer MR applications”, so this never happens.

Quest 3 is actually able to scan your room, so I was pretty excited about finally experimenting with MR all around me. But when I actually dug a bit more with the Scene SDK, I got very disappointed when I discovered that this process only works automatically for detecting the main room shape. The Quest 3 is able to reconstruct a rough mesh of the room and it is able to auto-detect walls, floor, and ceiling, but then it does nothing more. The user has still to mark by hand all the pieces of furniture, the doors, and the window, for every room where he wants to play with mixed reality. Just to make you understand how this process is boring and clunky: even I, who have a clear willingness to play around with MR development, haven’t mapped the whole pieces of furniture I have in my room, but just the biggest ones that let me do my tests. And I did it only for one specific room, which is my office. If I, a technical person with a specific need, am so reluctant to do that, imagine what could be the attitude of the average user.

Even worse: the model is never updated by the apps. Once you, the user, scan your room and manually map your pieces of furniture, that’s your room model forever, until you manually choose to go to the system menu and update it. So if I have a chair in a specific position, map my room, and then move the chair… the MR application running will still believe the chair is in the previous position. The only way to change it is to go to the system menu, update my room scan, update my chair position, and save the Scene model. This is next-level friction.

The creation of the room model can only happen in a specific Scene Setup feature of the Quest runtime, the applications can not run it under the hood. So either the user launches the Scene Setup in the system settings, or the MR application that needs scene data has to prompt the user to do the scene setup, and if the user agrees, the application is paused, and Scene Setup is launched. This is quite a bummer.

You can see Meta’s Scene Setup at work here: Upload’s journalist spends 3+ minutes scanning his big room… and he didn’t even mark a single piece of furniture. Notice that the app requesting room scanning actually opens the OS application to do that, because it can’t happen inside the application itself

The current version of Scene Setup on Quest 3 is so in my opinion just good enough for us developers and some tech enthusiasts to use it and do some experiments with it. It’s absolutely not ready for consumer adoption, because it just adds too much friction. Not to mention the fact that of course it is not even fully accurate at this stage.

Because of this friction and also because of privacy concerns (not everyone is happy with scanning his room on a device manufactured by Meta), many users won’t scan their room. This is why the amazing creator Lucas Rizzotto recently shared on Linkedin the suggestion to build your MR app so that it can work also without a scene model. I think he’s right: while mixed reality needs the room model to be truly magic, your app at the current stage of technology should be able to somewhat work also without it. It should be able to have a fallback state, maybe less “magical”, but still usable.

What about the future of mixed reality?

The good news is that Meta knows about all of this and it is already working in solving this issue. It has recently shared a research project called SceneScript which is all about auto-detecting the objects that are in the room to provide a more believable mixed reality. We don’t know when this work will be transformed into a product working on Quest headsets, but I hope it will happen pretty soon because it is a much-needed feature. And also because Apple is much ahead in this game.

SceneScript is an impressive research project carried on by Meta Reality Labs that lets you scan an environment with your headset (the researchers used Project Aria) and then the system understands not only the shape of the environment but also which objects are in it (chairs,… pic.twitter.com/f8GoubFQqg
— TonyVT SkarredGhost (@SkarredGhost) March 24, 2024

Apple ARKit already includes a feature called RoomPlan, which works on LiDAR-powered phones, and lets you scan your room and automatically detect which pieces of furniture you have around you and what are their dimensions. But unluckily, RoomPlan doesn’t work on Vision Pro yet, so you can not use this amazing feature when building your MR application for Apple’s headset, either.

I really can’t wait for this feature to be deployed on all the XR headsets… in the meantime, I’ll keep experimenting with the current manual features, to be ready to create something meaningful when the technology will be more frictionless.

(Header image by Meta)

Disclaimer: this blog contains advertisement and affiliate links to sustain itself. If you click on an affiliate link, I'll be very happy because I'll earn a small commission on your purchase. You can find my boring full disclosure here.