How to do an AI assistant with passthrough camera access on Quest

August 30, 2024August 30, 2024

Two days ago, I published a video on YouTube where I showed a new prototype I did in mixed reality: an AI-powered interior design assistant on Quest that lets you put some virtual elements in your room and then tells you if the new arrangement of the room makes sense and gives suggestions on how to improve it. The peculiarity of it is that the application analyzes the whole mixed environment, that is it can access the images of the cameras of the Quest 3! In this post, today, I want to explain to you how I did it to inspire you to do similar applications that exploit mixed reality.

(And I won’t stop here: next week, I’ll tell you how I did another cool demo exploiting mixed reality and artificial intelligence, but on another standalone headset, so stay tuned!)

I used #MixedReality and #ArtificialIntelligence to do an interior design assistant on my #MetaQuest3!

I put some virtual elements in my room, and the system analyzes the whole mixed environment and gives me suggestion about how I should reposition the virtual elements so that… pic.twitter.com/SHvTKREQXi
— TonyVT SkarredGhost (@SkarredGhost) August 28, 2024

Camera access, mixed reality, and artificial intelligence

Meta and the other headset vendors are betting a lot on mixed reality. But while the tools we developers have already let us create some interesting applications, they are, in my opinion, not enough. In particular, we need one very important thing: having access to the stream of the passthrough cameras, so that we can analyze and manipulate the images that come from it. If we had it, a very interesting thing that we could do is run a plethora of AI/ML algorithms on it and for instance recognize objects or faces that are around the user and provide mixed reality experiences that are fully aware of the context the user is in, creating a true blend of real and virtual reality. As I’ve already said in a post of mine a few weeks ago, without camera access given to developers, mixed reality will never be able to exploit its full potential.

Unluckily, Meta is too afraid that malicious developers may harm the privacy of the users and so is blocking camera access inside applications on Quest. This is overly too conservative and we all developers are demanding for this to change.

The best method to access camera images on Quest

Many people tried without success to access the camera images on Quest. It seemed impossible until two weeks ago when my friends at XR Workout (especially Michael Gschwandtner) found a very convoluted way to access the camera images from a Quest application. The solution worked, but it was very hacky: it required the user to initiate a streaming session and log in to his Meta account on a web view so that the application could access the streaming from the web view texture. They did a great job, but in my opinion, it was too complicated to be usable.

Here is how we did the on device object detection on Meta Quest 3 from last week

In short, casting from the headset to the headset itself into a webview and then feeding it into a MediaPipe CalculatorGraph that performs the object detection

Full article from UploadVR here… pic.twitter.com/G1bU6D22OB
— xrworkoutlabs ᯅ (@xrworkoutlabs) August 18, 2024

Their work anyway inspired another developer, Julian Triveri, to find a simpler solution: the application could use the Media Projection APIs (basically the ones that let you stream your screen on Zoom when you are on Android) to access the content of the screen as a Texture. This solution is much simpler: the user has just to approve a screen sharing authorization and then the application can use the texture of the screen as it wants. Julian shared a repository with a sample project employing this solution and I can confirm that it works very smoothly: the popup asking the authorization is clear and easy to dismiss, the camera frames have a decent resolution (around 1024×1024 on Quest 3), and have almost no lag from what you are seeing on screen. That’s amazing! Thanks to it, we have a kinda easy way to access camera frames from Unity on Quest!

Here is 'proper' screen copying to a texture inside of Unity on Quest without needing a connected PC, embedded browser, ADB, etc.

Scuffed Unity project: https://t.co/9kDurRZvmR

(PC is just to record video in this clip) pic.twitter.com/8cmN6pe90j
— jt (@trev3d) August 21, 2024

The drawbacks of this solution

You may wonder why in the past paragraph I was talking about screen sharing and not camera frame grabbing: well, as we already know, Meta blocks at OS level access to the cameras, so there is no way to get their frames. So the workaround that XR Workout people and Julian Triveri found is this one: since the application is in mixed reality, the screen of the headset is showing the passthrough images that you have in front of you, so you can get the content of the screen to get an image of the physical reality the user has in front of him.

This is a very smart workaround, but it has various drawbacks. The first one is that if your application is in mixed reality, you have on your screen both the passthrough and the virtual elements you put in the scene. So when you copy your screen content, you don’t just get the passthrough frames, but you get both the real and the virtual elements. If your mixed reality experience is full of virtual elements, your passthrough image (got from the screen) is so dirty that it is unusable.

Then since this method is recording your screen, a recording is already activated on the OS side, so you can not use the recording functions offered by the Meta Horizon OS. This is the reason why to record my video above, I had to connect my Quest to the PC via a cable and use scrcpy and OBS to record a video of what I was doing in the headset. This is an issue if you want content creators to easily make videos out of your experience.

The third big problem is that this is a hole that has been found and we don’t know yet what is Meta’s stance about it. The company may close it in a future update of the runtime, like it already did many times with similar situations. And even if it didn’t close the hole, it may still block the publication of any application using it to the Horizon Store, de facto relegating this solution to be used only in pure R&D contexts.

So while the solution is working, it’s still not ideal in a production environment: we still need Meta to give us proper camera access.

How to solve the problem with the mix of elements

The problem about the fact that you access a screen copy that includes both real and virtual elements can be solved in most cases. For instance, I thought about the following ideas:

Smart arrangements of virtual elements. If you are making a game and you have a HUD, the HUD can appear only on the periphery of the vision, so that the center of the screen shows pure passthrough pixels, that you can analyze without problems
Dedicated image analysis moment. If you need to analyze passthrough only in specific moments, then you can prompt the user to take a picture and make all virtual elements disappear while the picture has to be taken. For instance, in a cooking application that analyzes the fridge, you can ask the user to take a picture of the fridge, make all the virtual elements disappear, and then after the user “clicks” to take the picture, make all the UI appear again
Non-disturbing VR elements. You can use virtual elements that do not disturb your image analysis. For instance, if you need camera access to perform Aruco marker tracking to put a floating Earth on top of the marker, probably you shouldn’t put in the scene black and white square elements that may disturb the marker tracker. And you wouldn’t put the floating Earth in a way that occludes the pixels of the marker, but you would put it floating over it. If your VR elements do not occlude or disturb the analysis of the physical elements you want to “see”, there is no problems about having them in the image you are analysing

The Flaivor application could use approach number 2 when you take a picture of your fridge

How to exploit the mix of elements

When making the interior design demo, I wanted to take an approach different than the ones listed above: instead of thinking about how to do the camera image analysis notwithstanding the presence of virtual elements on the screen, I wanted to see if I could exploit the fact that I had both elements in the image to analyze. The fact that I had access to the whole mixed reality image meant that I could ask an AI to analyze a whole mixed reality context, which could be very interesting. The simplest idea I had was using it for interior design: if I have already a room and I want to put a few new elements on it, I could add these elements virtually and ask an AI assistant to tell me if the new elements fit the existing room and if not, suggest me how to improve the setup. This would have been a cool example of how to exploit mixed reality analysis thanks to AI. This way, I was able to transform the weakness of this camera access solution into a strength.

In the following sections, I’ll tell you briefly how I did the demo. It won’t be a step-by-step tutorial, but I’ll give you enough hints that if you are a Unity developer, you can understand how to replicate the demo yourself.

How I set up the project

Julian Triveri made a GitHub repo where you can access a demo of his solution. The cool thing is that it is not a plugin, but a full Unity project, already configured to properly build for Quest. So the first thing I did was clone his repository and set myself to create my demo by modifying his project.

I changed the name of the application and the Android package in the project settings to fit my needs. Then I changed the scene to build: the default one is about Aruco Tracking, while I wanted to use a clean scene. So I set to build the SampleScene that is in the ScreenCaptureTexture folder. I removed from the scene all the virtual elements (the cube, sphere, etc…) and so I obtained a scene featuring only the camera rig, the passthrough manager, and the game object for Camera Access (“Screen Capture Manager”).

How I added furniture to the room

The first thing I wanted to do was add furniture to the room. This is standard MR development, so it’s not rocket science. First of all, I imported the whole Meta All In One SDK to have access to all functionalities of the Meta SDK. The demo project just includes Core SDK, but this does not import all the interaction stuff, which I needed to quickly prototype the addition of furniture.

I added the Controller Tracking block (from the Meta building blocks) to have the controllers added to the scene. Then I added a World Space canvas to the left controller and added to it all the buttons that I needed to add furniture. Then I asked Meta SDK to make the Canvas interactive in VR by right-clicking on its gameobject, selecting Interaction SDK ->Add Ray Interaction To Canvas, and then following the instructions on the popup that came out. At this point, I had a full functional canvas on my left hand that I could point and click with my right hand. I then added a little Furniture Manager script that let a type of furniture appear in the room when I clicked the related button.

add interaction canvas meta quest — This is how you make a canvas interactive for Quest. The operation adds some weird game objects as children of the Canvas

Now I needed a way to preview the furniture element depending on where I pointed my right controller. I so added a Scene Mesh building block to the scene, to have access to the mesh of the room. The building block references by default a prefab in the Packages folder that tells it which room mesh game object to instantiate. I duplicated the prefab it was referencing (because you can not modify a prefab in the Packages folder) and in the new prefab of the room mesh I disabled the Mesh Renderer because I just wanted the collider of the room, not also a visible mesh. I also put this prefab in a special RoomMesh layer.

I then modified the script that was adding furniture to add a ray cast from the right controller, in the forward direction of the right controller: if the forward vector of the right controller was intersecting the collider on the RoomMesh layer, then the currently spawned piece of furniture should be moved to that point and with the same rotation of the forward vector of the controller on the XZ plane. The final touch was intercepting when the user pressed the trigger on the right controller thanks to the methods of the OVRInput class and finalizing the placement of the current piece of furniture in case of that event.

How I added Artificial Intelligence analysis

I had at this point a way to put furniture in my real environment, but I still needed the AI analysis. I decided to use OpenAI to do it because it’s the GenAI framework I have the most experience with. There is already a plugin that lets you invoke OpenAI from Unity (you can find it here: https://github.com/srcnalt/OpenAI-Unity), but unluckily it does not give you the ability to send images. I moved the plugin from the Packages folder to the Assets one and modified it so that it could take images as parameters. Images are sent to OpenAI as Base-64 encoded strings: luckily online there are already various forums that teach you how to do it. If you do have not the technical skills to do it, you can find another OpenAI wrapper that already supports image analysis. I won’t go into much detail about this process here, since it’s not a tutorial about OpenAI, but I’m happy to help via e-mail in case you get stuck here.

After I had my OpenAI connection, I had just to register to the OnTextureInitialized event of the script Quest Screen Capture Texture Manager (on the “Screen Capture Manager” game object) which is triggered whenever Julian Triveri’s plugin has a new image of the screen, convert the data of that Texture to Base64 and send it to OpenAI for analysis, with a prompt I wrote and that sounded like “You are an interior designer, tell me if the virtual elements fit well with the physical environment” (again, happy to share the real prompt, if you are curious about it).

Interestingly enough, ChatGPT-4o could somewhat understand which elements are virtual and which ones are real. I know this because I wrote a prompt in which I asked it to count the virtual elements in the images and it was able to do it with a decent accuracy: it was not perfect, it made some errors, but for a POC the accuracy was enough. For sure, it was also helped by the fact that I chose a package for the virtual furniture that had simple graphics and was not photorealistic at all. Thanks to this, the system was able to tell me if the virtual elements looked good in my physical space or not.

The final step was taking the output of ChatGPT and putting it into a text to be shown to the user. And this completed my demo.

After that, I shot a video of myself playing with it and I published it on YouTube:

Voilà, the demo is working!

Final considerations

It was fun to play with mixed reality and artificial intelligence on Quest. I was happy that I was able to deliver in a couple of hours a demo of what the future of MR may reserve for us, with artificial intelligence always giving us help that is contextual to what we are doing. This was possible only because of the work of the amazing XR community: without the great people at XRWorkout and Julian Triveri, something like I did could not have been possible.

I think times are ripe for Meta to give us camera access: instead of using hacky solutions, we should have an official way to do that, with privacy and safety embedded in it, and the user having full control over who he gives access to the cameras. It would be cool to have a way also to have access to the mixed reality stream: after this test I did, I realized that it’s also interesting to analyze both the real and the virtual elements together.

Only with proper camera access to developers, we could have a real evolution of our mixed reality ecosystem. Otherwise, we are stuck in the current limiting situation that zucks…

Disclaimer: this blog contains advertisement and affiliate links to sustain itself. If you click on an affiliate link, I'll be very happy because I'll earn a small commission on your purchase. You can find my boring full disclosure here.