Today I’m writing a deep dive into Visual Positioning Systems (VPS), which are one of the foundational technologies of the future metaverse. You will discover what a VPS service is, its characteristics, and its use cases, not only in the future but already in the present. As an example of a VPS solution, I will give you some details about Immersal, which is one of the leading companies for what concerns this technology. There is a lot to say and I’m sure you will find this article super informative, so let’s go!
[Disclaimer: this is a paid article built in collaboration with Immersal. In this blog, paid articles maintain the same objectivity, passion, and detail as non-paid ones. They are also completely written by me. A company can pay for an article just to be sure that I mention its product and I publish the post within a certain timeframe. That’s why I don’t call them “sponsored” articles, but “paid”: I’m not here to sell you anything, just to inform you, as usual.]
What is VPS?
VPS stands for Visual Positioning System. Slightly modifying Niantic’s definition, we can say that “a VPS is a cloud service that enables applications to localize a user’s device at real-world locations. Usually, this is used to let users interact with persistent AR content“.
If you want a more technical definition, Immersal has a good one for you: “A Visual Positioning System (VPS) utilizes sophisticated computer vision methods to determine a device’s position and orientation within an environment in real time. It works by processing camera images and analyzing the resulting data together with a database of spatial maps. By recognizing visual cues in the data and understanding their relationship to each other, VPS can accurately localize the device and its orientation within the environment“.
Putting it in layman’s terms, a VPS is a service that detects what is the exact position and rotation of your device (e.g. your phone) in relation to a physical place, so that you can correctly interact with AR content that is put in that place. Let’s make an example to explain it better: imagine that you want to create an AR experience in the middle of a park in your city so that a big virtual dragon comes out from a certain fountain. You want all users to see the dragon coming out from the middle of the fountain, no matter where they are in the park. So the devices the users are using, either phones or AR glasses, must have a way to know their exact position and orientation with regard to the fountain so that they all can put the dragon in exactly the same physical location. The best solution you have for this is to use a VPS service.
Why do we need VPS?
I can hear some of you saying “Why do we need VPS when we have other technologies to map where the users are in a place?”. You are right, we have many tracking technologies, and every one has its own use case, with VPS being unbeatable to accurately find the pose of your device with regard to a large physical location:
- GPS is great for giving the user coarse information about his/her geographical location. GPS together with the sensors on the phone is all that we need to orient ourselves on a 2D map like Google Maps that we use every day. The problem with GPS is that it gives a coarse location: the usual error in the detection is 1-5m, which is irrelevant on a 2D map, but becomes a problem when for instance I want to put some information in AR on the window of a shop. 5 meters of errors means that the info could be added to the next-door shop, instead;
- AR libraries like ARKit, ARCore, or even Meta Insight on Quest, are fantastic for local tracking. If you are playing an AR experience in a room, they are the way to go. But first of all, usually, they do not detect in what place they should start (unless some cloud anchor is used), they just start in the user’s room and use some local surfaces as a reference system. Then they are made for narrow places, and if you start moving very distant from the initial position, the tracking starts to drift and the virtual elements start moving away from their initial positions, detaching from the physical world;
- 2D Markers… I mean, they feel a bit old. If your experience is tailored to a specific planar image, they are the way to go, but this is not a common scenario outdoors for instance. Unless you want to put a huge textured blanket in the park, you can not use markers to show the dragon on the fountain in the above example. Furthermore, users should always frame the marker to see the augmentations, and this is annoying because it forces users to always look down;
- 3D Markers: better than the above scenario, but they need you to have an accurate 3D mesh reconstruction of the element to augment and then train some ML classifier to detect the object (which may take a lot of time). Augmentations work only if the 3D element to use as a 3D marker is currently visible. They are very useful if your purpose is to augment a specific physical object, but are still pretty cumbersome and sometimes pretty expensive.
All the above technologies have their specific use cases, but VPS services are the best technology available to guarantee that the device detects its absolute position and orientation in a certain indoor or outdoor location, even a pretty large one. It is the technology to use when you want to augment a specific place for multiple people in a coherent way.
Use cases of VPS
Before digging into the details of how a VPS system works under the hood, let’s evaluate its use cases.
The first one that comes into mind when talking about having the users know their position in the space, is building an indoor navigation system. Imagine being in a big shopping center, looking for a specific shop: personally, when I do this, if I try to follow the indications of the maps scattered around the place, I get lost 100% of the time. It would be great if you could have an AR system that would show on your phone screen some arrows that tell you the way from where you are to the shop you want to reach. VPS systems can help build exactly that: since they can localize the position and rotation of every device, they know where the user is and can guide him/her until destination. Immersal has in fact developed a similar solution for Mall of Tripla, an 85000 m2 shopping mall in Helsinki. But we can think about other situations where indoor navigation may be very helpful, like hospitals or airports.
Another use case is the superimposition of virtual elements on a building for industrial use cases. For instance, a pretty common request for AR applications is being able to see the network of pipes superimposed on the floor or ceiling of buildings, or even outside in the streets, to facilitate the work of maintenance workers. The right technology to do this is again VPS because it can track the pose of your phone across a large area and so it can help in superimposing the pipe system over the physical location. Immersal powered the AR4FM app by Granlund to provide this use case. Caverion AR by FlyAR had a similar function of overlaying BIM data on top of a real building for maintenance use cases.
Talking about more fun things, we can mention also entertainment and marketing. What if every child could see their favorite cartoon character in a specific place in a city? What if you could see augmented reality information overlayed on a stadium while you watch the match, no matter what seat you are in? What if there could be some virtual show happening in the middle of a commercial center to make your shopping experience more amusing? All these experiences need VPS to make sure the virtual elements are attached to the physical location they are augmenting.
Seeing things more long term, a clear use case of VPS services is the metaverse, the forbidden M-word that now companies like to call “large-scale spatial computing”. The metaverse requires that all our reality becomes augmented and that we all consistently see the same augmentations in the same locations of the physical world. So if I could see an AR popup that informs me of some discount on a shop, everyone else should see it in the exact same location. The same if I saw a huge dragon in a fountain in the park: all the other people should see it in the exact same physical place, doing the exact same things. To make sure that we all can see these virtual elements in a consistent way all around our cities, we need a system that is able to accurately detect the position and rotation of the devices at city-scale. And this is exactly what a VPS service does.
VPSes are already useful now for some specific use cases, but long term they are the foundation of our long-term shared mixed-reality future, that is… the metaverse.
How does a VPS work?
If you are a tech guy like me, at this point, you are probably thinking “Ok Tony, I got that VPS can track the pose of my phone everywhere in a location, but how is this possible?”. Let me go a bit deeper into the technical details and explain to you all the process that makes a VPS service work.
Feature Detection
As in many modern functionalities based on computer vision, VPS relies on feature detection. According to the definition given by Immersal, a “Feature Point is a distinct, high-contrast visual feature in an image. A corner of a poster on the wall, the grain on a wooden floor or a detail in the facade of a building”. Trying to put also this definition in layman’s terms, we can say that a feature point is a point in an image showing a little corner. The more an area is textured, the more there will be feature points, because the more the texture, the more corners will be depicted in the image.
There are various types of feature points and many algorithms to detect them: if you are into computer vision, for sure you are familiar with terms like KLT, SIFT, SURF. The reason why it is important to detect these “corner” features is because corners have distinctive characteristics both on the X and the Y axes. Imagine being in front of a wall that is fully white in a room with no shadows, and even lighting. If I show you a video recorded with a phone moving in front of this wall, you just see full white in every frame, so you have no idea about how the phone is moving. Now imagine that there are vertical black stripes on the white wall: if the phone moves vertically, you see again the same striped pattern every frame, so you have no idea at what vertical speed I’m moving. But if I move horizontally, now you can detect the movement because of the vertical stripes moving in the video. If there are no stripes, but checkers, now you can spot both vertical and horizontal movements, but you still lack info about the absolute positioning of the phone. But if some of the checkers are blue, other red, other yellow, and they form a specific pattern, now you can detect exactly where is the phone because your brain can identify some specific patterns of the drawing on the wall and match them with the image that is portrayed in the video. This is why it is important to have some features with strong components on the X and Y axes: they are easier to uniquely identify and they can help spot movement on all axes.
VPS systems work in a similar way: they memorize the unique features in your space and then they localize your device by matching the features that the camera of the device is seeing with the features that the system knows that there are in that space.
Mapping
Now that it is clear what a feature is, we can examine the steps through which a VPS service functions. The first step that a VPS service should undergo to work in a specific space is mapping, that is the system should memorize what are the feature points that are available in the space inside which tracking should work. To do this, you usually need a companion app for mobile: Immersal has for instance the Immersal Mapper, which I tried in its offices in Helsinki.
Immersal Mapper looks a bit like the camera app of your phone: you have to walk around the place where navigation should happen, and shoot pictures of it from different points of view so that the system can reconstruct the whole place. Immersal App has also an automatic mode, where you just walk around the place as if you were recording a video, and the system automatically shoots a new picture every time it thinks it is a good place to take it. After you have shot enough pictures, you can upload the data (which is a collection of images, and metadata associated with them, like the pose of the phone when the picture was shot) and let the cloud crunch it to reconstruct a point cloud of the place where you were in.
The cloud will extract the feature points of every picture and then merge the data of all these feature points to create a reconstruction of the place. I’m not going to describe here how the reconstruction algorithm works to not make you fall asleep out of boredom (the more geeky readers may look for “multiview stereo” online to read more about this, though), but you can imagine that a few things happen:
- Only the feature points that are truly reliable are used for the reconstruction: all the feature points that appear in only one picture but disappear in the next ones are probably just the result of noise, so they are discarded;
- The remaining “stable” feature points are matched the one with the others using the overlapping regions of the various images to reconstruct the shape of the whole place. For instance, if in an image the system detects the feature points of a door on the left to the ones of a desk, and in another image, there are the feature points of a desk on the left of the ones of a bookshelf, the system can use the desk overlap of the two images to reconstruct that on the side of the room there is a door, then a desk, then a bookshelf. Performing similar reasonings on all the images, the system gradually reconstructs the whole 3D shape of the space. This operation is similar to the “stitching” done with multiple flat videos that have to be merged into a 360 video.
Usually, there is a limit on the size of the maps that can be reconstructed with this operation, but the cool thing is that multiple maps can also be stitched together using the features of their overlapping areas. Thanks to this, VPS services can also work in big environments like university campuses or commercial centers. Actually, Immersal already aims at having city-scale mapping, that is having a big map of a whole city where a VPS tracking system may work.
All VPS systems perform mapping in a similar fashion, but not all of them have this operation done in an explicit way for the user. For instance, Google’s Geospatial VPS system does not ask the user to map the space, because it is Google itself that has already mapped many cities using the images it acquired for Google Maps. Niantic does the mapping under the hood using Pokemon Go players: players are encouraged to scan a new part of the city to have some reward inside the game, without being aware that they’re doing a mapping operation for a VPS system. I think that gamifying the mapping operation has been a genius idea by Niantic.
The result of the mapping operation is a point cloud of stable features that reconstructs the whole place. This can be used in the next step, which is the one of Localization.
Localization
Once the map is ready, most of the work has been done. You just have to run your application powered by VPS and make it confront the current images seen by the camera with the model of the place we have reconstructed with the mapping operation.
At every frame, the system will grab a frame from the camera of the device, extract the features from it, and then confront the found features with the features of the model. Using some trigonometry magic (I could have said “boring stuff”, but “magic” sounds more exciting), it is possible to reconstruct the rotation and position of the camera by matching the pixel positions of the features found in the current frame with the 3D data characteristics of the same features recorded in the 3D model of the current place. Once the system has this absolute pose, it knows exactly where the user is in the place, and so it can show augmentations at exact physical positions. This can also be done for every user in the same location, guaranteeing that they are all seeing a consistent augmented reality.
When I tried Immersal in its offices, I remember that after scanning the room we were in and having the cloud reconstruct the point cloud of the place, we proceeded to visualize on the tablet the feature points cloud of the room super-imposed on the room itself. This was a good way to test the localization: if the tracking was working correctly, we could see the point cloud perfectly superimposed to the physical elements that compose it. And I can say that the system was working very well because the virtual points replicated exactly the shape of the physical room.
AR tracking
Once localization works, you can superimpose virtual elements to the room you are in, so as to offer augmented reality to the user. But doing VPS every frame is a very intensive operation for a mobile device, so usually the tracking of the device is performed with more lightweight standard SLAM technologies (e.g. ARKit, ARCore), but then every 1-5 seconds the tracking is corrected with the absolute pose offered by VPS. This creates a good combination of performance and reliability.
How do you develop an application using VPS?
If you want to implement VPS in your application, you usually rely on existing VPS services like Immersal, Google Geospatial, or Niantic Lightship. These services already take care of all the heavy lifting for what concerns the mapping and reconstruction algorithms, together with all the localization logic.
You usually have just to import the SDK of the platform you have chosen, and then use its scripts to do a couple of things:
- Load the map of the place that you have recorded during the mapping operation. Usually, it is either a file that you downloaded from the mapping service, or it is a reference to a map that you have created in your user account of that VPS service;
- Place the virtual objects. These services usually show inside the game engine that you have chosen a preview of the place you are going to augment, and they let you put the virtual 3D elements wherever you want.
Immersal, for instance, has a Unity SDK that lets you preview in the editor the point cloud of the place you have mapped, so you can put the virtual elements in the 3D scene in a visual way. Then the scripts of the SDK simply do the magic of performing the localization and tracking every frame, alone or in combination with other services like AR Foundation.
If you want to go more low-level and use just the map to do some custom code about it yourself, you can still do it. From the Immersal servers, it is possible to download the following things for every saved map:
- The map file with .bytes extension. This is the actual map file used by the SDK for localization.
- A sparse point cloud representation of the map as a .ply file.
- A dense triangle mesh representation of the map as a .ply file.
- A textured triangle mesh representation of the map as a .glb file.
This gives the developer the maximum flexibility to develop the experience that he/she wants.
VPS Systems Characteristics
There are many VPS systems out there, and all of them have their own peculiarities. Let’s see some important characteristics to watch out for when you are looking for the system you should use.
Device compatibility
Not all VPS systems are compatible with all devices and before choosing a service, you should check if it works with the hardware you intend to use.
Compatibility concerns both the mapping and the localization operations. Mapping may be done with different pieces of hardware: I told you about the mobile phone, but actually it can also be carried out with 360 cameras, Matterport scanners, LiDAR scanners, or drones. Immersal is compatible with all of these. It actually is also compatible with custom solutions: it is not even necessary for the client to use the official Immersal Mapper app.
As for localization, compatibility means understanding which devices may run the applications powered by VPS. Immersal here is very strong because it can work on:
- Mobile devices that run ARKit, ARCore, or Huawei AR Engine
- AR glasses like Magic Leap, HoloLens, XReal, Rokid
- Mixed reality headsets like Pico 4E (a Vision Pro version is in the works)
- All devices compatible with WebAR, including mini applications inside WeChat
The compatibility for Immersal with so many pieces of hardware is possible because the VPS servers just work with REST APIs, and these are platform-independent. If a new type of glasses is released, it is just necessary to make it communicate with the Immersal servers using these REST APIs to make it compatible with the system.
On-device vs on-cloud localization
Some VPS systems need a connection to the cloud to work. These systems perform all the heavy lifting on the cloud so that the application on the client can be more lightweight. Notice that I’m not talking about the mapping, which almost always needs the cloud to be performed, I’m talking about the localization. Localization on the device is lag-free and can work even in parts of the world with a bad internet connection, but it puts the local device under heavy stress (which also means faster battery consumption). Many VPS systems just work with on-cloud localization because it’s easier to manage for the provider (updates to the localization algorithms must only be delivered on the server) and allows the client to be more lightweight.
Immersal supports both of them and in fact, when you develop an application with its SDK, you are asked how to retrieve the map of the place that must be navigated. Since industrial clients care a lot about their private data and do not want to put the data about their factories on a random server on the Internet, Immersal also offers the possibility of having a local deployment of the VPS services inside the cloud space of the customer.
Indoor vs Outdoor
Some services work better indoors, while others perform better outdoors. Some may have been more optimized for gaming scenarios, so to track elements that are close to the users, while others are more oriented toward navigation in larger spaces.
Indoor and outdoor tracking offer different challenges. Outdoor scenes are affected more by lighting, so performing localization at night when the scene was mapped during the day may present complications, because the features may appear differently in different light conditions. Indoor scenes have more uniform lighting, but they usually contain many challenging surfaces, like transparent glasses or mirrors that make tracking algorithms become confused.
Map scale
Some systems may work better in small spaces, while others may be oriented towards big areas. I’ve mentioned before the “city-scale” mapping that Immersal aims to and that is obtained by stitching many smaller maps together. Of course, this is also the mission of big players like Google and Apple.
Going city-scale introduces various challenges, like the fact that the whole map of a city can’t be contained by the host device, and anyway, the tracking can’t be done by comparing every time the current features with the one of the whole city. That’s why the map has to be broken into smaller chunks, that have to be quickly streamed (preferably via 5G) to the tracking device so that the user does not perceive any disruption of the service while he/she moves from one chunk to another one. Immersal demonstrated that its city-scale approach works by mapping a roughly 1,000,000m² area of Helsinki city center with 120+ separate maps that were aligned.
Openness
Some VPS systems just have their own pre-made maps, while others are open to you supplying your own maps of the places by scanning the environments. Some of them also let you connect to open systems like the Open AR Cloud, which is an open-source 3D map of the world.
Google Geospatial has for instance the handicap that you are in the hands of Google: you can not scan a place yourself, either Google mapped a location well or it has not.
Immersal claims to be a fairly open system, a toolbox that the customers can use as they want, even mixing their own tools with the one of Immersal.
Pricing
VPS solutions have different prices: usually, they are free to start with, but then there is a monthly fee to pay in case you want to build more professional applications. Immersal is free to experiment with, but a Pro license costs $99/month and an Enterprise one requires a private negotiation. (I also obtained that you readers can have one free month of Pro subscription if you use the special code SKARREDGHOST at checkout!)
When evaluating the solution that fits you, you should also verify which one is ideal for your budget capabilities.
Available VPS Systems
If you want to know some names of famous VPS systems to investigate, here are a few:
- Apple ARKit Geotracking
- Google Geospatial
- Niantic Lightship VPS
- Snap Landmarker
- … and of course, Immersal!
When I asked Immersal engineers for an honest comparison of their system with the other ones available, I was told that Google Geospatial is usually very good for outdoor locations with meter-accuracy, but its performances depend on how Google has mapped the place where the app should run. But for outdoor locations that are not tracked well, or for indoor locations, or if you need to customize the map, or you need centimeter-accuracy, Immersal should offer better performances.
Niantic Lightship, instead, works well for gaming use cases, and thanks to the fact that the map of the world is crowdgenerated, it always expands to new locations. However, industrial companies may not be very happy with seeing their industrial factories being mapped and inserted in the public 3D map of a gaming company. So for B2B use cases, Immersal should offer more data safety.
I have not personally verified these claims with a personal objective test, so take this opinion with a grain of salt. As usual, my suggestion is to try things by yourself: if you need a VPS service, choose three of them that on paper fit better with the needs that you have and then try them on the field and see what works better in your actual conditions.
Conclusion
VPS systems are foundational for our future, which will be made of a shared persistent mixed reality. The technology that powers them is not easy to develop, but luckily there are already existing SDKs that do the heavy lifting for us. Immersal is one of the companies offering these services and I have been able to verify with my own eyes that it does a pretty good job.
I hope that this article has been able to foster in you some curiosity about VPS and will entice you to use this kind of service for some applications that are useful for you. And if you have any questions, of course, you can ask them in the comments and I will do my best to support you!
(Header image by Immersal)