Kinect for App Development

The way we interact with computers is changing, as we move away from the world of the desktop PC. Part of that move is to the intimate world of mobile and touch, but there’s another set of interactions we need to consider: how we work with a new generation of wall-sized displays.
That’s where depth cameras come in, relatively low cost devices that can track hands and bodies — and even whole groups of people — in 3D, giving us the ability to collaboratively interact with large displays, either LCD panels or projections. The most mature of these cameras is Microsoft’s Kinect sensor. Perhaps best known as a new way of interacting with games on the Xbox, it’s also a tool for adding 3D human-scale interactions to more traditional applications, bringing what can only be described as Minority Report-style gestures.
Building these new apps should be easy, and it shouldn’t be expensive. That’s the message from the recent release of a low cost Kinect-to-PC adapter for the second generation sensor that ships with the Xbox One. You don’t buy a separate PC version of the Kinect sensor any more, the adapter connects your existing Xbox Kinect to a PC. Download the SDK, and you’re ready to start building interactive apps.
While it’s low cost, Kinect for Windows is still a demanding piece of technology. You’re going to need not just a hefty graphics card, but also a dedicated USB 3.0 controller in your development PC. That’s because it’s pushing a lot of data about; a 3D image is going to be considerably larger than the equivalent 2D grab. Then there’s processing the depth data to extract face points, gestures, or skeletal models.
Shortly after I hooked up a Kinect to a test system, I spoke with Chris White from the Kinect team to learn just what the Kinect v2 and its SDK would give developers. “You get access to the core capabilities of the sensors,” White told me, “All the way up to skeletal tracking.” The Kinect APIs are available to Windows developers working at low level in C++, for higher level apps built on .NET, and through WinRT for Windows Store apps.
White notes that the Kinect APIs capabilities start with its face recognition features. Using the Kinect v2’s color camera, it overlays a 2D mesh that deforms as the scanned face moves. “Each of the meshes we use is significantly better,” he told us, “The original Kinect mesh had 93 vertices, inv2 we have 1400.” That’s the difference between a scanned face that’s close to yours, rather than a creepy facsimile from the far side of the uncanny valley.
It’s a technology that Microsoft is already using in games like Kinect Sports Rivals to create avatars that look like the player. Outside the game world, detailed mesh capture like this can be used for more than face detection – it can also be used to created detailed 3D meshes of an object. Once you have a mesh it can be used to build what White calls “animation units”, providing a frame for rendering items on a model; an approach that gives online stores magic mirrors that let you see what clothes look like, keeping the cost of returns down.
This leads into an intriguing capability built into the SDK: Kinect Fusion. By moving a sensor around an object (perhaps by using a turntable in front of a fixed Kinect), you can quickly generate a 3D point cloud of an object, turning a single Kinect v2 into a 3D scanner. A good graphics card is essential, and the faster the better. The fastest cards will work in real time, slower cards will require some processing time to build a 3D model.
With 3D printers becoming increasingly common as prototyping tools, it’s easy to imagine using a Kinect attached to a CAD workstation to quickly capture physical objects for inclusion in rough sketches. Scanned objects can be converted into 3D printing models – and as Kinect v2 captures color information, it’s possible to print in color on the latest 3D printers. The latest version of Microsoft’s Windows 8.1 3D Builder printing tool uses Fusion to capture models that can be sent straight to a printer, or used in modelling apps. While modelling yourself as Han in carbonite may be a novelty, it’s an example of how the physical and digital are starting to merge.
One example of how the latest version of Kinect of Windows can be used comes from kitchen appliance manufacturer Amana, which uses Kinect as a key component of its store displays. In the past display appliances often needed to be modified to add interactivity – making them unsaleable. Using Kinect to add person and gesture recognition to a display not only means that floor appliances are available for sale, it’s also easier to attract customers to the display.
Three different Kinect v2’s are used as depth sensors. One under the unit, looking up, and two at the top, looking down. This combination creates a set of 3D hotspots that trigger events when people move through them; and by ensuring that actions are deliberate, giving the PC behind the display a relatively simple mechanism for determining intent. If a prospective customer moves towards a dishwasher and opens its door, the display starts showing information about the device – much the way a dedicated salesperson would interact with a customer. To attract people to the display, the main screen uses a flock of butterflies along with Kinect’s skeletal tracking. Step into the edges of the hotspots, and butterflies will fly into an outline of your body, guiding you into engaging with the display.
One of the Kinect for Windows SDK’s features is an interesting example of the way next generation technologies are used to make developers’ lives easier. The Visual Gesture Builder uses machine learning technologies to create gesture recognition code; for example recognizing a wave.
As White notes “How people perform gestures varies, which makes it hard to detect intentional versus unintentional gestures. We’ve done a lot of research into the right way of capturing things, recording a bunch of people doing things, and then outputting code to use with events.” It’s an approach that works well for iconic gestures, like golf swings or soccer kicks. Machine learning captures the common elements of a gesture, and produces generalized code that can be used with a wide cross section of users.
There are other issues, among them the problem of dealing with gestures in a 3D space. It might seem logical to bring two-handed touch gestures to 3D, but they’re both uncomfortable for the user and hard for software to recognize. What’s needed is a system that’s adaptive enough to handle natural actions, which can mean that once you have a target, you need to give users feedback – like the butterflies in the Amana system. It’s an approach White characterizes as “First you establish intent, and then you pull into the ‘pit of success’”.
One example of how Kinect for Windows handles this is its hand state recognition, which uses an open hand and closed fist as a tool for gripping and grabbing objects. The SDK adds what can best be thought of as rails for pulling an object in and out of a plane once grabbed. It’s a simpler set of interactions than you might get from a touch UI, but it lets you handle the same tasks.
White sees Kinect for Windows as pioneering a new set of human interfaces models, “For installations, it’s a pattern of attraction, then engagement and interaction – and finally a transaction.” While much of that was pioneered with the original Kinect, the new hardware and new SDK make the system more usable. “We’ve increased the resolution, and added more feedback loops”, White notes. “Take the skeletal tacking tools. We’ve improved how we model joints, which makes it better for physical therapy applications.”
Another big change with the latest release is the number of apps that can use a sensor at the same time. With Kinect v1, only one app could access the camera. That changes with v2, and any number of apps and components can work at the same time. An interactive store display like Amana’s can be using the same sensor to both analyze store traffic and run the customer interactions.
Microsoft is also taking Kinect v2 outside the traditional Windows developer ecosystem, adding support for the Unity game engine. There’s a lot of scope for using a high level tool like Unity to build interactions, as White points out, “Many interactive experiences look like simple games.” With Unity being used as a visualization engine, adding 3D interactions makes a lot of sense as a way of providing room scale and multi-user support for a new generation of large wall screens.
Kinect isn’t just a novelty. It’s a sign of how we’re extending our development models and interfaces into the physical world. Natural user interfaces are complex things, building on cultural and individual responses. They rely on new hardware and new ways of development mixing automated code generation and familiar APIs. Microsoft CEO Satya Nadella talks about a world of “ubiquitous computing and ambient intelligence”. It’s a big vision, and one that’s driven much of the last couple of decades of computing research. It’s technologies like these that are making the foundations of tomorrow.