Flocking about with the Kinect

Author: Marnix Kok
Website: http://marnixkok.com

This article is about projecting objects around you in an augmented reality, using the Kinect and Microsoft's official Software Development Kit. My day job, quite randomly, gave me the opportunity to play around with the Kinect for a few days. During this time, I was able to discover a number of interesting things that I want to share with you.

If there is anything you take away from this article, let it be this: do not use the official toolkit. The open source version of the Kinect SDK is much more powerful and complete. At the time I was testing the official toolkit, only a few days after its release, I discovered that most (if not all) of the interesting functionality which goes beyond the most basic functionality (like skeleton detection) is not implemented. Another thing that bothered me was the lacking documentation. But I guess that is to be expected from a new product.

To start off, I'll tell you a little bit more about what the Kinect is, and what it is capable of. Then, I will present a cool little demo that we're going to try and implement. The rest of the article is a walkthrough of the interesting bits and pieces.

Brand spanking new

Nintendo was first to bring motion detection to gaming consoles in the winter of 2006. Their Wiimote was a instantaneous hit with the public. Accompanied by a very affordable console, it was the envy of the other big console makers: Sony and Microsoft. After four long years of free-reign as king of this particular hill, Sony and Microsoft released their answer to Nintendo's amazing peripheral.

Sony introduced the Playstation Move in 2010. Basically nothing more than sticks with coloured light-balls on top that allowed the PS3 to locate and track your movements.

Released in 2010, the Kinect is the latest and (perhaps) greatest movement detection console accessory, introduced by Microsoft. Despite Microsoft's late entrance into this market, they introduced an accessory that has some great things going for it.

The Kinect has two cameras of considerable quality. Other than the cameras, there is a mechanism inside this wonderful device that is able to project a steady grid of points into the room the Kinect is setup in.

Together, the two cameras are able to record a three dimensional image, colour components, and also a depth buffer. Combining the normal picture they receive from the cameras with an interpretation of the projected laser grid results in the depth buffer.

Internally the Kinect contains some powerful processing power. Not only is the device able to determine a depth field, but is also able to track two people (or skeletons) at the same time. By analysing the imagery that it records, it is able to extrapolate the skeletal information.

All this information can be read from the device through a set of functions that Microsoft has gracefully released into the wild through its new Kinect SDK.

An interesting idea

Because I didn't have all the time in the world (a max of three days), I had to come up with something that would show off the possibilities of the SDK in an adequate and fun way.

Some of the core functionality of the device that I wanted to show off were:

- reading skeleton information;
- interpreting events in the scene;
- augmeningt the camera feed with additional objects.

After some thought, I decided to create a small demo that would be able to measure the size of the person being recorded; interpret that information; and overlay a swarm of bees onto the screen. Interaction with the scene would be provided once the person in front of the Kinect raises his or her hand. This would attract the bees to swarm around the hand that was lifted.

The rest of this article describes the methods I used to bring this little project to a satisfying end.

Measuring skeleton length

The demo that we're making is going to calculate the length of the person in front of the camera and translate his length from meters into pixels. This information is then used to put a swarm of bees on the screen that is sized similarly to their real-world counterparts, simply by providing their length in meters.

Assumptions

To start out, we need to discover and analyse the environment the Kinect is pointed at. Let's scribble down some knowledge we have about our surroundings and the Kinect:

1. The camera is always the center of the scene. This assumption makes a number of our calculations easier. So, let's just say that the camera is indeed, the center of our universe.

2. Skeleton detection information is provided as a vector. Vectors are known to have both direction and length. The cool thing about the information the Kinect returns is, that it returns it in real-world measurements -- the distance you can discern from a vector is measured in meters!

3. The colour buffer has a slight deviation to the depth buffer image's dimensions. So, when we try to translate something from skeletal information, which is based on depth buffer information, to a position on the screen it will not map 1:1. This means we need to take into account a small deviation constant.

There actually are Kinect API calls that allow you to translate from colour buffer to depth buffer coordinates and vice versa. However, these were not yet made available for public use. Later, I discovered they were quite readily accessible in the unofficial Kinect SDK.

Triangles all over the place

To determine the length of the person in front of the camera we're going to do some very basic math on the points that we can get information on. An assumption we're making is that the feet are always positioned on the floor on an equal height.

The Kinect provides us with the following skeletal rays:

- left foot position (L_ray);
- right foot position (R_ray); and
- position of the head (H_ray).

These rays are cast from the point of view of the camera. Now, you may not know this, but the Kinect also has a mechanism that allows the camera to automatically focus on elements that are in the room. Unfortunately, with the state the SDK was in, I was unable to retrieve this information from the Kinect. In the future, this information needs to be incorporated in the calculations below.

	L_pos = cast L_ray from Camera
	R_pos = cast R_ray from Camera
	H_pos = cast H_ray from Camera

Now, you might think this would be a hassle to calculate, but in reality we can leave this step behind. Remember our camera being the center of the universe? This means that the vector and its length cast from (0, 0, 0) would always result in the ray's value. So in actuality:

	L_pos = L_ray
	R_pos = R_ray
	H_pos = H_ray

Then, we create an imaginary triangle, running from the left foot, to right in between the legs (M), and up to the head. By calculating the length of the middle element we know (exactly) how tall the person in the camera is.

							H
						     /|
						b   / |
						   /  |  c
						  /   |
						 /____|
						L	 M
						    a

	a = distance(L_pos, R_pos) * 0.5
	b = distance(L_pos, H_pos)
	c = sqrt(k^2 - l^2)

Now we know the person's length in meters. We then translate this to a unit of measure that is understandable to the computer, pixels! The joint information we read from the Kinect does not solely contain the ray that was cast to detect to the joint, but also its position on the colour buffer (x, y). Now we know the height in meters, it's as simple as subtracting the left foot's y-position from the head's, and dividing that by our height in meters. Neat right!?

	Meters_Per_Pixel = (L_y - H_y) / c

After all this, we know quite a few things about the scene the bee swarm is going to be flying in. Displaying things on, or around the scene will be much easier!

During the development of my little demo, it was surprising how accurate the Kinect actually was. Taking into account a scaling factor because of the camera's lens distortion, every person that I had stand in front of it was really close to his/her actual height. So if this demo doesn't work out, I can always sell the code as the world's most uncomfortable and most expensive measuring tape!

Flocking behaviours

Now that we are able to find the skeleton in space, and know exactly how tall the person in front of the camera is -- and therefore know at what size and position to project our cute little swarm --, it is time to delve deeper into the behaviour of our swarm. We will be implementing a simple flocking mechanism that is common-place and documented very well around the web. I won't be explaining it from top to bottom, but the general concept follows. If you want to know how to implement it, check out the code that goes with this article.

Three rules of flight

When one talks about swarm behaviours, I always imagine a flock of birds, flying through the sky in intricate patterns that seem almost too beautiful to be real. Thankfully, smart people have disproven any such magic and narrowed it down to three rules that birds use to fly the way they do. Turns out, the same rule applies for swarms of other types, such as bees and fish.

The three rules are as follows:

1. cohesion;
2. separation; and
3. alignment.

Cohesion

It's important for our entire swarm to stay together and have the same general purpose. We can't be having one bee fly into one direction, while the other is flying in almost the opposite direction. This is where cohesion comes into play. It's similar to separation but on the scale of the entire swarm.

Separation

To make sure that the bees are able to stay in flight without any unfortunate collisions into the Queen Bee's main-quarters, it's important they steer clear of each other, this is called 'separation'. This separation is based on its closest neighbours.

In the code below I've made it possible for one bee to have a little bit more attractive power than other bees. This made it possible for me to create a "leader” which has a lot more attraction. This causes the rest of the swarm to follow him!

Pseudo-code:

	Total_Attraction = The sum of all bees' attraction scalars

	Average = [0, 0, 0]
	foreach Neighbour as Bee:
		Average += Bee.direction * (Bee.attraction / Total_Attraction)

	Average *= 1 / Number_of_Neighbours
	NewDirection = Average - CurrentPosition
	Normalize DirectionVector

	CurrentDirection = CurrentDirection + Weight * NewDirection

Alignment

Having set up a rule for not killing your neighbour in mid-flight by trying to occupy the same physical space is very necessary. But it doesn't mean that our bees don't like to be cozy, they're a swarm after all!

To keep everyone in the same area, there is the rule of 'alignment'. What it essentially entails is making sure a bee flies in the same general direction as its neighbouring bees.

Pseudo-code:

	Average = [0, 0, 0]

	foreach Neighbour as Bee:
		Average += Bee.direction
		
	Average *= (1 / Number_of_Neighbours)
	Normalize(Average)

	Add weighted average to Bee's direction vector

While coding my demo, I noticed that by just implementing the rules for separation and alignment, the swarm of bees had a near life-like behavioural pattern. So I opted to not implement the cohesion rule -- it's all about the illusion after all!

Microsoft sees right through you!

A coding article wouldn't be much without some code that wasn't just pseudo-code. Below I'll show you how to read the skeleton information from the Kinect using Microsoft's SDK.

The Kinect is able to, quite successfully, read all kinds of body-parts, from feet and head, to pelvis and wrists. This broad spectrum of points gives you an amazing range of possibilities when it comes to creating interesting functionalities. Just something as simple as creating player profiles based on the player's biometric properties. These properties could, for example, be the distances between all the different skeletal points in a resting position.

It is an awesome device, but sometimes it just can't quite track the position of your limbs from frame to frame, to remedy the lack of up-to-date information, it has built-in interpolation. It remembers the behaviour of the limb (i.e.: its direction and velocity) and when the information is not available it tries to predict its position. Eventually it will regain the ability to track the limb and make a nice smooth transition into the new position. When you draw the complete skeleton on screen this can give a weird effect, but when used in games, this is exactly what you want and expect to happen.

The code below assumes that you have properly initialized the Kinect SDK and that you are ready to get some information from the Kinect. To get skeleton information use the NUI_INITIALIZE_FLAG_USES_SKELETON constant in your call to NuiInitialize and have called: NuiSkeletonTrackingEnable(NULL, 0).

When you arrive in your game loop frame handler, we will run the code below to get information regarding the skeleton. To determine the position of the hand for the bees to circle, the code below gets a skeleton position frame, smooths it over to get the interpolation going, then determines whether the hand is visible or not. Then we transform the coordinates from ray-coordinates to on-screen coordinates.

bool Skeleton::nextFrame() {
	NUI_SKELETON_FRAME frame;
	HRESULT result = NuiSkeletonGetNextFrame(100, &frame);

	if (result == S_OK) {
		NuiTransformSmooth(&frame, 0);
		NUI_SKELETON_DATA *skeleton = this->findFirstActiveSkeleton(frame);

		// no active skeletons found? abort.
		if (skeleton == 0) {
			return false;
		}

		// right hand not visible?
		if (skeleton->eSkeletonPositionTrackingState[NUI_SKELETON_POSITION_HAND_RIGHT]
			 == NUI_SKELETON_POSITION_TRACKED)
			
			Vector4 &joint = 
				*skeleton->eSkeletonPositionTrackingState[
						NUI_SKELETON_POSITION_HAND_RIGHT];

			// transform from skeleton ray to depth image ray
			NuiTransformSkeletonToDepthImageF(
					joint, 
					&d_rightHand.x, // between [0, 1]
					&d_rightHand.y  // between [0, 1]
				);

			d_rightHand.proper = true;

			// scale to screen dimensions
			d_rightHand.x *= SCREEN_WIDTH;
			d_rightHand.y *= SCREEN_HEIGHT;

			// add something to get in the middle of the pixel
			d_rightHand.x += 0.5f;
			d_rightHand.y += 0.5f;

			// store the original ray 
			d_rightHand.ray.x = joint->x;
			d_rightHand.ray.y = joint->y;
			d_rightHand.ray.z = joint->z;
		} 
		else {
			joint.proper = false;
			cout << "Not all joints found" << endl;
			return false;
		}
		return true;
	}	
	return false;
}

Now that we have the hand's position, we need it to influence the position of the bees. The leader-bee (the one with the highest attraction) will be directed to the position of the hand, this in turn makes all the other bees follow him, allowing the swarm to circle the hand awesomely.

The actual demo's code is a bit more involved because it tracks other limbs to perform calculations on them (discussed above).

Conclusion

You still with me? Cool.

We have gone through quite some stuff: Kinect basics, creating the World's most expensive measuring tape, a simple algorithm for flocking behaviours and reading Kinect's skeleton information. All that's left is to bring it all together. To be honest, that's the boring stuff (you know it's true), so I'll leave that as an exercise for you.

My implementation can be seen in action on Youtube at the link below. Thanks go to my dear colleagues Bart for recording it and Kevin for providing an unintended Kinect stress test.

http://www.youtube.com/watch?v=q9u0042GtvM

One of the things I wish the official SDK would offer, but didn't at the time, was a separate stencil buffer that is like a cardboard cut-out for the skeleton that is being tracked. This information would have allowed me to hide a bee when it would disappear behind me in space. Stuff like that really pushes the illusion of augmented reality into a whole different dimension.

Since I've completed this little experiment, I've seen a great number of awesome projects involving the Kinect, confirming my belief that this seemingly unimpressive and maybe even awkward gaming accessory actually has great potential!

If you have questions or feedback about the things you read in this article, don't hesitate to e-mail me: marnixkok@gmail.com. If I can, I will help you with anything you need.

Marnix Kok