viernes, 18 de diciembre de 2009

Further improving gesture recognition

So far the minimal bounding box has been used to check that the whole hand is inside the viewport but this proved unreliable when the hand orientation was horizontal/vertical, as the box would be horizontal/vertical as well. Instead, the bounding circle was used. Only if it was completely inside would the gesture considered to be recognised.

Aside from this, to make pointer movement easier pointer movement could be enabled/disabled with gestures, which would make the pointer easier to control as the camera quite probably will not cover all the screen, so the user might find himself moving the hand in and out of the screen towards the desired direction in order to get the pointer to the desired location.


miércoles, 16 de diciembre de 2009

Improving gesture recognition: discarding invalid gestures

Several problems arised with our method of gesture recognition based on convexity defects with respect to the convex hull. We were able to discard some invalid convexity points based on the distance from the deepest point in the defect to the convex hull.

However, this approach still had a problem, which is shown in the next picture.

Poor segmentation can yield false positives, as is shown in the picture.

We did not find an easy way to tackle it, so in the end it was decided that this gesture would not be recognised. Instead, we decided that the only gesture with one valid convexity defect which would be accepted would be the thumb-up gesture, rotation-invariant.


As can be observed, this gesture has a particular width to height ratio, so gestures which don't meet a certain ratio threshold can be discarded. In particular, 1.6 was selected, being the result of dividing width/height or the inverse, whichever is >1.

Another criteria we used to validate gestures was to establish a minimum and maximum number of sides that the gesture's polygonal approximation can have. With the polygonal approximation we have chosen, the start and finish points of the convexity defect, the fingertips, can be either spikes or flat.

We chose the minimum number of sides to be 6, which we found happened with a closed fist. Then, for every convexity defect we can count 4 sides. Invalid defects change what would have been one side to two or more. Removing already counted sides, we get the following formula:

Min_sides = 6
Max_sides = min_sides + (4 x valid_defects) - (valid_defects-1) - invalid_defects

Even though not perfect, with these two simple methods we are able to discard many invalid gestures due to poor segmentation and thus have a more robust recognition.

Possible improvements out of our scope

1. Other skin models so as to include other races
2. Motion recognition
3. Expand gesture set making use of the fingertips and the angles between them.
4. Recognize gestures in movement - useful for gaming, for example.
5. Improve recognition in heavy clutter
6. Improve performance
7. Multiple hands tracking for enhanced functionality.
8. Improve hand presence detection
9. Find a way to segment hand and discard forearm without sleeves
10. Improve dynamical model
11. Find a more cost-effective segmentation method and enhance robustness to changes in light.
12. Enhance mouse motion with acceleration, for example.

TODO list

As of today, the bulk of the project can be considered finished and the remaining is about polishing what we have already in order to make it more robust.

The final aim is to have at least 2 gestures that can be solidly recognised. For that the following issues need to be addressed:

1. False holes in the perimeter causing false convexity defects to be recognised.
2. Correct convexity defects detection should be more robust as well.
3. Think of a way to deal with borders.
4. Adaptive skin modelling.

viernes, 11 de diciembre de 2009

Implementing the first gestures: left click

So far the segmentation process is not robust enough and lots of incorrect gestures are detected.

As a first test, we decided to implement the left button click, which would be triggered when no convexity defects were detected (closed fist). No convexity defects would be interpreted as 'left button down' and otherwise 'left button up'. TODO: check if mouse was up/down and call functions only when necessary (now they're called whenever the aforementioned conditions are met).

Firstly, it was necessary to check if the hand was completely within the viewport. In that case, convexity defects were detected and the left button functions triggered.

Gesture recognition was only allowed when the pointer was moving less than 4 pixels in either direction. This was necessary since the tracker is not completely precise.

miércoles, 9 de diciembre de 2009

Gesture recognition

In "Learning OpenCV" a method using histograms is suggested for basic gesture recognition. They suggested computing the histogram of the picture to detect the hand region, calculate the image gradient and then compute the histograms for the gesture.

This method, however, was not rotation invariant and we had interest in it being so. A similar method, in the sense of counting skin pixels, was used. It consisted in calculating the difference between the areas of the convex hull of the hand and its polygonal approximation. The result would be the convexity defects.

Steps:
1. Contour detection of the hand
2. Douglas-Peucker algorithm for polygonal approximation
3. Convex hull.
4. Calculate convex defects and consider the deepest points.

The problem with the convexity defects' deepest points, however, is that often there are points which are not of of our interest.

At first, the minimum bounding box was considered to see if it could be used to discard the unwanted points. The idea behind it was to fix a point of the box, join it with the estimated hand location (approximately at the centre of the hand) and compute the angle with the points. The problem was that it was not so easy to detect the orientation of the hand, and thus fix the point at the bounding box.

Another way was necessary and the distance of the points to the convex hull was considered. From observing the results we had so far, we noticed that the points at the valleys between fingers were at a further distance with respect to the unwanted points. The maximum distance to the convex hull was calculated and then the points which were at less than 0.6 of this maximum_distance were considered discardable, which generally worked fairly well. Important valleys were discarded sometimes, but they seemed to be more related to poor segmentation.

viernes, 4 de diciembre de 2009

Hand detection

Initially when there is no hand the tracker will be in a steady state with all particles being randomly spread across the image, giving an estimation of the location of the object at the centre of the image, roughly. However, we cannot use this estimation because there is no actual hand. The problem we are presented is therefore about detecting when the hand comes into scene.

Using the median of densities

Let the probability densities be measured as the Mahalanobis distance from the colour of a pixel to the mean colour of a skin pixel. At first, the median of the probability densities at each pixel within an 8x8 window centered at the estimated location was used to determine if a hand was in place. The reason behind this approach was that if the median was evaluated as 'skin' then most pixels within the window would be 'skin' and hence a hand was in place.

This method worked intermitently since the tracker was not able to follow the hand with a high enough precision when it was moved at varying speeds and directions. The moved hand would cause the median of the window to change drastically when the estimation was close to an edge of the hand.

Using the standard deviation

Another method tried was using the standard deviation of the particles. An assumption was made that if it was below a certain threshold then a hand was detected.

This method proved to be very robust even though it still had a weakness. If the noise in the image was not properly removed the tracker could be following wrong objects and thus mistakenly detecting a hand.

Motion detection

To start the tracker a basic motion detection of the hand could be used. However, one could argue that some automatism and convenience is lost.

lunes, 23 de noviembre de 2009

Mouse interface

The second part of the project consists of creating a mouse interface. The first objective is to be able to move the mouse pointer with the movement of our tracked hand. Afterwards, the system has to be able to recognise basic hand gestures corresponding to mouse events or other functionality.

To achieve the first objective, the tracker has to be able to recognise when the hand is in sight. To do so, we find the median m of the mahalanobis distances of the pixels in the window centered at the estimated location of the hand and accept it as a hand when m < 3.5.

Once the hand is detected, making the the mouse pointer move like the movement of the hand is quite straightforward. The algorithm is described below:

Variables: hand_out_of_sight = true.

1. Detect hand
2. If hand_detected and hand_out_of_sight = true then
mouse_position = new_hand_position - current_position (ensure the values do not go out of screen bounds)

Set current_position to the new position of the hand.
3. Set hand_out_of_sight according to 1.

The movement of the pointer is therefore relative to the movement of the hand with respect to the position where it was first detected. This way we avoid using absolute positioning and hence weird pointer jumps.

martes, 3 de noviembre de 2009

Hand segmentation

In order to do the tracking we segmented our hand prior to the tracking so that the measurement stage could be done against a binary image. Speed was therefore a big concern since the tracker had to be able to run in real time.

Several algorithms were considered, especially those which had been found to have higher rates of true positives as exposed by Vezhnevets et al. [1].

...

[1] Vezhnevets, V., Sazonov, V., Andreeva, A., 2003. A survey on pixel-based skin color detection techniques, GRAPHICON03, pp. 85-92.

martes, 20 de octubre de 2009

Tracking using the Condensation algorithm in OpenCV

Intel's OpenCV is a great computer vision library with high quality implementations of the most common algorithms in the field. In our project we will be using their CONDENSATION[1] algorithm implementation.

CONDENSATION stands for "Conditional Density Propagation" and is based on particle filter techniques in order to track objects. Density here represents a probability distribution of the location of the object. Previously the famous Kalman filter had been used for tracking but the Kalman filter proves inadequate in many cases where, for example, there are simultaneous possible objects to track, since it is based on Gaussian densities. Even if there is only one object to track, a cluttered background could provide false alternative hypotheses. Another drawback of the Kalman filter is that it estimates the state of a linear dynamic system, which cannot be generally assumed. The bouncing of a ball, for example, is not linear the moment it bounces at the floor.

Another advantage of the CONDENSATION algorithm is that it is much simpler than the Kalman filter. It is based on factored sampling, which is a method that aims at transforming uniform densities into weighted densities, but applied iteratively. Factored sampling consists of the following steps:

1. Generate a sample set {s_1, ... , s_n} from a prior density p(x).
2. Choose a sample with index i in the range {1, ... , N} with probability pi_n, where pi_n is a weight calculated from the observation, normalised with the total weight sum.

The CONDENSATION algorithm applies factored sampling iteratively to successive image frames in sequence, the images where we want to track our object. Each iteration is an execution of the factored sampling algorithm using as prior density the weighted sample set of the last iteration. Therefore, to start the algorithm must begin with a prior density. The prior density at time t is denoted as p(x_t | z_t-1).

Given an old N-sampled set {sn_t-1, pin_t-1, cn_t-1, n= 1, .., N} at time t-1 a new sample set is constructed for time t. Each of these new samples are generated as follows:

1. We select a sample by:
(a) ...
(b) ...
(c) ...
The cumulative probability is used for efficiency. Since we are selecting a number from the uniform distribution this implies all numbers have the same probability, which means we can select all values at an equal distance from the range [total_confidence/num_samples, total confidence]. Values with greater weight span across a greater range in the cumulative distribution, thus elements with higher probability are likely to be chosen several times while others with lower probability might not be chosen at all.

2. Prediction stage. This process consists of two stages: drift and diffusion. The first stage, the drift, is a deterministic process which aims at predicting where the object will be in the next iteration considering its motion dynamics. The second stage, the diffusion, is a stochastic process which adds randomness to the density so as to mimic prediction and measurement noise and effectively separate the elements which were chosen several times.

sn_t = Asn_t + Bwn_t, where A is the object dynamics matrix, B the process noise standard deviation and wn_t a random value from the Gaussian distribution.

At the end of this stage the new sample has been generated by prediction and its weight in the new density has to be measured.

3. Measurement stage. In this stage the observation comes into play. The location of the sample is considered and its weight is calculated according to the observation using a defined weight function.

When defining this weight function tt should be noted that weight 0 can't be given, otherwise we lose randomness when generating samples and samples could potentially disappear, if cumulative probability is 0.

[1] Isard, M. and Blake, A. 1998. CONDENSATION -- conditional density propagation for visual tracking, Int. J. Computer Vision, 29, 1, 5--28.