Object categorization with generative models: 2008

Monday, March 17, 2008

Wednesday, March 12, 2008

Monday, March 10, 2008

Levenshtein Distances working nicely

These are the 1-NN edit distances to the training set for each of my 9 test images. The first 4 motorbike images are showing a significantly lower edit distance than the last 5. The edit distances are computed by considering the extracted appearance patches to be words where each patch is a single character. Here, the matching cost between two image patches was computed using a straight SSD between the patches. The cost of inserting a gap was computed as the matching cost (SSD) of the patch with a canonical 11x11 patch having uniform intensity of 128.

Friday, March 7, 2008

My project report..

in its draft form is here.

Monday, March 3, 2008

Three are better than two

When I use three Gaussian components instead of two, things look a little better.

Throwing in more Gaussians

The variability in the appearance of a single part across different training images here suggests that a single Guassian may not be sufficient in capturing the underlying data. I decided to try out a mixture of Gaussians for each part (with diagonal covariances). The Netlab software for Matlab turned out to be very useful here as it has inbuilt routines for learning and using Gaussian mixture models (e.g. gmm, gmminit, gmmem and gmmprob scripts were a big help).

Here are the resulting log probabilites when using 2 mixture components for each part's appearance. In this case, the default EM initialization is used (uniform priors, random means and identity covariances).

Next, EM was initialized using the gmminit script, which initializes the centers and priors using k-means on the data. The covariance matrices are calculated as the sample covariance of the points closest to the corresponding centres.

Sunday, March 2, 2008

Reducing dimensionality with random projections instead of PCA

Instead of reducing the dimensionality of the appearance patches using PCA, I tried using a random projection matrix instead (similar to the one defined in question 1 here). The matrix was generated once during training and the same one was used again during testing. This approach does not seem to work any better than the previous PCA approach.

Here are the total log probabilities of the same test images that were used previously. Image 1 has taken an undesirable dip and image 4 hasn't been pulled up enough from the other negative test images.

The appearances of the negative test images has gone up relative to the bike images.

Of course, the location probabilities are exactly the same as before because the these are unaffected by the method of dimensionality reduction on appearance patches.

Here are the reconstructed patches obtained by back projecting to 121 dimensions. For this, the reduced dimensionality patches were multiplied by the pseudo-inverse of the random projection matrix that was used.

Wednesday, February 27, 2008

Fixing a bug

After closer examination of the extracted patches, I discovered that the sorting of the patches was actually happening by Y-coordinate instead of X. I fixed that bug and then the tests were showing better results.

This figure shows the patches extracted from the 47 training images. Each row shows the 10 patches extracted from a single motorbike training image, now sorted by X-coordinate.

The image below shows the 9 images used for testing (in row major order):

Here are the resulting log probabilities (for location, appearance and sum) for each of the test images:

The location probability of the fifth image, which is a car, is quite high. This can be seen easily from the locations of the extracted patches here (which also shows why the location probability of the ninth image is so low, as should be the case). However, the appearance probability for it is low. In general, the locations of the patches are doing a better job at differentiating the classes. The appearance probability of the fourth bike is very low. Showing it below along with its extracted patches.

Wednesday, February 20, 2008

Reconstructed patches

For debugging purposes, I reconstructed the patches by projecting them back to 121-dimensions and displaying them as images. The first four rows show motorbike patches and the remaining rows show patches from cars and faces. Can't see much difference.

Edit: These patches were sorted (erroneously) by Y-coordinate. The correct patches, sorted by X-coordinate, are shown below in the second figure.

Wednesday, February 13, 2008

Sorting by X-coordinate

Running the same experiment after sorting the features by X coordinate (instead of saliency), I get these probabilities:

CombinedLogProb =

-240.8994
-206.9385
-228.6303
-249.9772
-293.5449
-261.3568
-279.4719
-255.3435
-270.4987
-296.9481

>> LogProbApp

LogProbApp =

-144.7198
-114.0909
-130.4656
-151.1481
-166.9280
-150.5252
-166.1861
-108.0655
-125.5857
-161.1468

>> LogProbLoc

LogProbLoc =

-96.1796
-92.8476
-98.1648
-98.8291
-126.6168
-110.8315
-113.2858
-147.2780
-144.9131
-135.8014

Using more clean motorbike data

Previously, I had used 20 training images for parameter estimations of the location and appearance models. I re-ran the tests with 47 training images of motorbikes (sans background clutter). I then ran the recognition procedure on 10 different test images consisting of motorbikes, cars and faces.

The results were as follows:

CombinedLogProb =

-249.1764
-233.0733
-226.2195
-293.5680
-257.1131
-304.4388
-284.6287
-251.9015
-254.5245
-297.4117

>> LogProbApp

LogProbApp =

-141.6296
-118.7043
-118.2580
-179.9357
-138.6968
-186.3235
-162.7556
-110.6465
-137.2324
-161.1289

>> LogProbLoc

LogProbLoc =

-107.5468
-114.3690
-107.9615
-113.6323
-118.4163
-118.1153
-121.8730
-141.2551
-117.2921
-136.2827

Images 1-4 were bikes, 5-7 were cars, 8-9 were faces and 10 was another bike.

Wednesday, February 6, 2008

Model learning and recognition sans clutter and occlusion (for now)

Rob Fergus has been kind enough to email me a link to his code for his CVPR '03 paper. However, it's not running for me at the moment and it seems that I need to recompile some MEX files. The difficulty with that is that there is a different version of the gcc compiler installed on the Linux workstations in the APE Lab than the one that's needed and installing that first is a bit of a pain.

So I'm going ahead with this on my own at the moment. The main complications in this method arise from trying to deal with occlusion and clutter. That's what forces an exhaustive search over an exponentially large hypothesis space during both learning and recognition. For now, I'll work with clean data and assume all the features arise from the object and not from background (as is the case with most of the images in the Caltech motorbike dataset).

Using this idea, I ran an experiment with 20 training images and about 10 features (also equal to the number of parts, since all features are assumed to arise from the object for now). Since there is no hidden variable now, I estimated the parameters for the appearance of each part using plain Maximum Likelihood estimation. In addition, I estimated the ML parameters for the joint density of the locations of all parts. Then, using these parameters, I ran the recognition procedure on the following images:

The first three images were selected from within the training set of 20 images. Thus, the probability of recognition is expected to be high for these. The last image is selected from outside the training set and is deliberately chosen to be quite dissimilar from the training images. While running the code for recognition, there were numerical issues due to the location parameters being ill-conditioned. The covariance matrix of the joint Gaussian density for the locations of the parts was nearly singular. Perhaps this happened because I wasn't using enough data. Also, I haven't imposed an ordering constraint on the X coordinates of the features detected. If I look at the log probabilities for recognition from just the appearance models, they were -50.9192, -54.2892, -57.3182 and -792.5911 for the 4 images respectively.

It's probably a good thing that the fourth image had a lower matching probability as it does seem quite different from the other motorbike images in the training data.

Wednesday, January 30, 2008

Extracting features from faces and cars

So the appearance extraction process seems to be working quite well for bikes with a starting scale of 23. I wasn't sure that a single scale will work well for all categories. The detected features for faces and the tiled appearance patches are shown below:

Perhaps, a smaller starting scale would work better? But that would mean tweaking the starting scale for each different type of category which would defeat the whole purpose. So that's ruled out. Here are similar results for cars:

Improving the appearances of the parts

Looking at the features that were extracted earlier, they didn't seem to be providing much information. It's quite difficult even for a human to look at those features extracted and say that they belong to a motorbike. So I compared the results of my feature detection phase (which looked mostly like this) with the results of feature detection from Rob Fergus' paper which looks like this:

The problem seemed to be the scale of the features detected. Somehow small, local features were firing more strongly than more important larger features. I started gradually increasing the smallest scale admissible for detected features and finally settled on a starting scale of 23 (earlier it was 3). Using this value for starting scale and choosing the top 20 saliency values, the outputs on various bikes looked like this:

This seems much better and closer to the output of Fergus et. al. I extracted these newly detected features, resized them and tiled them into the image shown below. The 9 rows show the rescaled features (into an 11 x 11 patch) extracted from the 9 motorbikes shown above in row major order.

Now, we can at least see the tyres of the motorbike in almost all the input images. The new appearances of the parts seem to provide more information about the image's category.

Monday, January 28, 2008

Appearance of detected of features

I ran through a bunch of motorbike images (Caltech dataset) and ran the feature detector on them. I extracted an appearance patch around the top 20 features in each image. The picture below shows what those patches look like (from 10 images).

Wednesday, January 23, 2008

Feature Extraction (Appearance)

The Kadir and Brady feature detector picks out a bunch of salient features from the image and gives us their locations and scale. For notational convenience, the locations and scales for all these features are aggregated into the vectors X and S. The third key source of information is appearance and we now need to compute the vector A for a given image, which will contain the appearances of all the features.

For computing appearance of a single feature, it is cropped out of the image using a square mask and then scaled down to an 11 x 11 patch. This patch can be thought of as a single point in a 121-dimensional appearance space. However, 121 dimensions is too high and we need to reduce the dimensionality of the appearance space. This is done using PCA and selecting the top 10-15 components. The best reference for PCA that I have found so far are Prof. Nuno Vasconselos' slides (nos. 28 and 29 give an outline) from his ECE 271A course. My code for computing the principal components from training data and projecting new data onto these principal components is posted here and here.

During the learning stage, a fixed PCA basis of 10-15 dimensions is computed. This fixed basis is computed by using patches around all detected regions across all training images. I'm not sure if I need to compute a single basis for all the classes or I should compute a separate basis for each class.

Wednesday, January 16, 2008

Detecting Salient Regions

There is some useful Matlab code here for running the Kadir and Brady feature detector. The detected salient regions are marked by circles in the picture. There are probably too many features detected here. The desired number of features should be around 30. I played around a bit with the the parameters in the code and was able to get a reduction in the number of detected features. The new detections are shown in the second figure.

Object categorization with generative models