HMC Bee Lab: The Importance of Training Data for Machine Learning

In my last semester in the Bee Lab I have focused on improving the results of the Bee Forage Mapping algorithm. The goal of this algorithm is to take an aerial map of vegetation and automatically locate flowers within it and determine their species. This information is important because different species of flower have varying importance and usefulness to bees. Because of this, in order to really understand what resources are available to bees, we need to know not just how many flowers there are, but also their species. I am accomplishing this using image processing to pick out various features of the images and machine learning techniques to automatically learn which features are associated with different types of flowers.

An important part of this process is the data used to train an algorithm. Training data is essentially a set of examples that is given to a machine learning algorithm to “teach” it what to look for. For example, if you want your algorithm to learn to recognize the difference between dogs and cats, you could give it examples of dogs and examples of cats and tell it which was which. The algorithm then uses all of these examples to figure out what features are important in distinguishing between the two classes. The algorithm can then look for similar features in future data to classify it. Thus training data is extremely important, and if it is incorrect in some way it will have a huge impact on your results. In this example, say instead of giving the algorithm the correct labels of cats and dogs we mistakenly labeled many pictures of cats as dogs. You can imagine that in that case the algorithm would do a very poor job of distinguishing between the two, and you would end up with many cats classified as dogs, an undesirable result. The quality of the training data thus has large impacts on the quality of the results.

To us one of these is clearly a dog and one is a cat, but your computer can't tell the difference until you teach it!
Image source: https://www.kaggle.com/c/dogs-vs-cats/data

Because of this large impact, the first approach I took to improving the results of the algorithm was to improve the training data. This algorithm uses two supervised machine learning algorithms to locate and classify flowers in aerial images. The first step determines where flowers are in the image, and the second step determines which species each of those flowers is. A supervised machine learning algorithm is one that requires an input of labeled training data, in this case telling it which flower species is which. This data consists of cropped out images of flowering plants and associated labels indicating the species of the plant. In addition, there are images of just ground, not containing any flowering plants. These ground images are necessary to allow the algorithm to distinguish not only between species, but also between a flowering plant and ground.

We use two different types of training data. The first is data obtained from a specific area of research in which we had observed individual plants and marked them so that they could be found in the aerial map. The second kind of training data is data taken from transects. This data is much easier and faster to create because it doesn’t involve finding individually marked plants, but it is much less specific. In the transect a 50 meter long line is divided into 1 meter long increments, and a box is created that extends 1 meter to each side. The data collected then tells us what species are present in that 1x2 meter box, but not where each species is. This became an issue in combining the training data because they had different levels of accuracy, as well as different sizes. It is also important that the size of each training image match the size of the “window” that the algorithm looks at. In order to process the image, the algorithm divides it into overlapping 100x100 pixels sections, called “windows”. Features for each of these windows are then calculated and fed into the machine learning algorithm which determines the species present in that window. Just as it is important that the size of training images be consistent, it is also important to match that size to the size of the windows used by the algorithm. However, this was not the case with our old training data, which varied in size and was often much larger than the windows used in the algorithm.

Research area training image of Penstemon spectabilis

Transect training image

I solved these issues using two steps. First, all of the images taken from the research area data were tiled into windows that are the same size as the windows used to predict the species. This solves the issue of the varying sizes. Many of the features we use in our algorithm might look very different depending on the size of the image used. For example, the variance in an image that is large enough to contain both plant and ground will look very different than that of an image that contains only a plant. This makes it difficult to effectively train the algorithm using varying sizes. However, this was not possible for all of the transect data, because any given transect image is likely to contain both flowers and ground. Thus if it is split into small windows it is likely that some will contain only ground and some will contain only flowers. Labeling these individual windows then becomes problematic, as the only label available for the image is the list of all species present in the image as a whole. Because of this added difficulty in the processing of transect data, the images labeled as containing flowers were discarded from the training set, and only the images that contained only ground were used. This allowed us to then split those images into small tiles so that they match the size used in the research area data set and the classification algorithm.

100x100 pixel window shown on a research area training image

Resulting image of the correct size

With this improvement in training data we were able to improve the algorithm results. The new quantitative results were much improved over those from last fall, and are summarized below in a confusion matrix. This matrix shows not only how accurately each species is classified, but also what it is mislabeled as. For example, looking along the Non-flower row, we can see that places that are actually non-flowers are often confused for Acmispon glaber. Similarly, in the Marrubium vulgare row, we see that there is a high value in the furthest right cell, indicating that Marrubium vulgare is often mistakenly identified as Salvia apiana.

Confusion matrix showing the most recent results of the algorithm

However, just testing on these images of flowers is not enough to determine the full performance of the algorithm. We want to test on a larger scale by visually comparing results on an aerial map. To accomplish this task we used a labeling tool called LabelMe. Using this tool we were able to hand label a small portion of an aerial map to locate all of the P. spectabilis in the image as well as the ground. The classification algorithm was then run on this same portion of the map, and the results of the algorithm were compared with the hand labeling. To facilitate this comparison I developed code to compare the two. An image of this comparison is included below. There are four possible outcomes in the comparison. Black represents places where both the hand labeling and the algorithm agree that the pixel is ground. White represents places where both the hand labeling and the algorithm agree that the pixel is P. spectabilis. Purple represents places in which the hand labeling says ground while the algorithm says it is penstemon (a false positive). Finally, red represents places in which the hand labeling says penstemon while the algorithm says it is ground (a false negative).

Comparison of Penstemon spectabilis identification between algorithm results and hand labeled images.
Black: Both agree on ground
White: Both agree on P. spectabilis
Purple: Algorithm predicts P. spectabilis, mask label indicates ground
Red: Algorithm predicts ground, mask labels indicates P. spectabilis

This image shows that while the algorithm does correctly label large chunks of ground as well as parts of each penstemon plant, it also mistakes many areas of ground for P. spectabilis. The image also shows that the algorithm tends to recognize only a portion of each penstemon plant, rather than the entire plant. These are issues that I hope to improve upon in the next few weeks.

In the remainder of my time in Bee Lab I will be looking at incorporating larger scale features into the algorithm to help it identify the entirety of a plant, as well as to improve its specificity so that it doesn’t label so much of the ground as Penstemon. I will also be working on improving the visual representations of the results so that future algorithm users can more easily interpret their own results. For example, overlaying the mask comparison maps with the original image, and including more species in the mask comparison would help to provide more insight into what causes the algorithm to make mistakes. While there is still work to be done on this algorithm, and many improvements that could be made, the results so far have been encouraging, and I am hopeful that it can soon be a useful tool for the bee lab and other researchers!

HMC Bee Lab

Pages

Monday, May 8, 2017

The Importance of Training Data for Machine Learning

No comments:

Post a Comment