HMC Bee Lab: If at First You Don’t Succeed… You Might be Doing Research: Part 2

Our project this summer centers around identifying flower density in aerial images. Now that we have a working process for collecting and compiling these images, we are working to find an algorithm that will accurately compute flower densities. We decided to tackle this problem using machine learning.

Machine learning is great for this kind of problem. It is difficult to see exactly how a computer would differentiate between flowers and non-flowers. With the help of machine learning, however, we don’t have to figure out a precise way. Instead we can look for characteristics of the images that appear to differ between flowers and non-flowers, calculate these, and let the computer determine the boundaries and full algorithm.

Creating these density maps required several steps. First, we wrote code that “tiles” an image, dividing it into small, overlapping square tiles. We then wrote several functions that calculate features on each of these tiles. These features include average amounts of red, green, and blue in each pixel, HSV thresholding, and texture analysis. They were chosen because they help to distinguish between flowers and their surroundings. One of the ways we defined texture was looking for large variations in color in small areas of a picture. Based on this definition we would expect a flowering bush to have a lot of texture, whereas a homogenous patch of grass would have very little texture. Similarly, thresholding for certain colors, for example the color of blooming buckwheat flowers, helps to distinguish flowers because there should be a higher proportion of this color in images of blooming buckwheat plants.

Unfortunately, none of these features are perfect. Images of the field station have a lot of different things going on in them. There are many kinds of plants, many of which look similar, there are tiny flowers that can’t be seen from the air, there are paths criss-crossing their way across the ground that look remarkably similar in color to buckwheat. There are pebbles strewn across the ground that are shaped just like the flowers we see strewn across bushes. All of these features of the field station make it very difficult to accurately distinguish between flowers and non-flowers.

Of course, starting the project we knew it would be difficult, but it is hard to picture just how much a flower can blend into its environment until you have seen aerial images of the Bernard Field Station.

Buckwheat and surrounding ground have similar colors and shapes.

Once we have all of the features of an image calculated, we need to give that data to a machine learning algorithm. For this purpose we chose to use Support Vector Regression (SVR) and Gaussian Processes. Both of these algorithms give a continuous output, which was important to us for calculating flower densities. We chose to test two different algorithms because we weren’t sure which would work best. While both of these have been used for similar image processing applications before, the best algorithm is very dependent on the specifics of the application. Testing both gave us a better chance of getting a good output.

Both SVR and Gaussian Processes output a number representing the density of flowers in the tile being tested. Our input data was in units of flowers/sq.meter, and our output data is in the same format. We chose to treat this number as the density at the center of the tile, and to use interpolation between tiles to find the density at every point. Interpolating, rather than considering the calculated densities as applying to the whole tile, made a smoother map and prevented big leaps between tiles. It also makes sense because tiles can include half of a bush and half grass. We don’t want to count everything in that tile as the same density, but the surrounding tiles can provide information on how the density is distributed.

Finally, we were ready to create a density map. To do this we had to take all of the densities we calculated and match them up with pixel coordinates. We then used the contour map function in Python’s Matplotlib library to create a contour map. When we had written all of this code we were excited to finally see the result. We tested our program on images we had taken of the grassy quad in front of the Shanahan building on campus. However, it wasn’t quite what we had expected.

Screen Shot 2015-06-22 at 10.52.52 AM.png

First attempt at a density map on the Shanahan quad. Densities do not align with the map correctly, creating a striping effect.

It looked like the machine learning was correctly calculating densities, but those densities weren’t lining up on the map at all. This error led to two weeks of combing through our code. I checked our tiling function, our plotting function, and even tried creating a map without using the contour function in case that was the error. Despite all of these efforts the bug still eluded me. Finally, I gave up on looking through old code and tried rewriting our code in a slightly different format. It was a good exercise, and the resulting code was much more organized than the original, as I already knew exactly what needed to happen. As I neared the end of rewriting our code I was still unable to find the error. Finally, I found that the coordinates in one for loop were switched to (y,x) instead of (x,y). Changing two characters in the code solved all of our problems. In over 700 lines of code it was just 2 switched characters that caused our bug. Checking your code early on if at all possible is definitely worth the effort to avoid the situation we found ourselves in! It’s almost always easier to find a bug in just a few lines of code than it is in several hundred.

After all of that work we were able to create our first real density maps.

(A)

(B)

Successful density maps of the Shanahan Quad, looking for green grass. (A) On a small test image and (B) on the entire map.

Now that all of our code is set up, we “just” have to find the best training data and parameters for the machine learning to get the most accurate density map possible. The parameters we choose are important because they tell the computer what kind of data to expect, what relationships to look for, and what range of places it should look for those relationships. Because the parameters are so important in determining the structure of the final algorithm, changing them can have a huge effect on the output. While Tyler works on his bee simulation, I have been experimenting with parameters to find the best combination. The first few tests, with essentially random parameters, yielded interesting results. I tested on our first transect map, from the SE portion of the BFS.

The original map we used for testing. Two deer weed bushes are boxed. The map contains many examples of these flowering bushes.

First attempts at creating density maps on the BFS. Top: Support Vector Regression algorithm, Bottom: Gaussian algorithm, which fails to identify the bushes. The same bushes that were boxed in the original map are also boxed in both maps here.

These first density maps had three main problems that I wanted to solve:

The range of densities is very small compared to field data.
Flowering bushes and non-flowering bushes and trees are given the same density.
Large grassy areas with no flowers have a lot of noise in the density maps.

After using Scikit-Learn’s built in fit scoring functions I was able to narrow down my search a little bit. Fit scoring is the process used to validate the algorithm you get on different data sets. The set of training data I input is split in half. Half is used to train the algorithm. The other half is reserved for testing the algorithm. You can then test again by swapping the roles of each half of the data set. Keeping part of the data set aside during training allows me to make sure the parameters that were chosen can actually generalize to more situations. One of the risks with training a machine learning algorithm is overfitting. Overfitting often happens when you train the algorithm to one set of data, and tune the parameters on this set of data. While it may work perfectly on your training data set, it can’t generalize to any other cases that weren’t in your training set, so it is essentially useless. This is what cross validation aims to avoid.

The parameters that I got from this process weren’t perfect. The main issue I ran into was that because we have so much data on grassy sections, all of which have zero flower density, the “best” algorithms the computer could find tended to correctly identify grass as having low density, but misidentify flowers! While this matched the highest number of data points that the computer had, it didn’t take into account which points were more important than others. I used the parameters the computer gave me as a starting point, and by slightly changing one parameter at a time was able to find sets of parameters that each did well in different aspects.

Each of the three problems I had at first has been largely solved individually. For the most part, I have found that the Gaussian algorithm tends to produce better results. During this part of testing. My labmates, Tessa and Clayton, have been collecting data along transects in the BFS using ground surveys. I use their data to train my algorithm. I ran my algorithm on an image of one of the Northern transects in the BFS. This image contains a transect and surrounding areas, so it is ideal for testing as it includes both places where densities are known and places where densities are unknown.

Transect image that I ran the subsequent tests on. The large buckwheat patch is outlined in red. There are some patches of small flowers present, but they are scattered and very tiny. The main goal is to identify the buckwheat patches.

I was able to reduce the noise in the grassy areas somewhat using the optimal parameters from the cross validation process. Unfortunately I couldn’t get rid of all of the noise (without just producing a map of constant density for the entire thing), but it is reduced to an acceptable level. However, the range on this map far exceeds the range of flower densities found at the field station. In addition, the map does a poor job of recognizing flowers.

Density map using parameters that reduce some of the noise in the grass. These parameters were chosen through the cross validation process. However, the range on this map is too large.

Using a different set of parameters I was able to single out flowering buckwheat bushes from surrounding trees; however this map still has issues 1 and 3.

Density map using parameters that differentiate between buckwheat and a nearby tree. However the range is still very small compared to the actual range in the BFS.

Using another set of parameters in the Gaussian algorithm, I got a better output range. For reference, the density of flowers in the field station ranges from 0 to about 24000 flowers per square meter. However, this map still has issues 2 and 3.

Map with a better output range. The color legend displays the value in flowers per square meter for each color seen on the contour map.In this case the range still goes too negative for accuracy, however, the upper end is much closer to the actual upper values of flower densities in the BFS.

Ultimately, the hope is that we can produce a map that incorporates all three of these characteristics, and extends to the entire field station. In the last 2 weeks of research this summer I will be working to find the optimal set of parameters that does this. I will also be looking for more image features that might help to distinguish between flowering and nonflowering areas. Finally, as our labmates Tessa and Clayton collect more data on the transects we will be flying the drone over each transect to create more maps. These maps will provide more training data for testing and validating. With all of these improvements, we hope to see the density maps become a lot more accurate! I will be testing on more images of the field station, and if all goes well producing a flower density map of most of the Southern portion of the BFS.

2 comments:

NMatasciJuly 27, 2015 at 1:39 PM
That is really amazing work! Two thoughts: have you tried filtering the image for the region of the light spectrum that are more informative? If I remember correctly, vegetation has a different IR emittance than the ground. You might need an IR camera though. A starting point might be to work on the 3 channels separately and see if any of them works better (maybe a two pass analysis: the first one to identify vegetation on the filtered data and the second one to estimate flower density on the full spectrum). I would also suggest to plot the heatmaps with a different color palette as the rainbow scale has some artifacts that might give the impression that the algorithm performs worse than it really does (a full explanation here http://www.research.ibm.com/people/l/lloydt/color/color.HTM and here http://earthobservatory.nasa.gov/blogs/elegantfigures/2013/08/05/subtleties-of-color-part-1-of-6/)

HMC Bee Lab

Pages

Monday, July 27, 2015

If at First You Don’t Succeed… You Might be Doing Research: Part 2

2 comments: