Kaggle’s Yelp Restaurant Photo Classification Competition, Fast.ai Style: Part 1

Part 1 of my attempt to grapple with the Kaggle Yelp Restaurant Photo Classification competition, using the techniques (and code library) from fast.ai’s “Practical Deep Learning for Coders” course.

Lectures 3 and 4 of fast.ai’s Practical Deep Learning for Coders MOOC focuses in part on multi-label image classification. Teacher Jeremy Howard uses the Understanding the Amazon from Space Kaggle competition for teaching purposes[1], and sets homework to try other similar image classification competitions.

The forums point to a template version of the Jupyter notebook used in the lecture, which suggests trying the Yelp Restaurant Photo Classification competition. On the surface this actually turned out to be a pretty suboptimal match for the techniques used in the lecture and the default setup of the fast.ai library. But it did give me an opportunity to dig a little deeper than I might have otherwise.

Howard’s guidance is that students should aim to get an evaluation score which would put them in the top 50% of the leaderboard for the completion. Taking a look at the leaderboard I decided to aim slightly higher: I wanted to get an F1 score of at least 0.8. Ideally this score would be against the test set, but if I managed it against the validation set I’d be happy enough with that.

Specifics of the Competition

This competition has a degree of separation between the input and the expected output. To explain what I mean by that, let’s compare the training data from the Yelp completion to that from the Amazon competition. In both cases the data is supplied as jpeg images and CSV files. For the Amazon competition there is a single CSV file, mapping images to labels. It looks like this:

image_name,tags
train_0,haze primary
train_1,agriculture clear primary water
train_2,clear primary
train_3,clear primary
train_4,agriculture clear habitation primary road
train_5,haze primary water

Nice and simple. The name of the image on the first column; the labels for the image in the second. Mapping images of satellite imagery to appropriate labels is, after all, the point of this competition. For the yelp competition there are two CSV files. The first maps businesses (not images) to labels:

business_id,labels
1000,1 2 3 4 5 6 7
1001,0 1 6 8
100,1 2 4 5 6 7
...
485,1 2 3 4 5 6 7
...

Again this represents the point of the completion. We’re trying to learn the right labels for a particular restaurant. The images are a data tool we use in order to do so. So, there is a second CSV file which maps images to businesses:

photo_id,business_id
204149,3034
52779,2805
278973,485
195284,485
19992,485
80748,485
...

This is the degree of separation: no direct mapping between the input data (the images) and the desired output (the labels).

This presents two main problems. The first is small: the data needs to be merged into a format which can be used to train a neural network. Solving this leads to the second, much bigger issue: many of the resulting label to image mappings are inappropriate. But there isn’t enough information in the data set to do anything other than map every label for a business to every image for that business.

Consider that there are 9 labels:

  1. Good for lunch;
  2. Good for dinner;
  3. Takes reservations;
  4. Outdoor seating;
  5. Restaurant is expensive;
  6. Has alcohol;
  7. Has table service;
  8. Ambience is classy;
  9. Good for kids.

Now consider business 485 from the data above. It has every label apart form 0 (good for lunch) and 8 (good for kids). Associated with it are these 4 images:




Do each of those images demonstrate each of those labels? I can certainly see that the the presence of wine glasses in the last suggests that alcoholic drinks are available. But there’s nothing in any of the other three pictures which suggests booze is on the menu to my eyes. Likewise I’m not sure the third image suggests any of the labels, yet in training it will be expected to match all of them.

That’s not all. From the description of the data:

Since Yelp is a community driven website, there are duplicated images in the dataset. They are mainly due to:

  1. users accidentally upload the same photo to the same business more than once (e.g., this and this)
  2. chain businesses which upload the same photo to different branches
    Yelp is including these as part of the competition, since these are challenges Yelp researchers face every day.

So the same image might be in the training set multiple times, with entirely different labels each time. That’s a lot of mixed signals.

The upshot of this is that the problem is harder. This is borne out by the leaderboard results for the two competitions. The winning score for the Amazon competition is 0.93317 and 100th place has 0.92895. The winning score for the Yelp competition, however, is 0.83177, with 100th place getting 0.80087. Now, I want to stress that this is an apples to oranges comparison. The Yelp competition is graded using the F1 score, whereas the Amazon competition uses the F2 score, which punishes false negatives more harshly. Nevertheless, that’s a big difference and a larger drop off between 1st and 100th place.

Harder doesn’t mean impossible, though. I was curious as to whether the fast.ai techniques would work anyway. Beyond that I wondered if there was anything I could tweak to make them work better.

Processing the Input and Picking the Validation Set

I originally screwed this up and wasted a good few hours of training. The crux of the matter is this: your validation set should be based on the individual restaurants, not the individual images. I know that the first time around, but I didn’t fully understand the way the fast.ai library would handle it. The following is how I built my second, correct validation set.

Side note: I did (and continue to) do all my work for the fast.ai course using the fast.ai template at Paperspace, which I can highly recommend. If you want to try it our you can use my referral code to get $5 credit here.

First things first, I set the paths for the input CSV files and loaded them into pandas data frames:

PATH = 'data/yelp/'
photo_to_biz = f'{PATH}/train_photo_to_biz_ids.csv'
biz_to_labels = f'{PATH}/train.csv'
photo_to_biz_data = pd.read_csv(photo_to_biz)
biz_to_labels_data = pd.read_csv(biz_to_labels)

Next I selected the businesses which will be used for validation using fast.ai’s get_cv_idxs method. This provides a random but deterministic[2] list of indices given a dataset size. I added a new column to the biz_to_labels_data data frame and set it to true for every business in the validation set.

val_biz_idxs = get_cv_idxs(biz_to_labels_data.shape[0])
val_biz_idxs_set = set(val_biz_idxs)

for index in range(biz_to_labels_data.shape[0]):
	biz_to_labels_data.loc[index, 'validation_set'] = index in val_biz_idxs_set

You specify the validation set to the fast.ai library by giving it a list of the indices in the data set which are to be used for validation. But these indices must be based on the on-disk order of the input files, not the order they appear in the input CSV. Remember above when I said that I originally messed up the validation set? This point about how the fast.ai library interprets the validation set indices is where I did it. I didn’t look deeply enough at my original validation set, and that cost me a lot of time.

I joined the two data frames on the business_id field. Then sorted the resulting data frame by photo_id. As the photo_id field corresponds to the filename of each image, sorting on it means the two orders are now the same. This done, the indices of the validation data can be found by including the row number of each item which has the validation_set column I created above set to True.

joined = pd.merge(photo_to_biz_data, biz_to_labels_data, on='business_id')
joined.sort_values(by='photo_id', inplace=True)
val_idxs = [i for i in range(joined.shape[0])
            if joined.iloc[i, -1]]

Finally what remains is to output just the photo_id  and labels columns to a new CSV which can be read in by the fast.ai library:

photos_to_labels = f'{PATH}/train_photos_to_labels.csv'
joined.to_csv(photos_to_labels, columns=['photo_id', 'labels'],
			  index=False)

A key thing I learned here is that I need more experience with pandas. I’m pretty sure there are much more elegant and idiomatic ways of achieving the above. In the lecture, Howard recommends Python for Data Analysis which is written by the main author of pandas. That’s going on my todo list.

First Runs Through ResNet-34

I’m not going to go too deep into the nuts and bolts of actually training the neural network, nor talk about finding the learning rates. You can find pretty comprehensive notes and code samples for this in the fast.ai course forum here.

There are a few things which Andrew Ng’s Coursera Deep Learning Specialisation treats as advanced topics, but fast.ai bakes in from the outset. One of these is transfer learning. The starting point as taught by fast.ai is to use the ResNet-34 architecture with weights pre-trained against the ImageNet dataset. The trained weights are kept for the convolutional layers, but new fully connected classification layers are added to the end. Following the fast.ai recipe, I trained the new layers for 5 epochs, keeping the weights of the convolutional layers static. Then I unfroze the weights of the convolutional layers and continued training for a total of 7 epochs[3].

Something included in fast.ai from the start but not present in Andrew Ng’s course at all is one of Howard’s tricks for avoiding overfitting. This comes now. I increased the size of the input images from 244px to 299px then repeated the the above procedure. This makes the full regime:

  1. 5 epochs with an image size of 224px and the convolutional layers frozen;
  2. 7 epochs with an image size of 224px and the convolutional layers unfrozen;
  3. 5 epochs with an image size of 299px and the convolutional layers frozen;
  4. 7 epochs with an image size of 299px and the convolutional layers unfrozen;

Why 244px and 299px? It’s mentioned in the lectures that these are the standard sizes of images in the ImageNet dataset, which the ResNet was trained against. When I originally started playing with the data I tried a three stage progression from 64px to 128px to 256px, but found I was getting much better results more quickly by going directly to 244px and 299px. This may or may not be the case for other datasets. Figuring it out is definitely an art. I think Rick put it best:

The fast.ai library allows you to supply additional metrics when you train the network. These are entirely for the user’s feedback, and have no affect on the training itself. In order to get a better handle on how the training was actually going, I put together a function which returns the best case F1 value by picking the most effective decision boundary:

def f1(preds, targs, start=0.17, end=0.50, step=0.01):

	# Ignore warnings.
	with warnings.catch_warnings():
		warnings.simplefilter("ignore")

		# Find the threshold which yields the best F1.
		# Note: np.arange(...) is essentially range(...) for floats.
		mapping = {th : f1_score(targs, (preds > th), average='samples')
				   for th in np.arange(start,end,step)}
		th = max(mapping.keys(), key=mapping.get)
		
		# Return the F1 generated by this threshold.
		return mapping[th]

Running ResNet-34 with the above schedule gave me the following values for the trading and validation losses, plus my highly optimistic per photo F1 metric.

You’ll notice that there’s a data point missing at the end of the second set of epochs. The Jupyter notebook had a bit of an issue here, and though the training finished successfully, the loss and metric output didn’t make it to the screen. Frustrating, but this is one of the dangers of using Jupyter for long lived training runs.

Processing the Output

With the training runs finished, the next step was to test against in the validation set. At this stage I need per business, rather than per photo, F1. More processing is needed.

Remember before when I said my use of pandas was far from elegant and idiomatic? Well… look away now if that bothered you, because it’s about to get a lot worse. One of the dangers of Python is that it’s really easy to use it as a write only language. You can put a lot of power into a single line of code which makes no sense to you about an hour later.

Well... my quickly hacked together solution for matching photos to businesses in the validation set is one of those times. It uses a series of three dictionary comprehensions to to map the index of each photo in the validation photo set to the index of the appropriate business in the validation business set.

# Map the ids of businesses in the business validation set to their
# index in that set. 
val_biz_ids = {biz_to_labels_data.loc[val_biz_idxs[i], 'business_id'] : i
			   for i in range(len(val_biz_idxs))}
# Map the ids of photos in the photo validation set to their index
# in that set.
val_photo_ids = {joined.iloc[val_idxs[i], -4] : i for i in range(len(val_idxs))}

# Map index in the photo validation set to index in the business validation
# set.
photo_idx_to_val_biz_idx = {val_photo_ids[joined.iloc[i, -4]] :
							val_biz_ids[joined.iloc[i, -3]] for i in val_idxs}

With that done, I wrote a new method which first builds the per business predictions. I originally tried two approaches to this: taking the maximum of the predicted values for each class; and taking the mean of the predicted values. After a little bit of experimentation, I found that the mean[4] gave better results.

def photo_to_biz(preds, targs):
	
	# Initial storage for predications and targets.
	biz_preds = np.zeros((len(val_biz_idxs), preds.shape[1]))
	biz_targs = np.zeros((len(val_biz_idxs), targs.shape[1]))
	# Counts of the number of photos observed for each business.
	# Used to calculate a rolling average.
	biz_counts = {}
	
	for val_idx in range(preds.shape[0]):
		biz_idx = photo_idx_to_val_biz_idx[val_idx]
		
		# Update the number of photos seen for this business.
		biz_count = biz_counts.get(biz_idx, 0) + 1
		biz_counts[biz_idx] = biz_count

		# Update the rolling mean of the predictions.
		frac = ((biz_count-1) / biz_count)
		biz_preds[biz_idx,:] = (biz_preds[biz_idx,:] * frac) + (preds[val_idx,:] / biz_count)
		
		# Use max to update the target values.
		# (Technically this only needs to be done once for each
		# business and could be precalculated).
		biz_targs[biz_idx,:] = np.maximum(biz_targs[biz_idx,:], targs[val_idx,:])

	return biz_preds, biz_targs

The output can then be fed into the F1 calculation above. Surprisingly (to me), the per-business F1 score actually same out higher than the per photo score. 0.7845 vs 0.7565, which is a notable improvement.

Not quite good enough to hit my goal, though.

Next time: Dirty hacks, improvements galore, submitting to Kaggle, and graphs. Lots of graphs. You can read it here.


  1. As he is fond of doing. ↩︎

  2. Meaning that it always returns the same same output given the same input. ↩︎

  3. It’s actually more complicated that than, but as I noted above: that’s not important right now. ↩︎

  4. Again, this might not be the case for other datasets. I used a rolling calculation of the mean. This code could be made a little simpler by pre-counting the number of photos for each business. ↩︎