And why Periscope Should’ve Held Out for a Little Longer

Spoiler Alert: This article references a recent episode of the show Silicon Valley. It only refers to material already provided in HBO released previews, but if you’d like to stay completely out of the know, look away now.

In a recent episode of HBO’s “Silicon Valley”, one of the characters Mr. Jian-Yang builds an app called “Not Hotdog”. The app allows users to identify whether objects are or are not hot dogs. At face value, it seems to be of little use, but it turns out to have very interesting wider applicability (watch the episode to find out more).

One of the comedic quirks is that the character, Mr. Jian-Yang, insists that the app enables two different tasks:

  1. Identifies whether an object is a hot dog.
  2. Identifies whether an object is not a hot dog.

At first glance, Mr. Jian-Yang’s insistence on rigidly drawing this distinction seems to be solely a result of his poor grasp of the English language. But the geek in me got thinking — what if there’s something more that we’re missing here. Could this be a reference to the technology at play in powering the app (which, by the way, was actually built and is available for download on the App Store)?

The Distinction

Here’s where I think the distinction lies.

  • Mr. Jian-Yang has built a supervised binary image classifier, trained on a dataset that contains both labelled images of hot dogs and labelled images of objects that are not hot dogs.
  • When Emily Chang and the rest describe the tech as “machine learning to recognize if a food is a hot dog”, the pedantic Mr. Jian-Yang thinks that they’re assuming he’s built a self-supervised autoencoder.

Supervised Binary Image Classifier

Here’s what the code might look like:

from keras.layers import Input, Dense, Conv2D, MaxPooling2D, Flatten, Dropout
from keras.models import Model
from keras.optimizers import Adam

input_img = Input(shape=(296, 296, 3))

x = Conv2D(32, (2, 2), activation=’relu’, padding=’same’)(input_img)
x = MaxPooling2D((2, 2), padding=’same’)(x)
x = Conv2D(32, (2, 2), activation=’relu’, padding=’same’)(x)
x = MaxPooling2D((2, 2), padding=’same’)(x)
x = Conv2D(32, (2, 2), activation=’relu’, padding=’same’)(x)
x = MaxPooling2D((2, 2), padding=’same’)(x)

x = Flatten()(x)
x = Dense(256, activation=’relu’)(x)
x = Dropout(0.2)(x)

output = Dense(1, activation=’sigmoid’)(x)

model_binarysupervised = Model(inputs=[input_img], outputs=[output])
model_binarysupervised.compile(loss=’binary_crossentropy’, optimizer=Adam(lr=0.002), metrics=[‘accuracy’])

This network would have to be fed examples of images of hot dogs (with the ‘y’ value set to ‘1’ for example) and images of objects that are not hot dogs (with the ‘y’ value set to ‘0’).

Post training, the network will learn to return a number close to ‘1’ for images of hot dogs and ‘0’ for the rest. Simple enough?

Self-Supervised Autoencoder

Here’s an alternative approach that Mr. Jian-Yang could’ve used:

# Credit: https://blog.keras.io/building-autoencoders-in-keras.html

from keras.layers import Input, Conv2D, MaxPooling2D, UpSampling2D
from keras.models import Model
from keras.optimizers import Adam

input_img = Input(shape=(296, 296, 3))

x = Conv2D(32, (2, 2), activation=’relu’, padding=’same’)(input_img)
x = MaxPooling2D((2, 2), padding=’same’)(x)
x = Conv2D(32, (2, 2), activation=’relu’, padding=’same’)(x)
x = MaxPooling2D((2, 2), padding=’same’)(x)
x = Conv2D(32, (2, 2), activation=’relu’, padding=’same’)(x)
encoded = MaxPooling2D((2, 2), padding=’same’)(x)

x = Conv2D(32, (2, 2), activation=’relu’, padding=’same’)(encoded)
x = UpSampling2D((2, 2))(x)
x = Conv2D(32, (2, 2), activation=’relu’, padding=’same’)(x)
x = UpSampling2D((2, 2))(x)
x = Conv2D(32, (2, 2), activation=’relu’, padding=’same’)(x)
x = UpSampling2D((2, 2))(x)
decoded = Conv2D(3, (2, 2), activation=’sigmoid’, padding=’same’)(x)

model_autoencoder = Model(input_img, decoded)
model_autoencoder.compile(optimizer=Adam(lr=0.002), loss=’binary_crossentropy’)

In this case, the network need only be fed with images of hot dogs. Over several epochs, the autoencoder will learn to encode and decode images of hot dogs with low reconstruction loss, without any effort expended on learning what objects that are not hot dogs look like.

In production, the loss will be low for photos of hot dogs, and high for photos of other objects. By setting an appropriate threshold for this distinction, voila, the same result is achieved!

In my opinion, the latter example is much more elegant, since it doesn’t require a negative dataset to be curated. If only Periscope had held out for a little longer ;).


Supervised Binary Image Classifier (Left) and Self-Supervised Autoencoder (Right)

Credit Abishek Bhat for helping me thrash out the distinction.


To find out if your product is a “pizza pillow” or a “not pizza pillow” get in touch.Interested in working with our AI-driven APIs? Get a free trial today!