AI Writer


My Dear Megan Tang, this is our love story.

human or machine? Human. Why.I found some drops of soylent on it

Hello World, welcome to Sirajology! In today’s episode we’re going to build an AI Writer. That is an app that can write a short story about an image just by looking at it. I swear i’m being serious, this is actually possible. Sorry Stephen King, you’re out of the game. Humans have been writing stories since day 1. When we tell stories we synthesize our past experiences in life with our command of language to produce a narrative that we

can share with others to give them similar neural thought patterns. Isn’t it weird

how just by stringing together an exact combination of words to form sentences we can produce something of profound beauty? We have a bunch of thoughts in our head, abstractions we’ve built from just living life, representations of our experiences. When we read these stories, our brains are somehow encoding them into thoughts. When we ‘encode’ a sentence to a thought, the more semantically similar it is to an existing thought, the more we’ll

be able to relate to it. Our brains can detect symantic similarity to existing thoughts,

the essence of a piece of prose is already in us, its similarity can unlock that essence and potentially produce a flood of emotion in any direction. This is likely why relatability is such a valued concept in marketing. So how do we get an AI to write a story if it doesn’t have any experience living life in the real world? We’ll We’re going to build an AI Writer in python using the deep

learning library lasagne and we’ve got a lot of code to go over so I’ll explain as

we go. Let’s get pythonicCCCCC. At the highest level we could code this app in just 3 lines of code, it’s a little ridiculous. We import the generate class, then call the load_all function which will initialize all of our machine learning models. then call the story function with the generated models and image location as the parameter. That’s it, it’ll output a story. But let’s dive a little deeper. The load_all function is just boilerplate

initialization so let’s take a closer look at the story function, where the real magic

happens and see how that works.We’ll start off by loading an image into memory. This will be the image that we want to tell a story about. We’ll use the load image function to load it, and have the parameter set to the location of the image on our machine. The load image function uses the scientifiic computing library numpy to get the byte reprersenation of the image and then resize it so it’s smaller while preserving it’s aspect ratio.

Then it’ll crop it and format it for later input.

Once we’ve loaded our image, it’s time to input the image into a deep convolutional neural network to retrieve it’s features. In a previous episode we talked about how convolutional neural nets were great for image recognition, since they roughly mimic the human visual cortex. This CNN is pre-trained. We initialize it in the build_convnet function which is called in the boilerplate load_all method. Once we’ve specified all the layers,

we load up our pre-trained synapse weight file called Vgg 19. This file was trained

on a huge dataset of labeled images, so it will be able to recognize the objects in a novel image. Once we input our image in our CNN, it’ll return an array of features for us. These features are the highest level features in our neural net, the layer right before the output layer. The most abstract representation of the image, it’s content. Once we have our features we’ll want to encode the image features into a mulit-modal

neural language model. So what is this? Well it’s based off the paper called ‘unifying

visual semantic embeddings’. In our code we’re using a pre-trained model that’ll input a joint image sentence embedding into a multimodal vector space. It used an LSTM to encode the sentence and a CNN to encode the image. Then a decoder neural language model generates a novel description from the image. Since our model is pre-trained, when we embed our image into this multimodal space, our features are updated to include the weight

of the joint space. then we compute the nearest neighbors, to do this first we retrieve the

array of scores, that is a list of all novel sentences generated from the novel image, which we then sort in order of closeness. Super close…. Then we’ll want to print out the nearest captions. Now that we have a set of caption sentences we’ll want to compute a set of skip-thought vectors for each sentence. Skip thought vectors are a vector representation of a sentence. This is another implementation

of the encoder-decoder model, the encoder and decoder are both recurrent neural networks.

We take an input sentence and ‘encode’ it into a skip thought vector by inputting it into the encoding recurrent neural net. Since we are modeling a sequence of words, we’ll use Gated Recurrent Units or GRUs at each neuron. GRUs consist of 2 ‘gates’ an update gate and a reset gate. The gating units modulate the flow of data inside the unit, and unlike LSTM cells there are no seperate memory cells. LSTM cells control the amount

of memory content that is seen or used by other units in the network. GRUs cells don’t,

they expose its full content without any control. So GRUs have a less complex structure and are thus more computationally efficient. We’re starting to see these be used more and more, they’re relatively new. So when we feed the sentences into the RNN and it’ll create an abstraction, the vector representation or skip though vector. Sentences that share semantic and syntactic properties will be mapped to either the same or similar skip-thought


The function returns these vectors as a numpy array, which we can then modify via the style shift function. we’ll take our thought vectors and modify them to match the style of stories using a pre-trained recurrent neural network. The RNN was trained on a dataset of romance novels, where each passage was mapped to a thought vector. So we’re essentially computing a function that looks like this for a style shift. F(x) is a book passage thought vector,

x is an image caption, c is caption style vector and b is a book style vector. We remove

the caption style from the caption and replace it with a book style to create a book passage vector. Once we have our book passage styled vector, we can generate the story by running the decoder function on it. The decoder is another recurrent neural network that, given a vector representation of a sentence, can predict the previous sentence and the next sentence. We’ll run the decoder on our passage vector and that will generate our story based

on the image for us. Let’s take a look at what it says about this picture. Let’s read

the first few sentences- She was taking the man out of her mouth and she gave him a gentle shake of her head. oh my god, i cant wait to see what happened in the past twenty for hours, i had never met a woman before. This thing is a pro. For a small chunk of code, there’s a lot of machine learning going on here. We used a convolutional neural net to compute image features, an LSTM reccurent neural net to

encode our image into joint space and retrieve the sentence captions, a GRU recurrent neural

net to calculate the skip-thought vectors of those sentences, and after stye shifting, an RNN to decode our pasage vector to a story. That’s 4 neural nets. That’s what it takes to be cool these days 🙂 You can run this on your local machine since the necessary models are pretrained, you can also test this out at That site has a bunch of pre-trained deep learning models that you can test out, it’s super useful. For more

info, check out the links below and I just signed up for Patreon so if you guys find

my videos useful i’d really appreciate your support to help me continue doing this full-time. Please subscribe for ML videos and for now I’ve gotta go fix a null pointer exception so thanks for watching 🙂