The report explores the ideas presented in Deep Ensembles: A Loss Landscape Perspective by Stanislav Fort, Huiyi Hu, and Balaji Lakshminarayanan.
In the paper, the authors investigate the question - why do deep ensembles work better than single deep neural networks?
In their investigation, the authors figure out:
Different snapshots of the same model (i.e., model trained after 1, 10, 100 epochs) exhibit functional similarity. Hence, their ensemble is less likely to explore the different modes of local minima in the optimization space.
Different solutions of the same model (i.e., trained with different random initializations each time) exhibit functional dissimilarity. Hence, their ensemble is more likely to explore the different modes of local minima in the optimization space.
Inspired by their findings, in this report, we present several different insights that are useful for understanding the dynamics of deep neural networks in general.
Neural networks are stochastic functions i.e., each time you train a neural network, it may not lead to the exact same solution as before. Neural networks are optimized using gradient-based learning. This optimization problem is almost always non-convex. When expressed with Greek letters, this optimization problem looks like so -
$ \operatorname{minimize}{\theta} \frac{1}{m} \sum{i=1}^{m} \ell\left(h_{\theta}\left(x_{i}\right), y_{i}\right) $
where,
Consider the figure below that shows a sample non-convex loss landscape (typical for neural networks). As we can see, there are multiple local minima in there. A neural network can only reach one of these local minima at one time after they are trained. The same neural network can end up in different landscapes each time they are trained with different random initializations exhibiting high variance in predictions.
We can also see that these local minima lie at the same level in the loss landscape, which further suggests that if a network ends up in one of these local minima, it will yield the same kind of performance more or less.
To allow a network to cover these local minima better, we often train several versions of the same model but with different initializations. During inference, we take predictions from each of these different solutions, and we average their predictions. It works quite well in practice, and this process is referred to as ensembling. Ensembling also helps to reduce the high variance that might come from the predictions of individual models (the same network trained multiple times with different random initializations).
In order to understand why ensembles work well, we need to figure out the ingredients that make these ensembles cover the loss landscape better?
Neural networks are parameterized functions, as we saw earlier. Each time we train a network, we end up in a different parameter space leading to different optimums. The more diverse this space, the better the coverage of different optimums. So, how do we quantify this diversity?
To investigate this systematically, the authors do the following (among other things):
They measure the cosine similarity of the weights from different runs of the same network. Cosine similarity is a widely used metric to measure the similarity between two vectors. It does so by measuring the orientation and not the magnitude (refer to the figure below). Formally speaking, it is the dot product of two normalized vectors divided by the product of their respective norms. They want to examine the functional similarity of different trajectories (weights of the same model trained with different initialization).
Practically we can do this by training the same model with different initializations while grabbing trainable weights, ignore biases, flatten weights from each layer, and extend them to a list. Apply cosine similarity formula (NumPy implementation) for each pair of models.
# compute cosine similarity of weights
cos_sim = np.dot(weights1, weights2)/(norm(weights1)*norm(weights2))
They measure the extent to which the predictions from different runs disagree with each other. The authors want to see if the models trained with different initializations fail for the same subset(or complete set) of the testing dataset. Suppose a model trained with different inits produce different predictions on the test dataset, we can say that the prediction is a function of its initialization.
Also, the examples which tend to confuse the model across different initializations can be called intrinsically hard examples. To find this, we first compared confusion matrix epoch-wise, i.e., confusion matrix across individual epochs from the same init. This was followed with solution-wise comparison, i.e., confusion matrix from different solutions (inits) of the same model.
Practically, to compute dissimilarity in predictions, add the total number of equality between the true labels and the predicted labels, normalize by dividing the sum with the total number of test data points followed by subtraction by 1.
# compute dissimilarity
dissimilarity_score = 1 - np.sum(np.equal(preds1, preds2))/10000
Before we dive deep into the experiments mentioned above, it is essential to review our experimental setup.
keras-idiomatic-programmer
repositoryNote: We did not exactly follow what is specified in section 3 of the paper. There are minor differences in our experimental setup and what the authors followed.
For convenience, below, we specify how the learning rate schedule would look and the data augmentation pipeline we followed.
def augment(image,label):
image = tf.image.resize_with_crop_or_pad(image, 40, 40) # Add 8 pixels of padding
image = tf.image.random_crop(image, size=[32, 32, 3]) # Random crop back to 32x32
image = tf.image.random_brightness(image, max_delta=0.5) # Random brightness
image = tf.clip_by_value(image, 0., 1.)
return image, label
We used Google Colab for running all of our experiments.
Going back to our experiments, we are going to present them in two different flavors:
Note: By snapshots, we refer to models taken from epoch 0, epoch 1, and so on from the same training run (same initialization).
<br>
The functions (different checkpoints of the same model) in the same trajectory are similar, and it holds for all variants (small, medium, and large) of the model.
The cosine similarity between the weights of the different snapshots of the same model starts showing a high degree of similarity between each other as it approaches convergence. Thus, there is not much change in the weight space when the trajectory is settled for a loss landscape valley.
The checkpoints from the later stage of training differ the most from the initial stage of training, followed by mild similarity (whitish region).
The models trained with different initialization (different trajectories) are entirely dissimilar. This holds for all three variants of the model.
Thus, initialization decides the weight space the model will explore.
The functions (different checkpoints of the same model) in the same trajectory tend to disagree less about its predictions. Further, confirming that functions in the same trajectory are similar.
From the prediction dissimilarity plot we can see that different snapshots of the same model starts showing a high degree of similarity between each other as it approaches convergence(increasing epoch). Thus one can say that many examples are functionally mapped ($x \rightarrow y$) when the trajectory is settled for a loss landscape valley.
We also observe high dissimilarity in predictions between the checkpoints from the later stage of training and the very initial stage of training.
The predictions for the same model with different initializations trained on the same dataset with same hyperparameters disagree. :astonished:
Obviously there is a subset of examples that the model trained with different trajectories will agree upon.
There must be a subset of intrinsically hard examples that the model trained with different trajectories will misclassify similarly. We shall investigate in the next section.
Below we see that the set of examples that confuses a model epoch-wise changes as we proceed toward the optimization. We further see that this set varies when we train the model with different initialization. We could not enlist results from all the different initialization for space constraint, but feel free to check them out here. This suggests that the definition of intrinsically hard examples is relative to how a model is being initialized to train. This may also further suggest that the images that cause the top losses during training (epoch-wise) are also not the same when we change the initialization of a model.
Note: You can click on the little button located at the top-left corner and play with the slider to see how the confusion matrices change with epochs.
The idea of creating an epoch-wise callback is referred from this tutorial.
We talked about different initializations of the same model and observed functional dissimilarity between them. To spice it up, let's try to visualize the path for different trajectories visually. The authors do so by taking three (for simplicity) different trajectories (inits) of the same model. They then take the softmax output from different checkpoints along individual training trajectories and append them to an array. The shape of the array should be (num_of_trajectories, num_of_epochs, num_of_test_examples, num_classes)
and then compute a 2 component t-SNE of this array.
The predictions from all the solutions and their individual epochs were appended to a single array because they belong to the same "space". We apply 2 component t-SNE to reduce this higher dimensional space to a two-dimensional space. Below is the result of this experiment for Small and Medium sized CNN. And wow!
In our opinion and also from the plots (shown below), it is evident that the models with different initializations have different trajectories. As one approaches convergence, they tend to cluster around the same valley in space. Even though the models reach similar accuracy, we can clearly see the evidence of multiple minima which lie on the same plane.
Another interesting question the authors explore is - how ensemble size affects the overall test accuracy? Below we can see that as we keep increasing the ensemble size, the model performance gets enhanced. For SmallCNN, after a certain period, the enhancement gets plateaued. We think this might be because a small-capacity model does not produce an optimum solution over the training dataset. Ensembling predictions do help improve model performance, but after reaching peak performance, the uncertainty from multiple suboptimal models take over the benefit of ensembling.
This suggests it’s because an ensemble is able to cover the optimization landscape better than a single model and indeed that seems to be the case.
Although this behavior is interesting for deployment-related situations using a large ensemble of very heavy models might not be practically feasible.
The authors, in addition to the experiments based on the checkpoints along a trajectory also explore the subspace along an individual trajectory. Subspace along a trajectory is a set of functions (solutions) that exist in the function space around the explored space and while retraining with the same initialization could be explored. The authors use a representative set of four subspace sampling methods:
The authors construct their subspace around an optimized weight-space (weights and biases of a trained neural network) solution θ. By using the t-SNE plot experimental setup, they show that the created subspace lies in the same valley as the optimized solution while different solution lies in a different valley.
The authors validate two hypotheses -
The plot below summarizes these -
The paper we discussed in this report gives us an excellent understanding of why (deep) ensembles are very powerful in covering the optimization landscape better with simple experiments. Below we leave you with a couple of amazing papers in case you are interested in knowing more about different aspects of deep neural networks -
Thanks to Yannic Kilcher for his amazing explanation video of the paper which helped us pursue our experiments.
Thanks to Balaji Lakshminarayanan for providing feedback on the initial draft of the report and rectifying our mistake on the tSNE projections.
Hope you have enjoyed reading this report. For any feedback reach out to us on Twitter: @RisingSayak and @ayushthakur0.
Sayak Paul and Ayush Thakur have contributed equally to this report.