Pruning ResNets for Fun and Profit

What are Neural Networks?

Neural networks are highly parameterized functions which can be trained to fit/represent/model data.

Parameterized models are ubiquitous; even simple linear regression, fitting \(f(x) = mx+b\) to find the best fit line is parameterized by the (learnable) slope \(m\) and intercept \(b\). In practice, neural networks are typically far more complex than simple linear regression. Typical neural network models are a composition of several to many simpler parameterized functions often referred to as layers. We tend to think of data flowing through a neural network layer-by-layer; the output of one layer becomes the input the subsequent layer. These iterative transformations and the many learnable parameters in neural networks give them their expressive power–or, the ability to approximate functions and learn distributions of data.

What is Neural Network Model Pruning

Neural network models are often over-parameterized and frequently contain millions or even billions of learned parameters. For some of the most difficult machine learning tasks, these massive models may be required to achieve state of the art accuracy (or even just acceptably good accuracy). However, in most cases, models can be greatly simplified and still maintain most of their accuracy. Neural network pruning is one method for simplifying a neural network model.

Figure 1 illustrates the pruning process. In this example, the full model’s layer has 3 inputs (\(x_1\), \(x_2\), and \(x_3\)), 3 outputs (\(y_1\), \(y_2\), and \(y_3\)), and 9 learned weights (\(w_{i,j}\) for \(i, j\in {1, 2, 3}\)). The output \(y_1\) is a weighted sum of the inputs:

\[y_1 = w_{1,1}x_1 + w_{2,1}x_2 + w_{3,1}x_3.\]

If it’s the case that \(|w_{1,1}| + |w_{2,1}| + |w_{3,1}|\) is very close to zero, then \(y_1\) will be very close to zero for most inputs (assuming none of the \(x_i\)’s are very large).

Simple pruning illustration

Figure 1:

A simple example of pruning. Removing one output feature reduces the number of learned weights (edges) and the amount of computation required to compute all the layer outputs.

If this is the case, then \(y_1\) varies very little with the network’s input and thus has little impact on the network’s predictions. Pruning exploits this insensitivity by simply removing the output \(y_1\) and the weights used to compute it (\(w_{1,1}\), \(w_{2,1}\), and \(w_{3,1}\)); note this is equivalent to removing the weights or simply setting \(y_1\) to be a constant value of 0.

Technical Note: If we remove the weights \(w_{1,1}\), \(w_{2,1}\), and \(w_{3,1}\) thereby removing the output \(y_1\), then we also need to adjust the layer immediately following the one pictured. This is because the subsequent layer originally expected three inputs (\(y_1\), \(y_2\), and \(y_3\)) but is now only receiving two inputs. The result will be that a few weights, which were applied solely to \(y_1\), are no longer needed and can (must) also be removed.

What are ResNets?

resnet18

Figure 2:

A high-level view of ResNet18 which comprises 8 residual blocks.

ResNet models are deep neural networks that are composed of a sequence of residual blocks. Figure 2 shows a high level view of ResNet18 which is composed of 8 residual blocks. In a traditional convolutional neural network, the data flows along a single path forward, from one layer to the next. The residual blocks, which form the basis of ResNets, create skip connections allowing input data to bypass (or skip) some of the computation that would otherwise be done in the block–effectively creating two forward paths along which the data flows. One purpose of these skip connections is to shorten the path from input to output (or, more specifically, from output to input during a backwards pass) in an attempt to mitigate the vanishing gradient problem (see technical note for all the gory details). As a result, ResNets maintain the expressive power of deep neural networks without suffering (as badly as other deep networks) from vanishing gradients.

basic block size preserving model

Figure 3:

The computational path through this residual block preserves the size and shape of the input tensor allowing for the final point-wise addition of the block’s input tensor with the block’s output tensor.

Technical Note (vanishing gradient): The vanishing gradient problem refers to a phenomena where updates to weights (especially in early layers of a very deep network) become infinitesimally small preventing them from learning meaningful features. To illustrate the cause of the vanishing gradient and illustrate how skip connections lessen the risk of it occurring, consider a traditional neural network with 5 layers: \(f_1, f_2, f_3, f_4,\) and \(f_5\) where for any input \(x_0\) the output is: \[y=f_5(f_4(f_3(f_2(f_1(x_0))))).\] For notational simplicity, let’s denote \(y_i=f_i(f_{i-1}(…f_1(x_0)))\) so that \(y=y_5=f_5(y_4),\) etc. And, to abuse notation slightly we’ll simultaneously use \(f_i\) to be the function which represents the \(i^{\text{th}}\) layer and the learned weights (or parameters) contained in that layer. Weight updates at each layer will have the form: \[f_i\leftarrow f_i - \lambda \frac{\partial L}{\partial f_i}\] where the partial derivative (gradient) is evaluated at that layer’s input, \(y_{i-1}\) and \(L\) represents whatever loss function is being optimized. As we update the weights in every layer, we compute the gradient for each layer (recycling any computation we can to avoid doing it twice). \[\frac{\partial L}{\partial f_5} \rightarrow \text{directly computed}\] \[\frac{\partial L}{\partial f_4} = \frac{\partial L}{\partial f_5}\cdot \frac{\partial f_5}{\partial f_4} \rightarrow \text{(recycled) (computed)}\] \[\frac{\partial L}{\partial f_3} = \frac{\partial L}{\partial f_5}\cdot\frac{\partial f_5}{\partial f_4}\cdot\frac{\partial f_4}{\partial f_3}\] \[\frac{\partial L}{\partial f_2} = \frac{\partial L}{\partial f_5}\cdot\frac{\partial f_5}{\partial f_4}\cdot\frac{\partial f_4}{\partial f_3}\cdot\frac{\partial f_3}{\partial f_2}\] \[\frac{\partial L}{\partial f_1} = \frac{\partial L}{\partial f_5}\cdot\frac{\partial f_5}{\partial f_4}\cdot\frac{\partial f_4}{\partial f_3}\cdot\frac{\partial f_3}{\partial f_2}\cdot\frac{\partial f_2}{\partial f_1}\] If each of the terms \(\frac{\partial f_i}{\partial f_{i-1}}\) are small (e.g. smaller in absolute value than 1) then the product of many such terms quickly becomes vanishingly small and, as a result, the weights of the first layer(s) are essentially static. If we insert a skip connection between layers 1 and 5 (so that the input to layer 5 consists of the output of layer 4 added to the output of layer 1) then our network description is altered slightly (and so are the derivatives). The network becomes: \[y=f_5(f_1(x_0) + f_4(f_3(f_2(f_1(x_0))))).\] And the gradients (partial derivatives) become: \[\frac{\partial L}{\partial f_5} \rightarrow \text{ (unchanged)}\] \[\frac{\partial L}{\partial f_4} = \frac{\partial L}{\partial f_5}\cdot \frac{\partial f_5}{\partial f_4}\rightarrow \text{ (unchanged)}\] \[\frac{\partial L}{\partial f_3} = \frac{\partial L}{\partial f_5}\cdot\frac{\partial f_5}{\partial f_4}\cdot\frac{\partial f_4}{\partial f_3}\rightarrow \text{ (unchanged)}\] \[\frac{\partial L}{\partial f_2} = \frac{\partial L}{\partial f_5}\cdot\frac{\partial f_5}{\partial f_4}\cdot\frac{\partial f_4}{\partial f_3}\cdot\frac{\partial f_3}{\partial f_2}\rightarrow \text{ (unchanged)}\] \[\frac{\partial L}{\partial f_1} = \frac{\partial L}{\partial f_5}\left(\frac{\partial f_5}{\partial f_1} + \frac{\partial f_5}{\partial f_4}\cdot\frac{\partial f_4}{\partial f_3}\cdot\frac{\partial f_3}{\partial f_2}\cdot\frac{\partial f_2}{\partial f_1}\right)\] In this case, most of the updates are the same. However, for layer 1, there is two terms (reflecting the two paths data follows to the network output). The first term has only two products (so vanishing gradient is not an issue) and the second term is the same as we previously saw (which may suffer from the vanishing gradient). The heuristic to keep in mind is that the number of steps required to get from a layer’s input to the network’s output is the number of products (of partial derivatives) that will be computed in making the weight update. If that number of steps is very high, then it is easily possible for the gradient at that layer to vanish. By inserting skip connections, one keeps the (minimal) number of steps required to get to the output relatively small.

Figure 3 shows the structure of a basic, shape-preserving residual block. In this case, the computation within the block creates an output tensor which has the same shape and size as the block’s input tensor. This allows for the coordinate-wise addition of the input tensor with the output tensor. Note that there are several types of residual blocks and the internal structure of the blocks vary somewhat; the blocks illustrated here are the simplest such blocks and are referred to as Basic Blocks. More complex blocks may contain additional Convolution-BatchNorm-ReLU groups before applying the pointwise addition. The important feature that all residual blocks share is the pointwise addition of an output tensor with the block’s input tensor.

Sometimes the computation within a residual block can alter the shape or size of the tensor moving through the block, see Figure 4. This may happen, for example, if the number of learned kernels in the final convolution layer (the number of output channels) isn’t the same as the number of input channels to the first convolution layer; in this case, the height and width of the output tensor may match the height and width of the input tensor, but the depth of the two tensors vary. Additionally, if any convolution layer within the block has a stride that is not 1, then the output tensor’s height and width will be smaller than the input tensor’s height and width. In order to make this final point-wise addition feasible, something must be done to reshape either the input tensor, the output tensor, or both.

basic block not size preserving problem

Figure 4:

Sometimes the computational path through a residual block alters the size and shape of data passing through the block. In these cases, without modification, the block’s input tensor cannot be added point-wise to the output tensor.

The solution which is typically employed by ResNet architectures is to insert, wherever necessary, a lightweight convolution layer (1x1 convolution) with a BatchNorm chaser (see Figure 5). Doing this allows for the needed flexibility to reshape the input and maintain the feasibility of the final pointwise addition.

basic-block-not-size-preserving-fixed

Figure 5:

By inserting a relatively lightweight resampling function (convolution) the network can reshape the input tensor to match the output tensor’s shape making the point-wise addition possible.

Preparing to Train and Prune ResNets

To prune a convolutional neural network one selects (via any number of criteria) convolutional filters/kernels to remove. Since each convolutional filter/kernel produces a channel (or slice) of the layer’s output tensor, removing a kernel has the immediate impact of reducing the number of channels in the output tensor. Thus, if we were to remove a kernel from the second convolutional layer in a residual block that originally preserved shapes (e.g. as in Figure 3) we suddenly find ourselves in a position where shapes have been altered (e.g. as in Figure 4). To prepare for inevitably changing shapes, we ensure that every residual block contains a 1x1 convolution layer (as shown in Figure 5) and optionally also includes the batch norm layer depicted there. We use that convolutional “through” layer to resize the block’s input to match the shape of the block’s output; as we prune the block’s internal layers, we adapt the through layer so that output shapes always match. Pruning the first convolutional layer in any residual block is significantly easier. When we remove a kernel from the first convolutional layer the only other layers affected are the subsequent batch norm layer (which maintains a per-channel mean and variation) and the subsequent convolution layer (whose kernels expect a certain number of input channels).

Ideally, when pruning a neural network, the kernels selected for removal are those which are not impacting the model’s ability to make predictions. However, in practice, rather than no impact those kernels have a small impact on model performance and pruning many kernels simultaneously can have a larger impact on performance. As a result, after pruning a model, it is important to allow for a modest amount of additional training–often called fine-tuning. Compared to the cost of training a full model to convergence, the cost of fine-tuning a pruned model is very small. When viewed as a whole, the training and pruning process follows this pattern:

  1. Train the full model (as usual, typically long).
  2. Prune a modest amount of the model.
  3. Resume training (pruned model, typically short).
  4. If pruned model recovers sufficient accuracy, repeat from step 2.

Why not just train a pre-pruned network?

At first glance it may seem like this process: train a full model, prune, fine-tune, repeat, is unnecessarily complicated. Why don’t we just start with the smaller model training it from scratch?

The first challenge one encounters is simple: how do you know the right starting size of the model should you use? If one initially chooses a model which is too small, the accuracy may simply never be good enough and one would have to retrain with a larger network (again becoming an iterative process which doesn’t simplify the original one). If the initial model is still too large, one could again prune (again becoming iterative). In practice, and for reasons which are not fully understood, training a smaller model from scratch tends to produce worse models than training a very large model and subsequently pruning. In other words, if we applied the above pruning process to find the right starting size, then randomly initialize the small model and train it would be very unlikely that we would achieve comparable accuracy. One conjecture as to the source of this problem is the challenge of getting good weight initializations.

In [3] the authors refer to this as the lottery ticket hypothesis. Roughly, the idea is that by starting with a large (fat) model, i.e. one with many kernels in each layer, one effectively buys many lottery tickets (each kernel being a lottery ticket). Before training starts, it is impossible to know which will be the winners, but once the winners have been determined, the rest are no longer needed and can be discarded. Then, trying to train from a small model is akin to trying to win the lottery by only purchasing a minimal number of tickets. The authors showed that if they stored the original weights, trained a model to convergence, determined which kernels were winners, then pruned the model (to only include the winners), reset the winners to their original weights, and re-train they could recover the highly accurate model. However, their result is still not conclusive as it was later shown to not always be effective, and in practical terms it wasn’t very useful (because learning which tickets were winners required training the full model). So, the short answer as to why we follow this iterative approach is simple: in practice, not only does it work, but it works better than any other strategy currently known.

How to Select Kernels to Prune

There is a wide body of research around how to select which kernels to prune from any given neural network model; in this section we describe very briefly some of that research. Pruning strategies roughly fall into two categories: data-agnostic pruning and data-aware pruning.

One of the earliest proposed pruning strategies [5] involved locally approximating a loss function and estimating the impact on the loss function if any particular neuron were to be pruned–pruning the neurons with the least negative impact. A recent data-aware approach was proposed initially as a means to attempt to explain neural network predictions. This approach [2] was to remove convolutional kernels one-by-one and measure the impact on the predictive ability for each class. The goal was that by identifying a small number of kernels essential to accurately predict a class one could explain what it was about those kernels which made that prediction possible. While the method turned out to not lead to easy explanations a novel side-effect was that pruning could be done aggressively and in a single shot if one only kept a few kernels that were the most important for each class, e.g. at each layer one might keep the three most important (critical) kernels for each class–this frequently included a lot of overlap as kernels were often critical for more than one class. With a small amount of fine-tuning the models were shown to recover most of their accuracy. The real limitation with this approach however was the computational expense to determine which filters were critical. The model would have to be re-validated once for each kernel in the network (removing them one at a time and validating); for deep models with many, many kernels (and for large datasets) this computation may be prohibitively expensive.

Data agnostic pruning strategies are very popular because they tend to be sufficiently effective and relatively inexpensive. We previously hinted that if the sum of the absolute values of a kernel’s weights was small then that kernel would produce an output that was almost always close to 0–and is therefore not needed. This is indeed a common pruning strategy and there are (at least) two ways to look at “how small” a kernel’s weights are; pruning is accomplished either by removing all the kernels that are smaller than a threshold, or by removing the smallest kernels regardless of their nominal size. The first way to “size” a kernel is to compute the max norm of that kernel, also called the \(\ell_\infty\) norm, or simply, the largest absolute value of any weight in the kernel. This strategy appears in several papers including [6], it is easy to implement and typically quite effective. A second method for measuring the size of a kernel is to instead compute the kernel’s \(\ell_1\) norm, or the sum of the absolute values of the weights. Again, after computing the sizes of the kernels one either prunes kernels whose size falls below a threshold or just the smallest few kernels in the network. This strategy is (obviously) very similar to the first but is also frequently used including [4]. Less common are data-agnostic strategies which do not rely on kernel size. One such strategy [1] revolves around identifying redundant kernels and removing redundancy. They identify redundant kernels by computing the cosine similarity of the outputs of kernels given random inputs. If the cosine similarity is high, than for many inputs the output of the two kernels are nearly identical and therefore both kernels are not needed. Surprisingly, or maybe not, the work of [7] demonstrated that simply randomly pruning kernels is comparably effective to any of these data-agnostic strategies–a testament to how plastic (adaptable) these neural network models are.

So, how should you choose which kernels to prune? In light of [7] one might be tempted to think that they should simply randomly prune; however, the real take-away is that all of these strategies–at least on standard, benchmark datasets–perform their desired function. In all these cases, model sizes were dramatically shrunk with minimal loss in accuracy. So, in reality, the answer is: you should choose whichever method works for your data and your model, keeping in mind that it is a good idea to be flexible.

Real Impact of Pruning

Neural network pruning has (at least) two impacts.

First, a pruned model is necessarily a subset of its un-pruned parent. This means that the pruned model has a strict subset of the weights of its parent and therefore less expressive capacity. As an analogy, recall the function \(f(x)=mx+b\) for simple linear regression. If we prune one of the parameters, for example the intercept, we are left with \(g(x)= mx\). Anything expressible by \(g(x)\) is equally expressible by \(f(x)\) (setting \(b=0\)). But, there are some things which can be expressed by ((f(x))) that cannot be expressed by \(g(x)\). While a pruned model must be less expressive than its parent, it is not necessarily the case that a pruned model must be less accurate than its parent. In our analogy, for example, if the data we were representing naturally had an intercept of 0, then both models would be equally accurate. In practice it is hard (maybe impossible) to know how accurate a pruned model can (or will) be until it is pruned and tested.

Second, since a pruned model contains a strict subset of the weights of its un-pruned parent, it is necessarily the case that less computation must be done to compute the pruned network’s output than its parent. As a result of requiring less computation, the inference speed of a pruned model is at least as fast (usually faster) than its parent. There are many subtle factors which impact how much (if any) speed-up will be seen including: how much of the network was pruned (the more that is pruned, the more likely it is to see a larger speed-up), how well the network’s computation is parallelized (pruning can impact positively or negatively the parallelization efficiency), I/O speeds (if it takes longer to load data than to do the computation then doing less computation will not produce a speed-up because it doesn’t affect the I/O speed), etc.

In the following sections, we demonstrate the real affects of training and iteratively pruning a ResNet18 model on a dataset containing images of vehicles–the task being to identify the predominant color of the vehicle.

Minimal Loss of Model Accuracy

First, we examine the impacts on the accuracy of our model as we train and iteratively prune.

Figure 6 shows the validation accuracy over time as we perform this iterative procedure. The full model is updated through 40,000 iterations (batches of data) achieving an accuracy of just over 95% on the validation dataset. We then prune approximately 5% of all the convolutional kernels in the model and continue training. After the initial pruning little or no accuracy is lost and the model continues to learn. We continue to prune approximately 5% of the model (removing approximately 5% of the convolutional kernels) and immediately fine-tune the model for 8000 batches. Note that, after the initial pruning operations, the accuracy (if lost at all) is recovered significantly quicker than after 8000 iterations–indicating that would could have achieved similar results with less fine tuning. By the final few pruning iterations we performed, the initial drop is accuracy becomes more pronounced, although in every case the model is able to recover to at least 95% accuracy.

vehicle color full run accuracy

Figure 6:

The validation accuracy of a ResNet18 model as it is trained. The first 40,000 batches are training a full model. Then, pruning iterations begin. Each pruning iteration consists of removing approximately 5% of the convolutional kernels from the existing ResNet18 model and then fine-tuning (re-training) for 8000 additional batches.

Reduction in Model Size

In Figure 6 we estimated the percentage of convolutional kernels removed from the model. Note, this estimate is rough for two reasons: first, we can only remove an integer number of kernels (so we almost never remove precisely 5% of the model), and second, we are actually attempting to remove 5% of the current model at each step, so as the model shrinks, 5% of the current model is a decreasing fraction of the original model. But, removing entire kernels is not the only impact on the model. As previously discussed, when we remove some kernels, the remaining kernels in subsequent layers shrink (to accept inputs with fewer channels). In Figure 7 we illustrate the relative impact on model size as the pruning occurs. This model size is measured by examining the memory (disk space) required to store the model (its learned parameters). Since the exact model size is less important, we show here the size of pruned models relative to the full model (a ratio of pruned model to full model). This demonstrates that we can remove almost 80% (79.1%) of the models learned parameters without negatively impacting the predictive performance.

vehicle-color-full-run-model-size

Figure 7:

The relative model size as the model is initially trained (full model) then iteratively pruned. Note, the size referred to here is the relative size of the learned weights and is directly proportional to the number of learned weights in the pruned model. Specifically, the final model has 20.9% the number of learned parameters as the full model, or, in other words, almost 80% of the learned weights were removed (pruned) from the full model.

Improvement in Inference Time

While the direct impacts of a smaller model size may not be obvious (e.g., smaller memory footprint requiring less power), pruning a model also has a positive effect on the speed at which the model can make predictions. Figure 8 illustrates how much additional throughput the model can produce, specifically, our model which has been pruned by almost 80%–with no loss in accuracy–can make almost 65% more predictions per second than the full model.

vehicle color full run speedup

Figure 8:

As the model shrinks, inference time improves and more images can be processed in a fixed amount of time. With nearly 80% of the model pruned (and no noticeable loss in validation accuracy) we observe almost a 65% increase in the number of images that can be processed per second.

Conclusion

Pruned models are (almost always) better–in about every way that actually matters–than their full, over-parameterized counterparts. Pruned models are less likely to overfit data; pruned models (may) maintain all or most of the accuracy of the full model; and, pruned models require less memory, less energy, and inference faster.

References

  1. Babajide O Ayinde and Jacek M Zurada. 2018. Building efficient convnets usingredundant feature pruning.arXiv preprint arXiv:1802.07653(2018).
  2. Mihaela Dimovska and Travis Johnston. 2019. A Novel Pruning Method for Convolutional Neural Networks Based off Identifying Critical Filters. In Proceedings of the Practice and Experience in Advanced Research Computing on Rise of the Machines (PEARC19). Association for Computing Machinery, New York, NY, USA, Article 63, 1–7.
  3. Jonathan Frankle and Michael Carbin. 2019. The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. International Conference on Learning Representations (ICLR), Available: https://arxiv.org/abs/1803.03635.
  4. Hengyuan Hu, Rui Peng, Yu-Wing Tai, and Chi-Keung Tang. 2016. Network trimming: A data-driven neuron pruning approach towards efficient deep architectures. arXiv preprint arXiv:1607.03250 (2016).
  5. Yann Le Cun, John S. Denker, and Sara A. Solla. 1990. Optimal brain damage. Advances in neural information processing systems. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 598–605.
  6. Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. 2016. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710 (2016).
  7. Deepak Mittal, Shweta Bhardwaj, Mitesh M Khapra, and Balaraman Ravindran. 2018. Recovering from Random Pruning: On the Plasticity of Deep ConvolutionalNeural Networks. In 2018 IEEE Winter Conference on Applications of ComputerVision (WACV). IEEE, 848–857.

Profile of Travis Johnston, core dev team for AI/ML company Striveworks

Travis Johnston

Travis Johnston is a Senior Data Scientist at Striveworks. He holds a PhD in Mathematics from the University of South Carolina. Before joining Striveworks in the beginning of 2021, he was a Postdoctoral Researcher at the University of Delaware, and a Staff Research Scientist at Oak Ridge National Lab. He is the author/co-author of many scientific publications in mathematics, machine learning, and high performance computing and has mentored many undergraduate and graduate students as well as several postdocs.