Understanding Compression of Convolutional Neural Nets: Part 5
Convolution Layer Compression using CP Decomposition
This is the final post about compression of convolutional neural networks. In the previous Part 4 post, I introduced CP decomposition method for factorizing a tensor into a linear combination of rank one tensors. In this post, I am going to show its application for compressing convolution layers.
Compression Example
I will use a convolution layer specified as Conv2d(3, 2, kernel_size=(5, 5), bias = False). This layer has three inputs and two outputs. Corresponding to each output channel, there is a three-dimensional matrix of 5x5 weights. In our example, we will take the following tensor as the weight tensor of our convolution layer.
Now, let's apply the CP decomposition steps to our example layer. We will perform the tensor decomposition using python library Tensorly.
import numpy as np
import tensorly as tl
from tensorly.decomposition import parafac
np.random.seed(1234)
np.set_printoptions(precision=2, suppress=True)
W = tl.tensor(wt)# wt is the weight array shown earlier
weights, factors = parafac(W, rank=2)
[f.shape for f in factors]
[(2, 2), (3, 2), (5, 2), (5, 2)]
Thus, we have the factor matrices for implementing the convolution layer through the four steps mentioned above. Let's look at the factor matrices obtained via the CP decomposition. Tensorly gives the following output for factors:
Let's designate the factor matrices as:
first_wt =factors[1]
vert_wt =factors[2]
horz_wt = factors[3]
last = factors[0]
Now we can implement the convolution layer through the above factor matrices as shown below. Note the use of groupsparameter to perform convolutions in parallel.
We can now check to see how well the convolution layer approximation by the CP decomposition performed. We will do so by creating an arbitrary input tensor, shown below:
Using the above tensor as input, let’s calculate the output first by the convolution layer without any approximation, and then by the approximate implementation. The norm of the difference between two outputs is then calculated and expressed as a percentage of the output without any approximation. This results in 26.2571%; a rather large number. To understand why the error is so large, let’s look at the approximated tensor. We can get the approximated tensor by carrying out the outer products between the components of every rank 1 tensor and then adding them as we did at the very beginning of this post to create a tensor. Tensorly also has a function to do this.
You can see that the approximated tensor is much different from the original tensor. This is because R being equal to 2 is not sufficient and may be R needs to go higher. Anyway to get an idea of the tensor approximation error, I calculated the norm of the difference between the original tensor values and those of the approximated tensor. After normalizing it, this error turns out to be 35.6822%. Thus, the error in the convolution output because of approximation is in line with the approximation being performed in tensor decomposition.
You might be wondering about the amount of decompression achieved in this example. We had 150 weight values defining the convolution masks in the original implementation. With decomposition, the number of factored matrices elements is 30. Thus we are able to achieve 80% reduction. Of course, the approximation is too crude. Assuming that Requal to 3 will yield an acceptable approximation, we will then be needing 45 values to define different mask elements; it is still a significant reduction. As an example of decompression that can be achieved in a real network, let's look at the first convolution layer of the VGG-19 network. The layer is defined as Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)). Thus, the layer has 64 output channels, 3 input channels, and the convolution masks are of size 3x3. The layer weight size is thus [64, 3, 3, 3], meaning we need 1728 values to specify the masks. It turns out that R equal to 16 yields a fairly good approximation. In that case, the factor matrices are of the (64, 16), (3,16), (3,16), and (3,16). The total number of elements in these matrices is 1168, about 2/3rd of the elements without decomposition. In practice, a suitable value for R has to be figured out to optimize decompression ratio while keeping the approximation at the accepted level.
Before wrapping up this post, let me mention that there is another decomposition method, known as the Tucker decomposition. It is extremely efficient in terms of approximation and allows different number of components in different modes which is not possible in CP decomposition. Similar steps can be followed to reformulate convolution layers using the Tucker decomposition. For those interested in seeing how the Tucker decomposition will be applied, I highly recommend visiting this site.
These series of posts on compression were originally published at https://iksinc.tech.