Shivam, regarding the conv2d. E.g. say we have 5x5 image with 3 channels. I.e., the image is 5x5x3. Now, you have 2 kernels. Each kernel has the size 3x3x3. The first kernel will do the convolution for each RGB channel, adding the results together and give you a 3x3 feature map (if you don't consider sliding, each patch will give you a scalar). Then, you do the same thing for the 2nd kernel though. Now you have two 3x3 feature maps and you stack those together. So, the result of using the 2 kernels on the 5x5x3 image is a 3x3x2 feature map.