FAQs

February 11, 2023 · View on GitHub

Q: If the weight of a conv layer is zero, the gradient will also be zero, and the network will not learn anything. Why "zero convolution" works?

A: This is wrong. Let us consider a very simple

y=wx+by=wx+b

and we have

y/w=x,y/x=w,y/b=1\partial y/\partial w=x, \partial y/\partial x=w, \partial y/\partial b=1

and if w=0w=0 and x0x \neq 0, then

y/w0,y/x=0,y/b0\partial y/\partial w \neq 0, \partial y/\partial x=0, \partial y/\partial b\neq 0

which means as long as x0x \neq 0, one gradient descent iteration will make ww non-zero. Then

y/x0\partial y/\partial x\neq 0

so that the zero convolutions will progressively become a common conv layer with non-zero weights.