I am confused with these two structures. In theory, the output of them are all connected to their input. what magic make 'self-attention mechanism' is more powerful than the full-connection layer?
1 Answer
Ignoring details like normalization, biases, and such, fully connected networks are fixed-weights:
f(x) = (Wx)
where W
is learned in training, and fixed in inference.
Self-attention layers are dynamic, changing the weight as it goes:
attn(x) = (Wx)
f(x) = (attn(x) * x)
Again this is ignoring a lot of details but there are many different implementations for different applications and you should really check a paper for that.
-
1i.e. f(x)=((wx)*x) in self-attention. Anyway f(x) is a function of x. So theoretically speaking, multiple FC layers are able to simulate the same behavior of an attention.– tom_catOct 6, 2020 at 3:47
-
7@tom_cat Theoretically speaking, multiple FC can simulate any function. Oct 6, 2020 at 3:50
-
is it right to say that, to some extent, attention is a special type of FC, whose weights are dynamically and indirectly determined by some other weights @hkchengrex– tom_catOct 6, 2020 at 5:50
-
1@tom_cat It is a matter of interpretation but I wouldn't say that. I would say both FC and self-attention are cases of "connections" with the weights determined by a fixed or input-related scheme. Oct 6, 2020 at 5:54
-
@hkchengrex Could you please explain what you mean with "dynamic, changing the weight as it goes" in this context? Mar 4 at 14:17