11

I am confused with these two structures. In theory, the output of them are all connected to their input. what magic make 'self-attention mechanism' is more powerful than the full-connection layer?

1 Answer 1

18

Ignoring details like normalization, biases, and such, fully connected networks are fixed-weights:

f(x) = (Wx)

where W is learned in training, and fixed in inference.

Self-attention layers are dynamic, changing the weight as it goes:

attn(x) = (Wx)
f(x) = (attn(x) * x)

Again this is ignoring a lot of details but there are many different implementations for different applications and you should really check a paper for that.

5
  • 1
    i.e. f(x)=((wx)*x) in self-attention. Anyway f(x) is a function of x. So theoretically speaking, multiple FC layers are able to simulate the same behavior of an attention.
    – tom_cat
    Oct 6, 2020 at 3:47
  • 7
    @tom_cat Theoretically speaking, multiple FC can simulate any function.
    – hkchengrex
    Oct 6, 2020 at 3:50
  • is it right to say that, to some extent, attention is a special type of FC, whose weights are dynamically and indirectly determined by some other weights @hkchengrex
    – tom_cat
    Oct 6, 2020 at 5:50
  • 1
    @tom_cat It is a matter of interpretation but I wouldn't say that. I would say both FC and self-attention are cases of "connections" with the weights determined by a fixed or input-related scheme.
    – hkchengrex
    Oct 6, 2020 at 5:54
  • @hkchengrex Could you please explain what you mean with "dynamic, changing the weight as it goes" in this context? Mar 4 at 14:17

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.