Questions tagged [attention-model]
Questions regarding attention model mechanism in deep learning
396
questions
36
votes
5
answers
37k
views
What is the difference between Luong attention and Bahdanau attention?
These two attentions are used in seq2seq modules. The two different attentions are introduced as multiplicative and additive attentions in this TensorFlow documentation. What is the difference?
32
votes
3
answers
26k
views
How to understand masked multi-head attention in transformer
I'm currently studying code of transformer, but I can not understand the masked multi-head of decoder. The paper said that it is to prevent you from seeing the generating word, but I can not ...
22
votes
2
answers
2k
views
Attention Layer throwing TypeError: Permute layer does not support masking in Keras
I have been following this post in order to implement attention layer over my LSTM model.
Code for the attention layer:
INPUT_DIM = 2
TIME_STEPS = 20
SINGLE_ATTENTION_VECTOR = False
...
21
votes
2
answers
10k
views
what the difference between att_mask and key_padding_mask in MultiHeadAttnetion
What the difference between att_mask and key_padding_mask in MultiHeadAttnetion of pytorch:
key_padding_mask – if provided, specified padding elements in the key will be ignored by the attention. ...
17
votes
3
answers
14k
views
How to build a attention model with keras?
I am trying to understand attention model and also build one myself. After many searches I came across this website which had an atteniton model coded in keras and also looks simple. But when I tried ...
17
votes
1
answer
14k
views
Adding Attention on top of simple LSTM layer in Tensorflow 2.0
I have a simple network of one LSTM and two Dense layers as such:
model = tf.keras.Sequential()
model.add(layers.LSTM(20, input_shape=(train_X.shape[1], train_X.shape[2])))
model.add(layers.Dense(20, ...
17
votes
2
answers
7k
views
Does attention make sense for Autoencoders?
I am struggling with the concept of attention in the the context of autoencoders. I believe I understand the usage of attention with regards to seq2seq translation - after training the combined ...
16
votes
5
answers
11k
views
RuntimeError: "exp" not implemented for 'torch.LongTensor'
I am following this tutorial: http://nlp.seas.harvard.edu/2018/04/03/attention.html
to implement the Transformer model from the "Attention Is All You Need" paper.
However I am getting the following ...
16
votes
2
answers
5k
views
Why embed dimemsion must be divisible by num of heads in MultiheadAttention?
I am learning the Transformer. Here is the pytorch document for MultiheadAttention. In their implementation, I saw there is a constraint:
assert self.head_dim * num_heads == self.embed_dim, "...
14
votes
2
answers
5k
views
Why does embedding vector multiplied by a constant in Transformer model?
I am learning to apply Transform model proposed by Attention Is All You Need from tensorflow official document Transformer model for language understanding.
As section Positional encoding says:
...
14
votes
1
answer
8k
views
How visualize attention LSTM using keras-self-attention package?
I'm using (keras-self-attention) to implement attention LSTM in KERAS. How can I visualize the attention part after training the model? This is a time series forecasting case.
from keras.models ...
13
votes
2
answers
21k
views
Keras - Add attention mechanism to an LSTM model [duplicate]
With the following code:
model = Sequential()
num_features = data.shape[2]
num_samples = data.shape[1]
model.add(
LSTM(16, batch_input_shape=(None, num_samples, num_features), return_sequences=...
12
votes
2
answers
2k
views
Should RNN attention weights over variable length sequences be re-normalized to "mask" the effects of zero-padding?
To be clear, I am referring to "self-attention" of the type described in Hierarchical Attention Networks for Document Classification and implemented many places, for example: here. I am not referring ...
12
votes
1
answer
9k
views
Visualizing attention activation in Tensorflow
Is there a way to visualize the attention weights on some input like the figure in the link above(from Bahdanau et al., 2014), in TensorFlow's seq2seq models? I have found TensorFlow's github issue ...
11
votes
2
answers
5k
views
How can LSTM attention have variable length input
The attention mechanism of LSTM is a straight softmax feed forward network that takes in the hidden states of each time step of the encoder and the decoder's current state.
These 2 steps seems to ...
10
votes
1
answer
6k
views
MultiHeadAttention attention_mask [Keras, Tensorflow] example
I am struggling to mask my input for the MultiHeadAttention Layer. I am using the Transformer Block from Keras documentation with self-attention. I could not find any example code online so far and ...
9
votes
1
answer
10k
views
Inputs to the nn.MultiheadAttention?
I have n-vectors which need to be influenced by each other and output n vectors with same dimensionality d. I believe this is what torch.nn.MultiheadAttention does. But the forward function expects ...
9
votes
2
answers
12k
views
Outputting attention for bert-base-uncased with huggingface/transformers (torch)
I was following a paper on BERT-based lexical substitution (specifically trying to implement equation (2) - if someone has already implemented the whole paper that would also be great). Thus, I wanted ...
8
votes
1
answer
2k
views
Different `grad_fn` for similar looking operations in Pytorch (1.0)
I am working on an attention model, and before running the final model, I was going through the tensor shapes which flow through the code. I have an operation where I need to reshape the tensor. The ...
7
votes
2
answers
16k
views
How to visualize attention weights?
Using this implementation
I have included attention to my RNN (which classify the input sequences into two classes) as follows.
visible = Input(shape=(250,))
embed=Embedding(vocab_size,100)(visible)
...
7
votes
2
answers
3k
views
Why use multi-headed attention in Transformers?
I am trying to understand why transformers use multiple attention heads. I found the following quote:
Instead of using a single attention function where the attention can
be dominated by the actual ...
7
votes
2
answers
4k
views
Sequence to Sequence - for time series prediction
I've tried to build a sequence to sequence model to predict a sensor signal over time based on its first few inputs (see figure below)
The model works OK, but I want to 'spice things up' and try to ...
7
votes
2
answers
2k
views
How can I add tf.keras.layers.AdditiveAttention in my model?
I am working on a machine language translation problem. The Model I am using is:
Model = Sequential([
Embedding(english_vocab_size, 256, input_length=english_max_len, mask_zero=True),
...
7
votes
0
answers
3k
views
Implementing attention in Keras classification
I would like to implement attention to a trained image classification CNN model. For example, there are 30 classes and with the Keras CNN, I obtain for each image the predicted class. However, to ...
6
votes
2
answers
16k
views
Is there any way to convert pytorch tensor to tensorflow tensor
https://github.com/taoshen58/BiBloSA/blob/ec67cbdc411278dd29e8888e9fd6451695efc26c/context_fusion/self_attn.py#L29
I need to use mulit_dimensional_attention from the above link which is implemented ...
6
votes
1
answer
7k
views
Implementing Luong Attention in PyTorch
I am trying to implement the attention described in Luong et al. 2015 in PyTorch myself, but I couldn't get it work. Below is my code, I am only interested in the "general" attention case for now. I ...
6
votes
1
answer
502
views
Keras, model trains successfully but generating predictions gives ValueError: Graph disconnected: cannot obtain value for tensor KerasTensor
I created a Seq2Seq model for text summarization. I have two models, one with attention and one without. The one without attention was able to generate predictions but I can't do it for the one with ...
6
votes
0
answers
294
views
Getting error while converting a code in tf1 to tf2
Where the values are
rnn_size: 512
batch_size: 128
rnn_inputs: Tensor("embedding_lookup/Identity_1:0", shape=(?, ?, 128), dtype=float32)
sequence_length: Tensor("inputs_length:0", ...
6
votes
0
answers
854
views
Where should we put attention in an autoencoder?
In this tutorial in tensorflow site we can see a code for the implementation of an autoencoder which it's Decoder is as follows:
class Decoder(tf.keras.Model):
def __init__(self, vocab_size, ...
6
votes
1
answer
1k
views
How to add attention layer to seq2seq model on Keras
Based on this article, I wrote this model:
enc_in=Input(shape=(None,in_alphabet_len))
lstm=LSTM(lstm_dim,return_sequences=True,return_state=True,use_bias=False)
enc_out,h,c=lstm(enc_in)
dec_in=Input(...
6
votes
0
answers
3k
views
Attention in Tensorflow (tf.contrib.rnn.AttentionCellWrapper)
How exactly is tf.contrib.rnn.AttentionCellWrapper used? Can someone give a piece of example code?
Specifically, I only managed to make the following
fwd_cell = tf.contrib.rnn....
6
votes
0
answers
195
views
How to load a matrix to change the attention layer in seqToseq demo? - Paddle
While attempting to replicate the section 3.1 in Incorporating Discrete Translation Lexicons into Neural MT in paddle-paddle
I tried to have a static matrix that I'll need to load into the seqToseq ...
5
votes
1
answer
16k
views
TransformerEncoder with a padding mask
I'm trying to implement torch.nn.TransformerEncoder with a src_key_padding_mask not equal to none. Imagine the input is of the shape src = [20, 95] and the binary padding mask has the shape src_mask = ...
5
votes
1
answer
4k
views
Cannot parse GraphDef file in function 'ReadTFNetParamsFromTextFileOrDie' in OpenCV-DNN TensorFlow
I want to wrap the attention-OCR model with OpenCV-DNN to increase inference time. I am using the TF code from the official TF models repo.
For wrapping TF model with OpenCV-DNN, I am referring to ...
5
votes
0
answers
409
views
Retrieving attention weights for sentences? Most attentive sentences are zero vectors
I have a document classification task, that classifies documents as good (1) or bad (0), and I use some sentence embeddings for each document to classify the documents accordingly.
What I like to do ...
5
votes
1
answer
628
views
Is there a way to use the native tf Attention layer with keras Sequential API?
Is there a way to use the native tf Attention layer with keras Sequential API?
I'm looking to use this particular class. I have found custom implementations such as this one. What I'm truly looking ...
5
votes
1
answer
1k
views
Differences between different attention layers for Keras
I am trying to add an attention layer for my text classification model. The inputs are texts (e.g. movie review), the output is a binary outcome (e.g. positive vs negative).
model = Sequential()
...
5
votes
0
answers
1k
views
how to access the attention weights from the attention class
class AttLayer(Layer):
def __init__(self, **kwargs):
self.init = initializations.get('normal')
#self.input_spec = [InputSpec(ndim=3)]
super(AttLayer, self).__init__(** ...
4
votes
3
answers
6k
views
Can't set the attribute "trainable_weights", likely because it conflicts with an existing read-only
My code was running perfectly in colab. But today it's not running. It says
Can't set the attribute "trainable_weights", likely because it conflicts with an existing read-only @property of ...
4
votes
1
answer
3k
views
How can I pre-compute a mask for each input and adjust the weights according to this mask?
I want to provide a mask, the same size as the input image and adjust the weights learned from the image according to this mask (similar to attention, but pre-computed for each image input). How can I ...
4
votes
1
answer
462
views
AttentionDecoderRNN without MAX_LENGTH
From the PyTorch Seq2Seq tutorial, http://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html#attention-decoder
We see that the attention mechanism is heavily reliant on the ...
4
votes
1
answer
5k
views
How do I implement this attention layer in PyTorch?
I already did the implementation of the CNN part and everything seems to be working just fine. Afterwards started to implement the LSTM part and, If I understood it right, the output shape should be (...
4
votes
1
answer
2k
views
Number of learnable parameters of MultiheadAttention
While testing (using PyTorch's MultiheadAttention), I noticed that increasing or decreasing the number of heads of the multi-head attention does not change the total number of learnable parameters of ...
4
votes
1
answer
4k
views
Implementation details of positional encoding in transformer model?
How exactly does this positional encoding being calculated?
Let's assume a machine translation scenario and these are input sentences,
english_text = [this is good, this is bad]
german_text = [das ...
4
votes
1
answer
600
views
Hierarchical Attention Network - model.fit generates error 'ValueError: Input dimension mis-match'
For background, I am referring to the Hierarchical Attention Network used for sentiment classification.
For code: my full code is posted below, but it is just simple revision of the original code ...
4
votes
1
answer
1k
views
Why is my attention model worse than non-attention model
My task was to convert english sentence to German sentence. I first did this with normal encoder-decoder network, on which I got fairly good results. Then, I tried to solve the same task with the same ...
4
votes
1
answer
116
views
How do attention network works?
Recently I was going through Attention is all you need paper, ongoing through it I found an issue regarding understanding the attention network if I ignore the maths behind it.
Can anyone make me ...
4
votes
2
answers
2k
views
why softmax get small gradient when the value is large in paper 'Attention is all you need'
This is the screen of the original paper: the screen of the paper. I understand the meaning of the paper is that when the value of dot-product is large, the gradient of softmax will get very small.
...
4
votes
1
answer
1k
views
What does the "source hidden state" refer to in the Attention Mechanism?
The attention weights are computed as:
I want to know what the h_s refers to.
In the tensorflow code, the encoder RNN returns a tuple:
encoder_outputs, encoder_state = tf.nn.dynamic_rnn(...)
As I ...
4
votes
1
answer
3k
views
tf.keras.layers.MultiHeadAttention's argument key_dim sometimes not matches to paper's example
For example, I have input with shape (1, 1000, 10) (so, src.shape wil be (1, 1000, 10), which means the sequence length is 1000, and the dimension is 10. Then:
This works (random num_head and key_dim)...