Questions tagged [attention-model]

Questions regarding attention model mechanism in deep learning

Filter by
Sorted by
Tagged with
36 votes
5 answers
37k views

What is the difference between Luong attention and Bahdanau attention?

These two attentions are used in seq2seq modules. The two different attentions are introduced as multiplicative and additive attentions in this TensorFlow documentation. What is the difference?
Shamane Siriwardhana's user avatar
32 votes
3 answers
26k views

How to understand masked multi-head attention in transformer

I'm currently studying code of transformer, but I can not understand the masked multi-head of decoder. The paper said that it is to prevent you from seeing the generating word, but I can not ...
Neptuner's user avatar
  • 321
22 votes
2 answers
2k views

Attention Layer throwing TypeError: Permute layer does not support masking in Keras

I have been following this post in order to implement attention layer over my LSTM model. Code for the attention layer: INPUT_DIM = 2 TIME_STEPS = 20 SINGLE_ATTENTION_VECTOR = False ...
Saurav--'s user avatar
  • 1,589
21 votes
2 answers
10k views

what the difference between att_mask and key_padding_mask in MultiHeadAttnetion

What the difference between att_mask and key_padding_mask in MultiHeadAttnetion of pytorch: key_padding_mask – if provided, specified padding elements in the key will be ignored by the attention. ...
one's user avatar
  • 2,455
17 votes
3 answers
14k views

How to build a attention model with keras?

I am trying to understand attention model and also build one myself. After many searches I came across this website which had an atteniton model coded in keras and also looks simple. But when I tried ...
Eka's user avatar
  • 14.6k
17 votes
1 answer
14k views

Adding Attention on top of simple LSTM layer in Tensorflow 2.0

I have a simple network of one LSTM and two Dense layers as such: model = tf.keras.Sequential() model.add(layers.LSTM(20, input_shape=(train_X.shape[1], train_X.shape[2]))) model.add(layers.Dense(20, ...
greco.roamin's user avatar
17 votes
2 answers
7k views

Does attention make sense for Autoencoders?

I am struggling with the concept of attention in the the context of autoencoders. I believe I understand the usage of attention with regards to seq2seq translation - after training the combined ...
user3641187's user avatar
16 votes
5 answers
11k views

RuntimeError: "exp" not implemented for 'torch.LongTensor'

I am following this tutorial: http://nlp.seas.harvard.edu/2018/04/03/attention.html to implement the Transformer model from the "Attention Is All You Need" paper. However I am getting the following ...
noob's user avatar
  • 6,314
16 votes
2 answers
5k views

Why embed dimemsion must be divisible by num of heads in MultiheadAttention?

I am learning the Transformer. Here is the pytorch document for MultiheadAttention. In their implementation, I saw there is a constraint: assert self.head_dim * num_heads == self.embed_dim, "...
jason's user avatar
  • 2,108
14 votes
2 answers
5k views

Why does embedding vector multiplied by a constant in Transformer model?

I am learning to apply Transform model proposed by Attention Is All You Need from tensorflow official document Transformer model for language understanding. As section Positional encoding says: ...
giser_yugang's user avatar
  • 6,138
14 votes
1 answer
8k views

How visualize attention LSTM using keras-self-attention package?

I'm using (keras-self-attention) to implement attention LSTM in KERAS. How can I visualize the attention part after training the model? This is a time series forecasting case. from keras.models ...
Eghbal's user avatar
  • 3,763
13 votes
2 answers
21k views

Keras - Add attention mechanism to an LSTM model [duplicate]

With the following code: model = Sequential() num_features = data.shape[2] num_samples = data.shape[1] model.add( LSTM(16, batch_input_shape=(None, num_samples, num_features), return_sequences=...
Shlomi Schwartz's user avatar
12 votes
2 answers
2k views

Should RNN attention weights over variable length sequences be re-normalized to "mask" the effects of zero-padding?

To be clear, I am referring to "self-attention" of the type described in Hierarchical Attention Networks for Document Classification and implemented many places, for example: here. I am not referring ...
t-flow's user avatar
  • 123
12 votes
1 answer
9k views

Visualizing attention activation in Tensorflow

Is there a way to visualize the attention weights on some input like the figure in the link above(from Bahdanau et al., 2014), in TensorFlow's seq2seq models? I have found TensorFlow's github issue ...
reiste's user avatar
  • 123
11 votes
2 answers
5k views

How can LSTM attention have variable length input

The attention mechanism of LSTM is a straight softmax feed forward network that takes in the hidden states of each time step of the encoder and the decoder's current state. These 2 steps seems to ...
Andrew Tu's user avatar
  • 258
10 votes
1 answer
6k views

MultiHeadAttention attention_mask [Keras, Tensorflow] example

I am struggling to mask my input for the MultiHeadAttention Layer. I am using the Transformer Block from Keras documentation with self-attention. I could not find any example code online so far and ...
R. Giskard's user avatar
9 votes
1 answer
10k views

Inputs to the nn.MultiheadAttention?

I have n-vectors which need to be influenced by each other and output n vectors with same dimensionality d. I believe this is what torch.nn.MultiheadAttention does. But the forward function expects ...
angryweasel's user avatar
9 votes
2 answers
12k views

Outputting attention for bert-base-uncased with huggingface/transformers (torch)

I was following a paper on BERT-based lexical substitution (specifically trying to implement equation (2) - if someone has already implemented the whole paper that would also be great). Thus, I wanted ...
Björn's user avatar
  • 674
8 votes
1 answer
2k views

Different `grad_fn` for similar looking operations in Pytorch (1.0)

I am working on an attention model, and before running the final model, I was going through the tensor shapes which flow through the code. I have an operation where I need to reshape the tensor. The ...
abkds's user avatar
  • 1,774
7 votes
2 answers
16k views

How to visualize attention weights?

Using this implementation I have included attention to my RNN (which classify the input sequences into two classes) as follows. visible = Input(shape=(250,)) embed=Embedding(vocab_size,100)(visible) ...
Stupid420's user avatar
  • 1,387
7 votes
2 answers
3k views

Why use multi-headed attention in Transformers?

I am trying to understand why transformers use multiple attention heads. I found the following quote: Instead of using a single attention function where the attention can be dominated by the actual ...
SomeDutchGuy's user avatar
  • 2,339
7 votes
2 answers
4k views

Sequence to Sequence - for time series prediction

I've tried to build a sequence to sequence model to predict a sensor signal over time based on its first few inputs (see figure below) The model works OK, but I want to 'spice things up' and try to ...
Roni Gadot's user avatar
7 votes
2 answers
2k views

How can I add tf.keras.layers.AdditiveAttention in my model?

I am working on a machine language translation problem. The Model I am using is: Model = Sequential([ Embedding(english_vocab_size, 256, input_length=english_max_len, mask_zero=True), ...
user avatar
7 votes
0 answers
3k views

Implementing attention in Keras classification

I would like to implement attention to a trained image classification CNN model. For example, there are 30 classes and with the Keras CNN, I obtain for each image the predicted class. However, to ...
TheJokerAEZ's user avatar
6 votes
2 answers
16k views

Is there any way to convert pytorch tensor to tensorflow tensor

https://github.com/taoshen58/BiBloSA/blob/ec67cbdc411278dd29e8888e9fd6451695efc26c/context_fusion/self_attn.py#L29 I need to use mulit_dimensional_attention from the above link which is implemented ...
waleed hamid's user avatar
6 votes
1 answer
7k views

Implementing Luong Attention in PyTorch

I am trying to implement the attention described in Luong et al. 2015 in PyTorch myself, but I couldn't get it work. Below is my code, I am only interested in the "general" attention case for now. I ...
zyxue's user avatar
  • 8,270
6 votes
1 answer
502 views

Keras, model trains successfully but generating predictions gives ValueError: Graph disconnected: cannot obtain value for tensor KerasTensor

I created a Seq2Seq model for text summarization. I have two models, one with attention and one without. The one without attention was able to generate predictions but I can't do it for the one with ...
BlueMango's user avatar
  • 503
6 votes
0 answers
294 views

Getting error while converting a code in tf1 to tf2

Where the values are rnn_size: 512 batch_size: 128 rnn_inputs: Tensor("embedding_lookup/Identity_1:0", shape=(?, ?, 128), dtype=float32) sequence_length: Tensor("inputs_length:0", ...
Args's user avatar
  • 73
6 votes
0 answers
854 views

Where should we put attention in an autoencoder?

In this tutorial in tensorflow site we can see a code for the implementation of an autoencoder which it's Decoder is as follows: class Decoder(tf.keras.Model): def __init__(self, vocab_size, ...
Marzi Heidari's user avatar
6 votes
1 answer
1k views

How to add attention layer to seq2seq model on Keras

Based on this article, I wrote this model: enc_in=Input(shape=(None,in_alphabet_len)) lstm=LSTM(lstm_dim,return_sequences=True,return_state=True,use_bias=False) enc_out,h,c=lstm(enc_in) dec_in=Input(...
Osm's user avatar
  • 81
6 votes
0 answers
3k views

Attention in Tensorflow (tf.contrib.rnn.AttentionCellWrapper)

How exactly is tf.contrib.rnn.AttentionCellWrapper used? Can someone give a piece of example code? Specifically, I only managed to make the following fwd_cell = tf.contrib.rnn....
user3373273's user avatar
6 votes
0 answers
195 views

How to load a matrix to change the attention layer in seqToseq demo? - Paddle

While attempting to replicate the section 3.1 in Incorporating Discrete Translation Lexicons into Neural MT in paddle-paddle I tried to have a static matrix that I'll need to load into the seqToseq ...
alvas's user avatar
  • 119k
5 votes
1 answer
16k views

TransformerEncoder with a padding mask

I'm trying to implement torch.nn.TransformerEncoder with a src_key_padding_mask not equal to none. Imagine the input is of the shape src = [20, 95] and the binary padding mask has the shape src_mask = ...
Pourya Vakilipourtakalou's user avatar
5 votes
1 answer
4k views

Cannot parse GraphDef file in function 'ReadTFNetParamsFromTextFileOrDie' in OpenCV-DNN TensorFlow

I want to wrap the attention-OCR model with OpenCV-DNN to increase inference time. I am using the TF code from the official TF models repo. For wrapping TF model with OpenCV-DNN, I am referring to ...
Chintan's user avatar
  • 503
5 votes
0 answers
409 views

Retrieving attention weights for sentences? Most attentive sentences are zero vectors

I have a document classification task, that classifies documents as good (1) or bad (0), and I use some sentence embeddings for each document to classify the documents accordingly. What I like to do ...
Felix's user avatar
  • 323
5 votes
1 answer
628 views

Is there a way to use the native tf Attention layer with keras Sequential API?

Is there a way to use the native tf Attention layer with keras Sequential API? I'm looking to use this particular class. I have found custom implementations such as this one. What I'm truly looking ...
Wajd Meskini's user avatar
5 votes
1 answer
1k views

Differences between different attention layers for Keras

I am trying to add an attention layer for my text classification model. The inputs are texts (e.g. movie review), the output is a binary outcome (e.g. positive vs negative). model = Sequential() ...
Dr. Who's user avatar
  • 151
5 votes
0 answers
1k views

how to access the attention weights from the attention class

class AttLayer(Layer): def __init__(self, **kwargs): self.init = initializations.get('normal') #self.input_spec = [InputSpec(ndim=3)] super(AttLayer, self).__init__(** ...
prashant ranjan's user avatar
4 votes
3 answers
6k views

Can't set the attribute "trainable_weights", likely because it conflicts with an existing read-only

My code was running perfectly in colab. But today it's not running. It says Can't set the attribute "trainable_weights", likely because it conflicts with an existing read-only @property of ...
Rohan kumar Yadav's user avatar
4 votes
1 answer
3k views

How can I pre-compute a mask for each input and adjust the weights according to this mask?

I want to provide a mask, the same size as the input image and adjust the weights learned from the image according to this mask (similar to attention, but pre-computed for each image input). How can I ...
dusa's user avatar
  • 830
4 votes
1 answer
462 views

AttentionDecoderRNN without MAX_LENGTH

From the PyTorch Seq2Seq tutorial, http://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html#attention-decoder We see that the attention mechanism is heavily reliant on the ...
alvas's user avatar
  • 119k
4 votes
1 answer
5k views

How do I implement this attention layer in PyTorch?

I already did the implementation of the CNN part and everything seems to be working just fine. Afterwards started to implement the LSTM part and, If I understood it right, the output shape should be (...
deadknxght's user avatar
4 votes
1 answer
2k views

Number of learnable parameters of MultiheadAttention

While testing (using PyTorch's MultiheadAttention), I noticed that increasing or decreasing the number of heads of the multi-head attention does not change the total number of learnable parameters of ...
Elidor00's user avatar
  • 1,502
4 votes
1 answer
4k views

Implementation details of positional encoding in transformer model?

How exactly does this positional encoding being calculated? Let's assume a machine translation scenario and these are input sentences, english_text = [this is good, this is bad] german_text = [das ...
Sai Kumar's user avatar
  • 685
4 votes
1 answer
600 views

Hierarchical Attention Network - model.fit generates error 'ValueError: Input dimension mis-match'

For background, I am referring to the Hierarchical Attention Network used for sentiment classification. For code: my full code is posted below, but it is just simple revision of the original code ...
Ziqi's user avatar
  • 2,514
4 votes
1 answer
1k views

Why is my attention model worse than non-attention model

My task was to convert english sentence to German sentence. I first did this with normal encoder-decoder network, on which I got fairly good results. Then, I tried to solve the same task with the same ...
user avatar
4 votes
1 answer
116 views

How do attention network works?

Recently I was going through Attention is all you need paper, ongoing through it I found an issue regarding understanding the attention network if I ignore the maths behind it. Can anyone make me ...
Kumar Mangalam's user avatar
4 votes
2 answers
2k views

why softmax get small gradient when the value is large in paper 'Attention is all you need'

This is the screen of the original paper: the screen of the paper. I understand the meaning of the paper is that when the value of dot-product is large, the gradient of softmax will get very small. ...
Richard. Zhu's user avatar
4 votes
1 answer
1k views

What does the "source hidden state" refer to in the Attention Mechanism?

The attention weights are computed as: I want to know what the h_s refers to. In the tensorflow code, the encoder RNN returns a tuple: encoder_outputs, encoder_state = tf.nn.dynamic_rnn(...) As I ...
imhuay's user avatar
  • 281
4 votes
1 answer
3k views

tf.keras.layers.MultiHeadAttention's argument key_dim sometimes not matches to paper's example

For example, I have input with shape (1, 1000, 10) (so, src.shape wil be (1, 1000, 10), which means the sequence length is 1000, and the dimension is 10. Then: This works (random num_head and key_dim)...
EthanJiang's user avatar

1
2 3 4 5
8