Highest scored 'attention-model' questions

36 votes

5 answers

37k views

What is the difference between Luong attention and Bahdanau attention?

These two attentions are used in seq2seq modules. The two different attentions are introduced as multiplicative and additive attentions in this TensorFlow documentation. What is the difference?

Shamane Siriwardhana

4,031

asked May 29, 2017 at 8:43

32 votes

3 answers

26k views

How to understand masked multi-head attention in transformer

I'm currently studying code of transformer, but I can not understand the masked multi-head of decoder. The paper said that it is to prevent you from seeing the generating word, but I can not ...

Neptuner

321

asked Sep 27, 2019 at 2:40

22 votes

2 answers

2k views

Attention Layer throwing TypeError: Permute layer does not support masking in Keras

I have been following this post in order to implement attention layer over my LSTM model. Code for the attention layer: INPUT_DIM = 2 TIME_STEPS = 20 SINGLE_ATTENTION_VECTOR = False ...

Saurav--

1,589

asked Aug 15, 2017 at 11:09

21 votes

2 answers

10k views

what the difference between att_mask and key_padding_mask in MultiHeadAttnetion

What the difference between att_mask and key_padding_mask in MultiHeadAttnetion of pytorch: key_padding_mask – if provided, specified padding elements in the key will be ignored by the attention. ...

one

2,455

asked Jun 29, 2020 at 0:31

17 votes

3 answers

14k views

How to build a attention model with keras?

I am trying to understand attention model and also build one myself. After many searches I came across this website which had an atteniton model coded in keras and also looks simple. But when I tried ...

Eka

14.6k

asked Jul 9, 2019 at 7:03

17 votes

1 answer

14k views

Adding Attention on top of simple LSTM layer in Tensorflow 2.0

I have a simple network of one LSTM and two Dense layers as such: model = tf.keras.Sequential() model.add(layers.LSTM(20, input_shape=(train_X.shape[1], train_X.shape[2]))) model.add(layers.Dense(20, ...

greco.roamin

797

asked Nov 21, 2019 at 3:32

17 votes

2 answers

7k views

Does attention make sense for Autoencoders?

I am struggling with the concept of attention in the the context of autoencoders. I believe I understand the usage of attention with regards to seq2seq translation - after training the combined ...

user3641187

405

asked Sep 28, 2019 at 10:49

16 votes

5 answers

11k views

RuntimeError: "exp" not implemented for 'torch.LongTensor'

I am following this tutorial: http://nlp.seas.harvard.edu/2018/04/03/attention.html to implement the Transformer model from the "Attention Is All You Need" paper. However I am getting the following ...

noob

6,314

asked Oct 22, 2018 at 4:32

16 votes

2 answers

5k views

Why embed dimemsion must be divisible by num of heads in MultiheadAttention?

I am learning the Transformer. Here is the pytorch document for MultiheadAttention. In their implementation, I saw there is a constraint: assert self.head_dim * num_heads == self.embed_dim, "...

jason

2,108

asked Feb 26, 2021 at 16:45

14 votes

2 answers

5k views

Why does embedding vector multiplied by a constant in Transformer model?

I am learning to apply Transform model proposed by Attention Is All You Need from tensorflow official document Transformer model for language understanding. As section Positional encoding says: ...

giser_yugang

6,138

asked Jul 8, 2019 at 8:12

14 votes

1 answer

8k views

How visualize attention LSTM using keras-self-attention package?

I'm using (keras-self-attention) to implement attention LSTM in KERAS. How can I visualize the attention part after training the model? This is a time series forecasting case. from keras.models ...

Eghbal

3,763

asked Oct 12, 2019 at 17:47

13 votes

2 answers

21k views

Keras - Add attention mechanism to an LSTM model [duplicate]

With the following code: model = Sequential() num_features = data.shape[2] num_samples = data.shape[1] model.add( LSTM(16, batch_input_shape=(None, num_samples, num_features), return_sequences=...

Shlomi Schwartz

8,793

asked Nov 5, 2018 at 9:03

12 votes

2 answers

2k views

Should RNN attention weights over variable length sequences be re-normalized to "mask" the effects of zero-padding?

To be clear, I am referring to "self-attention" of the type described in Hierarchical Attention Networks for Document Classification and implemented many places, for example: here. I am not referring ...

t-flow

123

asked Mar 27, 2018 at 21:27

12 votes

1 answer

9k views

Visualizing attention activation in Tensorflow

Is there a way to visualize the attention weights on some input like the figure in the link above(from Bahdanau et al., 2014), in TensorFlow's seq2seq models? I have found TensorFlow's github issue ...

reiste

123

asked Nov 15, 2016 at 3:34

11 votes

2 answers

5k views

How can LSTM attention have variable length input

The attention mechanism of LSTM is a straight softmax feed forward network that takes in the hidden states of each time step of the encoder and the decoder's current state. These 2 steps seems to ...

Andrew Tu

258

asked Jun 8, 2017 at 18:48

10 votes

1 answer

6k views

MultiHeadAttention attention_mask [Keras, Tensorflow] example

I am struggling to mask my input for the MultiHeadAttention Layer. I am using the Transformer Block from Keras documentation with self-attention. I could not find any example code online so far and ...

R. Giskard

111

asked Jun 2, 2021 at 12:29

9 votes

1 answer

10k views

Inputs to the nn.MultiheadAttention?

I have n-vectors which need to be influenced by each other and output n vectors with same dimensionality d. I believe this is what torch.nn.MultiheadAttention does. But the forward function expects ...

angryweasel

346

asked Jan 9, 2021 at 12:51

9 votes

2 answers

12k views

Outputting attention for bert-base-uncased with huggingface/transformers (torch)

I was following a paper on BERT-based lexical substitution (specifically trying to implement equation (2) - if someone has already implemented the whole paper that would also be great). Thus, I wanted ...

Björn

674

asked Feb 7, 2020 at 20:46

8 votes

1 answer

2k views

Different `grad_fn` for similar looking operations in Pytorch (1.0)

I am working on an attention model, and before running the final model, I was going through the tensor shapes which flow through the code. I have an operation where I need to reshape the tensor. The ...

abkds

1,774

asked Apr 24, 2019 at 17:28

7 votes

2 answers

16k views

How to visualize attention weights?

Using this implementation I have included attention to my RNN (which classify the input sequences into two classes) as follows. visible = Input(shape=(250,)) embed=Embedding(vocab_size,100)(visible) ...

Stupid420

1,387

asked Dec 20, 2018 at 11:00

7 votes

2 answers

3k views

Why use multi-headed attention in Transformers?

I am trying to understand why transformers use multiple attention heads. I found the following quote: Instead of using a single attention function where the attention can be dominated by the actual ...

SomeDutchGuy

2,339

asked Feb 17, 2021 at 14:38

7 votes

2 answers

4k views

Sequence to Sequence - for time series prediction

I've tried to build a sequence to sequence model to predict a sensor signal over time based on its first few inputs (see figure below) The model works OK, but I want to 'spice things up' and try to ...

Roni Gadot

457

asked May 12, 2020 at 16:56

7 votes

2 answers

2k views

How can I add tf.keras.layers.AdditiveAttention in my model?

I am working on a machine language translation problem. The Model I am using is: Model = Sequential([ Embedding(english_vocab_size, 256, input_length=english_max_len, mask_zero=True), ...

user14349917

asked Oct 11, 2020 at 7:30

7 votes

0 answers

3k views

Implementing attention in Keras classification

I would like to implement attention to a trained image classification CNN model. For example, there are 30 classes and with the Keras CNN, I obtain for each image the predicted class. However, to ...

TheJokerAEZ

361

asked Jul 16, 2019 at 14:15

6 votes

2 answers

16k views

Is there any way to convert pytorch tensor to tensorflow tensor

https://github.com/taoshen58/BiBloSA/blob/ec67cbdc411278dd29e8888e9fd6451695efc26c/context_fusion/self_attn.py#L29 I need to use mulit_dimensional_attention from the above link which is implemented ...

waleed hamid

61

asked Mar 17, 2020 at 12:01

6 votes

1 answer

7k views

Implementing Luong Attention in PyTorch

I am trying to implement the attention described in Luong et al. 2015 in PyTorch myself, but I couldn't get it work. Below is my code, I am only interested in the "general" attention case for now. I ...

zyxue

8,270

asked May 28, 2018 at 18:41

6 votes

1 answer

502 views

Keras, model trains successfully but generating predictions gives ValueError: Graph disconnected: cannot obtain value for tensor KerasTensor

I created a Seq2Seq model for text summarization. I have two models, one with attention and one without. The one without attention was able to generate predictions but I can't do it for the one with ...

BlueMango

503

asked Jul 19, 2021 at 17:35

6 votes

0 answers

294 views

Getting error while converting a code in tf1 to tf2

Where the values are rnn_size: 512 batch_size: 128 rnn_inputs: Tensor("embedding_lookup/Identity_1:0", shape=(?, ?, 128), dtype=float32) sequence_length: Tensor("inputs_length:0", ...

Args

73

asked Jun 27, 2021 at 16:01

6 votes

0 answers

854 views

Where should we put attention in an autoencoder?

In this tutorial in tensorflow site we can see a code for the implementation of an autoencoder which it's Decoder is as follows: class Decoder(tf.keras.Model): def __init__(self, vocab_size, ...

Marzi Heidari

2,720

asked Dec 5, 2020 at 11:55

6 votes

1 answer

1k views

How to add attention layer to seq2seq model on Keras

Based on this article, I wrote this model: enc_in=Input(shape=(None,in_alphabet_len)) lstm=LSTM(lstm_dim,return_sequences=True,return_state=True,use_bias=False) enc_out,h,c=lstm(enc_in) dec_in=Input(...

Osm

81

asked Nov 8, 2017 at 9:25

6 votes

0 answers

3k views

Attention in Tensorflow (tf.contrib.rnn.AttentionCellWrapper)

How exactly is tf.contrib.rnn.AttentionCellWrapper used? Can someone give a piece of example code? Specifically, I only managed to make the following fwd_cell = tf.contrib.rnn....

user3373273

61

asked May 25, 2017 at 1:41

6 votes

0 answers

195 views

How to load a matrix to change the attention layer in seqToseq demo? - Paddle

While attempting to replicate the section 3.1 in Incorporating Discrete Translation Lexicons into Neural MT in paddle-paddle I tried to have a static matrix that I'll need to load into the seqToseq ...

alvas

119k

asked Oct 17, 2016 at 5:57

5 votes

1 answer

16k views

TransformerEncoder with a padding mask

I'm trying to implement torch.nn.TransformerEncoder with a src_key_padding_mask not equal to none. Imagine the input is of the shape src = [20, 95] and the binary padding mask has the shape src_mask = ...

Pourya Vakilipourtakalou

81

asked Jun 16, 2020 at 0:43

5 votes

1 answer

4k views

Cannot parse GraphDef file in function 'ReadTFNetParamsFromTextFileOrDie' in OpenCV-DNN TensorFlow

I want to wrap the attention-OCR model with OpenCV-DNN to increase inference time. I am using the TF code from the official TF models repo. For wrapping TF model with OpenCV-DNN, I am referring to ...

Chintan

503

asked Mar 5, 2019 at 16:42

5 votes

0 answers

409 views

Retrieving attention weights for sentences? Most attentive sentences are zero vectors

I have a document classification task, that classifies documents as good (1) or bad (0), and I use some sentence embeddings for each document to classify the documents accordingly. What I like to do ...

Felix

323

asked May 21, 2021 at 14:37

5 votes

1 answer

628 views

Is there a way to use the native tf Attention layer with keras Sequential API?

Is there a way to use the native tf Attention layer with keras Sequential API? I'm looking to use this particular class. I have found custom implementations such as this one. What I'm truly looking ...

Wajd Meskini

94

asked Dec 12, 2019 at 21:20

5 votes

1 answer

1k views

Differences between different attention layers for Keras

I am trying to add an attention layer for my text classification model. The inputs are texts (e.g. movie review), the output is a binary outcome (e.g. positive vs negative). model = Sequential() ...

Dr. Who

151

asked Oct 24, 2019 at 18:01

5 votes

0 answers

1k views

how to access the attention weights from the attention class

class AttLayer(Layer): def __init__(self, **kwargs): self.init = initializations.get('normal') #self.input_spec = [InputSpec(ndim=3)] super(AttLayer, self).__init__(** ...

prashant ranjan

51

asked Apr 26, 2017 at 3:58

4 votes

3 answers

6k views

Can't set the attribute "trainable_weights", likely because it conflicts with an existing read-only

My code was running perfectly in colab. But today it's not running. It says Can't set the attribute "trainable_weights", likely because it conflicts with an existing read-only @property of ...

Rohan kumar Yadav

41

asked Aug 10, 2020 at 15:50

4 votes

1 answer

3k views

How can I pre-compute a mask for each input and adjust the weights according to this mask?

I want to provide a mask, the same size as the input image and adjust the weights learned from the image according to this mask (similar to attention, but pre-computed for each image input). How can I ...

dusa

830

asked Feb 27, 2019 at 13:43

4 votes

1 answer

462 views

AttentionDecoderRNN without MAX_LENGTH

From the PyTorch Seq2Seq tutorial, http://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html#attention-decoder We see that the attention mechanism is heavily reliant on the ...

alvas

119k

asked Feb 9, 2018 at 4:06

4 votes

1 answer

5k views

How do I implement this attention layer in PyTorch?

I already did the implementation of the CNN part and everything seems to be working just fine. Afterwards started to implement the LSTM part and, If I understood it right, the output shape should be (...

deadknxght

41

asked Jul 9, 2023 at 17:07

4 votes

1 answer

2k views

Number of learnable parameters of MultiheadAttention

While testing (using PyTorch's MultiheadAttention), I noticed that increasing or decreasing the number of heads of the multi-head attention does not change the total number of learnable parameters of ...

Elidor00

1,502

asked Feb 12, 2021 at 12:31

4 votes

1 answer

4k views

Implementation details of positional encoding in transformer model?

How exactly does this positional encoding being calculated? Let's assume a machine translation scenario and these are input sentences, english_text = [this is good, this is bad] german_text = [das ...

Sai Kumar

685

asked May 1, 2020 at 21:18

4 votes

1 answer

600 views

Hierarchical Attention Network - model.fit generates error 'ValueError: Input dimension mis-match'

For background, I am referring to the Hierarchical Attention Network used for sentiment classification. For code: my full code is posted below, but it is just simple revision of the original code ...

Ziqi

2,514

asked Mar 3, 2019 at 9:04

4 votes

1 answer

1k views

Why is my attention model worse than non-attention model

My task was to convert english sentence to German sentence. I first did this with normal encoder-decoder network, on which I got fairly good results. Then, I tried to solve the same task with the same ...

user14349917

asked Oct 25, 2020 at 8:30

4 votes

1 answer

116 views

How do attention network works?

Recently I was going through Attention is all you need paper, ongoing through it I found an issue regarding understanding the attention network if I ignore the maths behind it. Can anyone make me ...

Kumar Mangalam

777

asked Dec 5, 2019 at 8:38

4 votes

2 answers

2k views

why softmax get small gradient when the value is large in paper 'Attention is all you need'

This is the screen of the original paper: the screen of the paper. I understand the meaning of the paper is that when the value of dot-product is large, the gradient of softmax will get very small. ...

Richard. Zhu

63

asked Feb 27, 2019 at 12:42

4 votes

1 answer

1k views

What does the "source hidden state" refer to in the Attention Mechanism?

The attention weights are computed as: I want to know what the h_s refers to. In the tensorflow code, the encoder RNN returns a tuple: encoder_outputs, encoder_state = tf.nn.dynamic_rnn(...) As I ...

imhuay

281

asked Jan 23, 2018 at 4:01

4 votes

1 answer

3k views

tf.keras.layers.MultiHeadAttention's argument key_dim sometimes not matches to paper's example

For example, I have input with shape (1, 1000, 10) (so, src.shape wil be (1, 1000, 10), which means the sequence length is 1000, and the dimension is 10. Then: This works (random num_head and key_dim)...

EthanJiang

43

asked Jul 1, 2022 at 18:03

Collectives™ on Stack Overflow

Questions tagged [attention-model]

Related Tags