Unexpected Attention dimension [nbr_layers, seq_length, hidden_layer_dim]

  Kiến thức lập trình

I’m working on extracting Attention from a modified Bert model that originally did NOT output any attention. When I extracted it by getting it from the BertEncoder (and through all the intermediate classes: ModelLayer, SelfUnpaddedAttention…), it seems to have [nbr_layers, seq_length, hidden_layer_dim] as shape.

But if I understand well, I should somewhere have [seq_length, seq_length] matrix through which I can visualize the attention map.

Here is the repository where I try to extract the attention : HuggingFace Model
And you can find the code where I extract the attention below.

I’m not sure if I have done a mistake if how to extract the attention, or should I just change something to get that expected shape ?

class BertEncoder(nn.Module):
    """A stack of BERT layers providing the backbone of Mosaic BERT.
    This module is modeled after the Hugging Face BERT's :class:`~transformers.model.bert.modeling_bert.BertEncoder`,
    but with substantial modifications to implement unpadding and ALiBi.
    Compared to the analogous Hugging Face BERT module, this module handles unpadding to reduce unnecessary computation
    at padded tokens, and pre-computes attention biases to implement ALiBi.
    """

...
...
# OTHER CODE HERE (SEE SOURCE LINK)
...
...
        # PART WHERE I EXTRACT ATTENTION
        all_encoder_layers = []
        all_attention_weights = []  # List to store attention weights
    
        if subset_mask is None:
            for layer_module in self.layer:
                # Since we get now attention too, we need to unpack 2 elements instead of 1.
                hidden_states, attention_weights = layer_module(hidden_states,
                                                                cu_seqlens,
                                                                seqlen,
                                                                None,
                                                                indices,
                                                                attn_mask=attention_mask,
                                                                bias=alibi_attn_mask)
                
                all_attention_weights.append(attention_weights)  # Store attention weights
                if output_all_encoded_layers:
                    all_encoder_layers.append(hidden_states)
            # Pad inputs and mask. It will insert back zero-padded tokens.
            # Assume ntokens is total number of tokens (padded and non-padded)
            # and ntokens_unpad is total number of non-padded tokens.
            # Then padding performs the following de-compression:
            #     hidden_states[ntokens_unpad,hidden] -> hidden_states[ntokens,hidden]
            hidden_states = pad_input(hidden_states, indices, batch, seqlen)
        else:
            for i in range(len(self.layer) - 1):
                layer_module = self.layer[i]
                # Since we get now attention too, we need to unpack 2 elements instead of 1.
                hidden_states, attention_weights = layer_module(hidden_states,
                                                                cu_seqlens,
                                                                seqlen,
                                                                None,
                                                                indices,
                                                                attn_mask=attention_mask,
                                                                bias=alibi_attn_mask)
                all_attention_weights.append(attention_weights)  # Store attention weights
                if output_all_encoded_layers:
                    all_encoder_layers.append(hidden_states)
            subset_idx = torch.nonzero(subset_mask[attention_mask_bool],
                                       as_tuple=False).flatten()
            # Since we get now attention too, we need to unpack 2 elements instead of 1.
            hidden_states, attention_weights = self.layer[-1](hidden_states,
                                                              cu_seqlens,
                                                              seqlen,
                                                              subset_idx=subset_idx,
                                                              indices=indices,
                                                              attn_mask=attention_mask,
                                                              bias=alibi_attn_mask)
            all_attention_weights.append(attention_weights)  # appending the attention of different layers together.
        if not output_all_encoded_layers:
            all_encoder_layers.append(hidden_states)

        # Since we now return both, we need to handle them wherever BertEncoder forward is called.
        return all_encoder_layers, all_attention_weights  # Return both hidden states and attention weights
        # return all_encoder_layers  # original return.

LEAVE A COMMENT