BERT MLM model fine-tune on small data bad results

Question

I want to fine tune a BERT MLM model from Hugging Face. I have only a small amount of data (train.csv) like this:

text
בראשית ברא אלהים את השמים ואת הארץ
והארץ היתה תהו ובהו וחשך על פני תהום ורוח אלהים מרחפת על פני המים
ויאמר אלהים יהי אור ויהי אור
וירא אלהים את האור כי טוב ויבדל אלהים בין האור ובין החשך
ויקרא אלהים לאור יום ולחשך קרא לילה ויהי ערב ויהי בקר יום אחד
ויאמר אלהים יהי רקיע בתוך המים ויהי מבדיל בין מים למים
ויעש אלהים את הרקיע ויבדל בין המים אשר מתחת לרקיע ובין המים אשר מעל לרקיע ויהי כן
ויקרא אלהים לרקיע שמים ויהי ערב ויהי בקר יום שני
ויאמר אלהים יקוו המים מתחת השמים אל מקום אחד ותראה היבשה ויהי כן
ויקרא אלהים ליבשה ארץ ולמקוה המים קרא ימים וירא אלהים כי טוב

Below is my script in doing the fine-tuning:

from huggingface_hub import login
from transformers import AutoModelForMaskedLM
from transformers import AutoTokenizer
import torch
from datasets import load_dataset
from transformers import DataCollatorForLanguageModeling
import collections
import numpy as np
from transformers import default_data_collator
from transformers import TrainingArguments
from transformers import Trainer
import math
from transformers import EarlyStoppingCallback, IntervalStrategy

access_token_write = "hf_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
login(token = access_token_write)

model_checkpoint = "dicta-il/BEREL_2.0"
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

sam_dataset = load_dataset("johnlockejrr/samv2")

def tokenize_function(examples):
    result = tokenizer(examples["text"])
    if tokenizer.is_fast:
        result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
    return result

tokenized_datasets = sam_dataset.map(
    tokenize_function, batched=True, remove_columns=["text"]
)

chunk_size = 128
tokenized_samples = tokenized_datasets["train"][:3]

concatenated_examples = {
    k: sum(tokenized_samples[k], []) for k in tokenized_samples.keys()
}

def group_texts(examples):
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    # Compute length of concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size
    # Split by chunks of max_len
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }
    # Create a new labels column
    result["text"] = result["input_ids"].copy()
    return result

lm_datasets = tokenized_datasets.map(group_texts, batched=True)

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

samples = [lm_datasets["train"][i] for i in range(2)]

train_size = 1_000
test_size = int(0.1 * train_size)

downsampled_dataset = lm_datasets["train"].train_test_split(
    train_size=train_size, test_size=test_size, seed=42
)

batch_size = 64
logging_steps = len(downsampled_dataset["train"]) // batch_size
model_name = model_checkpoint.split("/")[-1]

training_args = TrainingArguments(
    output_dir=f"{model_name}-finetuned-sam-v1",
    overwrite_output_dir=True,
    evaluation_strategy=IntervalStrategy.STEPS, # "steps"
    eval_steps = 50, # Evaluation and Save happens every 50 steps
    save_total_limit = 5, # Only last 5 models are saved. Older ones are deleted.
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    push_to_hub=True,
    fp16=True,
    logging_steps=logging_steps,
    metric_for_best_model = 'f1',
    load_best_model_at_end=True
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=downsampled_dataset["train"],
    eval_dataset=downsampled_dataset["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

trainer.train()

eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

trainer.push_to_hub()

All went well, model trained, pushed to Hugging Face. Problems:

Loss: 6.2156
Model work for Fill-Mask but with my new data seems something happened in the process and some words are masked as tokens and not full words: Here is a script as an example of Fill-Mask and the output:

from transformers import pipeline

mask_filler = pipeline( "fill-mask", model="johnlockejrr/BEREL_2.0-finetuned-sam-v1" )

text = "ואלין שמהת בני [MASK] דעלו למצרים עם יעקב גבר וביתה עלו" preds = mask_filler(text)

for pred in preds: print(f">>> {pred['sequence']}")

Output:

>>> ואלין שמה ##ת בני יעקב דעלו למצרים עם יעקב גבר ובית ##ה עלו
>>> ואלין שמה ##ת בני ישראל דעלו למצרים עם יעקב גבר ובית ##ה עלו
>>> ואלין שמה ##ת בני לאה דעלו למצרים עם יעקב גבר ובית ##ה עלו
>>> ואלין שמה ##ת בני ראובן דעלו למצרים עם יעקב גבר ובית ##ה עלו
>>> ואלין שמה ##ת בני דן דעלו למצרים עם יעקב גבר ובית ##ה עלו

I was expecting this output:

>>> ואלין שמהת בני יעקב דעלו למצרים עם יעקב גבר וביתה עלו
>>> ואלין שמהת בני ישראל דעלו למצרים עם יעקב גבר וביתה עלו
>>> ואלין שמהת בני לאה דעלו למצרים עם יעקב גבר וביתה עלו
>>> ואלין שמהת בני ראובן דעלו למצרים עם יעקב גבר וביתה עלו
>>> ואלין שמהת בני דן דעלו למצרים עם יעקב גבר וביתה עלו

What am I doing wrong?

Sakthivel R · Accepted Answer · 2024-04-18 07:02:50Z

1

I know this sounds ridiculous, but the output generated by the model seems to be correct. Because the models are dealing with tokens, the continuation of a word will be mentioned using ##. In which means ובית is followed by ה.

Hope, this example helps you better. Here the word lamb has been split into la and ##mb mentioning lamb. This image is from the hugging face documentation for token classification.

You can solve this by writing a simple python code to concatenate the those words. Or try giving a kwarg is_split_into_words=True in the Tokenizer Class. Docs: is_split_into_words

Hope this helps you :)

edited Apr 18 at 7:02

answered Apr 18 at 6:55

Sakthivel R

265 bronze badges

I'm not sure that works. I quote: "We can tell the tokenizer that we’re dealing with ready-split tokens rather than full sentence strings by passing is_split_into_words=True", I'm afraid I'm not trainig a wordlist.
– bsteo
2 hours ago

Add a comment |

Collectives™ on Stack Overflow

BERT MLM model fine-tune on small data bad results

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged
python
huggingface-transformers
bert-language-model
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged pythonhuggingface-transformersbert-language-model or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
python
huggingface-transformers
bert-language-model
or ask your own question.