Question

My current task is to classify the association between CVEs and CWEs. However, I’ve noticed that using BertModel.from_pretrained(‘bert-base-uncased’) in the fine-tuning stage results in lower accuracy compared to when I pretrain with more CVE-related descriptions first, and then fine-tune using the pretrained model.pt. I don’t understand why this is happening as I have ruled out compatibility issues with the model. It’s worth mentioning that in the pretraining phase, I only use the pretrained model weights for fine-tuning, and the tokenizer is consistently BertTokenizer.from_pretrained(‘bert-base-uncased’). I did not retrain or expand the tokenizer during pretraining because it is very time-consuming.
For pretraining, I am using data obtained from the NVD, encompassing around 210,000 CVE entries from the years 2000 to 2023. To ensure data cleanliness, I have also used spaCy for data cleaning.

nlp = spacy.load("en_core_web_sm")

def clean_text(text):
doc = nlp(text)
tokens = [token.text for token in doc if not token.is_stop and not token.is_punct]
return " ".join(tokens)

def read_data(file_path, column_name):
try:
df = pd.read_csv(file_path, encoding='utf-8')
except UnicodeDecodeError:
try:
df = pd.read_csv(file_path, encoding='ISO-8859-1')
except UnicodeDecodeError:
df = pd.read_csv(file_path, encoding='latin1')

    print(f"file_name：{file_path}")
    print("colums_name：", df.columns)
    print("len：", len(df))
    
    texts = df[column_name].dropna().tolist()  
    cleaned_texts = [clean_text(text) for text in texts]

Here are the hyperparameters I am using:

batch_size = 16 
num_epochs = 10 
learning_rate = 1e-4 
beta1 = 0.9 
beta2 = 0.99 
weight_decay = 0.01 
total_steps = num_epochs * len(train_loader) 
warmup_steps = total_steps // 10 
early_stopping_patience = 2

Additionally, the settings for masked language modeling (MLM) are:

mask_prob=0.15 
replace_mask_prob=0.8 
random_replace_prob=0.10 
keep_original_prob=0.10

I hope someone can answer my question. If more detailed pretrain/fine-tune code is needed, I can provide it. Thank you.

I have experimented with various combinations of hyperparameters and found that the current settings are optimal. However, the performance still falls slightly short compared to the scenario where no pre-training is used. The image shows the training loss and validation loss during my pre-training phase. The validation loss reaches its lowest point at epoch 1, and then it starts to increase from the next epoch onwards.

The accuracy from pretraining is worse than without pretraining

LEAVE A COMMENT Hủy