My current task is to classify the association between CVEs and CWEs. However, I’ve noticed that using BertModel.from_pretrained(‘bert-base-uncased’) in the fine-tuning stage results in lower accuracy compared to when I pretrain with more CVE-related descriptions first, and then fine-tune using the pretrained I don’t understand why this is happening as I have ruled out compatibility issues with the model. It’s worth mentioning that in the pretraining phase, I only use the pretrained model weights for fine-tuning, and the tokenizer is consistently BertTokenizer.from_pretrained(‘bert-base-uncased’). I did not retrain or expand the tokenizer during pretraining because it is very time-consuming.
For pretraining, I am using data obtained from the NVD, encompassing around 210,000 CVE entries from the years 2000 to 2023. To ensure data cleanliness, I have also used spaCy for data cleaning.

nlp = spacy.load("en_core_web_sm")

def clean_text(text):
doc = nlp(text)
tokens = [token.text for token in doc if not token.is_stop and not token.is_punct]
return " ".join(tokens)

def read_data(file_path, column_name):
df = pd.read_csv(file_path, encoding='utf-8')
except UnicodeDecodeError:
df = pd.read_csv(file_path, encoding='ISO-8859-1')
except UnicodeDecodeError:
df = pd.read_csv(file_path, encoding='latin1')

    print("colums_name:", df.columns)
    print("len:", len(df))
    texts = df[column_name].dropna().tolist()  
    cleaned_texts = [clean_text(text) for text in texts]

Here are the hyperparameters I am using:

batch_size = 16 
num_epochs = 10 
learning_rate = 1e-4 
beta1 = 0.9 
beta2 = 0.99 
weight_decay = 0.01 
total_steps = num_epochs * len(train_loader) 
warmup_steps = total_steps // 10 
early_stopping_patience = 2

Additionally, the settings for masked language modeling (MLM) are:


I hope someone can answer my question. If more detailed pretrain/fine-tune code is needed, I can provide it. Thank you.

I have experimented with various combinations of hyperparameters and found that the current settings are optimal. However, the performance still falls slightly short compared to the scenario where no pre-training is used. The image shows the training loss and validation loss during my pre-training phase. The validation loss reaches its lowest point at epoch 1, and then it starts to increase from the next epoch onwards.

