I am building an NER model for the following labels: ACQUIREE_COMPANY and ACQUIROR_COMPANY. The training data is based on press releases announcing mergers and acquisitions of acquiree and acquiror companies. I have annotated roughly 18,000 examples using ChatGPT-4. I trained the model using Prodigy with an 80% (training)-20% (eval) split both using a base model (en_core_web_lg) and without a base model. I am not getting above roughly 70% accuracy on the model trained with a base model and 67% accuracy on the model trained without a base model.

The stats for the training run without a base model were:

E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE
---  ------  ------------  --------  ------  ------  ------  ------
...
  0    3400        236.99    870.02   67.13   68.91   65.44    0.67
...
  0    4600      31159.08    946.75   67.07   73.73   61.52    0.67
...
  0    5000        581.26    919.93   64.44   62.43   66.58    0.64
✔ Saved pipeline to output directory

The stats for the training run with en_core_web_lg as base model were:

E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SPEED   SCORE
---  ------  ------------  --------  ------  ------  ------  ------  ------
...
  3   19000          0.00   3564.10   72.53   74.67   70.50  6875.54    0.73
  3   20000          0.00   3647.85   72.67   74.46   70.96  7190.40    0.73
...
  5   25000          0.00   3639.24   72.75   74.55   71.03  7433.97    0.73
  5   26000          0.00   3409.12   72.74   74.67   70.91  7425.77    0.73
✔ Saved pipeline to output directory

Some guidance on how to improve this accuracy would be greatly appreciated.

Thanks.