Text chunking with tokenization and spaces (some words were dropped)
I’m trying to break a text into chunks using a tokenization-aware approach that attempts to split the text at spaces or endlines when possible. The goal is to avoid breaking words or, if feasible, lines. However, I’m encountering an issue where some words are missing in the final output, particularly when the size of the text chunk equals the maximal chunk size.
Text chunking with tokenization and spaces (some words were dropped)
I’m trying to break a text into chunks using a tokenization-aware approach that attempts to split the text at spaces or endlines when possible. The goal is to avoid breaking words or, if feasible, lines. However, I’m encountering an issue where some words are missing in the final output, particularly when the size of the text chunk equals the maximal chunk size.
Text chunking with tokenization and spaces (some words were dropped)
I’m trying to break a text into chunks using a tokenization-aware approach that attempts to split the text at spaces or endlines when possible. The goal is to avoid breaking words or, if feasible, lines. However, I’m encountering an issue where some words are missing in the final output, particularly when the size of the text chunk equals the maximal chunk size.
Text chunking with tokenization and spaces (some words were dropped)
I’m trying to break a text into chunks using a tokenization-aware approach that attempts to split the text at spaces or endlines when possible. The goal is to avoid breaking words or, if feasible, lines. However, I’m encountering an issue where some words are missing in the final output, particularly when the size of the text chunk equals the maximal chunk size.
Text chunking with tokenization and spaces (some words were dropped)
I’m trying to break a text into chunks using a tokenization-aware approach that attempts to split the text at spaces or endlines when possible. The goal is to avoid breaking words or, if feasible, lines. However, I’m encountering an issue where some words are missing in the final output, particularly when the size of the text chunk equals the maximal chunk size.
Text chunking with tokenization and spaces (some words were dropped)
I’m trying to break a text into chunks using a tokenization-aware approach that attempts to split the text at spaces or endlines when possible. The goal is to avoid breaking words or, if feasible, lines. However, I’m encountering an issue where some words are missing in the final output, particularly when the size of the text chunk equals the maximal chunk size.
BertWordPieceTokenizer ignores multiple-Chinese-character vocabularies during encoding in vocab.txt
I use BertWordPieceTokenizer
(source), trying to train the tokenizer with a predefined vocabulary list, which the list contains 1 – 5 Chinese characters per line and the first 1000 lines are special characters like [PAD]
and [CLS]
.