I am getting this during inference and I am desperate after days of debugging – hoping for any help! Thank you!
what(): index out of bounds: 0 <= tmp30 < 1L
Up the strack trace of the error:
0 c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*)
I am training on
Ubuntu 22.04
nvidia-smi:
NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3
env:
cuda 12.1
cudnn 9.1
datasets 2.15.0
transformers 4.36.2
torch 2.4
Document question answering, custom dataset.
Model repo being trained: https://huggingface.co/Sharka/CIVQA_DVQA_LayoutXLM
Tokenizer:
copied from: microsoft/layoutxlm-base · Hugging Face
I am getting this error during evaluation (prior to training)(always with the same samle, as far as I can tell):
Code part:
...
outputs = model(input_ids=input_ids, attention_mask=attention_mask,
token_type_ids=token_type_ids, bbox=bbox, image=image,
start_positions=start_positions, end_positions=end_positions)
...
Error message:
5%|▌ | 5901/109877 [56:31<16:12:41, 1.78it/s]terminate called after throwing an instance of 'c10::Error'
what(): index out of bounds: 0 <= tmp30 < 1L
Exception raised from kernel at /tmp/torchinductor_aiteam/li/cliz2c63uoa3repoiaztoizrjecjxefsfbjltc6wzfp7p6brqesb.cpp:155 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f486a6d0f86 in /home/aiteam/miniconda3/envs/hf_layoutLM_test/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x68 (0x7f486a67fdd9 in /home/aiteam/miniconda3/envs/hf_layoutLM_test/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x432b (0x7f47aa16032b in /tmp/torchinductor_aiteam/li/cliz2c63uoa3repoiaztoizrjecjxefsfbjltc6wzfp7p6brqesb.so)
frame #3: <unknown function> + 0x16405 (0x7f48b92aa405 in /home/aiteam/miniconda3/envs/hf_layoutLM_test/lib/python3.9/site-packages/torch/lib/libgomp-a34b3233.so.1)
frame #4: <unknown function> + 0x8609 (0x7f48ba6e4609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #5: clone + 0x43 (0x7f48ba4af353 in /lib/x86_64-linux-gnu/libc.so.6)
terminate called recursively
--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0 c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*)
----------------------
Error Message Summary:
----------------------
FatalError: `Process abort signal` is detected by the operating system.
[TimeInfo: *** Aborted at 1725357799 (unix time) try "date -d @1725357799" if you are using GNU date ***]
[SignalInfo: *** SIGABRT (@0x3ea0001d000) received by PID 118784 (TID 0x7f47ff7ab700) from PID 118784 ***]
I heavily assume it is something with the data, with that particular sample.
Only, I am debugging for days, and I cannot see a systematic difference (obviously I am just missing it) between that sample and any other.
My feature:
“input_ids” → in range [0, 250002], which is the tokenizers’ vocab
“attention_mask” → in {0, 1}
“start_positions” 0 for this sample (subfinder didnt find the answer in the context)
“end_positions” 0 or this samples
“bbox” → normalized, all in [0, 1000]
“image” → uint8, all in range [0, 255]
Some(naiive) direct question(s):
-
Can the problem stem from start and end positions predictions being 0?
-
For some reason, the tokenizer is using token id==6 to tokenize the context of that sample, which, for this tokenizer, corresponds to an empty string, e.g.
''
: (fromtokenizer.json
):“vocab”:[…,[“▁”,-3.9299705028533936], …]
This being token id==6 (Why is the empty string being represented by this symbol?)
However, there is no empty string the original context to begin with.
- Can the problem stem from using the fast tokenizer, while the model config says “LayoutXLMTokenizer” ?
Any idea appreciated! Thank you in advance!