-
Notifications
You must be signed in to change notification settings - Fork 10.5k
Description
Current Behavior
After the RNG revert in PR #4357 (commit 5af6cac, included in 5.5.1), tesseract hallucinates extra characters in OCR output on certain inputs.
On the attached test image (page3_from_container.png), the first line is output as:
Page 3 0of 3
A spurious 0 is inserted between 3 and of. The rest of the text is correct.
A second image with the same text content (tesseract_repro_page3.png, also attached) does not trigger the bug. Both versions produce correct output on that image. The difference between the two images is DPI and rendering:
| Property | page3_from_container.png (triggers bug) |
tesseract_repro_page3.png (no bug) |
|---|---|---|
| DPI | 150 | 96 |
| Dimensions | 1275×1651 | 1275×1650 |
| File size | 31 KB | 23 KB |
The "bad" image came from a PDF-to-PNG container conversion at 150 dpi, producing slightly different sub-pixel anti-aliasing and glyph spacing. That different pixel pattern, combined with the changed RNG padding noise from the Knuth LCG, pushes the LSTM beam search past the confidence threshold for the spurious 0.
Expected Behavior
Page 3 of 3
Tesseract 5.5.0 produces this correct output on the same image.
Suggested Fix
The root cause is the RNG revert in #4357 which changed TRand from std::minstd_rand back to the Knuth LCG. This changes the padding noise generated by NetworkIO::Randomize() during LSTM inference, which shifts beam search results.
We understand #4357 was necessary to fix segfaults (#4146, #4148, #4270) caused by IntRand() returning values outside [0, INT32_MAX] under std::minstd_rand. However, the standard eng traineddata was distributed and used with std::minstd_rand for the entire 5.0.1 through 5.5.0 era (roughly 4 years), and the revert introduces accuracy regressions for users of those models.
Could std::minstd_rand be retained but with IntRand() corrected to properly constrain its output range, rather than reverting the entire PRNG?
tesseract -v
tesseract 5.5.2
leptonica-1.87.0
libgif 5.2.2 : libjpeg 8d (libjpeg-turbo 3.1.3) : libpng 1.6.55 : libtiff 4.7.1 : zlib 1.3.1.2-audit : libwebp 1.6.0 : libopenjp2 2.5.4
Found NEON
Found libarchive 3.8.5 zlib/1.3.2 liblzma/5.8.2 bz2lib/1.0.8 liblz4/1.10.0 libzstd/1.5.7 expat/expat_2.7.4 openssl/3.6.0 libb2/bundled libacl/2.3.2 libattr/2.3.2
Found libcurl/8.18.0-DEV OpenSSL/3.6.1 zlib/1.3.2 brotli/1.2.0 libpsl/0.21.5 nghttp2/1.68.0 nghttp3/1.15.0 mit-krb5/1.22.2 OpenLDAP/2.6.12
Operating System
Linux (Wolfi-based container image)
uname -a
Linux 4720c2ceefca 6.12.69-linuxkit #1 SMP Mon Feb 16 11:19:06 UTC 2026 aarch64 Linux
Compiler
GCC 15.2.0
CPU
Reproduced on both aarch64 (Apple Silicon M-series via Docker) and x86_64.
Virtualization / Containers
Docker container (linuxkit 6.12.69)
Other Information
Comparison across versions:
| tesseract version | page3_from_container.png output |
tesseract_repro_page3.png output |
|---|---|---|
| 5.5.0 | Page 3 of 3 (correct) |
Page 3 of 3 (correct) |
| 5.5.2 | Page 3 0of 3 (bug) |
Page 3 of 3 (correct) |
| 5.5.0 | Page 3 of 3 (correct) |
Page 3 of 3 (correct) |
Both test images and output text files are attached.
The bug is deterministic for a given image+version combination but input-dependent. The triggering factor appears to be DPI. Images rendered at 150 dpi (e.g., from PDF-to-PNG conversion) hit it, while the same content at 96 dpi does not. This suggests the LSTM is sensitive to the interaction between glyph rendering at certain resolutions and the specific padding noise pattern produced by the restored Knuth LCG.
Attachments: tesseract-rng-regression.zip -> page3_from_container.png, tesseract_repro_page3.png, output_bad.txt, output_good.txt