Skip to content

OCR accuracy regression after RNG revert in 5.5.1: LSTM hallucinates extra characters #4523

@jamie-albert

Description

@jamie-albert

Current Behavior

After the RNG revert in PR #4357 (commit 5af6cac, included in 5.5.1), tesseract hallucinates extra characters in OCR output on certain inputs.

On the attached test image (page3_from_container.png), the first line is output as:

Page 3 0of 3

A spurious 0 is inserted between 3 and of. The rest of the text is correct.

A second image with the same text content (tesseract_repro_page3.png, also attached) does not trigger the bug. Both versions produce correct output on that image. The difference between the two images is DPI and rendering:

Property page3_from_container.png (triggers bug) tesseract_repro_page3.png (no bug)
DPI 150 96
Dimensions 1275×1651 1275×1650
File size 31 KB 23 KB

The "bad" image came from a PDF-to-PNG container conversion at 150 dpi, producing slightly different sub-pixel anti-aliasing and glyph spacing. That different pixel pattern, combined with the changed RNG padding noise from the Knuth LCG, pushes the LSTM beam search past the confidence threshold for the spurious 0.

Expected Behavior

Page 3 of 3

Tesseract 5.5.0 produces this correct output on the same image.

Suggested Fix

The root cause is the RNG revert in #4357 which changed TRand from std::minstd_rand back to the Knuth LCG. This changes the padding noise generated by NetworkIO::Randomize() during LSTM inference, which shifts beam search results.

We understand #4357 was necessary to fix segfaults (#4146, #4148, #4270) caused by IntRand() returning values outside [0, INT32_MAX] under std::minstd_rand. However, the standard eng traineddata was distributed and used with std::minstd_rand for the entire 5.0.1 through 5.5.0 era (roughly 4 years), and the revert introduces accuracy regressions for users of those models.

Could std::minstd_rand be retained but with IntRand() corrected to properly constrain its output range, rather than reverting the entire PRNG?

tesseract -v

tesseract 5.5.2
 leptonica-1.87.0
  libgif 5.2.2 : libjpeg 8d (libjpeg-turbo 3.1.3) : libpng 1.6.55 : libtiff 4.7.1 : zlib 1.3.1.2-audit : libwebp 1.6.0 : libopenjp2 2.5.4
 Found NEON
 Found libarchive 3.8.5 zlib/1.3.2 liblzma/5.8.2 bz2lib/1.0.8 liblz4/1.10.0 libzstd/1.5.7 expat/expat_2.7.4 openssl/3.6.0 libb2/bundled libacl/2.3.2 libattr/2.3.2
 Found libcurl/8.18.0-DEV OpenSSL/3.6.1 zlib/1.3.2 brotli/1.2.0 libpsl/0.21.5 nghttp2/1.68.0 nghttp3/1.15.0 mit-krb5/1.22.2 OpenLDAP/2.6.12

Operating System

Linux (Wolfi-based container image)

uname -a

Linux 4720c2ceefca 6.12.69-linuxkit #1 SMP Mon Feb 16 11:19:06 UTC 2026 aarch64 Linux

Compiler

GCC 15.2.0

CPU

Reproduced on both aarch64 (Apple Silicon M-series via Docker) and x86_64.

Virtualization / Containers

Docker container (linuxkit 6.12.69)

Other Information

Comparison across versions:

tesseract version page3_from_container.png output tesseract_repro_page3.png output
5.5.0 Page 3 of 3 (correct) Page 3 of 3 (correct)
5.5.2 Page 3 0of 3 (bug) Page 3 of 3 (correct)
5.5.0 Page 3 of 3 (correct) Page 3 of 3 (correct)

Both test images and output text files are attached.

The bug is deterministic for a given image+version combination but input-dependent. The triggering factor appears to be DPI. Images rendered at 150 dpi (e.g., from PDF-to-PNG conversion) hit it, while the same content at 96 dpi does not. This suggests the LSTM is sensitive to the interaction between glyph rendering at certain resolutions and the specific padding noise pattern produced by the restored Knuth LCG.

Attachments: tesseract-rng-regression.zip -> page3_from_container.png, tesseract_repro_page3.png, output_bad.txt, output_good.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions