-
Notifications
You must be signed in to change notification settings - Fork 10.5k
Description
Current Behavior
Tesseract fails to run when a provided text file (containing a list of images to process) includes file paths enclosed in double quotes.
C:\Users\eduardo.oliveira\Downloads>tesseract C:\Users\eduardo.oliveira\Downloads\tesseract.txt output-prefix -l eng pdf
Leptonica Error in fopenReadStream: file not found: "10.png"
Leptonica Error in pixRead: image file not found: "10.png"
Image file "10.png" cannot be read!
Error during processing.Failing Examples:
Absolute path (Fails):
txt "C:\Users\eduardo.oliveira\Downloads\10.png"
Relative path (Fails):
txt "10.png"
Working Examples:
Tesseract processes the files correctly if the quotes are manually removed, even if the path contains spaces.
Relative path without quotes (Works):
txt 10.png
Absolute path without quotes (Works):
txt C:\Users\eduardo.oliveira\Downloads\10.png
Absolute path with spaces, without quotes (Works):
txt C:\Users\eduardo.oliveira\Downloads\1 0.png
Expected Behavior
Tesseract should gracefully handle and parse quoted file paths within the input text file. It should strip or ignore the surrounding quotes and process the file. Wrapping paths in quotes is a standard OS behavior, particularly when dealing with paths that contain spaces or when utilizing standard clipboard shortcuts.
Suggested Fix
It seems this might be a regression in Leptonica. However, as a robust workaround within Tesseract itself, it would be highly beneficial to automatically strip leading and trailing double quotes (") from image names/paths during the text file parsing phase before passing them along to Leptonica.
tesseract -v
tesseract v5.4.0.20240606
leptonica-1.84.1
libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 3.0.1) : libpng 1.6.43 : libtiff 4.6.0 : zlib 1.3 : libwebp 1.4.0 : libopenjp2 2.5.2
Found AVX2
Found AVX
Found FMA
Found SSE4.1
Found libarchive 3.7.4 zlib/1.3.1 liblzma/5.6.1 bz2lib/1.0.8 liblz4/1.9.4 libzstd/1.5.6
Operating System
Windows 11
Other Operating System
No response
uname -a
No response
Compiler
No response
CPU
12th Gen Intel(R) Core(TM) i5-1235U (1.30 GHz)
Virtualization / Containers
No response
Other Information
This bug significantly impacts Windows users because selecting multiple files in Windows Explorer and clicking "Copy as path" automatically wraps the absolute paths in quotes. Manually removing these quotes from large lists of files is a tedious extra step.