How to preprocess a CAPTCHA to improve text recognition accuracy with Tesseract OCR

  Kiến thức lập trình

I’m trying to improve text recognition accuracy using Tesseract OCR by preprocessing images with Sharp, but I’m still struggling to get accurate results. The text in the images often contains noise or distortion, and despite various preprocessing attempts, Tesseract is not able to recognize the text correctly.

import sharp from 'sharp';
import Tesseract from 'tesseract.js';
import fs from 'fs';
import path from 'path';
import { fileURLToPath } from 'url';

// Function to process the image with Sharp
async function processImage(imagePath, processedImagePath) {
    try {
        await sharp(imagePath)
            .resize(800) 
            .grayscale() 
            .normalize() 
            .threshold(228) 
            .sharpen() 
            .toFile(processedImagePath);
    } catch (error) {
        console.error("Error", error);
    }
}

// Function to recognize text with Tesseract
async function recognizeText(imagePath) {
    try {
        const { data: { text } } = await Tesseract.recognize(
            imagePath,
            "eng", 
            { 
                logger: info => console.log(info)
            }
        );
        console.log("Text:", text);
    } catch (error) {
        console.error("Error:", error);
    }
}

// Paths for the image files
const __filename = fileURLToPath(import.meta.url);
const __dirname = path.dirname(__filename);

const tempImagePath = path.join(__dirname, "downloaded-image.jpg");
const processedImagePath = path.join(__dirname, "processed-captcha.png");

// Process and recognize the image
await processImage(tempImagePath, processedImagePath);
await recognizeText(processedImagePath);

This is the original image:

What I’ve Tried:

  • Grayscale Conversion: I converted the image to grayscale to simplify the color channels.

  • Normalization: I normalized the image to enhance contrast.

  • Binarization: I applied a threshold to binarize the image.

  • Sharpening: I sharpened the image to make the text more distinct.

Despite these efforts, the text recognition is still inaccurate. Could anyone suggest additional preprocessing techniques or improvements to this approach? Are there specific settings or methods I should consider to enhance OCR results with Tesseract?

This is the result:

1

Theme wordpress giá rẻ Theme wordpress giá rẻ Thiết kế website

LEAVE A COMMENT