C# OCR (Optical Character Recognition)

OCR as the title say stands for: Optical Character Recognition, the ability to extract characters as they appear in an image.


We will be using the MODI Type library, it's a COM Interop.


The MODI library is available within The Microsoft Office suites (2003 to 2007), Unfortunately it is not available in the 2010 version.




Include the MODI Type library (COM Interop) and convert image(s) to text like this:
 
using MODI;
using System;
 
class Program
{
    static void Main(string[] args)
    {
        DocumentClass myDoc = new DocumentClass();
        myDoc.Create(@"theDocumentName.tiff"); //we work with the .tiff extension
        myDoc.OCR(MiLANGUAGES.miLANG_ENGLISH, true, true);
 
        foreach (Image anImage in myDoc.Images)
        {
            Console.WriteLine(anImage.Layout.Text); //here we cout to the console.
        }
    }
}






Leave me a comment if you need help with it.

6 comments:

  1. Modi is no longer packaged with MS Office starting with MS office 2010. what is the alternative solution for now?

    ReplyDelete
  2. @subi, Office 2010 has MODI http://support.microsoft.com/kb/982760. Don't lie..

    ReplyDelete
  3. I want to convert images to text of an Arabic language. MODI can't convert this. Is there any source without any third party tool.

    ReplyDelete
    Replies
    1. But MODI can convert it into English, Hindi, Gujarati and even Sanskrit..But MODI can convert PAKISTAN To INDIA....Are not these enough for you, PAK PM nawaz shareef...

      Delete
  4. How to use an URL as image source??
    Please Help!!

    ReplyDelete
  5. I really enjoyed reading your post. Well-written and insightful articles like yours are well worth my time. Thanks for your efforts. The scanned images of both Arabic and English documents can now be converted into fully searchable and editable text files. The accuracy and reliability of RDI's OCR engine, as well as our character recognition software, make this possible.

    ReplyDelete