Tesseract OCR开源项目

作者: admin 分类: 屏幕取词发布时间: 2013-04-18 12:56 ė6,293 浏览数 6没有评论

最近，项目中需要使用基于图像识别验证码的技术，初步探索尝试了一下开源的Tesseract OCR项目。该项目简介如下：

This package contains the Tesseract Open Source OCR Engine. Orignally developed at Hewlett Packard Laboratories Bristol and at Hewlett Packard Co, Greeley Colorado.

The Tesseract OCR engine was one of the top 3 engines in the 1995 UNLV Accuracy test. Between 1995 and 2006 it had little work done on it, but it is probably one of the most accurate open source OCR engines available. The source code will read a binary, grey or color image and output text. A tiff reader is built in that will read uncompressed TIFF images, or libtiff can be added to read compressed images.

Tesseract 是一款开源的光学字符串识别（OCR）项目，能够识别图像验证码。比如存在一个格式为TIF的文字图片，Tesseract能够识别出该图片中的文字，将识别到的文字写入到一个文本文件中，识别效果很不错。如果想要识别不同语言的文字图像，需要下载响应的支持包，才能让Tesseract识别更多格式的图像。

Tesseract项目地址为：http://code.google.com/p/tesseract-ocr/，可以通过下载开源发行包，或者到该项目网站了解更多信息。

下载当前较新的2.0.4版本，下载地址为http://tesseract-ocr.googlecode.com/files/tesseract-2.04.tar.gz。我不清楚，是否是我所在的网络有问题，下载过程中数据包丢失，还是其它原因，按照该项目网站上说明，没有成功安装好，经过仔细阅读文档及其项目网站上的FAQ，终于找到了问题的原因。现在把配置过程简单做个记录。

下载完成的压缩包为tesseract-2.04.tar.gz，我是直接在Fedora Core 7 Linux系统下，使用root权限在root目录下解压缩的，可以看到解压缩目录为tesseract-2.04，该目录下有很多文件，比较杂。下面开始执行安装过程：

1、编译Tesseract

估计下载下来的tesseract-2.04.tar.gz包解压以后，目录tesseract-2.04下的文件全是read-only的，需要修改一下文件操作权限：

[root@bogon tesseract-2.04]# chmod 777 -R *

然后，默认执行下面三个命令，配置、编译、安装：

[root@bogon tesseract-2.04]# ./configure [root@bogon tesseract-2.04]# make [root@bogon tesseract-2.04]# make install

可能需要花一点时间才能完成。

2、配置语言包

上面默认安装到了/usr/local/share/tessdata目录下，先到该目录下检查一下，如果里面的文件（不包含configs和tessconfigs目录）大小都是0字节，说明存在问题了，如果你执行启动Tesseract OCR引擎，就会出现如下异常：

Unable to load unicharset file /usr/local/share/tessdata/eng.unicharset

肯定会有问题，文件/usr/local/share/tessdata/eng.unicharset是空的，无法加载。再到/root/tesseract-2.04/tessdata目录中，检查一下如果里面的文件（不包含configs和tessconfigs目录）大小都是0字节，就需要单独下载，其实我感觉，之所以导致/usr/local/share/tessdata目录下文件为空，原因可能是，在上面执行安装过程中，/root/tesseract-2.04/tessdata目录中文件无效导致安装操作将一些空文件拷贝到了/usr/local/share/tessdata目录下，从而失败。

考虑单独下载语言包，下载http://tesseract-ocr.googlecode.com/files/tesseract-2.00.eng.tar.gz后得到解压缩文件目录tessdata，将目录中的8个非空文件拷贝到/usr/local/share/tessdata目录下覆盖掉原来的空文件，就可以了。

3、启动Tesseract OCR引擎，识别图像
现在，可以准备要进行识别的图像文件，我使用了Tesseract项目发行包中一个TIF图像文件：

执行识别图像的命令格式为：
tesseract <imagename> <outputbase> [-l lang] [configfile [[+|-]varfile]…]
其中tesseract是命令；<imagename>是待识别的图片，例如图片eurotext.tif；<outputbasename>是输出文本文件的名称，默认生成的是你所给定的输出文件名称，加上.txt扩展名；[-l lang]可选的，指定识别图像中的语言。

例如，启动Tesseract OCR 引擎，识别文字图片eurotext.tif ，执行命令：

[root@bogon tesseract-2.04]# tesseract eurotext.tif eurotext Tesseract Open Source OCR Engine [root@bogon tesseract-2.04]#

可以在tesseract-2.04目录下看到识别图像文件eurotext.tif 得到对应的文本文件eurotext.txt，内容如下所示：

The (quick) [brown] {fox} jumps! Over the $43,456.78 <lazy> #90 dog & duck/goose, as 12.5% of E-mail from aspammer@website.com is spam. Der ,,schnelle” braune Fuchs springt uber den faulen Hund. Le renard brun <<rapide» saute par-dessus le chien paresseux. La volpe marrone rapida salta sopra il cane pigro. El zorro marron répido salta sobre el perro perezoso. A raposa marrom répida salta sobre o cio preguicoso.

可见，识别正确率还是很高的，如果你使用发行包中自带的phototest.tif图像文件，识别正确率肯定是100％。但是，因为该图片中存在的干扰信息还是很弱的，不能妄言其识别正确率的高低，还有待于进一步测试它。