学习日记 » ocr

如何识别高级的验证码

admin — Wed, 22 Jan 2014 05:20:28 +0000

一、验证码的基本知识

1. 验证码的主要目的是强制人机交互来抵御机器自动化攻击的。

2. 大部分的验证码设计者并不得要领，不了解图像处理，机器视觉，模式识别，人工智能的基本概念。

3. 利用验证码，可以发财，当然要犯罪：比如招商银行密码只有6位，验证码形同虚设，计算机很快就能破解一个有钱的账户，很多帐户是可以网上交易的。

4. 也有设计的比较好的，比如Yahoo,Google,Microsoft等。而国内Tencent的中文验证码虽然难，但算不上好。

二、人工智能，模式识别，机器视觉，图像处理的基本知识

1)主要流程：

比如我们要从一副图片中，识别出验证码；比如我们要从一副图片中，检测并识别出一张人脸。大概有哪些步骤呢？

1.图像采集：验证码呢，就直接通过HTTP抓HTML，然后分析出图片的url，然后下载保存就可以了。如果是人脸检测识别，一般要通过视屏采集设备，采集回来，通过A/D转操作，存为数字图片或者视频频。

2.预处理：检测是正确的图像格式，转换到合适的格式，压缩，剪切出ROI，去除噪音，灰度化，转换色彩空间这些。

3.检测：车牌检测识别系统要先找到车牌的大概位置，人脸检测系统要找出图片中所有的人脸（包括疑似人脸）；验证码识别呢，主要是找出文字所在的主要区域。

4.前处理：人脸检测和识别，会对人脸在识别前作一些校正，比如面内面外的旋转，扭曲等。我这里的验证码识别，“一般”要做文字的切割

5.训练：通过各种模式识别，机器学习算法，来挑选和训练合适数量的训练集。不是训练的样本越多越好。过学习，泛化能力差的问题可能在这里出现。这一步不是必须的，有些识别算法是不需要训练的。

6.识别：输入待识别的处理后的图片，转换成分类器需要的输入格式，然后通过输出的类和置信度，来判断大概可能是哪个字母。识别本质上就是分类。

2)关键概念：

图像处理：一般指针对数字图像的某种数学处理。比如投影，钝化，锐化，细化，边缘检测，二值化，压缩，各种数据变换等等。

1.二值化：一般图片都是彩色的，按照逼真程度，可能很多级别。为了降低计算复杂度，方便后续的处理，如果在不损失关键信息的情况下，能将图片处理成黑白两种颜色，那就最好不过了。

2.细化：找出图像的骨架，图像线条可能是很宽的，通过细化将宽度将为1，某些地方可能大于1。不同的细化算法，可能有不同的差异，比如是否更靠近线条中间，比如是否保持联通行等。

3.边缘检测：主要是理解边缘的概念。边缘实际上是图像中图像像素属性变化剧烈的地方。可能通过一个固定的门限值来判断，也可能是自适应的。门限可能是图像全局的，也可能是局部的。不能说那个就一定好，不过大部分时候，自适应的局部的门限可能要好点。被分析的，可能是颜色，也可能是灰度图像的灰度。

机器视觉：利用计算机来模式实现人的视觉。比如物体检测，定位，识别。按照对图像理解的层次的差别，分高阶和低阶的理解。

模式识别：对事物或者现象的某种表示方式（数值，文字，我们这里主要想说的是数值），通过一些处理和分析，来描述，归类，理解，解释这些事物，现象及其某种抽象。

人工智能：这种概念比较宽，上面这些都属于人工智能这个大的方向。简单点不要过分学院派的理解就是，把人类的很“智能”的东西给模拟出来协助生物的人来处理问题，特别是在计算机里面。

三、常见的验证码的破解分析

以http://libcaca.zoy.org/wiki/PWNtcha这里PWNtcha项目中的资料为例分析，各种验证码的破解。（方法很多，仅仅从我个人乍看之下觉得可行的方法来分析）

1)Authimage

使用的反破解技巧：

1.不连续的点组成字符
2.有一定程度的倾斜

设计不好的地方：

1.通过纵横的直方图投影，可以找到字幕区域
2.通过Hough变换，适当的参数，可以找到近似的横线，可以做倾斜矫正
3.字符串的倾斜式面内的，没有太多的破解难度
4.字母宽度一定，大小一定

2)Clubic

使用的反破解技巧：

1.字符是手写体

设计不好的地方：

1.检测切割阶段没有任何技术含量，属于设计的比较丑的
2.只有数字，而且手写体变化不大
3.表面看起来对识别阶段有难度，仔细分析，发现几乎不用任何高级的训练识别算法，就固定的招某些像素点是否有色彩就够了

3)linuxfr.org

使用的反破解技巧：

1.背景颜色块
2.前景的横线或矩形

设计不好的地方：

1.背景色是单一色块，有形状，通过Region-Growth区域增长来很容易把背景给去掉
2.前景色是标准的线条，色彩单一
3.字母无粘连
4.都是印刷体

4)Ourcolony

使用的反破解技巧：

1.设计的太低级，不屑于去评价

设计不好的地方：

1.这种验证码，设计的最丑，但还是能把菜鸟搞定，毕竟学计算机的少，搞这个破解的更少，正所谓隔行如隔山

5)LiveJournal

使用的反破解技巧：

1.这个设计略微好点，使用个随机噪音，而且作为前景
2.字母位置粗细都有变化

设计不好的地方：

1.字母没有粘连
2.噪音类型单一
3.通过在X轴的直方图投影，能准确分割字幕
4.然后在Y周作直方图投影,能准确定位高度
5.识别阶段，都是印刷体，简单地很

四、网上的一些高级验证码

1)ICQ

2)IMDb

3)MS MVPS

4)MVN Forum

这些类型是被很多人认为比较难得类型，分析一下可以发现，字符检测，定位和分割都不是难。唯一影响识别率的是IMDBb和MVPS这两类，字体变形略大。

总体来说，这些类型的破解也不难，很容易做到50%以上的识别率。

五、高级验证码的破解分析

时间关系，我简单介绍如何利用图像处理和模式识别技术，自动识别比较高级的验证码。
(以风头正劲的Google为例)

1)至少从目前的AI的发展程度看，没有简单的做法能自动处理各种不同的验证码，即使能力很强，那么系统自然也十分复杂强大。所以，要想在很简单的算法实现比较高级的验证码破解，必须分析不同验证码算法的特点：

作为一般的图像处理和计算机视觉，会考虑色彩，纹理，形状等直接的特征，同时也考虑直方图，灰度等统计特征，还考虑FFT，Wavelet等各种变换后的特征。但最终目标都是Dimension Reduction（降维）然后利于识别，不仅仅是速度的考虑。从图像的角度看，很多系统都考虑转换为灰度级甚者黑白图片。

Google的图片可以看出，颜色变化是虚晃一枪，不存在任何处理难度。难度是字体变形和字符粘连。

如果能成功的分割字符，那么后期识别无论是用SVM等分类算法，还是分析笔顺比划走向来硬识别，都相对好做。

2)图像处理和粘连分割

代码中的part1目录主要完成图像预处理和粘连字符分割
001：将图像从jpg等格式转换为位图便于处理
002：采用Fix/Adaptive的Threshold门限算法，将图片Bin-Value二值化。
（可用003算法）
003：采用OSTU分水岭算法，将图片Bin-Value二值化。
（更通用，大部分时候效果更好）
005：获取ROI感兴趣的区域。
006：Edge Trace边缘跟踪。
007：Edge Detection边界检测。
008：Thin细化去骨架。
009：做了一些Tidy整理。
（这个一般要根据特定的Captcha算法调整）
010：做切割,注意图片中红色的交叉点。
011：将边缘检测和骨干交叉点监测的图像合并。
（合并过程可以做分析: 比如X坐标偏移门限分析，交叉点区域纹理分析，线条走势分析，等等各种方法，找出更可能的切分点和分离后部件的组合管理。）

代码：（代码质量不高，从其他项目拷贝过来，简单修改的。）

查看代码(./pstzine_09_01.txt)

注：在这里，我们可以看到，基本的部件（字母是分割开了，但可以造成统一字母的被切割成多个Component。一种做法是：利用先验知识，做分割；另外一种做法是，和第二部分的识别结合起来。比如按照从左至右，尝试增加component来识别，如果不能识别而且component的总宽度，总面积还比较小，继续增加。当然不排除拒识的可能性。）

3)字符部件组合和识别。

part2的代码展示了切割后的字母组合，和基于svm的字符识别的训练和识别过程。Detection.cpp中展示了ImageSpam检测过程中的一些字符分割和组合，layout的分析和利用的简单技术。而Google的验证码的识别，完全可以不用到，仅做参考。

SVM及使用：

本质上，SVM是一个分类器，原始的SVM是一个两类分类的分类器。可以通过1:1或者1:n的方式来组合成一个多类分类的分类器。天生通过核函数的使用支持高维数据的分类。从几何意义上讲，就是找到最能表示类别特征的那些向量（支持向量SV）,然后找到一条线，能最大化分类的Margin。

libSVM是一个不错的实现。

训练间断和识别阶段的数据整理和归一化是一样的。这里的简单做法是：

首先：

#define SVM_MAX +0.999
#define SVM_MIN +0.001

其次：

扫描黑白待识别字幕图片的每个像素，如果为0(黑色，是字母上的像素),那么svm中该位置就SVM_MAX,反之则反。

最后：

训练阶段，在svm的input的前面，为该类打上标记，即是那一个字母。
识别阶段，当然这个类别标记是SVM分类出来。

注意：

如果是SVM菜鸟，最好找一个在SVM外边做了包装的工具，比如样本选择，交叉验证，核函数选择这些，让程序自动选择和分析。

代码：通过ReginGrowth来提取单个单个的字符，然后开始识别。

查看代码(./pstzine_09_02.txt)

六、对验证码设计的一些建议

1.在噪音等类型的使用上，尽力让字符和用来混淆的前景和背景不容易区分。尽力让坏人（噪音）长得和好人（字母）一样。

2.特别好的验证码的设计，要尽力发挥人类擅长而AI算法不擅长的。比如粘连字符的分割和手写体（通过印刷体做特别的变形也可以）。而不要一味的去加一些看起来比较复杂的噪音或者其他的花哨的东西。即使你做的足够复杂，但如果人也难识别，显然别人认为你是没事找抽型的。

3. 从专业的机器视觉的角度说，验证码的设计，一定要让破解者在识别阶段，反复在低阶视觉和高阶视觉之间多反复几次才能识别出来。这样可以大大降低破解难度和破解的准确率。

七、个人郑重申明

1.这个问题，本身是人工智能，计算机视觉，模式识别领域的一个难题。我是虾米，菜得不能再菜的那种。作为破解者来说，是出于劣势地位。要做的很好，是很难得。总体来说，我走的是比较学院派的线路，能真正的破解难度比较高的验证码，不同于网上很多不太入流的破解方法。我能做的只有利用有限的知识，抛砖引玉而已。很多OCR的技术，特别是离线手写体中文等文字识别的技术，个人了解有限的很，都不敢在这里乱写。

2.希望不要把这种技术用于非法用途。

tesseract OCR锻炼新字体对图片的预处理和要求

admin — Wed, 22 Jan 2014 03:34:38 +0000

tesseract OCR训练新字体对图片的预处理和要求

同tesseract OCR识别对图片有要求一样，在训练新的字符集或新的字体时，对图片也有一定要求，符合要求的图片，能大大提高训练的效率。

在图像处理方面，去除噪声，使训练的字符图片尽量连贯、清晰。

其他方面，通常的要求如下：

1. 在一幅图片内，字体统一，决不能将多种字体混合出现在一幅训练图片内；如果不是通过扫描文本获取的字符图片，这个条件很容易被忽视。

2. 理想条件下，同种字体的字符图片集中到一幅大的训练图片中，在同一页内；

3. 要保留一定的字符间距与行间距；

4. 字符高度（大小），只要满足高度最小条件即可，对于小写字符x，其高度要至少大于10个像素；

5. 对于非字母字符，如!@#$%^&(),.{}<>/?，不要集中在一起出现，原因是这样不利于tesseract找出文本行基线baseline，不利于文本高度及大小的检测，baseline检测是tesseract engine的第一步；

6. 一般每个字符需要10个样本，高频常见字符至少20个样本，不常见字符需要5个样本；

7. 对于同种字体，多页训练图片，可以在训练中，件用相同的方式合并tr文件和box文件，两类文件内的字符次序要相同，利于提高训练效果。

在获取训练字符图片方面，不一定非要从待识别图片中收集，可以利用word字符集找到对应字体，打印，扫描，获取训练图片，简单、方便。这个根据实际情况来应用。

tesseract中有这样一句话：

but note that there is no incremental training mode that allows you to add new training data to existing sets.

大意是，没有增量训练的方式，把新的训练数据加入现有的数据集。

有的提出，通过使用多个训练库联合使用，如参数 -l 之后 tesseract input.tif output -l eng+newfont , 其效果待测试。

HBITMAP 灰度

admin — Tue, 21 Jan 2014 08:39:16 +0000

//灰度处理
#define GET_GRAY_VALUE(x) 0.110*GetBValue(x)+0.588*GetGValue(x)+0.302*GetRValue(x)
HBITMAP CCatchScreenDlg::GetGrayBitmap( HBITMAP hResBitmap ,int& nWhiteCount,int& nBackCount )
{
	nWhiteCount=0;
	nBackCount=0;
	ASSERT(hResBitmap);
	HBITMAP hDesBitmap=NULL;
	BITMAP bm;
	GetObject(hResBitmap,sizeof(bm),&bm);
	LONG lSize=bm.bmWidth*bm.bmBitsPixel*bm.bmHeight/8;
	int nSize=bm.bmWidth*bm.bmHeight/10;
	HLOCAL hMem=LocalAlloc(LHND,lSize);
	byte*  pData=(byte*)LocalLock(hMem);
	::GetBitmapBits(hResBitmap,lSize,pData);
	byte* pHead=pData;
	byte* pTail=pData+lSize-4;
	DWORD dwColor1=0;
	DWORD dwColor2=0;
	byte  bGray1=0;
	byte  bGray2=0;
	while ( pTail>pHead )
	{
		memcpy(&dwColor1,pHead,4);
		memcpy(&dwColor2,pTail,4);
		bGray1=GET_GRAY_VALUE(dwColor1);
		if ( bGray1<128 )
			nBackCount++;
		else
			nWhiteCount++;
		bGray2=GET_GRAY_VALUE(dwColor2);
		if ( bGray2<128 )
			nBackCount++;
		else
			nWhiteCount++;
		dwColor1=RGB(bGray1,bGray1,bGray1);
		dwColor2=RGB(bGray2,bGray2,bGray2);
		memcpy(pHead,&dwColor1,4);
		memcpy(pTail,&dwColor2,4);
		pHead+=4;
		pTail-=4;
	}
	HDC hDC=::GetDC(m_hWnd);
	hDesBitmap=CreateCompatibleBitmap(hDC,bm.bmWidth,bm.bmHeight);
	LONG lRet=::SetBitmapBits(hDesBitmap,lSize,pData);
	LocalUnlock(hMem);
	LocalFree(hMem);
	::ReleaseDC(m_hWnd,hDC);
	return hDesBitmap;
}

write DIB

admin — Tue, 21 Jan 2014 08:11:33 +0000

static BOOL WriteDIB( LPCTSTR szFile, HANDLE hDIB)
 {
 BITMAPFILEHEADER hdr;
 LPBITMAPINFOHEADER lpbi;

 if (!hDIB)
 return FALSE;

 CFile file;
 if( !file.Open (szFile, CFile::modeWrite | CFile::modeCreate))
 {
 return FALSE;
 }

 lpbi = (LPBITMAPINFOHEADER) hDIB;

 int nColors = 1 << lpbi->biBitCount;
 if (nColors > 256 || lpbi->biBitCount == 32)
 nColors = 0;

 // Fill in the fields of the file header 
 hdr.bfType = ((WORD) ('M' << 8) | 'B'); // is always "BM"
 hdr.bfSize = GlobalSize (hDIB) + sizeof( hdr );
 hdr.bfReserved1  = 0;
 hdr.bfReserved2  = 0;
 hdr.bfOffBits = (DWORD) (sizeof( hdr ) + lpbi->biSize +
 nColors * sizeof(RGBQUAD));

 // Write the file header 
 file.Write( &hdr, sizeof(hdr) );

 // Write the DIB header and the bits 
 file.Write( lpbi, GlobalSize(hDIB) );

 return TRUE;
 }

DDB To DIB

admin — Tue, 21 Jan 2014 08:05:52 +0000

HANDLE CGraphView::DDBToDIB( CBitmap& bitmap, DWORD dwCompression ) 
 {
     BITMAP                bm;
     BITMAPINFOHEADER    bi;
     LPBITMAPINFOHEADER  lpbi;
     DWORD                dwLen;
     HANDLE                hDIB;
     HANDLE                handle;
     HDC                    hDC;
     HPALETTE            hPal;

     CWindowDC            dc( this );
     CPalette            pal;
     //如果支持调色板的话，则建立它
    if( dc.GetDeviceCaps( RASTERCAPS ) & RC_PALETTE )
     {
         UINT        nSize   = sizeof(LOGPALETTE) + ( sizeof(PALETTEENTRY) * 256 );
         LOGPALETTE* pLP     = (LOGPALETTE*)new BYTE[nSize];
         pLP->palVersion     = 0x300;
         pLP->palNumEntries = (unsigned short)GetSystemPaletteEntries( dc, 0, 255, 
         pLP->palPalEntry );

         pal.CreatePalette( pLP );

         //释放
        delete[] pLP;
     }

     ASSERT( bitmap.GetSafeHandle() );

     //不支持BI_BITFIELDS类型
    if( dwCompression == BI_BITFIELDS )
         return NULL;

     //如果调色板为空，则用默认调色板
    hPal = (HPALETTE) pal.GetSafeHandle();
     if (hPal==NULL)
         hPal = (HPALETTE) GetStockObject(DEFAULT_PALETTE);

     //获取位图信息
    bitmap.GetObject(sizeof(bm),(LPSTR)&bm);

     //初始化位图信息头
    bi.biSize        = sizeof(BITMAPINFOHEADER);
     bi.biWidth        = bm.bmWidth;
     bi.biHeight         = bm.bmHeight;
     bi.biPlanes         = 1;
     bi.biBitCount        = (unsigned short)(bm.bmPlanes * bm.bmBitsPixel) ;
     bi.biCompression    = dwCompression;
     bi.biSizeImage        = 0;
     bi.biXPelsPerMeter    = 0;
     bi.biYPelsPerMeter    = 0;
     bi.biClrUsed        = 0;
     bi.biClrImportant    = 0;

     //计算信息头及颜色表大小
    int nColors = 0;
     if(bi.biBitCount <= 8)
         {
         nColors = (1 << bi.biBitCount);
         }
     dwLen  = bi.biSize + nColors * sizeof(RGBQUAD);

     hDC = ::GetDC(NULL);
     hPal = SelectPalette(hDC,hPal,FALSE);
     RealizePalette(hDC);

     //为信息头及颜色表分配内存
    hDIB = GlobalAlloc(GMEM_FIXED,dwLen);

     if (!hDIB){
         SelectPalette(hDC,hPal,FALSE);
         ::ReleaseDC(NULL,hDC);
         return NULL;
     }

     lpbi = (LPBITMAPINFOHEADER)GlobalLock(hDIB);

     *lpbi = bi;

     //调用 GetDIBits 计算图像大小
    GetDIBits(hDC, (HBITMAP)bitmap.GetSafeHandle(), 0L, (DWORD)bi.biHeight,
             (LPBYTE)NULL, (LPBITMAPINFO)lpbi, (DWORD)DIB_RGB_COLORS);

     bi = *lpbi;

     //图像的每一行都对齐(32bit)边界
    if (bi.biSizeImage == 0){
         bi.biSizeImage = ((((bi.biWidth * bi.biBitCount) + 31) & ~31) / 8) 
                         * bi.biHeight;

         if (dwCompression != BI_RGB)
             bi.biSizeImage = (bi.biSizeImage * 3) / 2;
     }

     //重新分配内存大小，以便放下所有数据
    dwLen += bi.biSizeImage;
     handle = GlobalReAlloc(hDIB, dwLen, GMEM_MOVEABLE) ;
     if (handle != NULL)
         hDIB = handle;
     else
         {
         GlobalFree(hDIB);

         //重选原始调色板
        SelectPalette(hDC,hPal,FALSE);
         ::ReleaseDC(NULL,hDC);
         return NULL;
         }

     //获取位图数据
    lpbi = (LPBITMAPINFOHEADER)hDIB;

     //最终获得的DIB
     BOOL bGotBits = GetDIBits( hDC, (HBITMAP)bitmap.GetSafeHandle(),
                 0L,                      //扫描行起始处
                (DWORD)bi.biHeight,      //扫描行数
                (LPBYTE)lpbi             //位图数据地址
                + (bi.biSize + nColors * sizeof(RGBQUAD)),
                 (LPBITMAPINFO)lpbi,      //位图信息地址
                (DWORD)DIB_RGB_COLORS);  //颜色板使用RGB

     if( !bGotBits )
     {
         GlobalFree(hDIB);

         SelectPalette(hDC,hPal,FALSE);
         ::ReleaseDC(NULL,hDC);
         return NULL;
     }

     SelectPalette(hDC,hPal,FALSE);
     ::ReleaseDC(NULL,hDC);
     return hDIB;
 }

tesscallback.h(1011): error C2872: “remove_reference”: 不明确的符号

admin — Mon, 20 Jan 2014 08:41:27 +0000

真正原因查明：\tesseract-ocr\include\tesseract\tesscallback.h(1011): error C2872: “remove_reference”: 不明确的符号。引发的真正元凶是
#include “baseapi.h” 与
using namespace std;
在源码文件中出现顺序的问题，先 #include “baseapi.h” 后
using namespace std; 则OK，如此可以避免名字的冲突！！！

Capture2Text

admin — Fri, 19 Apr 2013 09:47:04 +0000

Capture2Text

What is Capture2Text?

Capture2Text enables users to do the following:

Optical Character Recognition (OCR) Allows the user to quickly snapshot a small portion of the screen, OCR it and (by default) save the result to the clipboard.
Speech Recognition Using speech recognition the user can speak into their microphone and Capture2Text will convert the speech to text. If the speech recognition technology is not 100% sure, Capture2Text will present the user with a list of the most likely transcriptions. The selected result will (by default) be copied to the clipboard.

Conceptual illustration:

Download

The latest version can be found on the Capture2Text download page hosted by SourceForge. Source code is included.

How to Install

Unzip the contents of the zip file. Make sure that there are no Asian or other non-ASCII characters in the path where you unzipped it. Also, if you are on Windows 7, don’t unzip it to the Program Files directory (this will avoid issues related to write privileges).
Double-click on Capture2Text.exe. You should see the Capture2Text icon on the bottom-right of your screen (though it might be hidden in which case you will have to click on the “Show hidden icons” arrow).

OCR

Capture2Text can OCR the following languages:

Afrikaans	Frankish	Maltese
Albanian	French	Norwegian
Ancient Greek	Galician	Polish
Arabic	German	Portuguese
Azerbaijani	Greek	Romanian
Basque	Hebrew	Russian
Belarusian	Hindi	Serbian
Bengali	Hungarian	Slovakian
Bulgarian	Icelandic	Slovenian
Catalan	Indonesian	Spanish
Cherokee	Italian	Swahili
Chinese	Japanese	Swedish
Croatian	Kannada	Tagalog
Czech	Korean	Tamil
Danish	Latvian	Telugu
Dutch	Lithuanian	Thai
English	Macedonian	Turkish
Esperanto	Malay	Ukrainian
Estonian	Malayalam	Vietnamese
Finnish	Maltese

By default only Chinese, English, French, German, Japanese, and Spanish are installed.

To acquire other languages:

Download the appropriate OCR language dictionaries from http://code.google.com/p/tesseract-ocr/downloads/list. These files end in “.tar.gz” (ex. tesseract-ocr-3.02.rus.tar.gz).
Open the “.tar.gz” file you just downloaded with 7-Zip or similar decompression software and navigate to the directory that has the file that ends in “.traineddata”.
Drag the “.traineddata” file (and any other file in this directory) to this path in the Capture2Text directory: Capture2Text\Utils\tesseract\tessdata
Restart Capture2Text

Note: Arabic and Hindi are more CPU intentive and will thus be slower to OCR.

OCR Usage

Press the OCR capture key (default: Windows Key + Q) to start the capture. Now, using your mouse, resize the capture box over the area of the screen that you want to OCR. A preview of the captured OCR’d text will appear in the top-left corner of the screen. Press the capture key again or the left mouse button to complete the capture. The captured screen area will be OCR’d and the textual result will be stored in the clipboard by default.

To cancel an OCR capture, press Esc.

To move the capture box, hold down the right mouse button and drag the mouse.

To nudge the capture box, use the arrow keys.

To toggle the active capture box corner, press the space bar.

To change the OCR language, right-click the Capture2Text tray icon, select the OCR Language option and then select the desired language.

To quickly switch between 3 languages, use the OCR language quick access keys: Windows Key + 1, Windows Key + 2, and Windows Key + 3.

When the Tesseract versions of Chinese or Japanese is selected, you should specify the text direction (vertical or horizontal) using the text direction key: Windows Key + W. The text direction will not have any effect on the NHocr Chinese or NHocr Japanese dictionaries.

Using the Preferences dialog, you can change the following OCR settings:

OCR Hotkeys.
Current OCR Language.
The 3 Quick-Access OCR Languages.
Capture Box color and opacity.
Enable/Disable the preview box and change its colors, font and opacity.
Change the text direction (used for Chinese and Japanese).

Speech Recognition

Capture2Text can perform speech recognition for the following languages:

Afrikaans	French	Polish
Chinese	German	Portuguese
Czech	Italian	Russian
Dutch	Japanese	Spanish
English	Korean	Turkish

Speech Recognition Usage

Press the speech recognition capture key (default: Windows Key + A) to start the capture. You will see a box that says “Recording…” in the top-left corner of your screen. Speak a word or phrase or sentence into your microphone. Capture2Text will automatically recognize when you are done speaking and will display a box that says “Analyzing…”. The speech recognition will take a couple of seconds. When the speech recognition is complete you will see a list of possible transcriptions to choose from. When you choose a transcription, it will be stored in the clipboard by default.

When the results windows is displayed, you can press Enter to select the first transcription or use the number keys (1-9) to select the corresponding transcription.

To cancel a speech recognition capture, press Esc.

To change the speech recognition language, right-click the Capture2Text tray icon, select the Speech Recognition Language option and then select the desired language.

To quickly toggle between 2 languages, use the speech recognition language hotkey: Windows Key + 4.

Using the Preferences dialog, you can change the following speech recognition settings:

Speech recognition Hotkeys.
Current speech recognition Language.
The 2 speech recognition languages to toggle between.
The properties of the Results window (font, color, number of results).
How much silence to wait for before recording stops.

Output Options

By default, the OCR’d or speech recognized text will be placed in the clipboard.

You also have 3 more ways to output the text.

To send the text to a pop-up window you can right-click the Capture2Text tray icon and select Show Popup Window.

To send the text to whichever textbox currently contains the blinking cursor/I-beam, right-click the Capture2Text tray icon and select Send to Cursor.

Advanced: To send the text directly to a window/control (for example, Notepad++), first fill in the Send to Control settings in the Preferences dialog. Once this is done you may enable/disable the option by right-clicking the Capture2Text tray icon and selecting Send to Control.

Using the Preferences dialog, you can change the following output settings:

Text to prepend/append to the captured text.
Enable/Disable outputting to the clipboard.
Enable/Disable outputting to a popup window.
Popup window properties (default width and height).
Enable/Disable sending the output text to the cursor.
Enable/Disable outputting to a control.
Additional command to send to the output control.

Configuration

Right-click the Capture2Text tray icon in the bottom-right of your screen and then select the “Preferences…” option to bring up the Preferences dialog.

Substitutions

Sometimes Capture2Text consistently makes the same OCR mistakes such as recognizing an “M” as “I\/|”.

By editing the subtitutions.txt file in the Capture2Text directory, you may tell Capture2Text to substitute one text string for another text string.

Just find the appropriate language section and add one substitution per line in this format: from_text = to_text

Example (adding 3 substitutions to the English section):

English:: I\/| = M; >< = X; some%space%text = some_text

To create a substitution regardless of language, add the substitution to the “All:” section.

Special tokens and escape characters:

%space%	Space character
%tab%	Tab character
%eq%	Equals (=)
%perc%	Percent sign (%)
%lf%	Linefeed character (\n)
%cr%	Carriage return character (\r)

You may disable a substitution by adding a “#” in front.

When done editing substitutions.txt, either restart Capture2Text or switch language for the substitutions to take effect.

Command Line Options

You may OCR the screen via command line by calling Capture2Text in this format:

Capture2Text.exe x1 y1 x2 y2 [output_file]

Required Arguments:: x1 – X1-Coordinate of the screen; y1 – Y1-Coordinate of the screen; x2 – X2-Coordinate of the screen; y2 – Y2-Coordinate of the screen
Optional Arguments:: output_file – The OCR’d text will be written to this file if specified.

Capture2Text will read settings.ini to determine settings such as OCR language and output options (clipboard, popup, etc.).

Examples:: Capture2Text.exe 10 152 47 321 output.txt; Capture2Text.exe 10 152 47 321

Tesseract OCR开源项目

admin — Thu, 18 Apr 2013 04:56:37 +0000

最近，项目中需要使用基于图像识别验证码的技术，初步探索尝试了一下开源的Tesseract OCR项目。该项目简介如下：

This package contains the Tesseract Open Source OCR Engine. Orignally developed at Hewlett Packard Laboratories Bristol and at Hewlett Packard Co, Greeley Colorado.

The Tesseract OCR engine was one of the top 3 engines in the 1995 UNLV Accuracy test. Between 1995 and 2006 it had little work done on it, but it is probably one of the most accurate open source OCR engines available. The source code will read a binary, grey or color image and output text. A tiff reader is built in that will read uncompressed TIFF images, or libtiff can be added to read compressed images.

Tesseract 是一款开源的光学字符串识别（OCR）项目，能够识别图像验证码。比如存在一个格式为TIF的文字图片，Tesseract能够识别出该图片中的文字，将识别到的文字写入到一个文本文件中，识别效果很不错。如果想要识别不同语言的文字图像，需要下载响应的支持包，才能让Tesseract识别更多格式的图像。

Tesseract项目地址为：http://code.google.com/p/tesseract-ocr/，可以通过下载开源发行包，或者到该项目网站了解更多信息。

下载当前较新的2.0.4版本，下载地址为http://tesseract-ocr.googlecode.com/files/tesseract-2.04.tar.gz。我不清楚，是否是我所在的网络有问题，下载过程中数据包丢失，还是其它原因，按照该项目网站上说明，没有成功安装好，经过仔细阅读文档及其项目网站上的FAQ，终于找到了问题的原因。现在把配置过程简单做个记录。

下载完成的压缩包为tesseract-2.04.tar.gz，我是直接在Fedora Core 7 Linux系统下，使用root权限在root目录下解压缩的，可以看到解压缩目录为tesseract-2.04，该目录下有很多文件，比较杂。下面开始执行安装过程：

1、编译Tesseract

估计下载下来的tesseract-2.04.tar.gz包解压以后，目录tesseract-2.04下的文件全是read-only的，需要修改一下文件操作权限：

[root@bogon tesseract-2.04]# chmod 777 -R *

然后，默认执行下面三个命令，配置、编译、安装：

[root@bogon tesseract-2.04]# ./configure [root@bogon tesseract-2.04]# make [root@bogon tesseract-2.04]# make install

可能需要花一点时间才能完成。

2、配置语言包

上面默认安装到了/usr/local/share/tessdata目录下，先到该目录下检查一下，如果里面的文件（不包含configs和tessconfigs目录）大小都是0字节，说明存在问题了，如果你执行启动Tesseract OCR引擎，就会出现如下异常：

Unable to load unicharset file /usr/local/share/tessdata/eng.unicharset

肯定会有问题，文件/usr/local/share/tessdata/eng.unicharset是空的，无法加载。再到/root/tesseract-2.04/tessdata目录中，检查一下如果里面的文件（不包含configs和tessconfigs目录）大小都是0字节，就需要单独下载，其实我感觉，之所以导致/usr/local/share/tessdata目录下文件为空，原因可能是，在上面执行安装过程中，/root/tesseract-2.04/tessdata目录中文件无效导致安装操作将一些空文件拷贝到了/usr/local/share/tessdata目录下，从而失败。

考虑单独下载语言包，下载http://tesseract-ocr.googlecode.com/files/tesseract-2.00.eng.tar.gz后得到解压缩文件目录tessdata，将目录中的8个非空文件拷贝到/usr/local/share/tessdata目录下覆盖掉原来的空文件，就可以了。

3、启动Tesseract OCR引擎，识别图像
现在，可以准备要进行识别的图像文件，我使用了Tesseract项目发行包中一个TIF图像文件：

执行识别图像的命令格式为：
tesseract [-l lang] [configfile [[+|-]varfile]…]
其中tesseract是命令；是待识别的图片，例如图片eurotext.tif；是输出文本文件的名称，默认生成的是你所给定的输出文件名称，加上.txt扩展名；[-l lang]可选的，指定识别图像中的语言。

例如，启动Tesseract OCR 引擎，识别文字图片eurotext.tif ，执行命令：

[root@bogon tesseract-2.04]# tesseract eurotext.tif eurotext Tesseract Open Source OCR Engine [root@bogon tesseract-2.04]#

可以在tesseract-2.04目录下看到识别图像文件eurotext.tif 得到对应的文本文件eurotext.txt，内容如下所示：

The (quick) [brown] {fox} jumps! Over the $43,456.78 #90 dog & duck/goose, as 12.5% of E-mail from aspammer@website.com is spam. Der ,,schnelle” braune Fuchs springt uber den faulen Hund. Le renard brun <

可见，识别正确率还是很高的，如果你使用发行包中自带的phototest.tif图像文件，识别正确率肯定是100％。但是，因为该图片中存在的干扰信息还是很弱的，不能妄言其识别正确率的高低，还有待于进一步测试它。

学习日记 » ocr

如何识别高级的验证码

tesseract OCR锻炼新字体对图片的预处理和要求

HBITMAP 灰度

write DIB

DDB To DIB

tesscallback.h(1011): error C2872: “remove_reference”: 不明确的符号

Capture2Text

Capture2Text

Contents

What is Capture2Text?

Download

How to Install

OCR

Speech Recognition

Output Options

Configuration

Substitutions

Command Line Options

Tesseract OCR开源项目