OCR is the abbreviation of English Optical Character Recognition. Translating it into Chinese means recognizing characters by optical technology, which is an important aspect in the research and application of automatic recognition technology. It is a software technology that can automatically recognize and input characters into a computer, and it is the main software matched with a scanner. It belongs to the category of non-keyboard input and needs the cooperation of image input equipment, mainly the scanner. At present, OCR mainly refers to character recognition software. Before 1996, thunis began to match Chinese recognition software, scanners and OCR software in the market were always sold separately, and professional OCR software was "baked". Hanging in the air? Frog k widow school? Do you owe Zhiyuan real milk? CR software is also constantly upgrading, and scanner manufacturers have now sold professional OCR software with their own scanners. The rapid development of OCR technology is closely related to the extensive use of scanners. In recent two years, with the gradual popularization of scanners and the improvement of OCR technology, OCR has become a powerful assistant for most scanner users.
First, the development of OCR technology
Since the first generation of OCR products appeared in the early 1960s, after more than 30 years of continuous development and improvement, the research on various OCR technologies, including handwriting, has made remarkable achievements. The functional requirements of OCR products have also changed from focusing on the recognition rate to putting forward higher requirements on the recognition speed, user-friendly interface, simplicity of operation, product stability, adaptability, reliability and easy upgrade, and service quality before and after sales.
The OCR product was first developed by IBM, and 1965 exhibited the OCR product of IBM-IBML 287 at the new york World Expo. At that time, this product could only recognize printed numbers, English letters and some symbols, and it must be a designated font. In the late 1960s, Hitachi and Fujitsu also developed their own OCR products. The world's first automatic letter sorting system to realize handwritten postal code recognition was developed by Toshiba Company of Japan, and the same system was introduced by NEC Company two years later. By 1974, the automatic sorting rate of letters reached about 92%, and it was widely used in the postal system and played a good role. 1983 Toshiba Corporation of Japan released its OCRV595, an OCR system for recognizing printed Japanese characters. Its recognition speed is 70 ~100 Chinese characters per second, and the recognition rate is 99.5%. Later, Toshiba company began the research work of handwritten Japanese Chinese character recognition.
China's research on OCR technology started relatively late, and it began to study the recognition technology of numbers, English letters and symbols in the 1970s, and began to study the recognition of Chinese characters in the late 1970s. 1986, Tsinghua University, Beijing Institute of Information Technology and Shenyang Institute of Automation jointly developed Chinese OCR software. By 1989, Tsinghua University took the lead in launching the first set of Chinese OCR software in China-Tsinghua Wentong TH-OCR 1.0 version, and Chinese OCR officially entered the market from the laboratory. Tsinghua OCR printed Chinese character recognition software later introduced TH-OCR 92 high-performance practical simplified/traditional, multi-font and multi-function printed Chinese character recognition system, which made great progress in printed Chinese character recognition technology. TH-OCR 94, a high-performance Chinese-English mixed-row printing text recognition system introduced in 1994, was appraised by experts as "the first Chinese-English mixed-row printing text recognition system at home and abroad, ranking the international leading level in general". In the middle and late 1990s, the Department of Electronic Engineering of Tsinghua University proposed and conducted a comprehensive study on Chinese character recognition, which made important achievements in the fields of printed text, on-line handwritten Chinese character recognition, off-line handwritten Chinese character recognition and off-line handwritten digital symbol recognition. The representative achievement is TH-OCR 97 integrated Chinese character recognition system, which can complete the recognition and input of printed texts, online handwritten Chinese characters, offline handwritten Chinese characters and handwritten numbers in multiple languages (Chinese, English and Japanese). In recent years, apart from Tsinghua Wentong TH-OCR, other OCR softwares with different styles, such as Shangshu SH-OCR, have also come out one after another, and the Chinese OCR market has steadily expanded, with users all over the world.
It can be said that the recognition technology of printed OCR has reached a high level. OCR products have been developed from the early days when only designated printed numbers, English letters and some symbols can be recognized, into a powerful computer information rapid entry tool that can automatically analyze layout and identify tables, and realize the identification of mixed characters, multiple fonts, multiple font sizes and horizontal and vertical mixed rows. The recognition rate of printed Chinese characters is over 98%, and even the recognition rate of words with poor printing quality is over 95%. It can recognize the simplicity and tradition of many fonts, such as Song Dynasty, black body, regular script and imitation Song Dynasty, and can recognize the mixed typesetting of many fonts and different font sizes, and the recognition rate of handwritten Chinese characters is over 70%. Especially, after more than ten years' efforts, Chinese character OCR technology in China has overcome the difficulties of late start and huge Chinese character set, and the speed of word recognition (referring to the number of words from feature extraction to recognition result output in unit time) can reach more than 70 words/second. Because the printed OCR Chinese character recognition technology is mature, OCR products are widely used in news, printing, publishing, library, office automation and other industries.
Professional OCR products are mostly oriented to specific industries, that is, they are suitable for departments that need to process a lot of form information input every day, such as postal services, taxation, customs, statistics and so on. This professional OCR system for a specific industry has a relatively fixed format, a relatively small character set, and is often used in combination with special input devices, so it has the characteristics of high speed and high efficiency, such as an automatic mail sorting system.
Handwritten manuscript recognition products didn't come out until 1996 and 1997, and they were provided as an additional function of printed manuscript recognition products. Because people's writing habits are very different, it is quite difficult to realize free handwriting recognition. Therefore, the application field of handwriting OCR technology is online handwriting recognition, that is, people write while computers recognize it, which is a real-time recognition method.
Second, the basic principle of OCR
Simply put, the basic principle of OCR is to input an image of a manuscript to a computer through a scanner, and then the computer takes out the image of each character and converts it into the code of Chinese characters. Its specific working process is that the scanner converts the optical signal of the Chinese character manuscript into an electrical signal through the charge coupled device CCD, and then converts it into a digital signal through the analog/digital converter and transmits it to the computer. The computer accepts the digital images of the manuscript, and the Chinese characters on the images may be printed Chinese characters or handwritten Chinese characters, and then recognizes the Chinese characters in these images. For printed characters, firstly, the document data is converted into the original black-and-white dot matrix image file by optical means, and then the characters in the image are converted into text format by recognition software for further processing by word processing software. Among them, character recognition is an important technology of OCR.
Two ways of1.ocr recognition
Like other information data, the graphic information captured by all scanners in the computer is recorded and identified by the numbers 0 and 1, and all information is just a series of points or sample points saved by 0 and 1. OCR recognition program recognizes the character information on the page, mainly through cell pattern matching and feature extraction.
Pattern Matching is to compare each character loosely with the file with standard font and font size bitmap. If there is a large database of saved characters in the application, the application will select the appropriate characters for correct matching. Software must use some processing techniques to find the most similar match, usually by constantly trying different versions of the same character to compare. Some software can scan a page of text and identify every character that defines a new font. Some software uses its own recognition technology to identify the characters on the page as much as possible, and then manually select or directly enter the unrecognizable characters.
Feature Extraction is to decompose each character into many different character features, including diagonal lines, horizontal lines and curves. Then, these features are matched with the understood (recognized) characters. For a simple example, if an application recognizes two horizontal lines, it will "think" that the character may be "two". The advantage of feature extraction method is that it can recognize a variety of fonts. For example, Chinese calligraphy is realized by feature extraction method.
Most OCR applications have added the grammar intelligent checking function, which further improves the recognition rate. It mainly corrects spelling and grammar through context checking. In character recognition, OCR applications will do many contextual cohesion checks, and correspondingly check the words of character strings according to the existing phrases and fixed word order in the program. More advanced application software will automatically replace the wrong words with the words it "thinks" to correct the meaning of the sentence.
2. Several steps of character recognition
Character recognition includes the following steps: graphic input, preprocessing, word recognition and post-processing.
(1) graphic input
It refers to inputting a document into a computer through an input device, that is, digitizing the manuscript. The equipment that is widely used now is the scanner. The scanning quality of document images is a prerequisite for correct recognition by OCR software. Proper selection of scan resolution and related parameters is the key to ensure clear characters and no loss of features. In addition, the document should be placed correctly as much as possible to ensure that the inclination angle of preprocessing detection is small, and the deformation of the text image is small after the inclination correction. These simple operations will improve the recognition accuracy of the system. On the other hand, due to improper scanning settings, too many broken pens may separate out half of the text images. Some features will be lost due to the broken pen and the conglutination of strokes. When comparing the features with the feature database, the feature distance will be increased and the recognition error rate will increase.
(2) Pretreatment
Scanning an image of a simple printed document, sorting out each character image and giving it to the recognition module for recognition, this process is called image preprocessing. Pretreatment refers to some preparatory work before character recognition, including image purification to remove obvious noise (interference) from the original image. The main tasks are to measure the inclination angle of the document, analyze the layout of the document, confirm the layout of the selected text field, segment the text lines in horizontal and vertical layout, separate the text images in each line, and distinguish punctuation marks. The work at this stage is very important, and the effect of processing directly affects the accuracy of character recognition.
Layout analysis is the overall analysis of text images, which is to sort out all the text blocks in the document and distinguish the text paragraphs and typesetting order, as well as the areas of images and tables. The domain boundary of each text block (the coordinates of the starting point and ending point of the domain in the image), the attributes in the domain (horizontal and vertical layout) and the connection relationship of each text block are provided as a data structure to the recognition module for automatic recognition. The text area is directly recognized, the table area is specially analyzed and recognized, and the image area is compressed or simply stored. Line word segmentation is the process of cutting a large image into lines first, and then separating a single character from the image lines.
(3) word recognition
Single character recognition is the core technology of OCR character recognition. It is the key to make the computer "recognize words", that is, the so-called recognition technology, to convert the graphics and images of the text images detected from the scanned text into the standard code of the text. Just as the human brain knows words because it has preserved various features of words, such as the structure of words and the strokes of words. In order for the computer to recognize the characters, it is necessary to store the characters and other information in the computer first, but it is a very complicated process to store what information and how to obtain it, and it is necessary to achieve a very high recognition rate to meet the requirements. The usual practice is to analyze the characters according to their strokes, feature points, projection information and regional distribution of points.
There are thousands of Chinese characters commonly used in China, and the recognition technology is the feature comparison technology. By comparing with the recognition feature database, we can find the word with the most similar features and extract the standard code of the word, which is the recognition result. Comparison is a basic way for people to know things, and Chinese character recognition is also to find out the similarities, similarities and differences between Chinese characters through comparison, and to grasp the relationship between quantity and quality, as well as the relationship between time and space. For Chinese characters with large character set, multi-level classification, multi-feature and all-round dynamic matching are generally used to find similar sets to ensure high classification rate, strong adaptability and good stability; The key points of fine classification are similarity matching, weighting processing, structural discrimination, quantitative and qualitative analysis, and the relationship between front and back connectives, and finally discrimination. Chinese character recognition is essentially the application of comparative science or cognitive science in artificial intelligence, and its key technology is recognition feature base. Only with such a feature library can a computer complete the function of word recognition.
In the layout of an image document, there are not only words and pictures, but also tables sometimes. In order to digitize the recognized tables, it is necessary to carry out special treatment on the table fields in the process of layout analysis, which includes extracting the structural information of the table lines, sorting the text fields in the tables, identifying the table lines and the text fields, and generating different file formats according to the digitalization of the table lines. Because the tables in the document are arbitrary, diverse, closed and open, especially the diagonal lines in the tables, it is difficult to analyze the tables.
(4) Post-treatment
Post-processing refers to matching the recognized words or multiple recognition results up and down in the form of phrases, that is, segmenting the results of word recognition and comparing them with the phrases in the thesaurus, so as to improve the recognition rate of the system and reduce the misidentification rate.
Chinese character recognition is the most difficult problem in the field of character recognition, which involves pattern recognition, image processing, digital signal processing, natural language understanding, artificial intelligence, fuzzy mathematics, information theory, computer, Chinese information processing and other disciplines, and is a comprehensive technology. In recent years, the correct recognition rate of printed Chinese character recognition system has exceeded 95%. In order to further improve the overall recognition rate of the system, scanning images, image preprocessing and post-recognition technologies have also been deeply studied, and considerable progress has been made, effectively improving the overall performance of printed Chinese character recognition system. Tsinghua University has made outstanding achievements in this field and has become one of the most authoritative institutions in the world. At present, all the scanners in thunis are equipped with Tsinghua OCR Millennium Edition software, which has reached a high level in recognition rate, table recognition and even standardized handwriting recognition.
Third, OCR text recognition skills
In recent years, OCR recognition technology has developed rapidly with the popularity of scanners, and the performance of scanning and recognition software has been continuously strengthened and upgraded to intelligence. However, if you want to get the correct scanning results quickly and get efficient text input, you must seriously study relevant knowledge and combine practical experience to explore your own full set of solutions. Sometimes, when we do character recognition, the recognition rate is very low, which can't reach more than 95% as the software says. Please don't blame the hardware or software first. In fact, this is the reason why we haven't mastered the scanning and OCR recognition skills.
The following are some methods and techniques commonly used in character recognition operations.
The setting of1.resolution is an important prerequisite for character recognition. Generally speaking, scanners provide more image information, and recognition software can easily get recognition results. But it is not the case that the higher the scan resolution is set, the higher the recognition accuracy will be. Choose 300dpi or 400dpi resolution, which is suitable for scanning most documents. Pay attention to the scanning recognition of the original text, and do not exceed the optical resolution of the scanner when setting the scan resolution, otherwise it will not pay off. The following are some typical settings for reference only.
(1) 1, 2 and 3, 200dpi is recommended.
(2) 300dpl is recommended for paragraphs with small numbers 4 and 5.
(3) 400dpl is recommended for paragraphs with small numbers 5 and 6.
(4) It is recommended to use 600dpi for the paragraphs of No.7 and No.8.
2. Adjust the brightness and contrast values properly when scanning to make the scanned document black and white. This is the key to the recognition rate, and the setting of scanning brightness and contrast value is based on the principle of observing the thin strokes of Chinese characters in the scanned image but not stopping. Before recognition, look at the quality of words in the scanned image. If there are black spots or black spots in the image or the lines of words are thick and dark, and the strokes can't be distinguished, it means that the brightness value is too small, so you should increase the brightness value and try again. If the text lines are uneven, broken or even the outline of Chinese characters in the image is seriously incomplete, it means that the brightness value is too large, and you should try again after reducing the brightness.
3. Choose the scanning software. Choosing a good OCR software that is suitable for you is the basis of doing a good job of text recognition. Generally, you should not use the OEM software that comes with the scanner. The OEM OCR software has few functions and poor effects, and some even have no Chinese recognition. After comparison, I think that the recognition ability and use function of thunis OCR2003 Professional Edition and Shangshu OCR6.0 text automatic recognition input system are more prominent. Choose another image software. Doesn't OCR software have a scanning interface? Why are you looking for image software? First, OCR software cannot identify all scanners; Second, and most importantly, the images scanned by the scanning interface of image software are easy to process; Generally choose PHOTOSHOP.
4. If the text to be formatted, such as bold, italic, indented first line, etc., some OCR software will not recognize it, and the format will be lost or garbled. If you must scan formatted text, make sure that the recognition software you use supports text format scanning in advance. You can also turn off the pattern recognition system, so that the software can concentrate on finding the correct characters, regardless of the font and font format.
China OCR Information Network
In addition, there is a group purchase of products on the stationmaster group, which is cheap and guaranteed.