Merge 618e130ec8
into 59bafb8d4d
This commit is contained in:
commit
3c301920a1
|
@ -0,0 +1,423 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"\n",
|
||||
"*Note: the above pictures are from the Internet*\n",
|
||||
"\n",
|
||||
"# 1. OCR Technical Background\n",
|
||||
"## 1.1 Application scenarios of OCR technology\n",
|
||||
"\n",
|
||||
"* **<font color=red>What is OCR?</font>**\n",
|
||||
"\n",
|
||||
"OCR(Optical Character Recognition)is one of the important directions of computer vision. Traditionally defined OCR is generally oriented to scanned document objects. Now we often say OCR generally refers to scene Text Recognition (Scene Text Recognition, STR), which is mainly oriented to natural scenes, such as the text visible in various natural scenes such as the plaque shown in the figure below.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"<center>Figure1 Document scene character recognition vs. natural scene character recognition</center>\n",
|
||||
"\n",
|
||||
"<br>\n",
|
||||
"\n",
|
||||
"* **<font color=red>What are the application scenarios of OCR?</font>**\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"OCR technology has rich application scenarios. A typical scenario is vertical oriented structured Text Recognition widely used in daily life, such as license plate recognition, bank card information recognition, ID card information recognition, train ticket information recognition and so on. The common feature of these small vertical classes is that the format is fixed, so it is very suitable to use OCR technology for automation, which can greatly reduce labor cost and improve efficiency.\n",
|
||||
"\n",
|
||||
"This vertical class oriented structured Text Recognition is the scene where OCR is most widely used and the technology is relatively mature.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"<center>Figure 2 Application scenario of OCR technology</center>\n",
|
||||
"\n",
|
||||
"In addition to vertical class oriented structured Text Recognition, general OCR technology is also widely used, and often combined with other technologies to complete multimodal tasks. For example, in video scenes, OCR technology is often used for automatic subtitle translation, content security monitoring, etc, or combined with visual features to complete video understanding, video search and other tasks.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"<center>Figure 3 General OCR in multimodal scenario</center>\n",
|
||||
"\n",
|
||||
"## 1.2 1.2 OCR technical challenges\n",
|
||||
"The technical difficulties of OCR can be divided into two aspects: algorithm layer and application layer.\n",
|
||||
"\n",
|
||||
"* **<font color=red>Algorithm Layer</font>**\n",
|
||||
"\n",
|
||||
"The rich application scenarios of OCR determine that it will have many technical difficulties. Here are 8 common questions:\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"<center>Figure 4 Technical difficulties of OCR algorithm layer</center>\n",
|
||||
"\n",
|
||||
"These problems have brought great technical challenges to text detection and Text Recognition. It can be seen that these challenges are mainly oriented to natural scenes. At present, the research in academic circles mainly focuses on natural scenes, and the commonly used datasets in OCR field are also natural scenes. There are many researches on these problems. Relatively speaking, recognition faces greater challenges than detection.\n",
|
||||
"\n",
|
||||
"* **<font color=red>Application Layer</font>**\n",
|
||||
"\n",
|
||||
"In practical applications, especially in a wide range of general scenarios, in addition to the technical difficulties at the algorithm level such as affine transformation, scale problem, insufficient illumination and shooting blur summarized in the previous section, OCR technology also faces two landing difficulties:\n",
|
||||
"\n",
|
||||
"1. **Massive data requires OCR to process in real time** OCR applications often connect with massive data. We require or hope that the data can be processed in real time. It is a big challenge to achieve real-time model speed.\n",
|
||||
"\n",
|
||||
"2. **End side application requires OCR model to be light enough and recognition speed to be fast enough** OCR applications are often deployed on mobile terminals or embedded hardware. There are generally two modes for end-side OCR applications: upload to the server vs. end-side direct identification. Considering that the way of uploading to the server requires the network, low real-time performance, high pressure on the server when the amount of requests is too large, and the security of data transmission, we hope to complete OCR identification directly on the end side, The storage space and computing power on the end side are limited, so there are high requirements for the size and prediction speed of OCR model.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"<center>Figure 5 technical difficulties of OCR application layer</center>\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# 2.OCR Frontier Algorithm\n",
|
||||
"Although OCR is a relatively specific task, it involves many technologies, including text detection, Text Recognition, end-to-end Text Recognition, document analysis and so on. Academic researches on various related technologies of OCR emerge one after another. The following will briefly introduce the related work of several key technologies in OCR task.\n",
|
||||
"\n",
|
||||
"## 2.1 Text Detection\n",
|
||||
"\n",
|
||||
"The task of text detection is to locate the text area in the input image. In recent years, the academic research on text detection is very rich. One kind of methods regard text detection as a specific scene in target detection, and improve the adaptation based on the general target detection algorithm. For example, textboxes [1] adjusts the target frame to fit the text line with extreme aspect ratio based on the one-stage target detector SSD [2], Ctpn [3] is improved based on the fast RCNN [4] architecture. However, there are still some differences between text detection and target detection in target information and task itself. For example, text is generally large in length and width, often in a \"strip\", and text lines may be dense, bending text, etc. Therefore, many algorithms dedicated to text detection have been derived, such as East [5], psenet [6], dbnet [7], etc.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/548b50212935402abb2e671c158c204737c2c64b9464442a8f65192c8a31b44d\" width=\"500\"></center>\n",
|
||||
"<center>Figure 6 Example of text detection task</center>\n",
|
||||
"\n",
|
||||
"<br>\n",
|
||||
"\n",
|
||||
"At present, the more popular text detection algorithms can be roughly divided into two categories: **based on regression** and **based on segmentation**, and some algorithms combine the two. The regression based algorithm draws lessons from the general object detection algorithm. By setting the anchor regression detection box or directly doing pixel regression, this kind of method has better detection effect on regular shaped text, but relatively poor detection effect on irregular shaped text. For example, ctpn [3] has better detection effect on horizontal text, but poor detection effect on inclined and curved text, Seglink [8] is good for long text, but poor for text with sparse distribution; The segmentation based algorithm introduces mask RCNN [9], which can achieve a higher level in the detection effect of various scenes and texts of various shapes, but the disadvantage is that the post-processing is generally complex, so there are often speed problems and can not solve the detection problem of overlapping texts.\n",
|
||||
"\n",
|
||||
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/4f4ea65578384900909efff93d0b7386e86ece144d8c4677b7bc94b4f0337cfb\" width=\"800\"></center>\n",
|
||||
"<center>Figure 7 Overview of text detection algorithm</center>\n",
|
||||
"\n",
|
||||
"<br>\n",
|
||||
"\n",
|
||||
"||\n",
|
||||
"|---|---|---|\n",
|
||||
"<center>Figure 8 (left) ctpn [3] algorithm optimization based on regression anchor (middle) DB [7] algorithm optimization based on segmentation post-processing (right) SAST [10] algorithm based on regression + segmentation</center>\n",
|
||||
"\n",
|
||||
"<br>\n",
|
||||
"\n",
|
||||
"The technologies related to text detection will be explained and practiced in detail in Chapter 2.\n",
|
||||
"\n",
|
||||
"## 2.2 Text Recognition\n",
|
||||
"\n",
|
||||
"The task of Text recognition is to recognize the text content in the image. Generally, input the image text area cut from the text box obtained by text detection. Text Recognition can generally be divided into **regular text recognition** and **irregular text recognition** according to the shape of the text to be recognized. Regular text mainly refers to printed font, scanned text, etc., and the text is roughly in the horizontal line position; Irregular text is often not in the horizontal position, and there are problems such as bending, occlusion, blur and so on. Irregular text scene is very challenging, and it is also the main research direction in the field of text tecognition.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"<center>Figure 9 (left) regular text vs. (right) irregular text</center>\n",
|
||||
"\n",
|
||||
"<br>\n",
|
||||
"\n",
|
||||
"According to different decoding methods, the algorithms of regular Text Recognition can be roughly divided into two types: CTC based and sequence2sequence based. The processing methods of transforming the sequence features learned by the network into the final recognition results are different. The CTC based algorithm is represented by the classical crnn [11].\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"<center>Figure 10 CTC based recognition algorithm vs. attention based recognition algorithm</center>\n",
|
||||
"\n",
|
||||
"Irregular Text Recognition algorithms are more abundant than others. For example, star net [12] and other methods correct irregular text into regular rectangles by adding TPS and other correction modules; Rare [13] and other attention based methods have enhanced the attention to the correlation between various parts of the sequence; The segmentation based method takes each character of the text line as an independent individual, which is easier to recognize the segmented single character than the recognition after correcting the whole text line; In addition, with the rapid development of transfomer [14] and its effectiveness verification in various tasks in recent years, a number of Text Recognition algorithms based on transformer have emerged. These methods use transformer structure to solve the limitations of CNN in long dependency modeling, and have achieved good results.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"<center>Figure. 11 recognition algorithm based on character segmentation [15]</center>\n",
|
||||
"\n",
|
||||
"<br>\n",
|
||||
"\n",
|
||||
"The technologies related to Text Recognition will be explained and practiced in detail in Chapter 3.\n",
|
||||
"\n",
|
||||
"## 2.3 Document Structure Identification\n",
|
||||
"\n",
|
||||
"OCR technology in the traditional sense can meet the requirements of text detection and recognition, but in practical application scenarios, the final information to be obtained is often structured information, such as ID card and invoice information formatting and extraction, structured identification of forms, etc., mostly in express document extraction, contract content comparison, financial factoring sheet information comparison It is applied in scenarios such as document identification in logistics industry. OCR result + post-processing is a common structured scheme, but the process is often complex, and the post-processing needs fine design and poor generalization. With the gradual maturity of OCR technology and the increasing demand for structured information extraction, various technologies on intelligent document analysis, such as layout analysis, table recognition and key information extraction, have attracted more and more attention and research.\n",
|
||||
"\n",
|
||||
"* **Layout Analysis**\n",
|
||||
"\n",
|
||||
"Layout analysis is mainly used to classify the content of document images. Generally, the categories can be divided into plain text, title, table, picture, etc. Existing methods generally detect or segment different plates in the document as different targets. For example, Soto Carlos [16] improves the region detection performance by combining the context information and using the inherent location information of the document content on the basis of the target detection algorithm fast r-cnn; Sarkar mausoom [17] and others proposed a priori based segmentation mechanism to train the document segmentation model on very high-resolution images, which solves the problem that different structures of dense regions cannot be distinguished and then merged due to excessive reduction of the original image.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"<center>Figure 12 Schematic Diagram of Layout Analysis Task</center>\n",
|
||||
"\n",
|
||||
"<br>\n",
|
||||
"\n",
|
||||
"* **Table Recognition**\n",
|
||||
"\n",
|
||||
"The task of table recognition is to recognize and convert the table information in the document into excel file. The types and styles of tables in text images are complex and diverse, such as different row and column merging, different content text types, etc. in addition, the style of documents and the lighting environment during shooting have brought great challenges to table recognition. These challenges make table recognition always a research difficulty in the field of document understanding.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"<center>Figure 13 Schematic diagram of table recognition task</center>\n",
|
||||
"\n",
|
||||
"<br>\n",
|
||||
"\n",
|
||||
"There are many kinds of methods for table recognition. The early traditional algorithms based on heuristic rules, such as t-rect and other algorithms proposed by kieninger [18] and others, are generally processed through manual design rules and connected domain detection and analysis; In recent years, with the development of deep learning, some CNN based table structure recognition algorithms have begun to emerge, such as deep tabstr proposed by Siddiqui Shoaib Ahmed [19] and tabstruct net proposed by Raja Sachin [20]; In addition, with the rise of graph neural network, some researchers try to apply graph neural network to table structure recognition. Based on graph neural network, table recognition is regarded as a graph reconstruction problem, such as tgrnet proposed by Xue Wenyuan [21]; The end-to-end method directly uses the network to complete the HTML representation output of the table structure. Most end-to-end methods use the seq2seq method to complete the prediction of the table structure, such as some methods based on attention or transformer, such as tablemaster [22].\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"<center>Fig. 14 Schematic diagram of table recognition method</center>\n",
|
||||
"\n",
|
||||
"<br>\n",
|
||||
"\n",
|
||||
"* **Key Information Extraction**\n",
|
||||
"\n",
|
||||
"Key information extraction (KIE) is an important task in document VQA. It mainly extracts the required key information from the image, such as the name and ID number information extracted from the ID card. The types of such information are often fixed under specific tasks, but they are different between different tasks.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"<center>Figure 15 Schematic diagram of docvqa tasks</center>\n",
|
||||
"\n",
|
||||
"<br>\n",
|
||||
"\n",
|
||||
"KIE is usually divided into two sub tasks:\n",
|
||||
"\n",
|
||||
"-Ser: semantic entity recognition, which classifies each detected text, such as name and ID card. See the black box and red box in the figure below.\n",
|
||||
"-Re: relation extraction, which classifies each detected text, such as dividing it into questions and answers. Then find the corresponding answer to each question. As shown in the figure below, the red box and black box represent the question and answer respectively, and the yellow line represents the corresponding relationship between the question and answer.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"<center>Figure 16 Ser and re tasks</center>\n",
|
||||
"\n",
|
||||
"<br>\n",
|
||||
"\n",
|
||||
"General KIE methods are based on named entity recognition (NER) [4], but these methods only use the text information in the image and lack the use of visual and structural information, so the accuracy is not high. On this basis, the methods in recent years have begun to integrate visual and structural information with text information. According to the principle adopted in the fusion of multimodal information, these methods can be divided into the following four types:\n",
|
||||
"\n",
|
||||
"-Grid based method\n",
|
||||
"-Token based method\n",
|
||||
"-GCN based method\n",
|
||||
"-End to end based method\n",
|
||||
"\n",
|
||||
"<br>\n",
|
||||
"\n",
|
||||
"The related technologies of document analysis will be explained and practiced in detail in Chapter 6.\n",
|
||||
"\n",
|
||||
"## 2.4 Other related technologies\n",
|
||||
"Three key technologies in the field of OCR are mainly introduced: text detection, Text Recognition and document structured recognition. More other cutting-edge technologies related to OCR are introduced, including end-to-end Text Recognition, image preprocessing technology in OCR, OCR data synthesis, etc. Please refer to Chapter 7 and Chapter 8 of the tutorial.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# 3.Industrial practice of OCR technology\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"\n",
|
||||
">You are Xiao Wang. What should I do?\n",
|
||||
"> 1. I won't, I can't, I quit 😭\n",
|
||||
"> 2. Suggest the boss find an outsourcing company or a commercialization scheme. Anyway, it will cost the boss's money 😊\n",
|
||||
"> 3. Find similar projects online and program for GitHub 😏\n",
|
||||
"\n",
|
||||
"<br>\n",
|
||||
"\n",
|
||||
"OCR technology will eventually fall into industrial practice. Although there are many academic studies on OCR technology, and the commercial application of OCR technology has been relatively mature compared with other AI technologies, there are still some difficulties and challenges in the actual industrial application. The following will be analyzed from the perspectives of technology and industrial practice.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"## 3.1 Difficulties in industrial practice\n",
|
||||
"In actual industrial practice, developers often need to rely on open source community resources to start or promote projects, and developers often face three major problems when using open source models:\n",
|
||||
"\n",
|
||||
"<center>Figure 17 three problems in OCR technology industry practice</center>\n",
|
||||
"\n",
|
||||
"**1. Not found or selected**\n",
|
||||
"\n",
|
||||
"The open source community is rich in resources, but the information asymmetry leads to developers' inability to effectively solve the pain point problem. On the one hand, the resources of the open source community are too rich, and developers can't quickly find projects matching business requirements from a large number of code warehouses, that is, there is a problem of \"not finding\"; On the other hand, in algorithm selection, the indicators on the English public dataset can not provide a direct reference to the Chinese scenes often faced by developers. Algorithm by algorithm verification requires a lot of time and manpower, and it can not guarantee to select the most appropriate algorithm, that is, \"can not be selected\".\n",
|
||||
"\n",
|
||||
"**2.Not applicable to industrial scenarios**\n",
|
||||
"\n",
|
||||
"开源社区中的工作往往更多地偏向效果优化,如学术论文代码开源或复现,一般更侧重算法效果,平衡考虑模型大小和速度的工作相比就少很多,而模型大小和预测耗时在产业实践中是两项不容忽视的指标,其重要程度不亚于模型效果。无论是移动端和服务器端,待识别的图像数目往往非常多,都希望模型更小,精度更高,预测速度更快。GPU太贵,最好使用CPU跑起来更经济。在满足业务需求的前提下,模型越轻量占用的资源越少。\n",
|
||||
"\n",
|
||||
"**3. Difficult optimization and many training deployment problems**\n",
|
||||
"\n",
|
||||
"The open source community is rich in resources, but the information asymmetry leads to developers' inability to effectively solve the pain point problem. On the one hand, the resources of the open source community are too rich, and developers can't quickly find projects matching business requirements from a large number of code warehouses, that is, there is a problem of \"not finding\"; On the other hand, in algorithm selection, the indicators on the English public dataset can not provide a direct reference to the Chinese scenes often faced by developers. Algorithm by algorithm verification requires a lot of time and manpower, and it can not guarantee to select the most appropriate algorithm, that is, \"can not be selected\".\n",
|
||||
"\n",
|
||||
"## 3.2 产业级OCR开发套件PaddleOCR\n",
|
||||
"\n",
|
||||
"OCR industry practice needs a set of complete and whole process solutions to speed up R & D progress and save valuable R & D time. In other words, the ultra lightweight model and its whole process solution are just needed, especially for mobile terminals and embedded devices with limited computing power and storage space.\n",
|
||||
"\n",
|
||||
"In this context, the industrial OCR development kit [paddleocr]( https://github.com/PaddlePaddle/PaddleOCR )came into being.\n",
|
||||
"\n",
|
||||
"The construction idea of paddleocr starts from the user portrait and needs, relying on the core framework of the propeller, selects and replicates rich cutting-edge algorithms, develops PP characteristic models more suitable for industrial landing based on the reproduced algorithms, integrates training and promotion, and provides a variety of prediction deployment methods to meet the different demand scenarios of practical application.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"<center>Figure 18 Panorama of PaddleOCR Development Kit</center>\n",
|
||||
"\n",
|
||||
"<br>\n",
|
||||
"\n",
|
||||
"As can be seen from the panorama, paddleocr relies on the core framework of the propeller, provides rich solutions at the levels of model algorithm, pre training model library and industrial deployment, and provides data synthesis and semi-automatic data annotation tools to meet the data production needs of developers.\n",
|
||||
"\n",
|
||||
"**At the model algorithm level**, paddleocr provides solutions for **text detection and recognition** and **document structural analysis**. In terms of text detection and recognition, paddleocr reproduces or opens source four text detection algorithms, eight Text Recognition algorithms and one end-to-end Text Recognition algorithm, and develops PP-OCR series general text detection and recognition solutions on this basis; In the aspect of document structure analysis, paddleocr provides algorithms such as layout analysis, table recognition, key information extraction and named entity recognition, and puts forward the PP structure document analysis solution. Rich selection algorithms can meet the needs of developers in different business scenarios. The unification of code framework also facilitates developers to optimize and compare the performance of different algorithms.\n",
|
||||
"\n",
|
||||
"**At the model algorithm level**, paddleocr provides solutions for **text detection and recognition** and **document structural analysis**. In terms of text detection and recognition, paddleocr reproduces or opens source four text detection algorithms, eight Text Recognition algorithms and one end-to-end Text Recognition algorithm, and develops PP-OCR series general text detection and recognition solutions on this basis; In the aspect of document structure analysis, paddleocr provides algorithms such as layout analysis, table recognition, key information extraction and named entity recognition, and puts forward the PP structure document analysis solution. Rich selection algorithms can meet the needs of developers in different business scenarios. The unification of code framework also facilitates developers to optimize and compare the performance of different algorithms.\n",
|
||||
"\n",
|
||||
"**At the industrial deployment level**, paddleocr provides a server-side prediction scheme based on paddle inference, a service-oriented deployment scheme based on paddle serving, and an end-side deployment scheme based on paddle Lite to meet the deployment requirements in different hardware environments. At the same time, it provides a model compression scheme based on paddleslim, which can further compress the model size. The above deployment methods have completed the whole process of training and promotion, so as to ensure that developers can deploy efficiently, stably and reliably.\n",
|
||||
"\n",
|
||||
"**At the data tool level**, paddleocr provides a semi-automatic data annotation tool ppocrlabel and a data synthesis tool style text to help developers more conveniently train the data sets and annotation information required for production models. As the first open source semi-automatic OCR data annotation tool in the industry, ppocrlabel has built-in PP-OCR model to realize the annotation mode of pre annotation + manual verification, which can greatly improve the annotation efficiency and save labor cost. The data synthesis tool style text mainly solves the problem that the real data of the actual scene is seriously insufficient, and the traditional synthesis algorithm can not synthesize the text style (font, color, spacing and background). Only a few target scene images are needed, and a large number of text images similar to the style of the target scene can be synthesized in batches.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"<center>Figure 19 Schematic diagram of ppocrlabel</center>\n",
|
||||
"\n",
|
||||
"<br>\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"<center>Figure 20 Example of style text synthesis effect</center>\n",
|
||||
"\n",
|
||||
"<br>\n",
|
||||
"\n",
|
||||
"### 3.2.1 PP-OCR and PP-Structrue\n",
|
||||
"\n",
|
||||
"PP series characteristic model is a model that deeply optimizes each visual development kit of the propeller according to the needs of industrial practice, striving to balance speed and precision. PP series feature models in paddleocr include PP-OCR series models for text detection and recognition tasks and PP structure series models for document analysis.\n",
|
||||
"\n",
|
||||
"**(1) PP-OCR Chinese English Model**\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"<center>Figure 21 example of PP-OCR Chinese and English model recognition results</center>\n",
|
||||
"\n",
|
||||
"<br>\n",
|
||||
"\n",
|
||||
"The typical two-stage OCR algorithm used in PP-OCR Chinese and English model is the composition of detection model + recognition model. The specific algorithm framework is as follows:\n",
|
||||
"\n",
|
||||
"<center>Figure 22 pipeline diagram of PP-OCR system</center>\n",
|
||||
"\n",
|
||||
"<br>\n",
|
||||
"\n",
|
||||
"It can be seen that in addition to input and output, PP-OCR core framework includes three modules: text detection module, detection box correction module and Text Recognition module.\n",
|
||||
"-Text detection module: the core is based on [DB]( https://arxiv.org/abs/1911.08947 )The text detection model trained by the detection algorithm detects the text region in the image;\n",
|
||||
"-Detection box correction module: input the detected text box into the detection box correction module. At this stage, the text box represented by four points is corrected into a rectangular box to facilitate subsequent Text Recognition. On the other hand, it will judge and correct the text direction. For example, if it is judged that the text line is inverted, it will become a positive, This function is realized by training a text direction classifier;\n",
|
||||
"-Text Recognition module: finally, the Text Recognition module performs Text Recognition on the corrected detection box to obtain the text content in each text box. The classical Text Recognition algorithm [crnn] used in PP-OCR( https://arxiv.org/abs/1507.05717 )。\n",
|
||||
"\n",
|
||||
"Paddleocr successively launched PP-OCR [23] and PP-OCRv2 [24] models.\n",
|
||||
"\n",
|
||||
"PP-OCR model is divided into mobile version (lightweight version) and server version (general version). The mobile version model is mainly optimized based on the lightweight backbone network mobilenetv3. The size of the optimized model (detection model + text direction classification model + recognition model) is only 8.1m. The average prediction time of a single image on CPU is 350ms, and about 110ms on T4 GPU. After cutting and quantization, It can be further compressed to 3.5m with the same accuracy, which is convenient for end-to-side deployment. The test and prediction time on Xiaolong 855 is only 260ms. More PP-OCR evaluation data can be referred to[benchmark](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.2/doc/doc_ch/benchmark.md)。\n",
|
||||
"\n",
|
||||
"PP-OCRv2 maintains the overall framework of PP-OCR and mainly makes further strategy optimization in effect. The improvement includes three aspects:\n",
|
||||
"- In terms of model effect, compared with PP-OCR mobile version, it has been improved by more than 7%;\n",
|
||||
"- In terms of speed, compared with the PP-OCR server version, it has been improved by more than 220%;\n",
|
||||
"- In terms of model size, the total size of 11.6m can be easily deployed on the server side and mobile side.\n",
|
||||
"\n",
|
||||
"The specific optimization strategies of PP-OCR and PP-OCRv2 will be explained in detail in Chapter 4.\n",
|
||||
"In addition to Chinese and English models, paddleocr also trains and opens source English digital models and multilingual recognition models based on different data sets. The above models are ultra lightweight and suitable for different language scenarios.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"<center>Figure 23 schematic diagram of recognition effect of English digital model and multilingual model of PP-OCR</center>\n",
|
||||
"\n",
|
||||
"<br>\n",
|
||||
"\n",
|
||||
"**(2)PP-Structure Document analysis Model**\n",
|
||||
"\n",
|
||||
"PP-Structure Support layout analysis 、 table recognition 、 DocVQA Three sub tasks.\n",
|
||||
"\n",
|
||||
"The core function points of PP structure are as follows:\n",
|
||||
"- It supports layout analysis of documents in the form of pictures, and can be divided into five areas: text, title, table, picture and list (used in combination with layout parser)\n",
|
||||
"- Support text, title, picture and list area extraction as text fields (used in combination with PP-OCR)\n",
|
||||
"- Support structured analysis in the table area, and output the final result to excel file\n",
|
||||
"- It supports Python WHL package and command line, which is simple and easy to use\n",
|
||||
"- It supports two types of task customization training: layout analysis and table structure\n",
|
||||
"- Support VQA tasks - Ser and re\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"<center>Figure 24 schematic diagram of PP structure system (this figure only includes layout analysis + table identification)</center>\n",
|
||||
"\n",
|
||||
"<br>\n",
|
||||
"\n",
|
||||
"The specific scheme of PP structure will be explained in detail in Chapter 6.\n",
|
||||
"\n",
|
||||
"### 3.2.2 Industrial Deployment Scheme\n",
|
||||
"\n",
|
||||
"The flying propeller supports the whole process and the whole scene reasoning and deployment. The source of the model is mainly divided into three kinds. The first one is to use PaddlePaddle API to build the network structure for training, the second is based on the flying propeller assembly series, and the flying propeller kit provides a rich model library, simple and easy to use API, and has the open box, namely the visual model library PaddleCV, Intelligent speech library paddlespeech and natural language processing library paddlenlp, etc. the third model is produced from the third-party framework (pytorh, onnx, tensorflow, etc.) using x2paddle tool.\n",
|
||||
"\n",
|
||||
"The paddle model can be compressed, quantified and distilled by using the paddleslim tool. It supports five deployment schemes: service-oriented paddle serving, server / cloud paddle inference, mobile / edge paddle lite and web front-end paddle JS, for hardware not supported by paddle, such as MCU, horizon, Kunyun and other domestic chips, paddle2onnx can be transformed into a third-party framework supporting onnx.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"<center>Figure 25 deployment mode of propeller support</center>\n",
|
||||
"\n",
|
||||
"<br>\n",
|
||||
"\n",
|
||||
"Paddle inference supports server-side and cloud deployment, with high performance and versatility. It has been deeply adapted and optimized for different platforms and different application scenarios. Paddle inference is the original reasoning base of the propeller, which ensures that the model can be trained and used on the server side and deployed quickly. It is suitable for deploying complex models with multiple application language environments on high-performance hardware, The hardware covers x86 CPU, NVIDIA GPU, baidu Kunlun XPU, Huawei shengteng and other AI accelerators.\n",
|
||||
"Paddle Lite is an end-side reasoning engine with lightweight and high-performance characteristics. It has been deeply configured and optimized for end-side equipment and various application scenarios. At present, it supports Android, IOS, embedded Linux devices, MacOS and other platforms. The hardware covers ARM CPU and GPU, x86 CPU and new hardware, such as Baidu Kunlun, Huawei shengteng and Kirin, Ruixin micro, etc.\n",
|
||||
"Paddy serving is a high-performance service framework designed to help users quickly deploy the model in the cloud in several steps. At present, paddle serving supports customized pre-processing and post-processing, model combination, model hot loading and updating, multi machine, multi card and multi model, distributed reasoning, k8s deployment, security gateway and model encryption deployment, and supports multi language and multi client access. Paddle serving official also provides deployment examples of more than 40 models, including paddleocr, to help users get started faster.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"<center>Figure 26 deployment mode of propeller support</center>\n",
|
||||
"\n",
|
||||
"<br>\n",
|
||||
"\n",
|
||||
"The above deployment scheme will be explained and practiced in detail based on PP-OCRv2 model in Chapter 5."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# 4. Summary\n",
|
||||
"This section first introduces the application scenarios and cutting-edge algorithms of OCR technology, and then analyzes the difficulties and three challenges of OCR technology in industrial practice.\n",
|
||||
"\n",
|
||||
"The contents of subsequent chapters of this tutorial are arranged as follows:\n",
|
||||
"\n",
|
||||
"* The second and third chapters respectively introduce the detection and recognition technology and practice;\n",
|
||||
"* The fourth chapter introduces the PP-OCR optimization strategy;\n",
|
||||
"* The fifth chapter carries out forecast deployment and actual combat;\n",
|
||||
"* Chapter 6 introduces document structure;\n",
|
||||
"* Chapter 7 introduces other OCR related algorithms such as end-to-end, data preprocessing and data synthesis;\n",
|
||||
"* Chapter 8 introduces OCR related data sets and data synthesis tools.\n",
|
||||
"\n",
|
||||
"# Reference\n",
|
||||
"\n",
|
||||
"[1] Liao, Minghui, et al. \"Textboxes: A fast text detector with a single deep neural network.\" Thirty-first AAAI conference on artificial intelligence. 2017.\n",
|
||||
"\n",
|
||||
"[2] Liu W, Anguelov D, Erhan D, et al. Ssd: Single shot multibox detector[C]//European conference on computer vision. Springer, Cham, 2016: 21-37.\n",
|
||||
"\n",
|
||||
"[3] Tian, Zhi, et al. \"Detecting text in natural image with connectionist text proposal network.\" European conference on computer vision. Springer, Cham, 2016.\n",
|
||||
"\n",
|
||||
"[4] Ren S, He K, Girshick R, et al. Faster r-cnn: Towards real-time object detection with region proposal networks[J]. Advances in neural information processing systems, 2015, 28: 91-99.\n",
|
||||
"\n",
|
||||
"[5] Zhou, Xinyu, et al. \"East: an efficient and accurate scene text detector.\" Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2017.\n",
|
||||
"\n",
|
||||
"[6] Wang, Wenhai, et al. \"Shape robust text detection with progressive scale expansion network.\" Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.\n",
|
||||
"\n",
|
||||
"[7] Liao, Minghui, et al. \"Real-time scene text detection with differentiable binarization.\" Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34. No. 07. 2020.\n",
|
||||
"\n",
|
||||
"[8] Deng, Dan, et al. \"Pixellink: Detecting scene text via instance segmentation.\" Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 32. No. 1. 2018.\n",
|
||||
"\n",
|
||||
"[9] He K, Gkioxari G, Dollár P, et al. Mask r-cnn[C]//Proceedings of the IEEE international conference on computer vision. 2017: 2961-2969.\n",
|
||||
"\n",
|
||||
"[10] Wang P, Zhang C, Qi F, et al. A single-shot arbitrarily-shaped text detector based on context attended multi-task \n",
|
||||
"learning[C]//Proceedings of the 27th ACM international conference on multimedia. 2019: 1277-1285.\n",
|
||||
"\n",
|
||||
"[11] Shi, B., Bai, X., & Yao, C. (2016). An end-to-end trainable neural network for image-based sequence recognition and its application to scene Text Recognition. IEEE transactions on pattern analysis and machine intelligence, 39(11), 2298-2304.\n",
|
||||
"\n",
|
||||
"[12] Star-Net Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spa- tial transformer networks. In Advances in neural information processing systems, pages 2017–2025, 2015.\n",
|
||||
"\n",
|
||||
"[13] Shi, B., Wang, X., Lyu, P., Yao, C., & Bai, X. (2016). Robust scene Text Recognition with automatic rectification. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4168-4176).\n",
|
||||
"\n",
|
||||
"[14] Sheng, F., Chen, Z., & Xu, B. (2019, September). NRTR: A no-recurrence sequence-to-sequence model for scene Text Recognition. In 2019 International Conference on Document Analysis and Recognition (ICDAR) (pp. 781-786). IEEE.\n",
|
||||
"\n",
|
||||
"[15] Lyu P, Liao M, Yao C, et al. Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes[C]//Proceedings of the European Conference on Computer Vision (ECCV). 2018: 67-83.\n",
|
||||
"\n",
|
||||
"[16] Soto C, Yoo S. Visual detection with context for document layout analysis[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019: 3464-3470.\n",
|
||||
"\n",
|
||||
"[17] Sarkar M, Aggarwal M, Jain A, et al. Document Structure Extraction using Prior based High Resolution Hierarchical Semantic Segmentation[C]//European Conference on Computer Vision. Springer, Cham, 2020: 649-666.\n",
|
||||
"\n",
|
||||
"[18] Kieninger T, Dengel A. A paper-to-HTML table converting system[C]//Proceedings of document analysis systems (DAS). 1998, 98: 356-365.\n",
|
||||
"\n",
|
||||
"[19] Siddiqui S A, Fateh I A, Rizvi S T R, et al. Deeptabstr: Deep learning based table structure recognition[C]//2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2019: 1403-1409.\n",
|
||||
"\n",
|
||||
"[20] Raja S, Mondal A, Jawahar C V. Table structure recognition using top-down and bottom-up cues[C]//European Conference on Computer Vision. Springer, Cham, 2020: 70-86.\n",
|
||||
"\n",
|
||||
"[21] Xue W, Yu B, Wang W, et al. TGRNet: A Table Graph Reconstruction Network for Table Structure Recognition[J]. arXiv preprint arXiv:2106.10598, 2021.\n",
|
||||
"\n",
|
||||
"[22] Ye J, Qi X, He Y, et al. PingAn-VCGroup's Solution for ICDAR 2021 Competition on Scientific Literature Parsing Task B: Table Recognition to HTML[J]. arXiv preprint arXiv:2105.01848, 2021.\n",
|
||||
"\n",
|
||||
"[23] Du Y, Li C, Guo R, et al. PP-OCR: A practical ultra lightweight OCR system[J]. arXiv preprint arXiv:2009.09941, 2020.\n",
|
||||
"\n",
|
||||
"[24] Du Y, Li C, Guo R, et al. PP-OCRv2: Bag of Tricks for Ultra Lightweight OCR System[J]. arXiv preprint arXiv:2109.03144, 2021.\n",
|
||||
"\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "p2",
|
||||
"language": "python",
|
||||
"name": "p2"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.7.11"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 4
|
||||
}
|
|
@ -0,0 +1,180 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Text Detection FAQ\n",
|
||||
"\n",
|
||||
"This section lists problems that developers often encounter when using the text detection model of PaddleOCR, and gives corresponding solutions or suggestions.\n",
|
||||
"\n",
|
||||
"The FAQ is divided into two parts:\n",
|
||||
"- Text detection training related\n",
|
||||
"- Text detection prediction correlation"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 1. FAQs related to text detection training\n",
|
||||
"\n",
|
||||
"**1.1 What are the text detection algorithms provided by PaddleOCR?**\n",
|
||||
"\n",
|
||||
"**A**:PaddleOCR contains a variety of text detection models, including regression based text detection methods East and SAST, and segmentation based text detection methods dB and psenet.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"**1.2:What data sets are used for the Chinese ultra lightweight and general model in the PaddleOCR project? How many samples are trained, what configuration is the GPU, how many epochs have been run, and how long?**\n",
|
||||
"\n",
|
||||
"**A**:For the ultra lightweight DB detection model, the training data includes open source data sets LSVT, rctw, CASIA, CCPD, MsrA, MLT, borndigit, iFLYTEK, sroie and synthetic data sets. The total data volume is 10W, and the data set is divided into five parts. The random sampling strategy is adopted during training. About 500epoch is trained on 4-card v100gpu, which takes 3 days.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"**1.3 Does the text detection training label need specific text annotation? What does the \"###\" in the label mean?**\n",
|
||||
"\n",
|
||||
"**A**:Text detection training only needs the coordinates of the text area. The annotation can be four or fourteen points, arranged in the order of top left, top right, bottom right and bottom left. The label file provided by PaddleOCR contains text fields. If the text in the text area is not clear, it will be used ### instead. When training the detection model, the text field in the label will not be used.\n",
|
||||
" \n",
|
||||
"**1.4 When the text lines are close, the trained text detection model has poor effect?**\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"**A**:When using segmentation based methods, such as DB, to detect dense text lines, it is best to collect a batch of data for training, and reduce the parameters of generating binary images[shrink_ratio](https://github.com/PaddlePaddle/PaddleOCR/blob/8b656a3e13631dfb1ac21d2095d4d4a4993ef710/ppocr/data/imaug/make_shrink_map.py?_pjax=%23js-repo-pjax-container%2C%20div%5Bitemtype%3D%22http%3A%2F%2Fschema.org%2FSoftwareSourceCode%22%5D%20main%2C%20%5Bdata-pjax-container%5D#L37) during training. In addition, during prediction, the parameter can be appropriately reduced[unclip_ratio](https://github.com/PaddlePaddle/PaddleOCR/blob/8b656a3e13631dfb1ac21d2095d4d4a4993ef710/configs/det/ch_ppocr_v2.0/ch_det_mv3_db_v2.0.yml?_pjax=%23js-repo-pjax-container%2C%20div%5Bitemtype%3D%22http%3A%2F%2Fschema.org%2FSoftwareSourceCode%22%5D%20main%2C%20%5Bdata-pjax-container%5D#L59), unclip_ The larger the ratio parameter value, the larger the detection box.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"**1.5 For some document images with large size, DB will have more missed detection. How to avoid this problem?**\n",
|
||||
"\n",
|
||||
"**A**:First, it is necessary to determine whether the model is not well trained or handled during prediction. If the model is not well trained, it is recommended to add more data for training, or add more data enhancement during training.\n",
|
||||
"\n",
|
||||
"If the predicted image is too large, you can increase the longest edge setting parameter[det_limit_side_len](https://github.com/PaddlePaddle/PaddleOCR/blob/8b656a3e13631dfb1ac21d2095d4d4a4993ef710/tools/infer/utility.py?_pjax=%23js-repo-pjax-container%2C%20div%5Bitemtype%3D%22http%3A%2F%2Fschema.org%2FSoftwareSourceCode%22%5D%20main%2C%20%5Bdata-pjax-container%5D#L47) entered during prediction, which is 960 by default.\n",
|
||||
"\n",
|
||||
"Secondly, we can observe whether the missing text has segmentation results through the visual post-processing segmentation map. If there is no segmentation result, it indicates that the model is not well trained. If there is a complete divided area, it indicates that it is a problem of prediction post-processing, and it is recommended to adjust [DB post-processing parameters](https://github.com/PaddlePaddle/PaddleOCR/blob/8b656a3e13631dfb1ac21d2095d4d4a4993ef710/tools/infer/utility.py?_pjax=%23js-repo-pjax-container%2C%20div%5Bitemtype%3D%22http%3A%2F%2Fschema.org%2FSoftwareSourceCode%22%5D%20main%2C%20%5Bdata-pjax-container%5D#L51-L53).\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"**1.6 Missed detection of curved text (such as slightly deformed document images) in DB model?**\n",
|
||||
"\n",
|
||||
"**A**:When calculating the average score of the text box in DB post-processing, the average score of the rectangle area is calculated, which is easy to cause missed detection of curved text. The average score of the polygon area has been added, which will be more accurate, but the speed will be reduced. You can select as needed. You can view the [visual comparison effect](https://github.com/PaddlePaddle/PaddleOCR/pull/2604) in the relevant pr. This function is selected through parameters [det_db_score_mode](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.1/tools/infer/utility.py#L51). The parameter values can be [` fast '(default), ` slow'], ` fast 'corresponds to the original rectangle mode, and ` slow' corresponds to polygon mode. Thank user [buptlihang](https://github.com/buptlihang) for [pr](https://github.com/PaddlePaddle/PaddleOCR/pull/2574) help in solving this problem.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"**1.7 Simply, for OCR tasks with low accuracy requirements, how many pieces of dataset need to be prepared?**\n",
|
||||
"\n",
|
||||
"**A**:(1) The amount of training data is related to the complexity of the problem to be solved. The greater the difficulty and the higher the accuracy requirements, the greater the demand for data sets, and in general, the more training data in practice, the better the effect.\n",
|
||||
"\n",
|
||||
"(2) For scenes with low accuracy requirements, the amount of data required for detection task and recognition task is different. For the detection task, 500 images can ensure the basic detection effect. For the recognition task, it is necessary to ensure that the number of text images of each character appearing in different scenes in the recognition dictionary needs to be greater than 200 (for example, if there are 5 words in the dictionary and each word needs to appear in more than 200 pictures, the minimum required number of images should be between 200-1000), so as to ensure the basic recognition effect.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"**1.8 When the amount of training data is small, how to obtain more data?**\n",
|
||||
"\n",
|
||||
"**A**:When the amount of training data is small, you can try the following three ways to obtain more data: \n",
|
||||
"\n",
|
||||
"(1) manually collecting more training data is the most direct and effective way.\n",
|
||||
"\n",
|
||||
"(2) Basic image processing or transformation based on PIL and OpenCV. For example, the three modules of imagefont, image and ImageDraw in pil write text into the background, opencv rotation, affine transformation, Gaussian filtering, etc. \n",
|
||||
"\n",
|
||||
"(3) Use data generation algorithms to synthesize data, such as pix2pix and other algorithms.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"**1.9 How to replace the backbone of text detection / recognition?**\n",
|
||||
"\n",
|
||||
"**A**:Whether it is text detection or text recognition, the choice of backbone network is the trade-off between prediction effect and prediction efficiency. Generally, choose a larger backbone network, such as ResNet101_vd, the detection or recognition is more accurate, but the prediction time will increase accordingly. Choose a smaller backbone network, such as MobileNetV3_small_x0_35, the prediction is faster, but the accuracy of detection or recognition will be greatly reduced. Fortunately, the detection or recognition effect of different backbone networks is positively related to the effect of image 1000 classification task in Imagenet dataset. Paddle image classification kit paddleclas summarizes ResNet_vd, ReS2Net, HRNet, MobileNetV3, GhostNet and other 23 series of classification network structures, in the top 1 recognition accuracy of the above image classification tasks, the prediction time of GPU (V100 and T4) and CPU (Xiaolong 855) and the corresponding 117 pre training model download addresses.\n",
|
||||
"\n",
|
||||
"(1) The replacement of text detection backbone network is mainly to determine four stages similar to RESNET, so as to facilitate the integration of subsequent FPN like detection heads. In addition, for the text detection problem, the classification pre training model trained by Imagenet can accelerate the convergence and improve the effect.\n",
|
||||
"\n",
|
||||
"(2) When replacing the backbone network of character recognition, we need to pay attention to the falling position of network width and height stripe. Due to the large proportion of width to height in text recognition, the height drop frequency is less and the width drop frequency is more. You can refer to the changes of [MobileNetV3backbone network in PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR/blob/release%2F2.3/ppocr/modeling/backbones/rec_mobilenet_v3.py)\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"**1.10 How to use a small learning rate for the detection model finetune, such as freezing the previous layers or some layers?**\n",
|
||||
"\n",
|
||||
"**A**:If some layers are frozen, you can stop the variable_ Set the gradient property to true, so that all parameters before calculating this variable will not be updated. Refer to:https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/faq/train_cn.html#id4\n",
|
||||
"\n",
|
||||
"If you use a smaller learning rate for some layers, it is not very convenient in the static graph. One way is to set a fixed learning rate for the weight attribute during parameter initialization. Refer to:https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/api/paddle/fluid/param_attr/ParamAttr_cn.html#paramattr\n",
|
||||
"\n",
|
||||
"In fact, our experiments show that the effect is good when we directly load the model to fine tune without setting different learning rates of some layers.\n",
|
||||
"\n",
|
||||
"**1.11 Why should the preprocessing part of DB, the length and width of the picture be processed into a multiple of 32?**\n",
|
||||
"\n",
|
||||
"**A**:It is related to the multiple of sampling under the network. Taking the RESNET backbone network under detection as an example, after the image is input into the network, it needs to undergo 5 times of 2x down sampling, a total of 32 times. Therefore, it is recommended that the input image size be a multiple of 32.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"**1.12 In the model of PP-OCR series, why does the backbone network of text detection not use seblock?**\n",
|
||||
"\n",
|
||||
"**A**:SE module is an important module of MobileNetV3 network. Its purpose is to estimate the importance of each feature channel of the feature map, assign weight to each feature of the feature map, and improve the expression ability of the network. However, for text detection, the resolution of the input network is relatively large, generally 640\\*640. It is difficult to estimate the importance of each feature channel of the feature map using the se module, and the network promotion capacity is limited, but this module is time-consuming. Therefore, in the PP-OCR system, the backbone network of text detection does not use the se module. Experiments also show that when the se module is removed, the size of the ultra lightweight model can be reduced by 40%, and the text detection effect is basically not affected. Please refer to PP-OCR technical articles for details,https://arxiv.org/abs/2009.09941.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"**1.13 PP-OCR detection effect is not good, how to optimize it?**\n",
|
||||
"\n",
|
||||
"A: Specific analysis of specific problems:\n",
|
||||
"- If the detection effect is not available in your scene, the first choice is to do finetune training on your data;\n",
|
||||
"- If the image is too large and the text is too dense, it is recommended not to over compress the image. You can try to modify the resize logic of detection preprocessing to prevent the image from being over compressed;\n",
|
||||
"- The size of the detection box is too close to the text or the detection box is too large. You can adjust dB_unclip_ratio parameter, increasing the parameter can expand the detection frame, and decreasing the parameter can reduce the size of the detection frame;\n",
|
||||
"- There are many missed detection problems in the detection frame, which can reduce the threshold parameter det of DB detection post-processing db_box_thresh to prevent some detection frames from being filtered out. You can also try to set det_db_score_mode is' slow ';\n",
|
||||
"- Other methods can be used_ If the division is true, the feature map of the detection output will be expanded. Generally, the effect will be improved;\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"## 2. FAQs related to text detection and prediction\n",
|
||||
"\n",
|
||||
"**2.1 Some DB boxes are too text pasted, but some edges and corners of the text are removed, which affects the recognition. Is there any way to alleviate this problem?**\n",
|
||||
"\n",
|
||||
"**A**:You can appropriately increase the post-processing parameters [unclip_ratio](https://github.com/PaddlePaddle/PaddleOCR/blob/d80afce9b51f09fd3d90e539c40eba8eb5e50dd6/tools/infer/utility.py?_pjax=%23js-repo-pjax-container%2C%20div%5Bitemtype%3D%22http%3A%2F%2Fschema.org%2FSoftwareSourceCode%22%5D%20main%2C%20%5Bdata-pjax-container%5D#L52). The larger the parameter, the larger the text box.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"**2.2 Why does PaddleOCR detection prediction only support one picture test? test_batch_size_per_card=1**\n",
|
||||
"\n",
|
||||
"**A**:During prediction, the image is scaled in equal proportion, and the longest side is 960. The length and width of different images are inconsistent after scaling in equal proportion, so it cannot form a batch, so it is set to test_batch_size is 1.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"**2.3 Accelerate the text detection model prediction of PaddleOCR on CPU?**\n",
|
||||
"\n",
|
||||
"**A**:x86 CPU can be accelerated using mkldnn (onednn); Enable on CPUs that support mkldnn acceleration_ mkldnn [enable_mkldnn](https://github.com/PaddlePaddle/PaddleOCR/blob/8b656a3e13631dfb1ac21d2095d4d4a4993ef710/tools/infer/utility.py#L105) parameter. In addition, increase the number of threads predicted to be used on the CPU [num_threads](https://github.com/PaddlePaddle/PaddleOCR/blob/8b656a3e13631dfb1ac21d2095d4d4a4993ef710/tools/infer/utility.py#L106), which can effectively speed up the prediction speed on the CPU.\n",
|
||||
"\n",
|
||||
"**2.4 Accelerate the text detection model prediction of PaddleOCR on GPU?**\n",
|
||||
"\n",
|
||||
"**A**:TensorRt is recommended for GPU acceleration prediction.\n",
|
||||
"- 1.Download the paste installation package or prediction library with tensorrt from the [link](https://paddleinference.paddlepaddle.org.cn/master/user_guides/download_lib.html).\n",
|
||||
"- 2.Download the TensorRT version from NVIDIA's official website. Note that the downloaded TensorRT version is consistent with the tensorrt version compiled in the paddle installation package.\n",
|
||||
"- 3.Setting environment variable LD_LIBRARY_PATH, pointing to the Lib folder of TensorRT\n",
|
||||
"```\n",
|
||||
"export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<TensorRT-${version}/lib>\n",
|
||||
"```\n",
|
||||
"- 4.Enable PaddleOCR prediction [TensorRT option](https://github.com/PaddlePaddle/PaddleOCR/blob/8b656a3e13631dfb1ac21d2095d4d4a4993ef710/tools/infer/utility.py?_pjax=%23js-repo-pjax-container%2C%20div%5Bitemtype%3D%22http%3A%2F%2Fschema.org%2FSoftwareSourceCode%22%5D%20main%2C%20%5Bdata-pjax-container%5D#L38).\n",
|
||||
"\n",
|
||||
"**2.5 How to deploy the PaddleOCR model on the mobile terminal?**\n",
|
||||
"\n",
|
||||
"**A**: The propeller paddle has tools [PaddleLite](https://github.com/PaddlePaddle/Paddle-Lite) specifically for mobile deployment, In addition, PaddleOCR provides Android arm deployment code with DB + CRNN as demo. Refer to [link](https://github.com/PaddlePaddle/PaddleOCR/blob/release%2F2.3/deploy/lite/readme.md).\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"**2.6 How to use PaddleOCR multi process prediction?**\n",
|
||||
"\n",
|
||||
"**A**: Recently, PaddleOCR added [multi process predictive control parameter](https://github.com/PaddlePaddle/PaddleOCR/blob/8b656a3e13631dfb1ac21d2095d4d4a4993ef710/tools/infer/utility.py?_pjax=%23js-repo-pjax-container%2C%20div%5Bitemtype%3D%22http%3A%2F%2Fschema.org%2FSoftwareSourceCode%22%5D%20main%2C%20%5Bdata-pjax-container%5D#L111), ` use_ MP ` indicates whether to use multiple processes, ` total_ process_ Num ` indicates the number of processes when using multiple processes. Please refer to [document](https://github.com/PaddlePaddle/PaddleOCR/blob/release%2F2.3/doc/doc_ch/inference.md#1-%E8%B6%85%E8%BD%BB%E9%87%8F%E4%B8%AD%E6%96%87ocr%E6%A8%A1%E5%9E%8B%E6%8E%A8%E7%90%86) for specific usage.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"**2.7 Video memory explosion and memory leakage during prediction?**\n",
|
||||
"\n",
|
||||
"**A**: For the prediction of the training model, if the model is too large or the input image is too large, resulting in insufficient video memory, you can refer to the code and add a pad before the main function runs no_ Grad(), which can reduce the occupation of video memory. If the consumption of video memory is too high when predicted by the information model, you can add [config. Enable_memory_optim()](https://github.com/PaddlePaddle/PaddleOCR/blob/8b656a3e13631dfb1ac21d2095d4d4a4993ef710/tools/infer/utility.py?_pjax=%23js-repo-pjax-container%2C%20div%5Bitemtype%3D%22http%3A%2F%2Fschema.org%2FSoftwareSourceCode%22%5D%20main%2C%20%5Bdata-pjax-container%5D#L267) to reduce the memory consumption when configuring config.\n",
|
||||
"In addition, it is recommended to install the latest version of pad for memory leakage when using pad prediction. The memory leakage has been repaired.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"In addition, it is recommended to install the latest version of pad for memory leakage when using pad prediction. The memory leakage has been repaired."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.7.11"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 4
|
||||
}
|
Loading…
Reference in New Issue