PaddleOCR/notebook/notebook_ch/2.text_detection/Text Detection FAQ.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Text Detection FAQ\n",
    "\n",
    "This section lists problems that developers often encounter when using the text detection model of PaddleOCR, and gives corresponding solutions or suggestions.\n",
    "\n",
    "The FAQ is divided into two parts:\n",
    "- Text detection training related\n",
    "- Text detection prediction correlation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. FAQs related to text detection training\n",
    "\n",
    "**1.1 What are the text detection algorithms provided by PaddleOCR?**\n",
    "\n",
    "**A**：PaddleOCR contains a variety of text detection models, including regression based text detection methods East and SAST, and segmentation based text detection methods dB and psenet.\n",
    "\n",
    "\n",
    "**1.2：What data sets are used for the Chinese ultra lightweight and general model in the PaddleOCR project? How many samples are trained, what configuration is the GPU, how many epochs have been run, and how long?**\n",
    "\n",
    "**A**：For the ultra lightweight DB detection model, the training data includes open source data sets LSVT, rctw, CASIA, CCPD, MsrA, MLT, borndigit, iFLYTEK, sroie and synthetic data sets. The total data volume is 10W, and the data set is divided into five parts. The random sampling strategy is adopted during training. About 500epoch is trained on 4-card v100gpu, which takes 3 days.\n",
    "\n",
    "\n",
    "**1.3 Does the text detection training label need specific text annotation? What does the \"###\" in the label mean?**\n",
    "\n",
    "**A**：Text detection training only needs the coordinates of the text area. The annotation can be four or fourteen points, arranged in the order of top left, top right, bottom right and bottom left. The label file provided by PaddleOCR contains text fields. If the text in the text area is not clear, it will be used ### instead. When training the detection model, the text field in the label will not be used.\n",
    " \n",
    "**1.4 When the text lines are close, the trained text detection model has poor effect?**\n",
    "\n",
    "\n",
    "\n",
    "**A**：When using segmentation based methods, such as DB, to detect dense text lines, it is best to collect a batch of data for training, and reduce the parameters of generating binary images[shrink_ratio](https://github.com/PaddlePaddle/PaddleOCR/blob/8b656a3e13631dfb1ac21d2095d4d4a4993ef710/ppocr/data/imaug/make_shrink_map.py?_pjax=%23js-repo-pjax-container%2C%20div%5Bitemtype%3D%22http%3A%2F%2Fschema.org%2FSoftwareSourceCode%22%5D%20main%2C%20%5Bdata-pjax-container%5D#L37) during training. In addition, during prediction, the parameter can be appropriately reduced[unclip_ratio](https://github.com/PaddlePaddle/PaddleOCR/blob/8b656a3e13631dfb1ac21d2095d4d4a4993ef710/configs/det/ch_ppocr_v2.0/ch_det_mv3_db_v2.0.yml?_pjax=%23js-repo-pjax-container%2C%20div%5Bitemtype%3D%22http%3A%2F%2Fschema.org%2FSoftwareSourceCode%22%5D%20main%2C%20%5Bdata-pjax-container%5D#L59), unclip_ The larger the ratio parameter value, the larger the detection box.\n",
    "\n",
    "\n",
    "**1.5 For some document images with large size, DB will have more missed detection. How to avoid this problem?**\n",
    "\n",
    "**A**：First, it is necessary to determine whether the model is not well trained or handled during prediction. If the model is not well trained, it is recommended to add more data for training, or add more data enhancement during training.\n",
    "\n",
    "If the predicted image is too large, you can increase the longest edge setting parameter[det_limit_side_len](https://github.com/PaddlePaddle/PaddleOCR/blob/8b656a3e13631dfb1ac21d2095d4d4a4993ef710/tools/infer/utility.py?_pjax=%23js-repo-pjax-container%2C%20div%5Bitemtype%3D%22http%3A%2F%2Fschema.org%2FSoftwareSourceCode%22%5D%20main%2C%20%5Bdata-pjax-container%5D#L47) entered during prediction, which is 960 by default.\n",
    "\n",
    "Secondly, we can observe whether the missing text has segmentation results through the visual post-processing segmentation map. If there is no segmentation result, it indicates that the model is not well trained. If there is a complete divided area, it indicates that it is a problem of prediction post-processing, and it is recommended to adjust [DB post-processing parameters](https://github.com/PaddlePaddle/PaddleOCR/blob/8b656a3e13631dfb1ac21d2095d4d4a4993ef710/tools/infer/utility.py?_pjax=%23js-repo-pjax-container%2C%20div%5Bitemtype%3D%22http%3A%2F%2Fschema.org%2FSoftwareSourceCode%22%5D%20main%2C%20%5Bdata-pjax-container%5D#L51-L53).\n",
    "\n",
    "\n",
    "**1.6 Missed detection of curved text (such as slightly deformed document images) in DB model?**\n",
    "\n",
    "**A**:When calculating the average score of the text box in DB post-processing, the average score of the rectangle area is calculated, which is easy to cause missed detection of curved text. The average score of the polygon area has been added, which will be more accurate, but the speed will be reduced. You can select as needed. You can view the [visual comparison effect](https://github.com/PaddlePaddle/PaddleOCR/pull/2604) in the relevant pr. This function is selected through parameters [det_db_score_mode](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.1/tools/infer/utility.py#L51). The parameter values can be [` fast '(default), ` slow'], ` fast 'corresponds to the original rectangle mode, and ` slow' corresponds to polygon mode. Thank user [buptlihang](https://github.com/buptlihang) for [pr](https://github.com/PaddlePaddle/PaddleOCR/pull/2574) help in solving this problem.\n",
    "\n",
    "\n",
    "**1.7 Simply, for OCR tasks with low accuracy requirements, how many pieces of dataset need to be prepared?**\n",
    "\n",
    "**A**：(1) The amount of training data is related to the complexity of the problem to be solved. The greater the difficulty and the higher the accuracy requirements, the greater the demand for data sets, and in general, the more training data in practice, the better the effect.\n",
    "\n",
    "(2) For scenes with low accuracy requirements, the amount of data required for detection task and recognition task is different. For the detection task, 500 images can ensure the basic detection effect. For the recognition task, it is necessary to ensure that the number of text images of each character appearing in different scenes in the recognition dictionary needs to be greater than 200 (for example, if there are 5 words in the dictionary and each word needs to appear in more than 200 pictures, the minimum required number of images should be between 200-1000), so as to ensure the basic recognition effect.\n",
    "\n",
    "\n",
    "**1.8 When the amount of training data is small, how to obtain more data?**\n",
    "\n",
    "**A**：When the amount of training data is small, you can try the following three ways to obtain more data: \n",
    "\n",
    "(1) manually collecting more training data is the most direct and effective way.\n",
    "\n",
    "(2) Basic image processing or transformation based on PIL and OpenCV. For example, the three modules of imagefont, image and ImageDraw in pil write text into the background, opencv rotation, affine transformation, Gaussian filtering, etc. \n",
    "\n",
    "(3) Use data generation algorithms to synthesize data, such as pix2pix and other algorithms.\n",
    "\n",
    "\n",
    "**1.9 How to replace the backbone of text detection / recognition?**\n",
    "\n",
    "**A**：Whether it is text detection or text recognition, the choice of backbone network is the trade-off between prediction effect and prediction efficiency. Generally, choose a larger backbone network, such as ResNet101_vd, the detection or recognition is more accurate, but the prediction time will increase accordingly. Choose a smaller backbone network, such as MobileNetV3_small_x0_35, the prediction is faster, but the accuracy of detection or recognition will be greatly reduced. Fortunately, the detection or recognition effect of different backbone networks is positively related to the effect of image 1000 classification task in Imagenet dataset. Paddle image classification kit paddleclas summarizes ResNet_vd, ReS2Net, HRNet, MobileNetV3, GhostNet and other 23 series of classification network structures, in the top 1 recognition accuracy of the above image classification tasks, the prediction time of GPU (V100 and T4) and CPU (Xiaolong 855) and the corresponding 117 pre training model download addresses.\n",
    "\n",
    "(1) The replacement of text detection backbone network is mainly to determine four stages similar to RESNET, so as to facilitate the integration of subsequent FPN like detection heads. In addition, for the text detection problem, the classification pre training model trained by Imagenet can accelerate the convergence and improve the effect.\n",
    "\n",
    "(2) When replacing the backbone network of character recognition, we need to pay attention to the falling position of network width and height stripe. Due to the large proportion of width to height in text recognition, the height drop frequency is less and the width drop frequency is more. You can refer to the changes of [MobileNetV3backbone network in PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR/blob/release%2F2.3/ppocr/modeling/backbones/rec_mobilenet_v3.py)\n",
    "\n",
    "\n",
    "**1.10 How to use a small learning rate for the detection model finetune, such as freezing the previous layers or some layers?**\n",
    "\n",
    "**A**：If some layers are frozen, you can stop the variable_ Set the gradient property to true, so that all parameters before calculating this variable will not be updated. Refer to:https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/faq/train_cn.html#id4\n",
    "\n",
    "If you use a smaller learning rate for some layers, it is not very convenient in the static graph. One way is to set a fixed learning rate for the weight attribute during parameter initialization. Refer to:https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/api/paddle/fluid/param_attr/ParamAttr_cn.html#paramattr\n",
    "\n",
    "In fact, our experiments show that the effect is good when we directly load the model to fine tune without setting different learning rates of some layers.\n",
    "\n",
    "**1.11 Why should the preprocessing part of DB, the length and width of the picture be processed into a multiple of 32?**\n",
    "\n",
    "**A**：It is related to the multiple of sampling under the network. Taking the RESNET backbone network under detection as an example, after the image is input into the network, it needs to undergo 5 times of 2x down sampling, a total of 32 times. Therefore, it is recommended that the input image size be a multiple of 32.\n",
    "\n",
    "\n",
    "**1.12 In the model of PP-OCR series, why does the backbone network of text detection not use seblock?**\n",
    "\n",
    "**A**：SE module is an important module of MobileNetV3 network. Its purpose is to estimate the importance of each feature channel of the feature map, assign weight to each feature of the feature map, and improve the expression ability of the network. However, for text detection, the resolution of the input network is relatively large, generally 640\\*640. It is difficult to estimate the importance of each feature channel of the feature map using the se module, and the network promotion capacity is limited, but this module is time-consuming. Therefore, in the PP-OCR system, the backbone network of text detection does not use the se module. Experiments also show that when the se module is removed, the size of the ultra lightweight model can be reduced by 40%, and the text detection effect is basically not affected. Please refer to  PP-OCR technical articles for details,https://arxiv.org/abs/2009.09941.\n",
    "\n",
    "\n",
    "**1.13 PP-OCR detection effect is not good, how to optimize it?**\n",
    "\n",
    "A： Specific analysis of specific problems:\n",
    "- If the detection effect is not available in your scene, the first choice is to do finetune training on your data;\n",
    "- If the image is too large and the text is too dense, it is recommended not to over compress the image. You can try to modify the resize logic of detection preprocessing to prevent the image from being over compressed;\n",
    "- The size of the detection box is too close to the text or the detection box is too large. You can adjust dB_unclip_ratio parameter, increasing the parameter can expand the detection frame, and decreasing the parameter can reduce the size of the detection frame;\n",
    "- There are many missed detection problems in the detection frame, which can reduce the threshold parameter det of DB detection post-processing db_box_thresh to prevent some detection frames from being filtered out. You can also try to set det_db_score_mode is' slow ';\n",
    "- Other methods can be used_ If the division is true, the feature map of the detection output will be expanded. Generally, the effect will be improved;\n",
    "\n",
    "\n",
    "## 2. FAQs related to text detection and prediction\n",
    "\n",
    "**2.1 Some DB boxes are too text pasted, but some edges and corners of the text are removed, which affects the recognition. Is there any way to alleviate this problem?**\n",
    "\n",
    "**A**：You can appropriately increase the post-processing parameters [unclip_ratio](https://github.com/PaddlePaddle/PaddleOCR/blob/d80afce9b51f09fd3d90e539c40eba8eb5e50dd6/tools/infer/utility.py?_pjax=%23js-repo-pjax-container%2C%20div%5Bitemtype%3D%22http%3A%2F%2Fschema.org%2FSoftwareSourceCode%22%5D%20main%2C%20%5Bdata-pjax-container%5D#L52). The larger the parameter, the larger the text box.\n",
    "\n",
    "\n",
    "**2.2 Why does PaddleOCR detection prediction only support one picture test? test_batch_size_per_card=1**\n",
    "\n",
    "**A**：During prediction, the image is scaled in equal proportion, and the longest side is 960. The length and width of different images are inconsistent after scaling in equal proportion, so it cannot form a batch, so it is set to test_batch_size is 1.\n",
    "\n",
    "\n",
    "**2.3 Accelerate the text detection model prediction of PaddleOCR on CPU?**\n",
    "\n",
    "**A**：x86 CPU can be accelerated using mkldnn (onednn); Enable on CPUs that support mkldnn acceleration_ mkldnn [enable_mkldnn](https://github.com/PaddlePaddle/PaddleOCR/blob/8b656a3e13631dfb1ac21d2095d4d4a4993ef710/tools/infer/utility.py#L105) parameter. In addition, increase the number of threads predicted to be used on the CPU [num_threads](https://github.com/PaddlePaddle/PaddleOCR/blob/8b656a3e13631dfb1ac21d2095d4d4a4993ef710/tools/infer/utility.py#L106), which can effectively speed up the prediction speed on the CPU.\n",
    "\n",
    "**2.4 Accelerate the text detection model prediction of PaddleOCR on GPU?**\n",
    "\n",
    "**A**：TensorRt is recommended for GPU acceleration prediction.\n",
    "- 1.Download the paste installation package or prediction library with tensorrt from the [link](https://paddleinference.paddlepaddle.org.cn/master/user_guides/download_lib.html).\n",
    "- 2.Download the TensorRT version from NVIDIA's official website. Note that the downloaded TensorRT version is consistent with the tensorrt version compiled in the paddle installation package.\n",
    "- 3.Setting environment variable LD_LIBRARY_PATH, pointing to the Lib folder of TensorRT\n",
    "```\n",
    "export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<TensorRT-${version}/lib>\n",
    "```\n",
    "- 4.Enable PaddleOCR prediction [TensorRT option](https://github.com/PaddlePaddle/PaddleOCR/blob/8b656a3e13631dfb1ac21d2095d4d4a4993ef710/tools/infer/utility.py?_pjax=%23js-repo-pjax-container%2C%20div%5Bitemtype%3D%22http%3A%2F%2Fschema.org%2FSoftwareSourceCode%22%5D%20main%2C%20%5Bdata-pjax-container%5D#L38).\n",
    "\n",
    "**2.5 How to deploy the PaddleOCR model on the mobile terminal?**\n",
    "\n",
    "**A**: The propeller paddle has tools  [PaddleLite](https://github.com/PaddlePaddle/Paddle-Lite) specifically for mobile deployment, In addition, PaddleOCR provides Android arm deployment code with DB + CRNN as demo. Refer to [link](https://github.com/PaddlePaddle/PaddleOCR/blob/release%2F2.3/deploy/lite/readme.md).\n",
    "\n",
    "\n",
    "**2.6 How to use PaddleOCR multi process prediction?**\n",
    "\n",
    "**A**: Recently, PaddleOCR added [multi process predictive control parameter](https://github.com/PaddlePaddle/PaddleOCR/blob/8b656a3e13631dfb1ac21d2095d4d4a4993ef710/tools/infer/utility.py?_pjax=%23js-repo-pjax-container%2C%20div%5Bitemtype%3D%22http%3A%2F%2Fschema.org%2FSoftwareSourceCode%22%5D%20main%2C%20%5Bdata-pjax-container%5D#L111), ` use_ MP ` indicates whether to use multiple processes, ` total_ process_ Num ` indicates the number of processes when using multiple processes. Please refer to [document](https://github.com/PaddlePaddle/PaddleOCR/blob/release%2F2.3/doc/doc_ch/inference.md#1-%E8%B6%85%E8%BD%BB%E9%87%8F%E4%B8%AD%E6%96%87ocr%E6%A8%A1%E5%9E%8B%E6%8E%A8%E7%90%86) for specific usage.\n",
    "\n",
    "\n",
    "**2.7 Video memory explosion and memory leakage during prediction?**\n",
    "\n",
    "**A**: For the prediction of the training model, if the model is too large or the input image is too large, resulting in insufficient video memory, you can refer to the code and add a pad before the main function runs no_ Grad(), which can reduce the occupation of video memory. If the consumption of video memory is too high when predicted by the information model, you can add [config. Enable_memory_optim()](https://github.com/PaddlePaddle/PaddleOCR/blob/8b656a3e13631dfb1ac21d2095d4d4a4993ef710/tools/infer/utility.py?_pjax=%23js-repo-pjax-container%2C%20div%5Bitemtype%3D%22http%3A%2F%2Fschema.org%2FSoftwareSourceCode%22%5D%20main%2C%20%5Bdata-pjax-container%5D#L267) to reduce the memory consumption when configuring config.\n",
    "In addition, it is recommended to install the latest version of pad for memory leakage when using pad prediction. The memory leakage has been repaired.\n",
    "\n",
    "\n",
    "In addition, it is recommended to install the latest version of pad for memory leakage when using pad prediction. The memory leakage has been repaired."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}