Zejiang Shen is currently a data science fellow at the Institute for Quantitative Social Science (IQSS) at Harvard University. He is trying to use deep learning tools to distill important economic information for historical scans and analyze the extracted text. His research interest covers representation learning and computer vision.
Non-trivial layouts in book scans will disrupt Optical Character Recognition (OCR) outputs since it will misjoin texts from different sections. This limits the usability of extracted documents from historical scans and the following humanity research. Even recognized appropriately, the large bulk of natural texts prevents the subsequent analysis without appropriate processing. We propose a framework that helps extract structured information from such scans by combining computer vision and natural language processing approaches. With the help of our deep learning-based layout analysis method, our framework can understand the complicated layouts in the books and merge the extracted characters correctly. Moreover, we utilize named entity recognition and dependency parsing to breakdown original texts into structured feature tables for further analysis. This framework is verified on the Japanese Who is Who book in 1953 with high accuracy.