Home > Archive > 2019 > Volume 9 Number 5 (Oct. 2019) >
IJMLC 2019 Vol.9(5): 586-591 ISSN: 2010-3700
DOI: 10.18178/ijmlc.2019.9.5.844

Discovery of Structured Data Using Unsupervised Spatial Clustering and Human Supervision

Nikitha Rachapudi, Lakshmipathy Ganesh, Abinaya Sekar, Anand KS, and Rajkumar Sakthibalan

Abstract—Commercial data has been preserved digitally in portable document format (PDF) for its ease of encapsulating multiple data formats. In this digitization era, there comes a need to capture and store this data in structured format to facilitate its access for automated b2b services and business intelligence. In this paper, we propose a framework that automates discovery and extraction of tabular data incorporating both artificial and human intelligence. The framework involves clustering and heuristics to group cartesian location of text and spaces in a page to determine a table. The discovered table is then validated by the user using a user-interface designed to moderate the determined boundaries and fed back to the layout knowledge repository. The table data obtained is extracted as JSON key-value pairs which can then be loaded into any database. The framework thus provides enhanced accuracy and continuous human assisted learning for the automated document digitization process. The knowledge repository is further used to train the machine to generate document templates to be used for processing unseen documents. However, this paper concentrates on the discovery of structured data alone.

Index Terms—Clustering algorithm, spatial analysis, pdf table extraction, heuristics, human interaction.

The authors are with CloudIX Inc, Suite 301, 15446, Bel Red Road, Redmond, WA 94052 USA (e-mail: {nikhitha, ganeshl, abinayas, anandks, rajs}@cloudix.io).

[PDF]

Cite: Nikitha Rachapudi, Lakshmipathy Ganesh, Abinaya Sekar, Anand KS, and Rajkumar Sakthibalan, "Discovery of Structured Data Using Unsupervised Spatial Clustering and Human Supervision," International Journal of Machine Learning and Computing vol. 9, no. 5, pp. 586-591, 2019.

Copyright © 2019 by the authors. This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited (CC BY 4.0).

 

General Information

  • E-ISSN: 2972-368X
  • Abbreviated Title: Int. J. Mach. Learn.
  • Frequency: Quaterly
  • DOI: 10.18178/IJML
  • Editor-in-Chief: Dr. Lin Huang
  • Executive Editor:  Ms. Cherry L. Chen
  • Abstracing/Indexing: Inspec (IET), Google Scholar, Crossref, ProQuest, Electronic Journals LibraryCNKI.
  • E-mail: ijml@ejournal.net


Article Metrics in Dimensions