An intelligent and unified text and non-text object extraction from PDF using Support Vector Machine
Abstract
Today's e-book plays an important role in all fields to learn new things through personal computers, laptops or mobile phones. There are several formats for an eBook. The most used format is PDF because it preserves the original format of the document. Segmentation is used to reuse content, but in the existing system documents are only segmented as textual content. It does not take into account non-textual elements, such as graphics, tables and images. In this survey, the design analysis is performed by extracting text objects and non-text objects from the PDF document and segmenting the objects separately using the Support Vector Machine (SVM) classifiers. Finally, we get the output as text objects and non-text objects separately. This method uses a bottom-up approach to extract lines of text and a top-down approach to split the diagram tree generated by Kruskal's algorithm into sub diagrams that use the Euclidean distance between adjacent vertices. Text and non-text objects are classified using SVM techniques. With each section using the SVM technique for each segmented and non-textual text, different dimensional characteristics are extracted for labeling purposes. Different eBook PDF documents are tested, and some sample input and output PDF documents are shown in the experimental results.