Step 1.1 Auto Extract PDF

May 2009
Tip of the month:  How to use automatic extraction to load PDF data to a database



Visit SiMX.com to download and install TextConverter.  TextConverter can perform Extraction Transformation and Loading (ETL) from popular document formats including PDF, DOC, RTF, XLS, HTML, CSV and txt.  It uses Artificial Intelligence (AI) to extract and transform data from the input data source.

Use either sample input PDF provided.  Set the "Open File As" to "Pdf".  Load the sample file by dragging and dropping it into TextConverter. Under Templates (either in the Options Pane or the Input Pane), select Generate templates... The template that best suits the input data will be chosen by default.

In the example provided the pattern was easily recognized by TextConverter's AI.  Sometimes a single file does not have enough information to allow the AI to set all the fields in the data dictionary correctly. When this occurs, you can make changes to the fields in the output data dictionary or extract all or any portion of the data manually.

Each time you change a setting in the conversion options the input data source will reload but the output dictionary is NOT reset.  To reset the output dictionary, click the reset dictionary icon on the tool bar.

Save your project - TextConverter stores all of the current settings in a project file.  To save your project, choose "Save Project" form the file menu or click the disc icon () on the toolbar.  Elements of a project include:

  • path to the input data source
  • path to the output data source
  • paths to all other files and databases used in the project
  • the complete script
  • the mapping of the input dictionary to the output dictionary
  • All output dictionary settings
  • All "options" settings
  • Any other ETL settings
  • The workspace layout is NOT saved as part of the project but is instead retained with Windows
This is a beginner level tutorial.  May more tutorials and samples are available at help.SiMX.com.
Comments