Step 6. Extract second level data

In this exercise, we extract data from the header and footer of the PDF and add it to each extracted record along with the input file name.  VBScript is used including a buffer.  The buffer is used because the footer data is at the end of the document but must be applied to all the preceding records.  The skill level for this exercise is intermediate

In VBScript a proprietary TextConverter method named IsDelimited is used to separate the patterned data from the additional data shown when Show All Lines is checked.  Show All Lines allows you to include lines, in your ETL project, that do not conform to the selected automatic template in the conversion.

Computer generated documents, subject to extraction transformation and loading (ETL), often contain multiple hierarchical levels of data.  A report with a title header at the top of each section is a typical example of hierarchical data in a computer generated document.  TextConverter's artificial intelligence (AI) extracts just one level of data per template automatically.  TextConverter 2.2.3.26 only allows the use of one template per project.

The range of files supported by TextConverter varies from well structured delimited files like CSV to tag based text like QuickBooks or MS Money to loosely structured heterogeneous input containing multiple level data. Use the Input pane to load a file or just drag and drop a text file from Windows Explorer into TextConverter. TextConverter 2.2 can extract, transform, and load (ETL) data from popular document formats including PDF, DOC, RTF, XLS, HTML, CSV and free formatted text files. 

TextConverter uses Artificial Intelligence (AI) to to extract and transform data from the input data source.  Sometimes a single file does not have enough information to allow the AI to set all the fields in the data dictionary correctly.  When this occurs, you can make changes to the fields in the output data dictionary or extract the data manually.

location:  InstallDir/Samples/TextConverter/Automatic Mode/2 With script/

project file:  InstallDir/Samples/TextConverter/Automatic Mode/2 With script/6 ExtractSecondLevel.ConverterX

input file:  InstallDir/Samples/TextConverter/Automatic Mode/dataquick.pdf

Step by step:

Section A

1.  Clear the project by pressing the New button () in the tool bar or by selecting "New" from the "File" menu.

2.  Set the "Open File As" option to "Pdf".

3.  Open the sample file provided with the software named "dataquick.pdf"

4.  Select "Generate Templates ..." from either the Options or Input Pane. TextConverter selects a template that best fits the input file's structure by default.

5.  Click the "Reset Output Dictionary" icon to create a linked input dictionary using the new template.

6.  In the options pane, check the box for "Show all lines".



Three additional lines will appear in the input preview area.  Two at the top (lines 2 and 3) and one at the bottom (line 24).  In the output preview, field 1 "Parcel" will become much wider.  If you scroll the output preview to the right, you will see that the rest of the output preview is still the same.

7.  We will create a brief VBScript that will keep the data from the header and footer available for our use but will not treat these lines as records.  The main distinction between the lines suppressed by default (including in this case the header and footer lines) and the rest of the document is that for the rest of the document TextConverter's AI was able to identify implied delimiters.  This script will separate the data that IsDelimited from the data that is not.  Another example of IsDelimited is provided on Working in Automatic Mode.  The script will use SkipRecord to keep the non-delimited records from the automated extraction.  We will extract these records manually.

a.  Click on the "Script" tab next to the "Preview/Options" tab about halfway down the left side of the user interface (UI).

b.  Declare a variable to hold the title that you will extract from the header by typing the following line:

Dim title

c.  Right click in the scripting area and choose "OnRecord" from the fly menu.  This will be pasted into the script area below your global variables:

Dim title

'----------------- OnRecord -----------------
Function OnRecord
End Function

d.  After "Function OnRecord" hit return then copy and paste the script below for dividing the delimited data from the non-delimited data:

Dim title

'----------------- OnRecord -----------------
Function OnRecord
If this.IsDelimited() then
Else
    this.SkipRecord
End If
End Function


e.  Click the "Check Syntax" button in the lower right corner to check for errors in syntax.  Click the preview tab to see that the non-delimited records are no longer shown.

8.  Create output fields to hold the second level of data.

a.  Go to the output dictionary (upper right pane) and scroll down the list to the bottom.

b.  Right click in the space below the last field and select "New Field" form the fly menu.

c.  Name the first new field "PDF_Title", set the type to "string", and set the width to 60.

d.  Create a second new field and name it "PDF_date", and set the width to 20.

e.  Create a third new field and name it "PDF_filename", and set the width to 200. 

Note:  In the steps above, feel free to choose names and widths for the output dictionary that fit your particular database.

f.  Right click on all three new fields (use ctrl+click or shift+click) and choose "Create script conversion" from the fly menu.  The script tab will open and the following text will be inserted automatically just above "End Function":

Dim title

'----------------- OnRecord -----------------
Function OnRecord
If this.IsDelimited() then
Else
    this.SkipRecord
End If
DictOut.PDF_Title.value = value
DictOut.PDF_date.value = value
DictOut.PDF_filename.value = value
End Function


9.  Move these three lines up under "If this.IsDelimited() then" using ctrl+x to cut and ctrl+v to paste.  Replace the place holder "value" with the variable "title" for the first output field.  For the "PDF_filename" field, use the TextConverter method "this.GetInputFile ()" to aquire the name of the input data source file.   The script will now look like this:

Dim title

'----------------- OnRecord -----------------
Function OnRecord
If this.IsDelimited() then
    DictOut.Title.value = title
    DictOut.PDF_date.value = value
    DictOut.PDF_filename.value = this.GetInputFile()
Else
    this.SkipRecord
End If
End Function

a.  Get the data for the "PDF_Title" field by trimming line 2 (record 2) using the Data Object Method this.GetRecordNumber.  Add this line of VB Script between "Else" and "this.SkipRecord"

Dim title

'----------------- OnRecord -----------------
Function OnRecord
If this.IsDelimited() then
    DictOut.Title.value = title
    DictOut.PDF_date.value = pdf_date
    DictOut.PDF_filename.value = this.GetInputFile()
Else
    If this.GetRecordNumber() = 2 Then title = Trim(DictIn.Parcel.value)
    this.SkipRecord
End If
End Function

10.  Now we have everything we need except for the file date.  The file date in the footer occurs in the last record.  We need to add the date to all of the records.  To finish this section we will get the input data file's DateLastModified from VBScript's built in FileSystemObject.  In Section B (below) we will get the date from the file footer and add it to each record.

a.  Add two new variables  filesys for the FileSystemObject and demofile for the input data source file name.
b.  Add the following code to use the GetFile method of the FileSystemObject object to create a File object and pass the DateLastModified value into the output field.

Dim title, filesys, demofile

'----------------- OnRecord -----------------
Function OnRecord
If this.IsDelimited() then
DictOut.PDF_Title.value = title
Set filesys = CreateObject("Scripting.FileSystemObject")
Set demofile = filesys.GetFile(this.GetInputFile())
DictOut.PDF_date.value = demofile.DateLastModified
DictOut.PDF_filename.value = this.GetInputFile()
Else
If this.GetRecordNumber() = 2 Then title = Trim(DictIn.Parcel.value)
this.SkipRecord
End If
End Function

Click the "Check Syntax" button in the lower right corner to check for errors in syntax.  Click the preview tab to see the results.

11.  Now you are ready to connect the output data source and run the extraction process.

Save your project - TextConverter stores all of the current settings in a project file.  To save your project, choose "Save Project" form the file menu or click the disc icon () on the toolbar.  Elements of a project include:

  • path to the input data source
  • path to the output data source
  • paths to all other files and databases used in the project
  • the complete script
  • the mapping of the input dictionary to the output dictionary
  • All output dictionary settings
  • All "options" settings
  • Any other ETL settings
  • The workspace layout is NOT saved as part of the project but is instead retained with TextConverter

Section B


Start from the end of Step 7 in section A (above).

Extracting the date from the footer seems fairly straight forward.  Why not do it the same way that we extracted the title in section A (above)?  The reason we can't just parse the data and feed it to the output field object is that the date value occurs in the very last record.  By the time you extract the date value, all of the records have already been processed.  In order to extract data from the end of a file that will be applied to all of the preceding records, we will put all of the delimited records into a buffer with TextConverter's method "this.buffer.AddRecord" function and then process them at the end of the run with the "OnFinishProcess" function. 

1.  Here we get all the data we need ready.  We store the delimited records in a buffer and the extra data in variables.

a.  Declare a global variable pdf_date to hold the date that you will extract from the footer - Line 1
b.  Declare a local x to look for the date, then use VBScript's built in Trim, Mid, and InStr to grab the data for testing from the relevant part of the footer - Line 10
c.  Use VBScript's IsDate to test x for datehood and update the pdf_date variable as applicable - Line 11
d.  Use TextConverter's Buffer method to put all of the records (other than the header and the footer) into a buffer - Line 7

2.  Whan all of the records have been processed we unload the buffer and the variables into the output fields.

a.  Declare two local variables, i to cycle through the records, and j to cycle through all of the fields. - Line 19
b.  Create a nested set of for loops.  The outer loop (Line 21, 31) cycles through and records and the inner loop (Line 24, 26) cycles through the fields.
c.  In the outer loop, update the three new fields with the data from the non-delimited header and footer records - Lines 27-29

3.  Here is the completed script with all of the comments:

Before you use the script below for cut and paste, you will need to remove the line numbers.

1  Dim title, pdf_date 'declare the global variables for the new fields
2
3  '----------------- OnRecord -----------------
4  Function OnRecord
5   Dim x 'declare local variable to look for the date
6  If this.IsDelimited() then 'test for delimited record
7      this.buffer.AddRecord 'put the delimited records in a buffer
8  Else
9      If this.GetRecordNumber() = 2 Then title = Trim(DictIn.Parcel.value) 'Grab the title from record 2
10     x = Trim(Mid(DictIn.Parcel.value, InStr(DictIn.Parcel.value, " "), 13)) 'Grab data for x
11     if IsDate(x) Then pdf_date = x 'if x is a date feed it to pdf_date
12 End If
13 this.SkipRecord 'but don't do anything yet
14 End Function
15
16
17 '----------------- OnFinishProcess -----------------
18 Function OnFinishProcess 'when you have looked at all the records
19 Dim i, j 'declare two index variables i to count records and j to count fields
20
21 For i = 0 To this.GetBufferCount()-1 'for loop to process all the records
22     this.AppendStart 'Prepares a record set for insertion of a new record
23     this.buffer.GetRecord(i) 'grab each record
24     For j = 1 to DictIn.GetFieldCount() 'for loop to process all fields
25        DictOut.SetFieldValue j, DictIn.GetFieldValue(j) 'move data from input to output
26     Next 'end of field processing for loop
27     DictOut.PDF_Title.value = title 'put title into PDF_Title field
28     DictOut.PDF_date.value = pdf_date 'put pdf_date into PDF_Date field
29     DictOut.PDF_filename.value = this.GetInputFile() 'put file name into PDF_filename field
30     this.AppendRecord
31 Next
32 End Function

Click the "Check Syntax" button in the lower right corner to check for errors in syntax.  Click the preview tab to see the results.

4.  Now you are ready to connect the output data source and run the extraction process.

Save your project - TextConverter stores all of the current settings in a project file.  To save your project, choose "Save Project" form the file menu or click the disc icon () on the toolbar.  Elements of a project include:

  • path to the input data source
  • path to the output data source
  • paths to all other files and databases used in the project
  • the complete script
  • the mapping of the input dictionary to the output dictionary
  • All output dictionary settings
  • All "options" settings
  • Any other ETL settings
  • The workspace layout is NOT saved as part of the project but is instead retained with TextConverter
Learn more with Step 7. Transform data with a look up
(or go back to Step 5. Split one field into three)

Step 6 Section B Video


Step 6 Section A Video


Comments