In this exercise, we extract data from the header and footer of the PDF and add it to each extracted record along with the input file name. VBScript is used including a buffer. The buffer is used because the footer data is at the end of the document but must be applied to all the preceding records. The skill level for this exercise is intermediate. In VBScript a proprietary TextConverter method named IsDelimited is used to separate the patterned data from the additional data shown when Show All Lines is checked. Show All Lines allows you to include lines, in your ETL project, that do not conform to the selected automatic template in the conversion. Computer generated documents, subject to extraction transformation and loading (ETL), often contain multiple hierarchical levels of data. A report with a title header at the top of each section is a typical example of hierarchical data in a computer generated document. TextConverter's artificial intelligence (AI) extracts just one level of data per template automatically. TextConverter 2.2.3.26 only allows the use of one template per project. The range of files supported by TextConverter varies from well structured delimited files like CSV to tag based text like QuickBooks or MS Money to loosely structured heterogeneous input containing multiple level data. Use the Input pane to load a file or just drag and drop a text file from Windows Explorer into TextConverter. TextConverter 2.2 can extract, transform, and load (ETL) data from popular document formats including PDF, DOC, RTF, XLS, HTML, CSV and free formatted text files. TextConverter uses Artificial Intelligence (AI) to to extract and transform data from the input data source. Sometimes a single file does not have enough information to allow the AI to set all the fields in the data dictionary correctly. When this occurs, you can make changes to the fields in the output data dictionary or extract the data manually. location: InstallDir/Samples/TextConverter/Automatic Mode/2 With script/ project file: InstallDir/Samples/TextConverter/Automatic Mode/2 With script/6 ExtractSecondLevel.ConverterX input file: InstallDir/Samples/TextConverter/Automatic Mode/dataquick.pdf Step by step: Section A 1. Clear the project by pressing the New button ( 2. Set the "Open File As" option to "Pdf". 3. Open the sample file provided with the software named "dataquick.pdf" 6. In the options pane, check the box for "Show all lines". ![]() Three additional lines will appear in the input preview area. Two at the top (lines 2 and 3) and one at the bottom (line 24). In the output preview, field 1 "Parcel" will become much wider. If you scroll the output preview to the right, you will see that the rest of the output preview is still the same.
a. Click on the "Script" tab next to the "Preview/Options" tab about halfway down the left side of the user interface (UI).
b. Declare a variable to hold the title that you will extract from the header by typing the following line: Dim title
c. Right click in the scripting area and choose "OnRecord" from the fly menu. This will be pasted into the script area below your global variables: Dim title
'----------------- OnRecord ----------------- Function OnRecord End Function d. After "Function OnRecord" hit return then copy and paste the script below for dividing the delimited data from the non-delimited data: Dim title
e. Click the "Check Syntax" button in the lower right corner to check for errors in syntax. Click the preview tab to see that the non-delimited records are no longer shown.'----------------- OnRecord ----------------- Function OnRecord If this.IsDelimited() then Else this.SkipRecord End If End Function 8. Create output fields to hold the second level of data. a. Go to the output dictionary (upper right pane) and scroll down the list to the bottom.
b. Right click in the space below the last field and select "New Field" form the fly menu. c. Name the first new field "PDF_Title", set the type to "string", and set the width to 60. d. Create a second new field and name it "PDF_date", and set the width to 20. e. Create a third new field and name it "PDF_filename", and set the width to 200. Note: In the steps above, feel free to choose names and widths for the output dictionary that fit your particular database.
f. Right click on all three new fields (use ctrl+click or shift+click) and choose "Create script conversion" from the fly menu. The script tab will open and the following text will be inserted automatically just above "End Function": Dim title
'----------------- OnRecord ----------------- Function OnRecord If this.IsDelimited() then Else this.SkipRecord End If DictOut.PDF_Title.value = value DictOut.PDF_date.value = value DictOut.PDF_filename.value = value End Function Dim title
'----------------- OnRecord ----------------- Function OnRecord If this.IsDelimited() then DictOut.Title.value = title DictOut.PDF_date.value = value DictOut.PDF_filename.value = this.GetInputFile() Else this.SkipRecord End If End Function a. Get the data for the "PDF_Title" field by trimming line 2 (record 2) using the Data Object Method this.GetRecordNumber. Add this line of VB Script between "Else" and "this.SkipRecord" Dim title
'----------------- OnRecord ----------------- Function OnRecord If this.IsDelimited() then DictOut.Title.value = title DictOut.PDF_date.value = pdf_date DictOut.PDF_filename.value = this.GetInputFile() Else If this.GetRecordNumber() = 2 Then title = Trim(DictIn.Parcel.value) this.SkipRecord End If End Function a. Add two new variables filesys for the FileSystemObject and demofile for the input data source file name.
b. Add the following code to use the GetFile method of the FileSystemObject object to create a File object and pass the DateLastModified value into the output field. Dim title, filesys, demofile
'----------------- OnRecord ----------------- Function OnRecord If this.IsDelimited() then DictOut.PDF_Title.value = title
ElseSet filesys = CreateObject("Scripting.FileSystemObject") Set demofile = filesys.GetFile(this.GetInputFile()) DictOut.PDF_date.value = demofile.DateLastModified DictOut.PDF_filename.value = this.GetInputFile() If this.GetRecordNumber() = 2 Then title = Trim(DictIn.Parcel.value)
End Ifthis.SkipRecord End Function Click the "Check Syntax" button in the lower right corner to check for errors in syntax. Click the preview tab to see the results.
11. Now you are ready to connect the output data source and run the extraction process. Save your project - TextConverter stores all of the current settings in a project file. To save your project, choose "Save Project" form the file menu or click the disc icon (
Section B Start from the end of Step 7 in section A (above). Extracting the date from the footer seems fairly straight forward. Why not do it the same way that we extracted the title in section A (above)? The reason we can't just parse the data and feed it to the output field object is that the date value occurs in the very last record. By the time you extract the date value, all of the records have already been processed. In order to extract data from the end of a file that will be applied to all of the preceding records, we will put all of the delimited records into a buffer with TextConverter's method "this.buffer.AddRecord" function and then process them at the end of the run with the "OnFinishProcess" function. 1. Here we get all the data we need ready. We store the delimited records in a buffer and the extra data in variables. a. Declare a global variable pdf_date to hold the date that you will extract from the footer - Line 1
b. Declare a local x to look for the date, then use VBScript's built in Trim, Mid, and InStr to grab the data for testing from the relevant part of the footer - Line 10 c. Use VBScript's IsDate to test x for datehood and update the pdf_date variable as applicable - Line 11 d. Use TextConverter's Buffer method to put all of the records (other than the header and the footer) into a buffer - Line 7 2. Whan all of the records have been processed we unload the buffer and the variables into the output fields. a. Declare two local variables, i to cycle through the records, and j to cycle through all of the fields. - Line 19
b. Create a nested set of for loops. The outer loop (Line 21, 31) cycles through and records and the inner loop (Line 24, 26) cycles through the fields. c. In the outer loop, update the three new fields with the data from the non-delimited header and footer records - Lines 27-29 3. Here is the completed script with all of the comments: Before you use the script below for cut and paste, you will need to remove the line numbers.
1 Dim title, pdf_date 'declare the global variables for the new fields
2
3 '----------------- OnRecord -----------------
4 Function OnRecord 5 Dim x 'declare local variable to look for the date 6 If this.IsDelimited() then 'test for delimited record 7 this.buffer.AddRecord 'put the delimited records in a buffer 8 Else 9 If this.GetRecordNumber() = 2 Then title = Trim(DictIn.Parcel.value) 'Grab the title from record 2 10 x = Trim(Mid(DictIn.Parcel.value, InStr(DictIn.Parcel.value, " "), 13)) 'Grab data for x 11 if IsDate(x) Then pdf_date = x 'if x is a date feed it to pdf_date 12 End If 13 this.SkipRecord 'but don't do anything yet 14 End Function 15 16 17 '----------------- OnFinishProcess ----------------- 18 Function OnFinishProcess 'when you have looked at all the records 19 Dim i, j 'declare two index variables i to count records and j to count fields 20 21 For i = 0 To this.GetBufferCount()-1 'for loop to process all the records 22 this.AppendStart 'Prepares a record set for insertion of a new record 23 this.buffer.GetRecord(i) 'grab each record 24 For j = 1 to DictIn.GetFieldCount() 'for loop to process all fields 25 DictOut.SetFieldValue j, DictIn.GetFieldValue(j) 'move data from input to output 26 Next 'end of field processing for loop 27 DictOut.PDF_Title.value = title 'put title into PDF_Title field 28 DictOut.PDF_date.value = pdf_date 'put pdf_date into PDF_Date field 29 DictOut.PDF_filename.value = this.GetInputFile() 'put file name into PDF_filename field 30 this.AppendRecord 31 Next 32 End Function Click the "Check Syntax" button in the lower right corner to check for errors in syntax. Click the preview tab to see the results.
4. Now you are ready to connect the output data source and run the extraction process.
Save your project - TextConverter stores all of the current settings in a project file. To save your project, choose "Save Project" form the file menu or click the disc icon (
|
