Step 7.1. Extract data from web pages

In this exercise we extract data from a web page into a database with TextConverter. This specific example takes a website with a list of cities, zip codes, and counties and uses TextConverter to transform the information into the database found in Step 7. We will first open the web page through TextConverter and set the options to skip any unwanted introductory text. Then we will create the desired output fields in the Output Dictionary and link the fields to those in the input. The information from the website then becomes a functional database.
 
location: InstallDir/Samples/TextConverter/Automatic Mode/2 With script/
 
project file: InstallDir/Samples/TextConverter/Automatic Mode/2 With script/7.1 ZIPextract
 
input file: InstallDir/Samples/TextConverter/Automatic Mode/2 With script/california-county-lookup.html
 
Step by step:

1. Clear the project by pressing the "New" button () in the tool bar or by selecting "New" from the "File" menu.

2. Set the "Open File As" option to "HTML".

3. Load the sample input file provided with the software (california-county-lookup.html) by either dragging and dropping it into TextConverter, choosing "Load Input File" from the file menu, or clicking the input icon.
Note: Although we provide a web page for the purpose of this sample, any website can be used by simply saving the page to your hard drive (File -> Save As...) and then loading or dragging the file into TextConverter.
 
4. Once the file is loaded, scroll down the input text table until you reach the data that will be used (the cities, zip codes, and counties in the example file). Enter the line number preceding the desired data (line 60 in this exercise) in the Options Pane under "Number of Lines to Skip".
 
 
5. Separate the output by City, Zip, and County.
 
a. Right click under "Field_1" in the Output Dictionary and select "New Field" from the fly menu. Name this field "City." Do this twice more and name the fields "Zip" and "County."
 
b. Select the three new fields (SHIFT+click to select multiple fields), right click, and select "Create Script Conversion" from the fly menu. The Script tab will open and the following will be inserted automatically:

'----------------- OnRecord -----------------
Function OnRecord
DictOut.City.value = value
DictOut.Zip.value = value
DictOut.County.value = value
End Function

c. After the second line (Function OnRecord), enter the script highlighted below to determine the input field to be separated and to skip lines without information.
Note: This command allows headers like "City Zip County" (line 61) to stay visible in the input text table while being omitted in the exported database.

'----------------- OnRecord -----------------
Function OnRecord
Dim s
s=DictIn.Field_1.value
If Len(Trim(Left(s, 10))) = 0 Then this.SkipRecord(): Exit Function
DictOut.City.value = value
DictOut.ZIP.value = value
DictOut.County.value = value
End Function

d. Determine the starting positions and lengths of the three new output fields by entering the highlighted code in place of "value" on the right of the outputs' equal signs:

'----------------- OnRecord -----------------
Function OnRecord
Dim s
s=DictIn.Field_1.value
If Len(Trim(Left(s, 10))) = 0 Then this.SkipRecord(): Exit Function
DictOut.City.value = Trim(Mid(s, 1, 28))
DictOut.ZIP.value = Trim(Mid(s, 29, 6))
DictOut.County.value = Trim(Mid(s, 36))
End Function

e. Click the "Check syntax" button on the bottom right to check for any errors in the script's syntax. Click the Preview/Options tab to see your results.
 
 
6. Now you are ready to connect the output data source and run the extraction process.

Save your project - TextConverter stores all of the current settings in a project file.  To save your project, choose "Save Project" form the file menu or click the disc icon () on the toolbar.  Elements of a project include:

  • path to the input data source
  • path to the output data source
  • paths to all other files and databases used in the project
  • the complete script
  • the mapping of the Input Dictionary to the Output Dictionary
  • All Output Dictionary settings
  • All Options settings
  • Any other ETL settings
  • The workspace layout is NOT saved as part of the project but is retained with TextConverter
 
Comments