Step 5. Split one field into three

In this exercise we use a single automatically extracted input data field to populate three output data fields.  The example uses city, state, and ZIP code for the target extracted data to be separated. The techniques shown here are applicable to many other ETL tasks that require trimming, separating, or identifying data for extraction and loading to an output database or file.

The range of files supported by TextConverter varies from structured delimited files like CSV, to tag based text like QuickBooks or MS Money, to loosely structured heterogeneous input with multilevel data. Use the Input pane to load a file or drag and drop a text file from Windows Explorer into the TextConverter frame. TextConverter can perform ETL by extracting data from popular document formats including PDF, DOC/DOCX, RTF, XLS,/XLSX, HTML, CSV and free formatted text files. 

TextConverter uses Artificial Intelligence (AI) to to extract and transform data from the input data source.  Sometimes a single file does not have enough information to allow the AI to set all the fields in the data dictionary correctly.  When this occurs, you can make changes to the fields in the output data dictionary.

location:  InstallDir/Samples/TextConverter/Automatic Mode/

project file:  InstallDir/Samples/TextConverter/Automatic Mode/2 With script/5 SplitFields.ConverterX

1. Clear the project by pressing the New button () in the tool bar or by selecting "New" from the "File" menu.
 
2.  Make sure the "Open file as" option is set to "Pdf".
 
3.  Open the sample file provided with the software named "orange_county.pdf"

4.  Select "Generate Templates ..." from either the Options or Input Pane. TextConverter selects a template that best fits the input file's structure by default.

5.  Click the "Reset Output Dictionary" icon to create a linked input dictionary using the new template.

 
Notice the "Field_12" contains the city, state, and ZIP code.  For this example we will assume that your database requires the city, state, and ZIP code in three separate fields.

 
 
6.  First create three fields in the output dictionary to contain the city, state, and ZIP code.

a.  Click on the name "Field_12" in the outout dictionary (upper left pane) and change it to "Site CityStateZIP".  Note:  This step is not required but it will make the fields easier to keep track of during the exercise.

b.  Right click on field 12 "Site CityStateZIP" in the output dictionary and select "copy" from the fly menu.

c.  With field 12 "Site CityStateZIP" still highlighted, press Ctrl + V three times to paste 3 copies of the field above the original field in the output dictionary.

d.  Rename the three new fields (from top to bottom) "Site City", "Site State", and "Site ZIP".  Note: This step is also not required but it will make the fields easier to keep track of during the exercise.

e.  Now set the width of field 13 "Site State" to 2, since we will be working with 2 letter state abbreviations.  You may leave the other fields at a width of 25 or adjust them as you see fit.  Note:  Setting the field widths is not required for splitting the data from one field in to many.

7.  Now create the script that will load the new fields.

a.  Highlight the three new fields and right click.  To highlight multiple fields, use Ctrl + click or Shift + click.  Choose "Create Script Conversion" from the fly menu.  The script tab will open and a script will be created that looks like this:

'----------------- OnRecord -----------------
Function OnRecord
DictOut.SetFieldValue "Site City", value
DictOut.SetFieldValue "Site State", value
DictOut.SetFieldValue "Site ZIP", value
End Function

b.  Below this automatic script, copy and paste in this function (below) for splitting up address fields:

Dim st

'-----------------
Function CityStateZip(s, ByRef f1, ByRef f2, ByRef f3)
Dim p1
p1 = FindStatePos(s)
If p1 > 0 Then
    f1 = Trim(Replace(Left(s, p1-1), ",", Empty))
    f2 = Trim(Mid(s, p1, 3))
    f3 = Trim(Mid(s, p1+3 ))
End If
End Function


'-----------------
Function FindStatePos( s )
Dim i, p
For i = 0 To UBound(st) - 1
    p = InStr(s, " " & st(i) & " ")
    If p = 0 And Right(s, 2) = st(i) Then p = Len(s) - 2
    If p > 0 Then Exit For
Next
FindStatePos = p
End Function

st = Array( "AB","AK","AL","AR","AZ","BC","CA","CO","CT","DC","DE","FL","GA","HI", _
            "IA","ID","IL","IN","KS","KY","LA","LB","MA","MB","MD","ME","MI","MN", _
            "MO","MS","MT","NB","NC","ND","NE","NF","NH","NJ","NM","NS","NT","NV", _
            "NY","OH","OK","ON","OR","PA","PE","PQ","PR","RI","SC","SD","SK","TN", _
            "TX","UT","VA","VT","WA","WI","WV","WY","YT")

c.  Now go to the end of line two "Function OnRecord", hit return, and declare three variables like this:

'----------------- OnRecord -----------------
Function OnRecord
Dim city, state, zip

d.  Hit return again, call the new function "CityStateZip", and give it three parameters, the input field and the three output variables.  You can either type the function name "CityStateZip", or copy and paste it from below.  Add a space then drag "Field_12" from the input dictionary, then add a comma, and then type "city, state, zip".  It will look like this:

'----------------- OnRecord -----------------
Function OnRecord
Dim city, state, zip
CityStateZip DictIn.Field_12.value, city, state, zip

e.  Finally, replace the word "value" (a placeholder) after each of the new output fields with the variables city, state, and zip as follows (this is the full script):

'----------------- OnRecord -----------------
Function OnRecord
Dim city, state, zip
CityStateZip DictIn.Field_12.value, city, state, zip

DictOut.SetFieldValue "Site City",   city
DictOut.SetFieldValue "Site State",  state
DictOut.SetFieldValue "Site ZIP",    zip
End Function

Dim st

'-----------------
Function CityStateZip(s, ByRef f1, ByRef f2, ByRef f3)
Dim p1
p1 = FindStatePos(s)
If p1 > 0 Then
 f1 = Trim(Replace(Left(s, p1-1), ",", Empty))
 f2 = Trim(Mid(s, p1, 3))
 f3 = Trim(Mid(s, p1+3 ))
End If
End Function


'-----------------
Function FindStatePos( s )
Dim i, p
For i = 0 To UBound(st) - 1
 p = InStr(s, " " & st(i) & " ")
 If p = 0 And Right(s, 2) = st(i) Then p = Len(s) - 2
 If p > 0 Then Exit For
Next
FindStatePos = p
End Function

st = Array( "AB","AK","AL","AR","AZ","BC","CA","CO","CT","DC","DE","FL","GA","HI", _
 "IA","ID","IL","IN","KS","KY","LA","LB","MA","MB","MD","ME","MI","MN", _
 "MO","MS","MT","NB","NC","ND","NE","NF","NH","NJ","NM","NS","NT","NV", _
 "NY","OH","OK","ON","OR","PA","PE","PQ","PR","RI","SC","SD","SK","TN", _
 "TX","UT","VA","VT","WA","WI","WV","WY","YT")

f.  Click the "Check Syntax" button in the lower right corner to check for errors in syntax.  Click the preview tab to see the results.  Also, feel free to delete the original combined "Site CityStateZIP" field from the output dictionary.


8.  Now you are ready to connect the output data source and run the extraction process.

Save your project - TextConverter stores all of the current settings in a project file.  To save your project, choose "Save Project" form the file menu or click the disc icon () on the toolbar.  Elements of a project include:

  • path to the input data source
  • path to the output data source
  • paths to all other files and databases used in the project
  • the complete script
  • the mapping of the input dictionary to the output dictionary
  • All output dictionary settings
  • All "options" settings
  • Any other ETL settings
  • The workspace layout is NOT saved as part of the project but is instead retained with TextConverter
Learn more with Step 6. Extract second level data
Comments