Data Conversion with Microsoft Access Part Three

 

In this part we look in more detail at how to perform some specific transformation processes using Microsoft Access.

Advanced transformation Topics.

The database in this example was relatively simple and you could argue that a single query written on the load file could complete all the transformation and formatting changes required. However breaking the process into steps with staging tables allows for a more transparent process Also some transformations require that staging tables must be used.

Removing Duplicate Entries

One transformation that may often be required is the removal of duplicate entries. In our example database if we were receiving multiple extracts from multiple legacy systems we may have duplicate entries occurring.

The easiest way to remove duplicate entries is to create a table with the same structure as the data you have with primary keys defined for the combination of fields that needs to be unique. In our example we will create a copy of tbl:Load:MCHB:Generic we will use the cut and paste technique to create a table tbl:Load:MCHB:Unique

Then in the design view you select all the fields that combined need to be unique

clip_image002

In the example above we are saying that the combination of material/plant/Sloc and batch represent a unique entry. So if we have multiples of these then there are duplicate entries that need to be removed.

Now all we need to do to use this table is create an append query to populate this table as the appending process will eliminate the duplicate entries.

When running the append query in foreground you get a message similar to the one below indicating that records were removed due to key violations

clip_image004

Cleaning up Spaces

Data elements from the source system may often have multiple spaces and leading and trailing spaces. It is a simple task to remove spaces from a field using the trim function. Create an update query that includes the fields that you want but do not include the field that you want to remove the leading and trailing spaces. Add a field description for your cleaned field followed by a colon then select the expression builder button

clip_image006

Then insert the Trim$ function with the name of the field to be cleaned up in square brackets

clip_image008

Concatenating Data

Another common task is to concatenate fields together this can be done within the expression builder using the ampersand symbol (&)

In the example below I have concatenated the material and description fields with an underscore symbol

clip_image009

Parsing Data

As much as concatenating data it is often required to parse data into separate columns a common example of this might be an address field that needs to be split into multiple fields. Parsing can be achieved by using certain characters to represent splits in data elements (such as spaces or commas) or parsing can be completed based on the number of characters (fixed width) ie the first 10 characters represent the first element and the next 5 represent another element and the last 20 represent a third element.

In the example below I have an address that is all defined in one single field using commas as the delineation

clip_image010

The transformation required is to parse this into the table structure below with separate fields.

clip_image011

The first step is to create a query with some intermediate fields that identify where the commas are to do this you can use the Instr function that identifies the position of a specific character in a text string.

The instr function has four parameters

clip_image012

The instr function has a start position so once the first comma has been found the second field has to start after the first. In out example there are five fields being parsed by four commas so four fields need to be defined to identify where the commas are. For this we create four fields firstcom, secom, thircom and fourcom, each one of these uses the instr function to identify where the next comma is

clip_image014

The formulas are

firstcom: InStr(1,[address],",",1)

secom: InStr([firstcom]+1,[address],",",1)

thircom: InStr([secom]+1,[address],",",1)

fourcom: InStr([thircom]+1,[address],",",1)

Each one starting at the previous position plus one character

Additionally it is necessary to determine the length of the address field. This can be done using the len command.

length: Len([address])

Once all the commas have been identified then the left$ and mid$ functions can be used to parse out each field.

clip_image015

These use the following formulas

Field

Formula

Street_num

street_num: Left$([address],[firstcom]-1)

Street

street: Mid$([address],[firstcom]+1,[secom]-[firstcom]-1)

City

city: Mid$([address],[secom]+1,[thircom]-[secom]-1)

State

State: Mid$([address],[thircom]+1,[fourcom]-[thircom]-1)

Zip

zip: Mid$([address],[fourcom]+1,[length]-[fourcom])

   

Adding this append query to update the parsed table will convert the data.

clip_image017

If this was a fixed width file the task could be completed with a mid$ function with constants to separate the fields.

If there are many fields to parse this process could get quite involved creating variables so the process could be simplified by a custom function to parse separate words.

Custom Parsing function

To create a custom function you need to use VBA in a module.

This function will have three parameters,

· The string to parse

· The delimiting character

· And the word number you want.

In our example if our address has five words. So If I want to pull out word three I could use my custom function like so

PARSE([address],”,”,3)

This will simplify our query significantly as the query will look like below

clip_image019

And this parse function can parse any string without the need for complex queries using different delimiters.

To create a custom function you must first create a module from the create module

clip_image021

Within the module create a custom function called PARSE

So what king od VBA code do you need to perform this function. Below is the function code

clip_image023

The code basically does a couple of things. Firstly it loops through every character in the string looking for the delimiter character. When it finds one it notes the position of that character and stores is as the end point of a word. It then knows it’s looking for the endpoint of the next word so it continues looping until the next character is found this represents the end of the second word.. And so on until the last delimiter is found. In this case the end of the last word is the total length of the string. This is stored in an array called words and when filled with our example looks like below

clip_image024

From this we can see word one ends at position 3, word 2 ends at position 17, word three at 29 and four at 38, there are in fact five words so the last one ends at the end of the string position 43.

Once it has a word count and last position. It basically uses the standard left$ and mid$ functions to separate out the words. It uses the Select Case / Case function to determine different choices based on if it is the first or last word. In my particular case if someone enters a word higher than the number of words it returns a blank.

You could easily modify this function to simply give the word count, which would be a useful function in data conversion, as in our example you could quickly analyze 1000’s of addresses to see how many words there are. This would tell you if they would all transform into the format you are expecting. For example in some cities there may be a postal district there might be six words as opposed to five, immediately highlighting a problem.

Replacing Data

Replacing data based on its value can easily be achieved using the standard Replace function. In a similar way to the standard Microsoft ctrl H function you can find certain strings and replace them in a source string. The replace function has six parameters

clip_image025

Using the address table as an example if I wanted to replace the word Way with Street I can create a query using the following formula

clnadd: Replace([address],"way","Street")

clip_image026

Giving the results below comparing the original field with the transformed field.

clip_image028

Again using staging tables and update queries allows you to “show your working” and gives the option of going back a step if the destructive transformation has been done incorrectly.

Removing non ASCII characters

Another very frustrating data conversion issue that can happen is when dealing with data extracted from mainframe systems that may or may not be unicode compliant. In some extracts characters may be embedded that do not show up in a text editor. This is really frustrating as you can look at the data with a text editor and just see spaces. These spaces in many cases may not be real spaces (as defined by ASCII character 32) but are in fact characters that may be displayed quite differently in a unicode system.

Below is a grid of the standard ASCII character set showing characters 1 -128

clip_image029

You may find that the extracted text contains characters outside of this range that need to be limited to the 128 characters in the standard set.

In this case we need to create a custom function that loops through each character in a text string to find its associated Ascii code, then if the code is higher than 127 we can replace it with a safe character such as a space (ASCII 32)

In my example I have table tbl:non Unicode that contains some characters outside of the normal 0-127 ASCII range. In an simple ASCII system these will display as spaces and it may never have been known that there were additional characters embedded, however in an extended ASCII or Unicode environment these will display

In my example below the legacy system displayed a description

Rectangular blank 3 mm by 6

However there are additional ascii character higher than 127 which in our Unicode compliant system will display as

Rectangular blank 3 †œÇmm byàõ6

Using a custom Clean Ascii function CLascii I can limit the text to a max ascii character of 127 and replace anything higher with 32 (space)

The query below uses the custom function CLascii to clean the text

clip_image030

Giving the result

clip_image031

The custom function is very similar to the parsing function but even simpler

clip_image033

Depending on how the text files were imported into Access you may or may not see the characters, I have been confounded for hours trying to work out what was going on because the text files were encoded showing these characters as spaces even though they were not.

Validations
Field Validations

In a similar way to creating a staging table to remove duplicates a staging table can also be used to perform field level validations, these at a simple level could be just format checks such as dates or numeric values. Or at a more detailed level they could be limiting the values to predefined lists of values. Field validations can be added to any field in a table, within the validation area you can define a formula that needs to be true for the data to meet a validation. One thing to note is that if you want to validate a field only contains values from another table then you should not use validations for this purpose, rather create a query with an outer join then look for the missing entries in the second table. Although validations can check against a list of values, you have to key them in (as opposed to reference a table)

clip_image035

In the example above the validation checks that the first digit of the batch is not equal to B, for batches that violate this rule they are eliminated from the load process when running the append query.

clip_image037

Creating a validation staging table allows for a query to be created with the generic table to identify which records fell out of the validation.

Lookup Table validations

If a certain data element cannot be automatically transformed then unfortunately it will need to be manually transformed by a user going through line by line transforming the data. I many cases the user will be selecting certain values from a configuration table and in this case you can assist the user by loading the data into a staging table for manual transformation whereby you can limit the allowed values that can be entered in a field using lookup table validations.

For example if the material group mapping is complex and being performed manually by a user then you might want to extract the material group configuration table and use it as validation table. To set a lookup validation against a field select the lookup tab and select the list box option. You will now have the option to select the lookup table you want to validate against.

We have a table tbl:Material:manual_transform which has been loaded with some material data, however assigning the material group will be a manual task.

To prepare for this there is a material group table that shows the allowed values

clip_image038

As you can see it may not be able to programmatically determine if something is paint or shellac or a finish, or if a fixture is a bolt or a nut. So in this case this will be a manual transformation.

Setting up the material group field on the tbl:material:manual_transform you select the lookup tab, then select combo box, and the table that holds the values.

clip_image039

You can select the number of columns to show from the lookup table and also limit the values to the list values using the limit to list option

Now in the manual transform table the user can enter values from the drop down

clip_image040

Splitting Data Files

We have looked at some of the transformation tasks that an Access database may be required to do but another important task an Access database can help with is preparing multiple load files quickly and efficiently. We already looked at when creating a database we start with the end in mind creating a table in the output format so we can simply export as a text file to drive our load program. However on a recent project I was required to create a load file for batch level inventory where each material / batch combination had to be loaded separately into the target system. Some materials had so many batches that they would run for a great deal of time. As the inventory conversion must happen during the blackout we wanted to run this program from multiple files in parallel to reduce the overall run time.

It was decided that we could spare 21 background processes during the inventory conversion so to maximize the load we would create a 21 input files, the top 20 materials in terms of number of batches and the rest.

Without any automation this would be a considerably daunting task. You could create 21 queries each specifying a different material and the 21st excluding certain materials. However the problem was that the top 20 materials were changing on a daily basis so a more flexible approach was required.

Establishing the Top 20 materials

The first step is to dynamically generate a top 20 list, in my example I needed the count of batches by material, which can be achieved by a simple aggregation query

Starting with my output format table it is a simple task to create an aggregate count of batches via a query

Create a query on the output format table and add the two fields Material and Batch then select the aggregate or sum function (sigma)

clip_image042

Now we want to group by material and select a count of the batches

clip_image043

In my sample database I only have 24 materials defined with the batch counts below

clip_image044

So in my this case we’ll create a top 10 and not a top 20

To limit the result of this query to the top 10 back in the design view we select the return values and enter 10

clip_image045

This limits the report to the top 10, adding a descending sort I get the result below

clip_image046

I can now use this table to drive my individual outputs.

Parameter Queries

If I was trying to avoid using VBA then the only viable way of being able to create separate reports for these would be to use a parameter query. This is a query where you can specify a parameter value when you run it.

Create a normal query for the output format then press the parameter button

clip_image048

In the pop up create a parameter mat_t10

clip_image049

Now you can specify this as a criteria for the material field in the query using [mat_t10]

clip_image051

Now when running the query it will ask you for the parameter

clip_image052

You could create a macro to run this query then output the file. You would have to cut and paste the material number each time. (which is not very automated) A much better way would be to use a little VBA to completely automate the task as in the next section

VBA approach using Record Sets

In a similar way to creating the parameter query above, you can create a parameter query that specifies a custom function as its criteria, in this way you can use VBA to control the material number being passed

Below is a query based on our output format table

clip_image054

You can see that the criteria for material is Top_20_mat(). It is not in square brackets because it is not a parameter this is a visual basic function.

In a similar way that we made some custom functions for Parsing and cleaning up data we can create a VB function to represent the material value.

Before we dive into the function let’s take a step back and think about how we will automate out task.

Firstly we want to run our top 10 query and load that into a top 10 table, we can use standard update queries and a macro to perform these steps.

Secondly for each material in the top ten table we want to run our parameter query then export a text file with the naming convention that uses the material number.

Then we want to loop through the remainder of the top 10 doing the same two functions

For this we will need to set each material in the top 10 as a parameter and perform the above two steps.

To achieve this we will need to use a global variable that is available to all objects in the database

Within the visual basic editor create a new module (you don’t need to create a new module you could use the one already created, but to keep it simple I’m creating a new module for these output functions

clip_image056

clip_image058

The first thing we need to do is declare a global variable that can be seen by multiple objects in the database.

clip_image059

This is a public variable and it is outside of any functions so it can be seen by multiple functions.

The next step is to create the function top_20_mat() which will be used in our parameter query

clip_image060

All this function does is set its value to whatever Gmat is

Now we create a subroutine that loops through the top 10 list and for each record it sets GMAT to the current record.

It then runs the query and transfers it to a text file using the “output” specification we set up in a previous section.

This function uses a record set to be able to loop through the top 10 table.

clip_image062

Now all we need to do is create a macro that runs this function

To achieve this we use the runcode command

clip_image064

Then specifying out outputt10 function

Running this macro will create 10 output files instantly

clip_image066

Summary

So far we have seen how data conversion as a process fits in to the ERP implementation process. We have also taken a look at the ubiquitous Microsoft Access and how it can be used to develop a repeatable conversion process. There are many specialized ETL tools that are available that provide a full suite of tools to simplify the conversion process. These tools usually come with a hefty price tag and learning curve. Microsoft Access 2010 comes as part of office 2010 professional that is usually already available and can generally be considered free (almost free at $140) or at least a sunk cost. So before you jump straight into Excel consider more appropriate tool such as MS Access

Post to Twitter

No comments yet.