Data Parsing Gets at the Roots of Your Legacy Data

Get at the roots of your data conversion problems by designing a custom data parsing algorithm. This is an iterative process which requires a knowledgeable programmer and a user who is well versed in the actual data, what it looks like, its anomalies and how they came about. The business decisions that are required depart from the typical standard applied to programming work because the data to be parsed is generally passed from one system to another only once. That being the case, some parsing will be manual, that being the least costly solution.

A successful data migration project involves transforming data from the format and data integrity rules (or lack of rules) of the source system into a different format and a different set of rules, those that makeup the database integrity of your new system Wikidata. The process of evaluating, standardizing and interpreting the source data so that it can be properly reformatted and stored in the target system is sometimes referred to as “parsing”. All data in the source system must be looked at before migrating it to the new system. Nothing should be taken for granted regardless of how well defined the data entry procedures for the old system may have been. The nature of data and the flexibility for its use which is designed into systems, results in Parsing being required for most data elements.

Perhaps the most globally used data items requiring the use of a parsing algorithm are name and address information. Name and address parsing is based on the concept that name and address information is comprised of numerous components that have common, identifiable characteristics.

Although the process is not infallible, a high degree of success can be achieved in parsing out names and addresses so they can be successfully reformatted for use in a system with different formatting requirements than the system which originally captured the data. One of the most common problems which parsing is less than perfect in resolving is inconsistent data entry.

A name and address block is made up of three main components-name lines, address lines and city/state/zip code lines. Any of these may occur multiple times or may be absent. Each has particular characteristics that can be identified and is made up of its own set of components. Alternatively, any two or all three may be combined into one data field.

The name lines will be first and are made up of a name prefix (i.e. Mr., Mrs., Ms, etc.) a first name, middle name, and last name and a name suffix, (i.e. Dr., DDS, etc.). Compound names can be recognized by key words or characters such as “and” or “&”. Business names can be recognized by keywords such as “Company”, “Inc.” etc. To be successful the name parser must take into account such things as misspellings, plural forms, hyphenated names and abbreviations. To allow flexibility for the unique characteristics of a particular region or business, the identification of these components is built into tables.

Leave a Reply

Your email address will not be published. Required fields are marked *