Informatica AddressDoctor Product Documentation Version 5.6.0 AddressDoctor GmbH Roentgenstr. 9 67133 Maxdorf Germany +49 (6237) 9774 0 USA 208 S WILMINGTON ST STE 200 Raleigh NC 27601-1434 United States +1 (866) 402 2800 UK FreeCall 0800-0328-276 France Numéro Vert 0800 917113 India FreeCall 000800 1003486 Singapore FreeCall 800 1301756 [email protected] www.AddressDoctor.com Released: November 5, 2014 Foreword This documentation explains features and functions of Informatica AddressDoctor, previously known as the AddressDoctor Software Library, for postal address validation. You have selected a leading data quality product that provides you superb address quality for postal addresses from all around the world. This documentation is meant to cater to the information needs of beginners and advanced users alike. It covers all platforms supported by Informatica AddressDoctor and all available interfaces. The Introduction chapter (chapter 2) provides a general overview of the Informatica AddressDoctor components and concepts and looks at them from a business perspective. While chapter 3 describes the installation process, chapter 4 should help you get started right away, both when you are new to Informatica AddressDoctor or migrating from the previous version. The Concepts chapter (chapter 5) helps you understand important features of Informatica AddressDoctor. We recommend this chapter for all user groups. Advanced users will find chapter 6 ”How do I…” helpful. It provides sample code for common tasks. This PDF document provides embedded bookmarks for fast access to document chapters, open the bookmark view of your PDF viewer in case it was not opened by default. Additionally, all chapter number references in the text provide hyperlink access to the reference targets. Always check the release note section included with the API documentation, see chapter 10.2. The Informatica AddressDoctor documentation is a work in progress document, and we always appreciate user feedback to improve this document. Email your suggestions and comments to [email protected]. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 2 Contents 1. Document Conventions 6 2. Introduction 7 3. 4. 5. 2.1 Functional Overview 8 2.2 Supported Platforms 12 2.3 System Requirements 13 Setup 14 3.1 General Remarks 14 3.2 Installing the Library Files 15 3.3 Installing the Reference Databases 17 Quick Start Guide 21 4.1 First-Time Use of Informatica AddressDoctor 5 21 4.2 New Features and Enhancements in Informatica AddressDoctor 26 Concepts 39 5.1 Character Set Mapping 39 5.2 Transliteration 40 5.3 Address Element Abstraction 42 5.4 Address Parsing 43 5.5 Address Validation 45 5.6 Informatica AddressDoctor 45 5.7 AddressObjects 46 5.8 Input and Output Encoding 47 5.9 AddressElement Items and AddressLines 47 5.10 Address Item Types 49 5.11 Process Modes 50 5.12 Process Parameters 56 5.13 Output Formatting 67 5.14 Output Standardization 67 5.15 Alternative Names and Aliases 69 5.16 AliasStreet Option Examples 69 5.17 Process Status Values 72 5.18 Mailability Scores 74 Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 3 6. 5.19 Geocoding Status Values 75 5.20 CAMEO Status Values 77 5.21 CASS Status Values 78 5.22 SERP Status Values 78 5.23 SNA Status Values 78 5.24 AMAS Status Values 78 5.25 SendRight Status Values 79 5.26 Country Specific Enrichment 79 5.27 Element Status and Relevance Values 88 5.28 Extended Element Result Status Fields 91 5.29 ResultPercentage Values 100 5.30 Language ISO Code Output 100 5.31 Address Types 100 5.32 Return Codes 106 5.33 OptimizationLevel 109 5.34 Preloading 110 5.35 Caching 112 5.36 Multithreading 112 5.37 Memory Management 114 How do I… 117 6.1 …initialize Informatica AddressDoctor? 117 6.2 …determine Informatica AddressDoctor version? 119 6.3 …specify processing or input parameters and a result format? 119 6.4 …handle unlock codes? 122 6.5 …configure reference databases? 123 6.6 …determine the current engine settings? 124 6.7 ...assign an address to the AddressObject? 124 6.8 …validate an address? 129 6.9 …parse an address? 129 6.10 …check the process mode? 130 6.11 …retrieve a suggested correction? 130 6.12 ...retrieve the result status and additional information? 131 Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 4 6.13 ...retrieve address enrichments? 132 6.14 ...analyze error conditions? 133 6.15 ...assign and process addresses in non-Latin script? 134 6.16 …use Informatica AddressDoctor with multiple processor cores? 136 6.17 …produce valid Informatica AddressDoctor XML? 137 6.18 …use Informatica AddressDoctor XML for flexible Business Processes? 137 6.19 …use Informatica AddressDoctor for Master Data Management? 138 6.20 …use Informatica AddressDoctor in an eBusiness Environment? 138 6.21 …use the Quick Address Entry Feature? 139 6.22 …use Informatica AddressDoctor in a multi-tenant hosted environment? 139 6.23 …use Informatica AddressDoctor for Web Services? 140 6.24 ...validate an address in CERTIFIED mode? 141 6.25 ...optimize performance? 147 7. Demonstration Applications 151 7.1 ConsoleDemo Application 151 7.2 AddressCheck (Windows only) 151 8. Sample Address Data for Testing 153 8.1 Addresses with Status Code Vx 153 8.2 Addresses with Status Code Cx 154 9. Miscellaneous Topics 156 9.1 Background on the (Postal) Reference Database 156 9.2 Postal Certifications 158 9.3 Support Information 158 9.4 Recommended Database Layout for International Addresses 159 10. Appendix 162 10.1 API Document Type Definitions 162 10.2 API Reference 162 10.3 Schematic Representation of Informatica AddressDoctor Processing Flow 163 10.4 AddressElement Output Examples 163 10.5 Province Output 164 10.6 Reference Data Copyright Notices 169 11. Glossary Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 171 5 1. Document Conventions This document uses icons in the margin to indicate if an explanation is specific to a certain version of the interface. While the functionality of the API is the same for all interfaces, different syntax may be required. Applies to the C interface (C wrapper) on all platforms Applies to the Java interface (Java wrapper) on all platforms Sample program code may be identified by its fixed space typeface, for example: For i = 1 to 5 j = i + 1 Next In some cases, this document abbreviates the Informatica AddressDoctor product name to AddressDoctor. For example, the document may use AddressDoctor as a sample address element name. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 6 2. Introduction Informatica AddressDoctor provides a powerful software library with functions to enhance and ensure postal address data quality. With a world population of more than 6.5 billion people and increasingly global trade relationships, more and more people face the challenge of handling addresses from all around the world. At the same time, taking care of customer relationships is more important than ever - especially in today’s rushed world economy. We receive letters generated by computers, talk to computers on the telephone, and we check-out in supermarkets by ourselves. Data, once in a computer system, is often considered to be correct. In many cases, it serves as the foundation for numerous business processes. Rarely ever is the data in the system questioned. This can lead to dangerous situations, as we could all see in the movie “The Net” where a young woman loses her identity because of deleted data. The following example should illustrate the situation: Data input by hotel staff Correct address needed for delivery Sven Schreiber Feuerbergstr. 1 67134 Birkenheide Germany The Informatica AddressDoctor product line was introduced in 1994 by Platon Data Technology GmbH (now AddressDoctor GmbH), a German software company that has become the innovation leader in data quality tools for postal addresses. From the very beginning, Informatica AddressDoctor has specialized in international addresses. Here is an overview of the Informatica AddressDoctor product line: Online Applications Web Services Software Library Reference Data Reference data and the software library are the foundation for all Informatica AddressDoctor product offerings With the global launch of Informatica AddressDoctor, Informatica AddressDoctor has set new standards in terms of flexibility, ease of use and processing power. Covering more than 240 countries and territories, the software encompasses knowledge about postal addresses from virtually anywhere. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 7 The global approach of Informatica AddressDoctor is a direct and major benefit to its customers: Cost savings in: Vendor n for Country n … PRICE Vendor 3 for Country 3 Vendor 2 for Country 2 Vendor 1 for Country 1 Many different regional vendors One Vendor One World o Contract Management o Integration o Deployment o Licenses o Support and Maintenance o … Cross-country synergy effects compared to a multi-vendor approach AddressDoctor 2.1 Functional Overview Informatica AddressDoctor features several stages of address processing, namely Transliteration, Parsing, Validation and Formatting, which interact with each other. 2.1.1 Character Set Mapping and Transliteration Informatica AddressDoctor incorporates functionality to handle international strings and their complexities. It uses fully Unicode enabled string processing which enables the transliteration of non-roman characters into the Latin character set and mapping between different character sets. Storing data in and mapping between over 30 different character sets including UTF-8, ISO 88591, GBK, BIG5, JIS, EBCDIC Proper “elimination” of diacritics according to language rules Transliteration for various alphabets into Latin Script: o o o o o o Greek (BGN/PCGN 1962, ISO 843 – 1997) Cyrillic (BGN/PCGN 1947, ISO 9 – 1995) Hebrew Japanese Katakana, Hiragana and Kanji Chinese Pinyin (Mandarin, Cantonese) Korean Hangul For example: ΑΘΗΝΑΣ 63 105 52 ΑΘΗΝΑ GREECE ATHINAS 63 105 52 ATHINA Transliteration Data input is in a foreign alphabet Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 GREECE Transliterated data is in Latin script 8 2.1.2 Address Parsing, Formatting and Standardization Restructuring incorrectly fielded address data is a complex and difficult task especially when done for international addresses. People introduce many ambiguities as they enter address data into computer systems. Among the problems are misplaced elements (such as company or personal names in street address fields) or varying abbreviations that are not only language, but also country specific. The Informatica AddressDoctor Parser component identifies address elements in totally unfielded or partially fielded addresses and assigns them to the proper fields. This is an important precursor to the actual validation. Without restructuring, “no match” situations might result. Properly identified address elements are also important when addresses have to be truncated or shortened to fit specific field length requirements. With the proper information in the right fields, specific truncation rules can be applied. Parses and analyzes free form addresses and identifies individual address elements Detects countries (names, ISO codes, big cities, and so on.) High processing speed Ideal pre-processing stage before validation Processes over 30 different character sets Formats addresses according to the postal rules of the country of destination Standardizes address elements (such as Avenue to Ave, Street to St, or vice versa) Identifies “address trash” elements such as telephone numbers and puts them into the proper fields For example: 2.1.3 Global Address Validation 7031 Columbia Gateway Dr, Suite 101 Columbia MD 21046 USA Parsing Data is unstructured or in incorrect fields House number: Street: Sub-Building: City: State: ZIP: Country: 7031 Columbia Gateway Dr Suite 101 Columbia MD 21046 USA All elements are stored in proper fields Address validation is the correction process where properly fielded address data is compared against reference tables supplied by postal organizations or other data providers. Informatica AddressDoctor has to deal with improperly truncated data, incomplete data, missing address elements, ambiguous names and many other challenges. The Informatica AddressDoctor validation is designed to provide the best possible matches while minimizing incorrect modifications to address elements. In many cases, it is not possible to fully validate an address. Here Informatica AddressDoctor has a unique deliverability assessment feature that classifies addresses according to their probable deliverability. The address validation feature of Informatica AddressDoctor has the following advantages: Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 9 Leverages the world’s largest reference database of postal data. Validates individual address elements and checks for correctness using sophisticated fuzzy matching technology. Provides a batch validation mode for bulk address validation. Batch validation checks elements for correctness and changes them if necessary Provides an interactive mode that enables you to check elements for correctness and improve them if possible. In the interactive mode, Informatica AddressDoctor provides pick lists of alternatives for ambiguous input data records. Provides a Fast Completion mode that automatically completes truncated or incomplete address elements to facilitate fast data entry. Provides a single-line address validation option in the Fast Completion mode. The single-line address validation option enables you to validate addresses that are entered as a single line in the AddressComplete element. Produces standardized and formatted output based on postal standards and user preferences. Uses an internal performance-optimized data storage mechanism for the reference data. No third party database software is required For example: 7031 Golumbia Gateway Dr. Suite 101 Columbia MD 21044 USA 7031 COLUMBIA GATEWAY DR STE 101 COLUMBIA MD 21046 USA Validation Incorrect input address 2.1.4 Corrected output address CAMEO Socio-Demographic Encoding In cooperation with the Callcredit Information Group, Informatica AddressDoctor provides address enrichment with socio-demographic characteristics (CAMEO) through our product offerings CAMEO offers a highly detailed system through which you'll learn all you need to know about your customers and markets, helping you make the best of every marketing opportunity. Whether you're managing a customer database, searching for prospects or conducting market analysis, CAMEO will provide you with the latest socio-demographics and lifestyle data at Micro-cell level using a wide range of data variables, including: Child Presence and Age Adult Age Single Households Retired Households Movement Property Age Urbanicity Land Values Qualifications Further Education Interest in Fashion Cars Bought New Previous Owners Interest in Family Goods Interest in Books and CDs Arts and Culture Culinary Interests Interest in Business Interest in Finance Interest in the Lottery Computer Literacy Concern with Self Image Activity Levels Interest in Home and Garden Mail Order Responsiveness Car Age and Type Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 10 Car Manufacturer Origin Engine Power and Size Prestige Car Purchase of Luxury Goods Use CAMEO to: Enhance and segment consumer databases Better understand your customers and responders Locate more prospects by finding look-a-likes Perform area and site location analysis Understand market potential Perform advanced statistical analysis and modelling CAMEO is available for the following countries: Australia, Austria, Belgium, Brazil, Canada, Czech Republic, Denmark, Estonia, Finland, France, Germany, Hong Kong, Hungary, Indonesia, Ireland, Italy, Japan, Mexico, Netherlands, New Zealand, Norway, Poland, Portugal, Romania, Russian Federation, Singapore, Slovakia, South Africa, Spain, Sweden, Switzerland, the United Kingdom, and United States of America. Like for all enrichment options, Informatica AddressDoctor validates each address before adding the CAMEO information. This improves the result by enabling socio-demographic information to be displayed more accurately. Accessing CAMEO In order for CAMEO codes to be available for any country, a customer must first be subscribed to the address validation reference data for that country and have a valid unlock code. Informatica AddressDoctor always performs address validation prior to any enrichment process such as Geocoding and CAMEO encoding. Second, the customer must have a valid CAMEO unlock code (as is the same with Geocoding) and a subscription to the CAMEO database for their selected countries. See your Informatica AddressDoctor Sales representative for CAMEO pricing and availability. Output Details The CAMEO enrichment offering provides multiple new output fields. CAMEOStatus provides for troubleshooting issues with the enablement of CAMEO in your environment. It will tell you if the databases may be missing, if the address cannot be encoded, or if it was successful. CATEGORY provides a code and description for the Age and Affluence of the address per country. GROUP provides a code and description for the neighborhood of the address per country. INTERNATIONAL provides a code and description of the Age and Affluence at an International level. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 11 MVID is a match key that can be used to link your CAMEO encoded addresses to another Callcredit Information Group product called CAMEO Analysis. CAMEO Analysis is a separate product offering that can be licensed directly from Callcredit Information Group. 2.1.5 Informatica AddressDoctor in a Nutshell Informatica AddressDoctor provides a single, unified library (DLL on Windows or .so/sl shared library on Unix systems, respectively) with both, C function calls ("C API") and a Java API. The library accesses *.MD database files, which contain the postal reference data. The software consists of a single engine which, after initialization, processes input addresses contained in AddressObjects, the data structure for storing an input address, parameter settings and the processing result (for details see the “Concepts” Chapter 5 below). 2.2 Supported Platforms Informatica AddressDoctor version 5 is developed using the C++ programming language. The resulting API is available for C and Java, provided by a single combined software library. While the Informatica AddressDoctor documentation provides only examples for the most common implementation languages reflected by those two API flavors, they may be used to guide implementation in any programming language, such as C++, C#, VB.Net, PHP, Perl, Ruby or Python. Note that Informatica AddressDoctor can only provide support at an API level and does not provide support for implementation-specific questions. While the primary development platform is Windows and Microsoft Visual Studio 2005, the Informatica AddressDoctor package is available for many hardware and software platforms, including Windows, AIX, Solaris, Linux, and HP-Unix platforms. Some of the packages are available on request if you cover the full cost of porting, build, test, and support. Contact Informatica AddressDoctor Support (see chapter 9.3) about the individual availability of certain platform package versions. The following table lists the supported platforms and system configurations. Operating System Processor Architecture Java Development Kit Windows XP Pro SP3 Windows Server 2008 SP2 x86 (32-bit) Sun SE 7 Windows XP Pro x64 Edition SP2 Windows XP Pro SP3 Windows Server 2008 R2 Windows Server 2008 SP2 x64 (64-bit) Sun SE 7 SUSE Linux Enterprise Server 10 and 11 x86 (32-bit) x64 (64-bit) Sun SE 7 RedHat Enterprise 5 and 6 x86 (32-bit) x64 (64-bit) Sun SE 7 RedHat Enterprise 5 and 6 System z (64-bit) IBM SE 7 Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 12 Operating System Processor Architecture Java Development Kit AIX 5.3, 6, and 7 POWER (64-bit) IBM SE 7 Solaris 10 and 11 Intel (64-bit) SPARC (64-bit) Sun SE 7 HP-UX 11 Intel Itanium (64-bit) HP SE 5 2.3 System Requirements Informatica AddressDoctor has been designed to achieve the best possible performance while being highly efficient in its memory and resource usage. In order to ensure best possible performance, a fast I/O system and sufficient memory is recommended. At the time of writing, the entire worldwide postal reference database requires around 15 to 20 GB of disk space. Additional disk space is needed when United States certified databases are used also. As with most applications, the engine will perform better if more memory and a faster processor are installed. The minimum requirements are 512 MB of memory for validation operations and 128 MB of memory, if only parsing is required. To optimize performance, the most commonly used databases should reside in memory (see chapter 6.25 for details). Thus it is recommended to have at least 1 GB RAM available, up to at least 16 GB needed for loading the full reference database set into memory. As this exceeds the maximum 3 GB of memory that a 32-Bit operating system can typically address (see chapter 6.25), Informatica AddressDoctor strongly recommends using 64-Bit operating systems in production. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 13 3. Setup Informatica AddressDoctor has been designed to be independent of other software modules, thus easing the setup process. All components have been linked in such a way that no external dependencies on libraries or DLL files (on Windows) exist, apart from absolutely necessary core system libraries like KERNEL32.DLL (on Windows) or libc.so (on Linux). In some cases, runtime files or software development environments may be required for the demo applications. 3.1 General Remarks Informatica AddressDoctor is provided as download in separate ZIP files (packages). There are packages for: Software libraries Documentation Postal reference data (see section 9.1) 3.1.1 Software packages The file names of the packages for the software contain the release date in the format YYMMDD as well as the release version in the format 5.0.x.yyyy. In this naming convention, x is the major build version and yyyy is the minor build version with a length of three to four digits. In addition, the file names include the platform (PPP) and the architecture used (32 or 64-Bit). File names may contain compiler information also, in case different compiler versions are available for one platform. The file names will thus be as follows (all examples are without compiler information): AD5_PPP_32/64_YYMMDD_(5.0.x.yyyy).zip C/Java Packages: WIN: Windows RHT_SUSE: Linux (Red Hat and Suse) AIX: AIX SOS: Solaris SPARC HPU: HP-UX As an example a file could be named AD5_SOS_32_090210_(5.0.11.384).zip, which contains Informatica AddressDoctor with C and Java API for Solaris Sparc 32 bit. File names containing extra compiler information would look like AD5_SOS_32_090410_(5.0.11.392)-sun.studio.11.zip. This contains Informatica AddressDoctor with C and Java API for Solaris Sparc 32 bit, compiled using Sun Studio 11. 3.1.2 Documentation package The documentation for all wrappers and platforms is contained in a ZIP file with a name following this naming convention: AD5_DOC_YYMMDD_(5.0.x.yyyy).zip 3.1.3 Postal reference data packages The postal reference data is available in individual ZIP files for each supported country and territory. These files are named as follows: Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 14 DB5_XXX5ZZ_YYMMDD.zip Once again the YYMMDD stands for the date of the release of the database. The XXX is replaced by the ISO-3 alpha code (ISO 3166) as found in the Informatica AddressDoctor country list online (http://www.addressdoctor.com/en/countries-data/country-list.html), and ZZ denotes the type of reference data: For now BI for Batch/Interactive, FC for Fast Completion, Cx for CERTIFIED and GC for Geocoding. An example of a file name would be DB5_DZA5BI_091210.zip for an Algerian Batch/Interactive database released on December 10th 2009. 3.2 Installing the Library Files 3.2.1 C Installation The C .lib file requires no special setup. Simply unpack the ZIP file preserving the directory structure. The following directories (folders) will be created: /bin /etc /include /lib /src The bin directory contains executable sample applications like the ConsoleDemo (see chapter 7 for details). At this time, the sub-directories under etc contain XML configuration file examples that must be copied to your working directory for adjusting the default behaviour of the ConsoleDemo application. The include directory contains all required header files. The.dll (.so/sl on Unix) file is contained in the lib directory and must be copied to your shared library path (echo %PATH% on Windows or echo $LD_LIBRARY_PATH on Unix, see chapter 7.1 also). The code of the sample applications (see chapter 7.1) is located in the src directory and its sub-directories. Take extra care to remove any prior versions of Informatica AddressDoctor shared library from your shared library path to avoid confusion and unnecessary support effort. If used, make sure that your configuration XML files are present in the directory your application references them from (i.e. in case of Informatica AddressDoctor ConsoleDemo, the working directory the executable is called from). On many UNIX platforms, increasing the thread stack size to at least 1MB is required; for example, using export AIXTHREAD_STK=1000000 on AIX or export PTHREAD_DEFAULT_STACK_SIZE=1000000 on HP-UX, and so on. Furthermore, we recommend setting ulimit -s unlimited. 3.2.2 Java Installation To use Informatica AddressDoctor, Version 5.5.0 and later, through the accompanying JAR archive, you must have the Java Runtime Environment (JRE) version 7 set up on the device on which you install Informatica AddressDoctor. Previous versions of Informatica AddressDoctor work with the JRE version 6. If you want to develop your own applications, you must have the Java platform (JDK) SE 7 installed on the device. However, on HP-UX, you can continue to use HP SE 5 version of the JDK. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 15 You can download the JRE package from the Sun Java website. Note that Informatica AddressDoctor does not officially support Informatica AddressDoctor Version 5.5.0 and later installations that run on JRE versions earlier than Version 7. The Informatica AddressDoctor ZIP archive contains several files that should be extracted to a directory on your computer preserving the stored folder names (for an explanation of the archive structure see chapter 3.2.1 above). After extraction, files might need to be copied as follows: Windows Copy AddressDoctor5.jar and AddressDoctor5.dll to the classpath of your Java Runtime. For instance C:\Program Files\Java\jre\lib\ext typically resides in the system-wide classpath, although it is recommended practice to use and explicitly set application specific classpaths using the –cp switch, for example after unpacking the ZIP archive to the present working directory (see also chapter 7.1): java –Xss2048k -cp bin;lib/AddressDoctor5.jar -Djava.library.path=lib ConsoleDemoJava Solaris and other Unix versions Copy AddressDoctor5.jar and libAddressDoctor5.so (resp. libAddressDoctor5.sl in case of the HP-UX version of the Java wrapper) to the classpath of your Java Runtime. For instance /usr/j2se/jre/lib/ext typically resides in the system-wide classpath, although it is recommended practice to use and explicitly set application specific classpaths using the –cp switch, for example after unpacking the ZIP archive to the present working directory (see also chapter 7.1): java –Xss2048k -cp bin:lib/AddressDoctor5.jar -Djava.library.path=lib ConsoleDemoJava Take extra care to remove any prior versions of Informatica AddressDoctor from your system wide classpath to avoid confusion and unnecessary support effort. If used, make sure that your configuration XML files are present in the directory your application references them from (i.e. in case of the Informatica AddressDoctor ConsoleDemoJava, the working directory the executable is called from). Ensure that sufficient memory can be allocated by your application. At this time we recommend 2048k thread stack size if you intend to use the validation functionality, as well as a minimum of 512m of heap space. Assuming the name of the main class of your application is MyApp and you are using Informatica AddressDoctor on Linux in the lib sub-directory, compile and start it as follows: javac -cp .:lib/AddressDoctor5.jar MyApp.java java –Xss2048k –Xms512m –Xmx2048m -cp .:lib/AddressDoctor5.jar -Djava.library.path=lib MyApp The Java Virtual Machine may encounter a limit on the amount of heap memory it can assign to an application, typically between 1.5-2.5 GB on 32-Bit operating systems. This effectively limits the number of databases that can be preloaded. On many UNIX platforms, increasing the thread stack size to at least 1MB is required; for example, using export AIXTHREAD_STK=1000000 on AIX or export PTHREAD_DEFAULT_STACK_SIZE=1000000 on HP-UX, and so on. Furthermore, we recommend setting ulimit -s unlimited. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 16 Additionally, the IBM J9 JVM will require the JVM call java OS stack size. –Xmso2048k for increasing the 3.3 Installing the Reference Databases The postal reference databases are named XXX5YY.MD where XXX stands for the ISO-3 alpha code (ISO 3166) of the country and YY for the database type. The postal reference databases are read only and platform independent. The same database files may be used on all supported platforms. Batch/Interactive – ISO5BI.MD Fast Completion – ISO5FC.MD Certified – ISO5Cx.MD with x={1,…,n} Address Code Lookup – ISO5AC.MD Standard Geocoding – ISO5GC.MD High Precision Arrival Point Geocoding – ISO5GCAP.MD Parcel Centroid Geocoding – ISO5GCPC.MD CAMEO – ISO5CA.MD Supplementary – ISO5Ex.MD with x={1,…,n} To use any of these databases, you must have a valid unlock code to indicate that the database be unlocked. For example the following databases are presently available for Germany: DEU5BI.MD DEU5FC.MD DEU5GC.MD DEU5CA.MD DEU5AC.MD 3.3.1 Installation Notices All reference database ZIP files are typically unpacked into the same directory, but storing them on different storage devices is also supported (via SetConfig.xml, see the respective DTD in Appendix 10.1 and chapter 6.5). Several applications may share a common set of reference databases on a shared read-only drive, although performance might suffer in such a setup (unless all databases are fully pre-loaded to memory, see chapter 6.25). When full pre-loading of all database files you require is not an option, Informatica AddressDoctor strongly recommend using SSDs (solid state drives) instead of mechanical HDDs (hard disk drives) for the database files. This will significantly improve performance, especially under multi-threading conditions (see chapter 6.25 for more details). Postal reference database files carry expiration date information to honour data provider license terms and help customers determine whether they are using current and relevant reference data. This ExpirationDate information is accessible via GetConfig.xml (see the respective DTD in Appendix 10.1). In certain cases a database file will be no longer accessible when its ExpirationDate has elapsed, for example, when required by data provider license terms. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 17 Depending on the number of database files in use, it might well be necessary to raise the number of file handles or descriptors available to a process, for example, by virtue of setting ulimit –n 8192 on UNIX type operating systems. Note that additional internal limitations of the libc version used might interfere also, which then calls for an upgrade to the latest patch version or fix pack of the operating system used. 3.3.2 Special Remarks for USA CASS Certified Mode The US CASS certified processing requires additional databases that contain CASS related information such as Carrier Route codes, EWS, ZIPMOVE, LACSLink, DPV, DFS2, SuiteLink, and so on. This information is contained in files named USA5C1.MD to USA5C25.MD (at the time of writing, subject to additions). These CASS specific database files must be present in the database directory in order to have their respective information available for the CERTIFIED process mode. While the ZIPMOVE file provides information about past ZIP code changes, the Early Warning System (EWS) file contains information about upcoming changes to ZIP codes and both are required for CASS certification. The regular EWS and ZIPMOVE database updates by USPS are made available through Informatica AddressDoctor and need to be placed in the database directory. For DPV, LACSLink or SuiteLink information the additional files mentioned above are needed, but USPS licensing terms do not allow storing this data outside the US. Therefore, this data is available only to US customers. Since CASS Cycle L (2007 - 2009), DPV and LACSLink processing are mandatory to achieve a CASS certification, with CASS Cycle M (2009 - 2011) SuiteLink was made mandatory also. Even if the certified database files are missing, Informatica AddressDoctor adds ZIP+4 Codes, as long as USA5BI.MD is available in the database folder (see chapter 6.24 also) SuiteLink contains suite numbers for business addresses in selective high-rise buildings and targets high-rise addresses with high-volume default mail. SuiteLink also improves business addressing information through assignment of suite numbers. The new data provided by the USPS has been provided to Informatica AddressDoctor end users through an updated USA5C18.MD file. Informatica AddressDoctor also allows records with input suite data that did not match to the ZIP4 file to go through the SuiteLink process ignoring the input suite data. If a match is found during SuiteLink processing the input suite data will be retained in the residue component and output on DAL2 as required by the USPS to retain the extraneous data. Residential Delivery Indicator (RDI) processing has been added to Informatica AddressDoctor for the CASS Cycle N (2011 – 2012) processing and is available with the 5.2.7 release. This component is optional and is not required to get the postal discounts. The new databases needed for RDI processing are USA5C22.MD and USA5C23.MD. The RDI processing is intended for parcel shippers, their agents or analysts. The result of the processing is a single bit of “Y” or “N”. The “Y” represents a residential delivery and is determined by the zip9 or zip11 not being found in the database. The “N” represents a business delivery and is determined by the zip9 or zip11 being found in the database. Starting with version 5.2.9 eLOT (Enhanced Line of Travel) processing has been added to Informatica AddressDoctor. The feature is enabled automatically as soon as the USA5C24.MD and USA5C25.MD databases are available. The databases are created and distributed by Informatica AddressDoctor. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 18 The USPS is the official licensor of the eLOT database. Customers requiring eLOT for use with Informatica AddressDoctor will need to obtain a license to the eLOT data from the USPS and pay the necessary license fees to the USPS prior to receiving the Informatica AddressDoctor eLOT database file. Upon proof of license, Informatica AddressDoctor provides the customer with the necessary entitlements and instructions for obtaining the Informatica AddressDoctor formatted eLOT database for use with the product. Contact the USPS National Customer Support Center at 1-800-238-3150 for information on how to obtain a license or you can check https://ribbs.usps.gov/index.cfm?page=elot for additional information on eLOT licensing. 3.3.3 Special remarks for Canadian SERP Certified Mode The Canada SERP processing requires an additional CAN5C1.MD containing the PoCAD (Point of Call Address Data) which has been introduced with the 2011 SERP cycle. The file has to be placed in the database directory specified in the SetConfig.xml file. SERP processing will not be possible without this database. Starting with Version 5.4.2, Informatica AddressDoctor becomes SERP 2014-compliant. SERP 2014 compliance ensures that Informatica AddressDoctor versions 5.4.2 and later adhere to the following changes to the postal rules and regulations set by Canada Post: When the range-based Point of Call Address Database (PoCAD) has only one suite range available for a given address and if the suite number in the input address is outside the available range, Informatica AddressDoctor marks that address as invalid. However, if the input contains a postal code that maps to a Large Volume Receiver (LVR), Informatica AddressDoctor copies the (input) suite number (to the output) even when the input suite number does not match any of the corresponding database entries that contain the correct single suite-civic number combination. When the range-based PoCAD has only one address associated with a civic street and if the input address does not match the address available in the database, Informatica AddressDoctor marks that address as invalid or non-correctable. When the range-based PoCAD has a Type 2 record that does not have a route identifier and delivery mode identifier available for a rural address, Informatica AddressDoctor handles that address in the same way it handles Type 1 addresses. However, the following conditions apply to the handling of rural civic addresses: o If the input address does not have a match in the range-based PoCAD and the postal code of the input address has a corresponding Type 4 address in the range-based PoCAD, Informatica AddressDoctor marks the address as VQ (Valid but questionable) in the SERP category enrichment. o If the input address is a rural address with a street having no civic street number, Informatica AddressDoctor adds the civic street number when a unique correction is possible. If no unique civic street can be added to the input address, that address is rejected. Note that Post Office box numbers from 99900 through 99905 in Canada denote Deliver to Post Office (DTPO) addresses for retail outlet locations. Addresses with Post Office box numbers from Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 19 99900 through 99905 are specifically meant for parcel delivery and should not be used for other mail items such as letters, publications, etc. For more information about SERP 2014 certification, contact Canada Post. 3.3.4 Special Remarks for Australian AMAS Certified Mode The Australian AMAS certified processing requires 2 additional databases: AUS5C1.MD and AUS5C2.MD. These database files need to be placed in the database directory specified in SetConfig.xml in Section <DataBase> with Type=”CERTIFIED” for ISO=”AUS” or “ALL”. Without the additional files, no AMAS processing is possible. The databases contain Postal Address File (PAF) data which includes Australia Post’s Delivery Point Identifiers (DPIDs). Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 20 4. Quick Start Guide To install Informatica AddressDoctor, unpack the ZIP file for the selected platform in such a way that the directory structure is preserved. For more information about the ZIP file and directory structure, see section 3. Similarly, unpack the postal reference database files to a destination directory of your choice. Informatica AddressDoctor consists of a single engine which, after initialization, processes input addresses contained in AddressObjects, the data structure for storing an input address, parameter settings and the processing result (for details see the "Concepts" chapter 5 below). 4.1 First-Time Use of Informatica AddressDoctor 5 The engine needs to be initialized by a specific sequence: o AD_Initialize() must be called to actually initialize the engine. It evaluates the settings and configures the engine accordingly. Only after this function has returned successfully AD_GetAddressObject() or any other functions may be called. o AD_DeInitialize() must be called last to de-initialize the engine; the engine is then ready to be initialized again; all AddressObjects must have been released by calling AD_ReleaseAddressObject() before calling AD_DeInitialize(). Consequently, include the following minimal C example code (the program flow is similar for Java) for correcting a single address from Singapore in your application (also refer to the latest API documentation, see Appendix 10.2): AD_AOHandle hAOHandle; char sResultXML[ 16 * 1024 ]; AD_Initialize( "<?xml version='1.0' encoding='iso-8859-1' ?>\n" "<!DOCTYPE SetConfig SYSTEM 'SetConfig.dtd'>\n" "<SetConfig>\n" "<General />\n" "<UnlockCode>(Enter Code here)</UnlockCode>\n" "<DataBase CountryISO3='ALL' Type='BATCH_INTERACTIVE' Path='/ADDB' PreloadingType='NONE'/>\n" "</SetConfig>\n", NULL, NULL, NULL ); AD_GetAddressObject( &hAOHandle ); AD_SetInputDataXML( hAOHandle, "<?xml version='1.0' encoding='ISO-8859-1'?>\n" "<!DOCTYPE InputData SYSTEM 'InputData.dtd'>\n" "<InputData>\n" "<AddressElements>\n" "<Country Item='1' Type='NAME'>SGP</Country>\n" "<Locality Item='1' Type='COMPLETE'>Singapore</Locality>\n" Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 21 "<PostalCode Item='1' Type='FORMATTED'>048624</PostalCode>\n" "<Street Item='1' Type='COMPLETE'>Raffles Place</Street>\n" "<Number Item='1' Type='COMPLETE'>80</Number>\n" "<Building Item='1' Type='COMPLETE'>#50-01 UOB Plaza 1</Building>\n" "<Organization Item='1' Type='NAME'>AddressDoctor GmbH</Organization>\n" "</AddressElements>\n" "</InputData>\n" ); AD_Process( hAOHandle ); AD_GetResultXML( hAOHandle, sResultXML, sizeof( sResultXML ) ); AD_ReleaseAddressObject( hAOHandle ); AD_DeInitialize(); Ensure that the minimal configuration XML (see SetConfig.dtd in chapter 10.1 for configuration setting details) passed upon AD_Initialize() contains a valid Unlock Code you received when purchasing the Informatica AddressDoctor library and the correct destination path that your reference database files have been unpacked to. See InputData.dtd in chapter 10.1 for more details on the structure of data input as XML using AD_SetInputDataXML(). Depending on your requirements, there is also the possibility of using 16 bit input and output (which is the default for Java), see chapter 5.8 for details. Now compile your application as usual, making sure that Informatica AddressDoctor dependencies are met. How to achieve this varies greatly between platforms and compilers, for example on Linux and using gcc, the following command will build the ConsoleDemo C++ example code (see chapters 3.2.1 and 7.1): gcc -Iinclude -Llib -lAddressDoctor5 -lpthread -o bin/ConsoleDemo src/ConsoleDemo.cpp The output of Informatica AddressDoctor processing will be provided in sResultXML in XML format, see Result.dtd (chapter 10.1) for the structure of the XML result from AD_GetResultXML(): <?xml version="1.0" encoding="UTF-16"?> <Result ProcessStatus="V2" ModeUsed="BATCH" Count="1" CountOverflow="NO" CountryISO3="SGP" PreferredScript="DATABASE" PreferredLanguage="DATABASE"> <ResultData ResultNumber="1" MailabilityScore="4" ResultPercentage="100.00" ElementResultStatus="F0F000F0F000404440E0" ElementInputStatus="60600060600020222060" ElementRelevance="10100010100000000010"> <AddressElements> <Country Type="NAME_EN" Item="1">SINGAPORE</Country> Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 22 <Locality Item="1">SINGAPORE</Locality> <PostalCode Item="1">048624</PostalCode> <Street Item="1">RAFFLES PLACE</Street> <Number Item="1">80</Number> <Building Item="1">UOB PLAZA 1</Building> <SubBuilding Item="1"># 50</SubBuilding> <SubBuilding Item="2">01</SubBuilding> <Organization Item="1">ADDRESSDOCTOR GMBH</Organization> </AddressElements> <AddressLines> <RecipientLine Line="1">ADDRESSDOCTOR GMBH</RecipientLine> <DeliveryAddressLine Line="1">80 RAFFLES PLACE</DeliveryAddressLine> <DeliveryAddressLine Line="2">#50-01 UOB PLAZA 1</DeliveryAddressLine> <CountrySpecificLocalityLine Line="1">SINGAPORE 048624</CountrySpecificLocalityLine> <FormattedAddressLine Line="1">ADDRESSDOCTOR GMBH</FormattedAddressLine> <FormattedAddressLine Line="2">80 RAFFLES PLACE</FormattedAddressLine> <FormattedAddressLine Line="3">#50-01 UOB PLAZA 1</FormattedAddressLine> <FormattedAddressLine Line="4">SINGAPORE 048624</FormattedAddressLine> </AddressLines> <AddressComplete>ADDRESSDOCTOR GMBH 80 RAFFLES PLACE #50-01 UOB PLAZA 1 SINGAPORE 048624 </AddressComplete> </ResultData> </Result> Finally, an example for Java (note that in comparison to the C example the “Encoding” attribute for the “Input” and “Result” elements has to be explicitly set to UTF-16 or UCS-2 via Parameters.xml here as well as WriteXMLEncoding for both SetConfig.xml and Parameters.xml, as Java defaults to its native 16 Bit string handling, see chapter 5.8): private static AddressObject m_oAO; public static void main(String[] args) { int iLastError = 0; String sResultXML = ""; try { AddressDoctor.initialize( "<?xml version='1.0' encoding='UTF-16' ?>"+ "<!DOCTYPE SetConfig SYSTEM 'SetConfig.dtd'>"+ "<SetConfig><General WriteXMLEncoding='UTF-16' />"+ " <UnlockCode>(Enter Code here)</UnlockCode>"+ Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 23 " " <DataBase CountryISO3='ALL' Type='BATCH_INTERACTIVE'"+ Path='/ADDB' PreloadingType='NONE' />"+ "</SetConfig>", null, "<?xml version='1.0' encoding='UTF-16' ?>"+ "<!DOCTYPE SetConfig SYSTEM 'Parameters.dtd'>"+ "<Parameters WriteXMLEncoding='UTF-16'>"+ " <Input Encoding='UTF-16' />"+ " <Result Encoding='UTF-16' />"+ "</Parameters>", null); iLastError = AddressDoctor.getLastError(); System.out.println("Using AddressDoctor version: " + AddressDoctor.getVersion()); System.out.println("Init returned " + iLastError); } catch (AddressDoctorException ex) { System.out.println("Exception while initializing "+ "AddressDoctor: " + ex.toString()); System.out.println("Further processing not possible, "+ "application ends!"); return; } try { m_oAO = AddressDoctor.getAddressObject(); } catch (AddressDoctorException ex) { System.out.println("Exception while trying to get an "+ "AddressObject: " + ex.toString()); System.out.println("Further processing not possible, "+ "application ends!"); try { AddressDoctor.deinitialize(); } catch (AddressDoctorException ex2){} return; } try { m_oAO.setInputDataXML( "<?xml version='1.0' encoding='UTF-16'?>"+ "<!DOCTYPE InputData SYSTEM InputData.dtd'>"+ "<InputData>"+ "<AddressElements>"+ " <Key>4711</Key>"+ Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 24 " <Country Item='1' Type='NAME'>SGP</Country>"+ " <Locality Item='1' Type='COMPLETE'>Singapore</Locality>"+ " <PostalCode Item='1' Type='FORMATTED'>048624</PostalCode>"+ " <Street Item='1' Type='COMPLETE'>Raffles Place</Street>"+ " <Number Item='1' Type='COMPLETE'>80</Number>"+ " <Building Item='1' Type='COMPLETE'>#50-01 UOB Plaza 1</Building>"+ " <Organization Item='1' Type='NAME'>AddressDoctor GmbH</Organization>"+ "</AddressElements>"+ "</InputData>"); } catch (Exception ex) { System.out.println("Data could not be assigned! Closing "+ "application: " + ex.toString()); try { AddressDoctor.releaseAddressObject(m_oAO); AddressDoctor.deinitialize(); } catch (AddressDoctorException ex2){} return; } try { AddressDoctor.process(m_oAO); iLastError = AddressDoctor.getLastError(); System.out.println("Process returned " + iLastError); } catch (AddressDoctorException ex) { System.out.println("Exception during process: " + ex.toString()); } if (iLastError == 0) { try { sResultXML = m_oAO.getResultXML(); } catch (AddressDoctorException ex) { System.out.println("Exception while trying to get "+ "ResultXML: " + ex.toString()); return; } System.out.println(sResultXML); } Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 25 try { AddressDoctor.releaseAddressObject(m_oAO); AddressDoctor.deinitialize(); } catch (AddressDoctorException ex) { System.out.println("Exception while releasing the AO and "+ "de-initializing AddressDoctor: " + ex.toString()); } } } Refer also to the C and Java source code provided for Informatica AddressDoctor 5 ConsoleDemo described in chapter 7.1. 4.2 New Features and Enhancements in Informatica AddressDoctor Informatica AddressDoctor adds features and enhancements in each product release. The following lists describe the features and enhancements in the current release and in earlier releases. For complete information on the product changes in any release, consult the release notes for the release. 4.2.1 What’s new in version 5.2.8 Informatica AddressDoctor introduces the following features and enhancements in version 5.2.8: CAMEO social and demographic analysis Informatica AddressDoctor returns CAMEO code values for the following countries: Australia, Austria, Belgium, Brazil, Canada, Czech Republic, Denmark, Estonia, Finland, France, Germany, Hong Kong, Hungary, Italy, Japan, Mexico, Netherlands, New Zealand, Norway, Poland, Portugal, Romania, Singapore, Slovakia, Spain, Sweden, Switzerland, the United Kingdom, and the United States. Australian localities and vanity names Informatica AddressDoctor maintains a valid vanity name from an input field to an output field when you run the engine in Certified mode. Australian Incremental Change File Informatica AddressDoctor supports the Incremental Changes File through the AMAS Certified Software Program. The file contains is a list of delivery point IDs that have changed between releases of the Postal Address File. SERP and AMAS Certification The software library has passed the respective requirements for SERP and AMAS 2012 certification. Czech Republic Address Validation Enhancements Reference data for the Czech Republic includes district numbers. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 26 AddressType Output Field Informatica AddressDoctor populates the AddressType output field for all countries whose reference data supports the identification of the address type. Geocoding Enhancements You can perform geocoding without first cleansing and standardizing the address data. As always, Informatica AddressDoctor recommends cleansing all addresses prior to geocoding in order to provide the most accurate coordinates possible. 4.2.2 What’s new in version 5.2.9 Informatica AddressDoctor introduces the following features and enhancements in version 5.2.9: India Enhancements Informatica AddressDoctor provides enhanced parsing and validation operations and also provides enhancements to the India reference database. Note that Informatica AddressDoctor does not support the Parse-Only mode for India address validation. Note also that older versions of the database are incompatible with version 5.2.9. Use the older database for version 5.2.8 and earlier versions, and use the new database for version 5.2.9 and later versions. Country Improvements Informatica AddressDoctor offers improved address validation for the following countries: Italy, Netherlands, Singapore, Hong Kong, Malaysia, Great Britain, Germany, and the United States. United States improvements include support for eLOT sequence numbers and street name aliases. Enhancements to Transliteration, Parsing, and Formatting for Japan Addresses Informatica AddressDoctor has made significant improvements to the way Japan addresses are processed and validated. Due to these changes, the format of the Japanese database has changed. To obtain the new functionality, download the new database for Japan. Note: if you use a previous version of the API with the new database, you will not benefit from the enhancements in version 5.2.9. 4.2.3 What’s new in version 5.3.0 Informatica AddressDoctor introduces the following features and enhancements in version 5.3.0: Address Resolution Code The Address Resolution Code is a new twenty-character output string that is similar to the Element Result Status field. It is populated for all non-valid (process status = Ix) records. The Address Resolution Code explains why an address is rejected and directs you to possible solutions. Extended Element Result Status Code The Extended Element Result Status code is a new twenty-character output string that is similar to the Element Result Status field. It is populated for valid or corrected addresses. The code indicates that additional information may be available in the reference database for the given address. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 27 Standardization of Non-valid Address Elements Informatica AddressDoctor can standardize address elements for non-valid (process status = Ix) addresses. Standardized addresses can improve downstream business processes such as matching and de-duplication. Dual Addresses You can specify the address type against which to validate an address. Additional Results in Fast Completion and Interactive Modes Informatica AddressDoctor has increased the upper limit of the suggestion list from twenty to one hundred results. In addition, house number ranges can be expanded for countries where individual house numbers exist. Support for Ireland Informatica AddressDoctor introduces support for Ireland. British Forces Postal File Informatica AddressDoctor implements the Royal Mail British Forces Post Office (BFPO) data. Multi-Language Support for Belgium Informatica AddressDoctor introduces multi-language support for Belgian addresses. You can specify the language of the output, or you can preserve the language of the input address. Use the PreferredLanguage parameter to write the output address in French, Flemish or German. Language ISO Code Output Informatica AddressDoctor can write the ISO code language as output when the output address contains data from the reference database. The output is an ISO 639 3-letter code, i.e. “DEU” for Germany. For transliterated output, the original language will be reported, for example “JPN” for romanized Japanese output. Austrian Postal Changes Informatica AddressDoctor supports the latest address format for Austria. Austrian Post changed its address format in 2011. Informatica AddressDoctor 5.3.0 reflects the changes. New and Removed Databases Informatica AddressDoctor introduces the databases for the following countries: Curacao (CUW) Sint Maarten (SXM) Bonaire, Sint Eustatius and Saba (BES) Montenegro (MNE) Serbia (SRB) South Sudan(SSD) The following databases are no supported: Serbia and Montenegro (SCG) Netherlands Antilles (ANT) Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 28 Australian Sub-Building Changes Informatica AddressDoctor adds sub-building data to the Australian Database as of August 2012. Version 5.3.0 validates addresses with sub-building information for Australian addresses in Batch and Interactive modes. Informatica AddressDoctor supports two versions of the Australian databases to ensure compatibility with previous versions of the software library. Singapore Updates The reference data from Singapore Post includes floor, suite and door values. The Software Library supports the information in all processing modes. Japan Updates Japan reference data includes house numbers and address codes for Japan. Due to these changes, a new format of the Japanese database has been introduced. Informatica AddressDoctor supports two versions of the Japan data to ensure compatibility with previous versions of the software library. Certifications Informatica AddressDoctor is officially certified by five postal organizations around the globe: the United States Postal Service, Canada Post, Australia Post, New Zealand Post, and La Poste of France. 4.2.4 What’s new in version 5.3.1 Informatica AddressDoctor introduces the following features and enhancements in version 5.3.1: Address Resolution Code Updates Informatica AddressDoctor adds output values to the Address Resolution Code. Extended Element Result Status Informatica AddressDoctor adds output values to the Extended Element Result Status. Country Improvements and Enrichments Informatica AddressDoctor offers improved address validation logic for the following countries: France, South Africa, China, Japan, the United Kingdom, and Serbia. Informatica AddressDoctor supports the Choumei Aza code as an enrichment for Japan. Informatica AddressDoctor supports the Unique Delivery Point Reference Number as an enrichment for the United Kingdom. Informatica AddressDoctor supports the Postal Address Code as an enrichment for Serbia. Canada Enhancements Informatica AddressDoctor adds enhancements in the following areas: Multi-language support Thirteen-character abbreviation for localities Rural Route information 4.2.5 What’s new in version 5.4.0 Informatica AddressDoctor introduces the following features and enhancements in version 5.4.0: Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 29 Point Address Geocoding Point Address Geocoding enables highly accurate and precise “to the premise” geocoding points for properties and premises. Point address geocoding includes the following types of geocoding: Arrival Point geocoding. The geo coordinates are calculated for a point that is placed in the center of a street segment in front of the house. Parcel Centroid geocoding. The geo coordinates are calculated for a point that is at the geographic center of the parcel of land. 4.2.6 What’s new in version 5.4.1 Informatica AddressDoctor introduces the following features and enhancements in version 5.4.1: Address Code Lookup Address Code Lookup enables you to enter a country-specific address code and retrieve the complete or partial address for the code. Address Code Lookup is currently available for the following countries: Germany Great Britain Japan South Africa Serbia Country Improvements and Enrichments The parser handles fields that are unique to Turkish addresses. Informatica AddressDoctor now supports native parsing for China. The parsing improvements enable better-quality address validation for China. Informatica AddressDoctor supports the two characters that Swiss Post has added to the Swiss postal codes. Informatica AddressDoctor can now return the new address code for deprecated or outdated addresses for Japan. Informatica AddressDoctor has improved the parsing and validation of Japan addresses, including the following: Support for the transliteration of the JIS-2004 Japanese character set into the Latin character set. Support for the Preserve Input Script parameter for Japan addresses. Informatica AddressDoctor supports additional variations for Post Office Box values in Australian addresses. Informatica AddressDoctor has improved the performance of address validation for India. Informatica AddressDoctor also returns suggestions for partial or incomplete Indian addresses in Interactive mode. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 30 Informatica AddressDoctor provides the National Address Database ID as an enrichment output field for South African addresses. Informatica AddressDoctor provides the Brazilian Institute of Geography and Statistics (IBGE) code as an enrichment output field for Brazilian addresses. Informatica AddressDoctor provides enrichment output fields for the Amtliche Gemeindeschlüssel (AGS), the locality ID, and the street ID in German addresses. Sort Order for House Numbers Informatica AddressDoctor returns a list of house numbers in logical order instead of alphanumeric order. For example, Informatica AddressDoctor now sorts and returns numbers in the following logical order: 1, 2, 3, 11, 12, 13, 14, 21, 22 Unlock Codes and Engine Expiration Informatica AddressDoctor includes improvements and changes that relate to unlock codes and engine expiration. Starting with release 5.4.1, new unlock codes are needed for supplementary databases. You must reinitialize the 5.4.1 engine with the new unlock codes in order to enable the supplementary databases. In addition, the GetConfig.xml file reflects the status of the engine at the time of the AD_Initialize call. SendRight Certification Informatica AddressDoctor has passed the 2014 Cycle of the SendRight Certification. 4.2.7 What’s New in Version 5.4.2 Informatica AddressDoctor introduces the following features and enhancements in version 5.4.2: SERP 2014 Compliance Informatica AddressDoctor is SERP 2014‐compliant. Extended Coverage for Point Geocoding Informatica AddressDoctor extends point geocoding support to addresses in Austria, Denmark, Germany, the Netherlands, and Sweden 4.2.8 What’s new in version 5.5.0 Informatica AddressDoctor introduces the following features and enhancements in version 5.5.0: Support for Single Line Address Validation Informatica AddressDoctor can parse and validate addresses that are entered in a single line in Fast Completion mode. Single line address validation is available for the following countries: Australia Canada Germany Great Britain New Zealand Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 31 United States Support for Taiwan Informatica AddressDoctor database coverage includes Taiwanese (Republic of China) addresses. Note that Informatica AddressDoctor currently supports Taiwanese addresses in the Latin script. Support for Locality Aliases (Vanity Names) You can retain locality aliases, also known as vanity names, in the validated output. Support for Java Version 7 Informatica AddressDoctor 5.5.0 uses version 7 of the Java Run-Time Environment. To develop your own applications, you must install the Java Development Kit SE 7 on the development machine. Note: You can continue to use the HP SE 5 version of the Java Development Kit on a machine that runs the HP-UX operating system. Country Improvements Informatica AddressDoctor supports the new street name-based address system applied in South Korea. Informatica AddressDoctor supports the Postal Address Code as an enrichment for addresses in Austria. Informatica AddressDoctor supports the INSEE code as an enrichment for addresses in France. Informatica AddressDoctor supports Gmina codes, Locality TERYT IDs, and Street TERYT IDs as enrichments for addresses in Poland. Support for Preserving Input Scripts You can preserve the input script of addresses from Belarus, China, Greece, Kazakhstan, Macedonia, Russia, and Ukraine. Cyrillic Support for Belarus and Macedonia Informatica AddressDoctor extends Cyrillic transliteration support to Belarus and Macedonia addresses. You can enter and validate Belarus and Macedonia addresses in the native script. Enhancements to United States Address Validation Informatica AddressDoctor Version introduces the following improvements to United States address processing: Support for default unique ZIP code assignments Support for locality name override Improved handling of delivery instructions Improved handling of leading zeros in sub-building number elements 4.2.9 What’s New in Version 5.6.0 Informatica AddressDoctor introduces the following features and enhancements in version 5.6.0: Country Improvements Informatica AddressDoctor adds support for Taiwan addresses in the Mandarin Traditional Chinese script, the native and official script in Taiwan. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 32 Informatica AddressDoctor adds an attribute, GlobalPreferredDescriptor, to specify the output format for street, building, and sub-building element descriptors in addresses from Australia and New Zealand. Informatica AddressDoctor adds support for Kilometer information as additional street information in valid Brazil addresses. Informatica AddressDoctor adds support for county and sub-building information in the Fast Completion output for United States addresses. Informatica AddressDoctor adds support for the new seven-digit postal codes in Israel. Enhancements for Countries in the DACH Region Informatica AddressDoctor adds support for keywords such as Zimmer and App in the house number field of addresses from Germany, Austria, and Switzerland. Informatica AddressDoctor parses the Zimmer and App information in the House Number field as sub-building information. Japan Enhancements Informatica AddressDoctor adds support for Ban or block information in Japan addresses. Informatica AddressDoctor adds support for Gaiku code in Japan addresses. Informatica AddressDoctor now provides old and new Choumei Aza codes and the Gaiku code in Japan address output and supports a combination of the Choumei Aza code and the Gaiku code in Address Code Lookup for Japan addresses. Spain Enhancements Informatica AddressDoctor provides improved reference address data and the following validation improvements for Spain addresses: Identification of the building name and street name in the Delivery Address Line 1 field. Addition of a slash symbol (/) between a building element and a sub-building element when the sub-building element is a number. United Kingdom Enhancements Informatica AddressDoctor adds support for rooftop geocoding for the United Kingdom addresses. Informatica AddressDoctor adds support for Address Key values in the United Kingdom addresses. 4.2.10 New Parameters Added in Version 5.6.0 Informatica AddressDoctor adds the following new parameters and values in Version 5.6.0. GlobalPreferredDescriptor attribute for the Result element in Parameters.DTD. Configures the output format for street, building, and sub-building element descriptors in Australia and New Zealand addresses, and the Strasse element in Germany addresses. Supported values are DATABASE, LONG, SHORT, and PRESERVE_INPUT. ADDRESS_KEY attribute for the SupplementaryGB element in Result.DTD. Provides the address key as an address enrichment to the United Kingdom addresses. GAIKU_CODE attribute for the SupplementaryJP element in Result.DTD. Provides the Gaiku code as an address enrichment to Japan addresses. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 33 4.2.11 Setting up Informatica AddressDoctor Informatica AddressDoctor processes input addresses (AD_Process()) contained in the AddressObjects. As there is only one engine per process, there is no Informatica AddressDoctor handle. The engine needs to be initialized by a specific sequence: AD_Initialize() must be called to actually initialize the engine. It evaluates the settings and configures the engine accordingly. Only after this function has returned successfully AD_GetAddressObject() or any other functions may be called. AD_DeInitialize() must be called last to de-initialize the engine; the engine is then ready to be initialized again; all AddressObjects must have been released by calling AD_ReleaseAddressObject() before calling AD_DeInitialize(). The engine stores the following data for its internal use (see SetConfig.dtd): General engine configuration, i.e., the maximum amount of memory the engine may request from the OS The access codes; at least one valid access code must be supplied when calling AD_Initialize() Optional preloading parameters for the databases In addition, the engine stores the following parameter data as a default for the AddressObjects (see Parameters.dtd in chapter 10.1): Process parameters, i.e., the processing mode to be used Input parameters, i.e., which input encoding is to be used Format specifications for the result, i.e., Casing specifications This configuration data has default values as specified by the corresponding SetConfig.dtd; they can be changed by passing a corresponding config XML as parameter to AD_Initialize(): There is no way to change the configuration after AD_Initialize() has been called, this parameter configuration data is used by the AddressObjects by default, when no alternative setting is made for a specific AddressObject. 4.2.12 AddressObjects The AddressObject is a data structure for storing an input address, parameter settings and a result. AddressObjects store the following (configuration) data: Parameter settings (see Parameters.dtd) An input address (see InputData.dtd) A result (see Result.dtd) The last return code AddressObjects should not be created and destroyed frequently, but rather be reused for performance reasons. Specifically, the parameter settings should be reused to avoid repeated settings overhead. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 34 Informatica AddressDoctor manages all AddressObjects. Only a specific number of AddressObjects can be created. This number can be set in the initialization phase of the engine (using the “MaxAddressObjectCount” attribute). Recommended setting would be between once and twice the number of threads, which must be set using the “MaxThreadCount” attribute and is currently limited to a practical maximum value of 32 threads (see chapter 5.36) Note that the default parameter settings differ between Informatica AddressDoctor 4 and 5, for instance the standard settings for script and casing (see chapters 5.12 and 5.14 for details). 4.2.13 Direct API The engine in- and output in Informatica AddressDoctor 5 is based on a XML API. See the corresponding DTDs in Appendix 10.1 and the following chapter 5 for details. To ease the transition from Version 4, the engine partly supports setting and getting some of the XML values and attributes directly, for example AD_SetInputAddressElement( hAOHandle, "PostalCode", 1, "67133" ) sets the item 1 postal code to "67133". To set a street and a dependent street using the direct API in C, the “Item” parameter has to be 1 for the first and 2 for the second: AD_SetInputAddressElement( hAOHandle, "Street", 1, NULL, "Main St 5" ); AD_SetInputAddressElement( hAOHandle, "Street", 2, NULL, "Dependent St 8" ); For example, to set 3 formatted address lines, the “Line” parameter has to be set from 1 to 3: AD_SetInputAddressLine( hAOHandle, "FormattedAddressLine", 1, "AddressDoctor GmbH" ); AD_SetInputAddressLine( hAOHandle, "FormattedAddressLine", 2, "Roentgenstr. 9" ); AD_SetInputAddressLine( hAOHandle, "FormattedAddressLine", 3, "D-67133 Maxdorf" ); Similarly, setting street and dependent street in Java: m_oAO.setInputAddressElement("Street", 1, "COMPLETE", "Main St 5"); m_oAO.setInputAddressElement("Street", 2, "COMPLETE", "Dependent St 8"); And setting 3 formatted address lines in Java: m_oAO.setInputAddressLine("FormattedAddressLine", 1, "AddressDoctor GmbH"); m_oAO.setInputAddressLine("FormattedAddressLine", 2, "Roentgenstr. 9"); m_oAO.setInputAddressLine("FormattedAddressLine", 3, "D-67133 Maxdorf"); Both kinds of API functions may be intermixed although this is not recommended; specifically, note that calling AD_SetInputDataXML() clears any possibly existing input data beforehand as input may not be assigned using both, direct and XML API (see the respective return code in chapter 5.32 also). Complete direct API examples in C and Java follow (also refer to the latest API documentation, see Appendix 10.2): AD_AOHandle hAOHandle; char sCompleteAddress[ 4096 ]; AD_U32 ulNumResults; AD_Initialize( "<?xml version='1.0' encoding='iso-8859-1' ?>\n" Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 35 "<!DOCTYPE SetConfig SYSTEM 'SetConfig.dtd'>\n" "<SetConfig>\n" "<General />\n" "<UnlockCode>(Enter Code here)</UnlockCode>\n" "<DataBase CountryISO3='ALL' Type='BATCH_INTERACTIVE' Path='/ADDB' PreloadingType='NONE'/>\n" "</SetConfig>\n", NULL, NULL, NULL ); AD_GetAddressObject( &hAOHandle ); AD_SetInputAddressElement( hAOHandle, "Country", 1, NULL, "SGP" ); AD_SetInputAddressElement( hAOHandle, "Locality", 1, NULL, "Singapore" ); AD_SetInputAddressElement( hAOHandle, "PostalCode", 1, NULL, "048624" ); AD_SetInputAddressElement( hAOHandle, "Street", 1, NULL, "Raffles Place" ); AD_SetInputAddressElement( hAOHandle, "Number", 1, NULL, "80" ); AD_SetInputAddressElement( hAOHandle, "Building", 1, NULL, "#50-01 UOB Plaza 1" ); AD_SetInputAddressElement( hAOHandle, "Organization", 1, NULL, "AddressDoctor GmbH" ); AD_Process( hAOHandle ); AD_GetResultCount( hAOHandle, &ulNumResults ); if( ulNumResults > 0 ) AD_GetResultAddressComplete( hAOHandle, 1, sCompleteAddress, sizeof( sCompleteAddress ) ); AD_ClearData(); another input address // Not necessary here, only if hAOHandle were to be filled with AD_ReleaseAddressObject( hAOHandle ); AD_DeInitialize(); Or, alternatively: private static AddressObject m_oAO; public static void main(String[] args) { try { // Initialize the engine AddressDoctor.initialize( "<?xml version='1.0' encoding='UTF-16' ?>"+ "<!DOCTYPE SetConfig SYSTEM 'SetConfig.dtd'>"+ "<SetConfig><General WriteXMLEncoding='UTF-16' />"+ " <UnlockCode>(Enter Code here)</UnlockCode>"+ " <DataBase CountryISO3='ALL' Type='BATCH_INTERACTIVE'"+ " Path='/ADDB' PreloadingType='NONE' />"+ "</SetConfig>", null, "<?xml version='1.0' encoding='UTF-16' ?>"+ "<!DOCTYPE SetConfig SYSTEM 'Parameters.dtd'>"+ "<Parameters WriteXMLEncoding='UTF-16'>"+ Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 36 " <Input Encoding='UTF-16' />"+ " <Result Encoding='UTF-16' />"+ "</Parameters>", null); // Get an AddressObject to use m_oAO = AddressDoctor.getAddressObject(); // Set the address elements m_oAO.setInputAddressElement("Country", 1, "ISO_3", "SGP"); m_oAO.setInputAddressElement("Locality", 1, null, "Singapore"); m_oAO.setInputAddressElement("PostalCode", 1, null, "048624"); m_oAO.setInputAddressElement("Street", 1, "NAME", "Raffles Place"); m_oAO.setInputAddressElement("Number", 1, null, "80"); m_oAO.setInputAddressElement("Building", 1, null, "#50-01 UOB Plaza 1"); m_oAO.setInputAddressElement("Organization", 1, null, "AddressDoctor GmbH"); // Process the AddressObject AddressDoctor.process(m_oAO); // If there is at least one result, print the address on the screen if (m_oAO.getResultCount() > 0) System.out.println(m_oAO.getResultAddressComplete(1)); // Clear the AddressObject so that it may be filled with another input address m_oAO.clearData(); // Release the AddressObject, all AddressObjects must be released to deinitialize AddressDoctor.releaseAddressObject(m_oAO); // Deinitialize the engine AddressDoctor.deinitialize(); } catch (AddressDoctorException e) { System.exit(1); } } Take note of the native XML example shown in chapter 4.1 to understand the differences between direct and XML type API usage for evaluation of which API mode might better suit your needs. Also, the example given above pertains 8 Bit data handling, see chapter 5.8 “Input and Output Encoding” for the differences when handling 16 Bit data (which is the default for Java). 4.2.14 Transliteration (formerly UniString Object) The “Transliteration only” process mode in Version 4 via the UniString object has been superseded: To obtain transliterated address elements without validation, simply set the “Mode” attribute of the “Process” element in Parameters.xml (see DTD in chapter 10.1) to PARSE using AD_SetParametersXML() before submitting your AddressObject to AD_Process(). For this specific use case, setting an “OptimizationLevel” of NARROW would also be recommended (see chapter 5.33). 4.2.15 Unlock Code Mechanism The Informatica AddressDoctor unlock code mechanism has been slightly redesigned, note that multiple unlock codes are to be passed as separate XML elements (see chapter 6.4) and that Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 37 information which databases have been unlocked may be queried using AD_GetConfigSettingsXML() see chapter 6.6 for details. 4.2.16 Unlock Codes and Engine Expiration Informatica AddressDoctor includes the following improvements and changes regarding unlock codes and engine expiration. 4.2.17 Not Yet Valid Unlock Codes Previous versions of the engine reported unlock codes that were not yet valid with a status of “EXPIRED” in the GetConfig.xml file. Informatica AddressDoctor now reports these codes with a status of “NOT_VALID_YET”. In addition, Informatica AddressDoctor no longer reports the following warning for these codes when calling AD_Initialize or AD_InitializeW: AD_SC_WRN_INIT_UNLOCKCODE_EXPIRED = 2 = “The SetConfig.xml contained at least one expired or not yet valid unlock code” 4.2.18 Adjacent Unlock Codes Unlock codes that are adjacent to the currently valid unlock codes or unlock codes that overlap are now handled correctly by extending the internally computed engine expiration date accordingly. In the GetConfig.xml file, adjacent unlock codes are reported with a status of “NOT_VALID_YET”. 4.2.19 Engine Expiration In order to use the library, valid unlock codes of type VALIDATION are required. If no valid codes are present in the SetConfig.xml file, the engine goes into an “expired” state. The engine does not accept any process calls in the expired state. Instead, it returns the following critical error code: AD_SC_CERR_EXPIRED = -1601 - “The engine usage period has expired or is not activated yet” The same situation occurs with a call to AD_Initialize and AD_InitializeW. In Version 5.4.1 and later, the engine is initialized so that a GetConfig.xml file can be retrieved to get details about the unlock codes. Previous versions did not allow any calls to the engine. However, it should be noted that no process calls can succeed when the engine is in an “expired” state. If the expiration date is reached and a call to AD_Process is made, the engine goes into an “expired” state and returns the above error code. The engine must be re-initialized with valid unlock codes. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 38 5. Concepts This chapter explains the major functions of Informatica AddressDoctor and shows how Transliteration, Formatting and the two major address processing stages, Parsing and Correction, interact. The entire functionality is implemented through two objects, Informatica AddressDoctor (frequently referred to as “Engine”) and AddressObject. For the C and Java interfaces these objects have been mapped to functions. The following figure may act as a general guideline in understanding the sequence of the different processing stages an address is subjected to by Informatica AddressDoctor. A more detailed discussion is given in chapter 5.6. Informatica AddressDoctor supports address parsing and address verification for more than 240 countries and territories through one API. Consequently, Informatica AddressDoctor scales easily from a single country setup to multi country or even global scenarios. 5.1 Character Set Mapping In today’s computer environments we encounter numerous character sets. In the early days of computing most systems used either EBCDIC or ASCII character sets. Programmers and system designers used the concept of code pages to cope with the limited characters that were available on these computers. Several years later Unicode was introduced to address the problems associated with the large number of different character sets that are used around the world. With room for more than 65000 characters it seemed like a sufficient solution at first. Now even this character set has become too small to represent all characters from around the world, thus newer versions of Unicode support well over a million. When data is transferred or transported between different computer systems, character set mapping problems frequently occur. These problems result from different numerical values that are assigned to the “same” character in different character sets. While the basic ASCII characters of the Latin alphabet like A, B, and C are usually represented with the same numerical values, the problems often start once accented or other non-standard characters are used. These are often encoded differently in each character set. The following table shows a comparison of the decimal values for some characters in the Latin and Unicode character set: Character Latin Unicode A 65 65 B 66 66 Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 39 Character Latin Unicode Å 143 197 ß 225 223 さ — 12373 While some characters have the same numerical representation in both character sets, others have different values. If a file is created using one character set and then displayed using another character set, mapping problems will occur that lead to an illegible text at best. Taking the text ABÅß from Latin to Unicode without any mapping will lead to the following output text AB•á that is clearly different from the original input. Other characters such as the Japanese cannot even be represented in Latin and would be lost when a Unicode file would be viewed with a Latin interpretation. Informatica AddressDoctor’s transliteration stage offers functionality1 to address these issues. String data is internally stored in the Unicode UCS 2 format. Strings can be assigned in any of the more than 30 supported character sets (and possibly more). If data is retrieved in another character set, a mapping takes place to ensure that the characters are properly represented in the other character set. Provided that each character has a representation in the other character set, no information is lost. Characters that have no representation in a particular character set (such as さ in Latin) will be mapped to a space. 5.2 Transliteration Transcription and Transliteration are processes of changing one character of one character set into other characters of another character set, such as converting from Greek to Latin, or Japanese Katakana to Latin. A transliteration uses invertible mapping, so that a transliteration can be reversed without information loss. In contrast, a transcription aims to provide non-native speakers with an approximate pronunciation of a word, based on the pronunciation rules of their own language. In practice, transliterations are consistent with transcriptions for many character sets, while no real (i.e. invertible) transliterations exist for most syllable or ideographic languages. Thus for the rest of this document, transliteration is used to denote both, transliteration or transcription. Transliteration surpasses mere character set mapping, which is limited to the mapping between different numerical representations of a character (see the example in chapter 5.1). A language such as Japanese with the Katakana, Hiragana and Kanji characters has no direct representation in the English language. However, each Japanese character has a certain associated sound that can be approximated using phonetic Latin characters. Numerous transliteration schemes have been introduced for different languages. The following examples show how transliteration works for different languages: Ä → AE (Latin → Latin) ĝ → g (Latin → Latin) 1 Note for users of the previous version: By virtue of the AD_UniString object. (See chapter 4.2.14.) Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 40 个→ ka (Japanese → Latin) Ж → ZH (Cyrillic → Latin) We can see that even within the Latin alphabet transliteration can be useful when certain extended characters cannot be represented in the target character set. Most languages use only a subset of the sounds a normal human could produce and of course these subsets differ from language to language. If a sound used by one language cannot be represented correctly in a different script, it must be approximated: This approximation may be quite inadequate if the sounds used in the target language for transliteration differ significantly from the sounds in the original language. This problem is especially relevant when transliterating languages with very few syllables, such as Japanese (much less so for Chinese). Here are some examples of circular transliteration (i.e. English to kana to English) leading to dramatic changes: Original: Philippines Japanese: フィリピン Transliterated: Firipin Original: Düsseldorf Japanese: ヂュッセルドルフ Transliterated: Dyusserudorufu Original: Beethoven Japanese: ベートーベン Transliterated: Betoben These transliterated words provide challenges when working with transliterated place names for non-Asian countries that were previously represented in an Asian language. Examples using character set mapping and transliteration may be found in chapter 6.15. One known limitation with transliteration of Japan addresses from Kanji script is that certain characters when they are part of the first name of the contact are incorrectly transliterated into Arabic numerals instead of the corresponding Latin alphabets. The following table shows those Kanji numerals that could cause this issue, their Arabic equivalents, and the preferred Latin transliterations. Kanji Numeral Arabic Equivalent Latin Transliteration Kanji Numeral Arabic Equivalent Latin Transliteration 一 1 ichi 六 6 roku 二 2 ni 七 7 nana 三 3 san 八 8 hachi 四 4 yon 九 9 kyū 五 5 go 十 10 jū Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 41 5.2.1 Cyrillic Transliteration Informatica AddressDoctor supports Cyrillic transliteration for the following countries: Belarus Kazakhstan Russia Macedonia Ukraine 5.3 Address Element Abstraction Addresses have developed differently in different countries and cultures. Informatica AddressDoctor uses mapping and a hierarchical approach to describe the various address elements. At the foundation of the address “pyramid” is a country. Currently there are 191 United Nations member countries as well as several dependent and independent territories around the world. Informatica AddressDoctor covers a total of over 240 countries and territories in its postal reference databases. A country is often subdivided into provinces or regions. These regions are not always required in postal addresses, however. Cities or localities belong to a region and buildings are typically assigned to streets. The building themselves can often be subdivided, which is done through the concept of a sub-building. An example for a sub-building would be a floor or suite number. Organizations then reside in sub-buildings and are subdivided into departments that in turn employ people (known as contacts). The following figure visualizes this abstraction model graphically: Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 42 Note that not all elements are present (or required) in all cases. Non-business addresses for instance will lack the Organization and Department elements. Also, different names may be in use for similar elements: As an example, the territorial subdivision in the USA is called a state. Canada calls it a province, while Switzerland has given the name Canton to this subdivision. Informatica AddressDoctor has defined general terms that allow mapping these concepts globally to the standardized “Item” fields of the AddressObject (see chapters 5.7 and 5.9). As an example, the AddressObject contains an attribute with the name “Province”. Depending on the country, this field may either contain the state (USA), the county (UK), the province (Canada), the prefecture (Japan), the Canton (Switzerland), the Bundesland (Austria or Germany) and so on. The following figure illustrates this mapping: Province County (e.g. UK) State (e.g. USA) Province (e.g. Canada) Prefecture (Japan) Kanton (Switzerland) Another example shows the Japan address system, which is divided into several address levels from the biggest entities down to blocks and buildings. Informatica AddressDoctor can validate from postal code, province down to street level which are parts of reference data. House number level is not included in the reference data and only copied from input. 5.4 Address Parsing Addresses stored in computer systems were often entered by humans. Frequently, people entering data do not understand the nature of the information that they enter. Quite often the fields for storing the data are not sufficient, because they are either not long enough or because there are too Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 43 few fields to store addresses from countries all over the world. See section 9.4 on page 159 for a recommended database layout to store addresses from all around the world. Computer programs that validate postal addresses often rely on the information provided by the field names to identify address elements. While this is sufficient in a few cases, most of the time this information can be misleading because information was entered in the wrong fields. It is frequent that consignee information such as names are placed in address fields that are designed to store street information. Postal codes are input in city fields and building numbers are placed in wrong locations. These are just some of the easier challenges found in international addresses where incorrectly fielded data is omnipresent. Analyzing address elements and assigning them to the proper fields is one of the most difficult challenges of handling postal addresses. Informatica AddressDoctor implements a parsing engine that is independent of postal reference data. The parsing engine parses Japanese Kanji addresses natively. As a consequence, the parser as implemented in Informatica AddressDoctor can be used without any postal reference data present and is especially suited for OEM integration scenarios where the reference data can be added as needed. The parsing functionality is implemented by the PARSE process mode (and is implicitly included with the other process modes described later in chapter 5.11). It can either work on fielded data that is retrieved from address element fields or from totally unfielded data as it can be found in databases that have just a line by line layout for address data. While structuring an unfielded address seems more difficult at first, handling the potentially conflicting information in a seemingly fielded address can be an even more difficult challenge. Here the name of the field might indicate that a street should be expected but the software has to decide that this “hint” is possibly nonsense and be bold enough to decide to ignore this information. Depending on the “OptimizationLevel” attribute set in Parameters.xml (see the DTD in chapter 10.1), Informatica AddressDoctor parser behaves differently in respect to fielded input element assignment (see chapter 5.33 for more detail): NARROW: The parser will honor input assignment strictly, with the exception of separation of House Number from Street information. STANDARD: The parser will separate address element more actively, for example: o Province will be separated from Locality information o PostalCode will be separated from Locality information o House Number will be separated from Street information o SubBuilding will be separated from Street information o DeliveryService will be separated from Street information o SubBuilding will be separated from Building information o Locality will be separated from PostalCode information Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 44 WIDE: Parser separation will happen similarly to STANDARD, but additionally up to 10 parsing candidates will be passed to validation for processing. Validation will widen its search tree and take additional reference data entries into account for matching. It is very important to note that, apart from the case of NARROW, such mixed input will naturally result in information to be separated out into different AddressElement Items by the parser. For example, while Street and SubBuilding were jointly assigned as Street on input, that SubBuilding will be located in a SubBuilding Item on output (and not in Street) – provided it could be identified as such. 5.5 Address Validation When validating an address, each component is compared against a postal reference data set that is stored in Informatica AddressDoctor reference databases (see section 9.1). All elements of an address may be correct, thus resulting in positive validation. On the other hand, each individual element may match against the reference data, but the components do not make sense when looked at in their combination. Let us regard an example: City: Wilmington ZIP: 90210 State: CA In this example each component is correct by itself. However, the components do not match, as the ZIP code does not belong to Wilmington and Wilmington is not in the state of California. Whenever it is possible, Informatica AddressDoctor will attempt to correct such errors. To do this without endangering potentially correct data elements and creating “false positives”, great care is taken and very sophisticated algorithms are used to analyze and potentially correct the data. The algorithms used by Informatica AddressDoctor include fuzzy matching and heuristics to predict the best possible correction for an address. It is always Informatica AddressDoctor’s intention to correct or improve an address if at all possible. Here it differs from most postal certification schemes such as the Coding Accuracy Support System (CASS) as introduced by the US Postal Service. These certification schemes intend to prevent poorly addressed mail from entering the postal mail stream, thus easing the work of the postal organization. Informatica AddressDoctor’s intent, however, is to improve as many addresses as possible (see chapter 5.33 for more detail on “OptimizationLevel”). 5.6 Informatica AddressDoctor Informatica AddressDoctor processes input addresses (AD_Process()) contained in the AddressObjects. As there is only one engine per process, there is no Informatica AddressDoctor handle. The engine needs to be initialized by a specific sequence: AD_Initialize() must be called to actually initialize the engine. It evaluates the settings and configures the engine accordingly. Only after this function has returned successfully AD_GetAddressObject() or any other functions may be called. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 45 AD_DeInitialize() must be called last to de-initialize the engine; the engine is then ready to be initialized again; all AddressObjects must have been released by calling AD_ReleaseAddressObject() before calling AD_DeInitialize(). The engine stores the following data for its internal use (see SetConfig.dtd): General engine configuration, i.e., the maximum amount of memory the engine may request from the OS The access codes; at least one valid access code must be supplied when calling AD_Initialize() Optional preloading parameters for the databases In addition, the engine stores the following parameter data as a default for the AddressObjects (see Parameters.dtd in chapter 10.1): Process parameters, i.e., the processing mode to be used Input parameters, i.e., which input encoding is to be used Format specifications for the result, i.e., Casing specifications This configuration data has default values as specified by the corresponding SetConfig.dtd; they can be changed by passing a corresponding config XML as parameter to AD_Initialize(): There is no way to change the configuration after AD_Initialize() has been called, this parameter configuration data is used by the AddressObjects by default, when no alternative setting is made for a specific AddressObject. 5.7 AddressObjects The AddressObject serves as a container object for a postal address. It has several properties that can store individual components of an address such as postal (ZIP) code, street name and building number, but also company or contact names. AddressObjects store the following (configuration) data: Parameter settings (see Parameters.dtd) An input address (see InputData.dtd) A result (see Result.dtd) The last return code AddressObjects should not be created and destroyed frequently, but rather be reused for performance reasons. Specifically, the parameter settings should be reused to avoid repeated settings overhead. Informatica AddressDoctor manages all AddressObjects. Only a specific number of AddressObjects can be created. This number can be set in the initialization phase of the engine (using the “MaxAddressObjectCount” attribute). Recommended setting would be between once and twice the number of threads, which must be set using the “MaxThreadCount” attribute and is currently limited to a practical maximum value of 32 threads (see chapter 5.36). Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 46 For setting the address elements in an AddressObject (as described in chapter 5.3, examples are Organization, Department, Street, Province, PostalCode and Locality), see chapter 6.7. For retrieving them, see chapter 6.11. Retrieving address elements individually is especially useful when the data to be processed originates from a database that has individual fields for address elements. Alternatively, the AddressObject may be assigned unfielded FormattedAddressLine data (see chapter 6.7.4), where the address representation is only structured by delimiters such as linefeeds. The FormattedAddressLine representation is also helpful when retrieving processed address data: It will return the processed data according to the country specific formatting rules. When assigning AddressObject values, either the address elements or the FormattedAddressLine representation should be used, while both representations are provided on output (see the Result.xml example at the end of chapter 3). 5.8 Input and Output Encoding The XML-encoding is passed within the XML header <?xml ?>; if none is explicitly set, UTF-8 or UTF16 is the default (as defined by the XML standard and depending on the bit width chosen as described below). The different encodings for XML input and output may be specified via attributes (see chapter 6.3 and SetConfig.dtd/Parameters.dtd in chapter 10.1). For the direct API, the engine default encoding is ISO-8859-1. There may be 8 and 16 bit input and result data; to deal with both character sizes, there are two sets of functions for any Set…() or Get…() functionality: The 8 bit versions have no special naming (i.e. AD_SetInputDataElement() or AD_GetResultDataParameter()) The 16 bit versions end in W (for word, i.e. AD_SetInputDataElementW() or AD_GetResultDataParameterW()) When using the 16 bit API functions it is crucial to have set a corresponding 16 bit encoding, otherwise an encoding error code is returned. The 16-bit input functions Set...W() also support an additional parameter for the string length, thereby making it possible to pass non-zero-terminated strings. To enable passing zero-terminated strings of unknown length, the special value AD_AUTOLEN can be passed as string length; the engine then automatically determines the length. The currently active encoding (see Parameters.dtd) must match the used function: When a 16 bit function is called, the encoding must also be 16 bit (i.e. UTF-16), consequently. This is specifically the case for the Java API (see the example in chapter 4.1), which does support 16 bit input and output only, in line with internal Java string handling. 5.9 AddressElement Items and AddressLines Many of the direct API functions have an item or line parameter. The same applies to the XML API, where XML element attributes are used for that purpose. These parameter numbers refer to the index or hierarchical level of an address element or line: Items and lines are counted from 1 on, the default for XML is 1 (see the DTDs in Section 10.1). To set a street and a dependent street using the direct API in C, the “Item” parameter has to be 1 for the first and 2 for the second: Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 47 AD_SetInputAddressElement( hAOHandle, "Street", 1, NULL, "Main St 5" ); AD_SetInputAddressElement( hAOHandle, "Street", 2, NULL, "Dependent St 8" ); For example, to set 3 formatted address lines, the “Line” parameter has to be set from 1 to 3: AD_SetInputAddressLine( hAOHandle, "FormattedAddressLine", 1, "AddressDoctor GmbH" ); AD_SetInputAddressLine( hAOHandle, "FormattedAddressLine", 2, "Roentgenstr. 9" ); AD_SetInputAddressLine( hAOHandle, "FormattedAddressLine", 3, "D-67133 Maxdorf" ); Similarly, setting street and dependent street in Java: m_oAO.setInputAddressElement("Street", 1, "COMPLETE", "Main St 5"); m_oAO.setInputAddressElement("Street", 2, "COMPLETE", "Dependent St 8"); And setting 3 formatted address lines in Java: m_oAO.setInputAddressLine("FormattedAddressLine", 1, "AddressDoctor GmbH"); m_oAO.setInputAddressLine("FormattedAddressLine", 2, "Roentgenstr. 9"); m_oAO.setInputAddressLine("FormattedAddressLine", 3, "D-67133 Maxdorf"); Refer to chapter 6.7 for understanding the valid combinations of AddressElement Items and AddressLines for address data input. In the XML API case, an example for InputData.xml with two items assigned for sub-elements of the PostalCode (known as ZIP+4, see chapter 6.24) would be: <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE InputData SYSTEM "InputData.dtd"> <InputData> <AddressElements> <Country Item="1" Type="NAME">USA</Country> <Locality Item="1" Type="COMPLETE">Raleigh</Locality> <PostalCode Item="1" Type="UNFORMATTED">27601</PostalCode> <PostalCode Item="2" Type="UNFORMATTED">1356</PostalCode> <Province Item="1" Type="COUNTRY_STANDARD">NC</Province> <Street Item="1" Type="COMPLETE">Fayetteville Street</Street> <Number Item="1" Type="COMPLETE">133</Number> <SubBuilding Item="1" Type="COMPLETE">Suite 201</SubBuilding> <Organization Item="1" Type="COMPLETE">AddressDoctor</Organization> </AddressElements> </InputData> Check chapter 6.7 for details on how to process XML input using AD_SetInputDataXML(). In this XML example all “Type” attributes (see chapter 5.10) correspond to their default values, so omitting them would yield the same processing results. The following table gives examples of what certain AddressElement items would typically contain: Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 48 Legend: Typically needed for correct global address representation Typically not populated for a correct address* Not available through the Informatica AddressDoctor 5 API Will be needed for future country support Not address relevant (will be copied over to output) *but may contain certain input elements copied over to output The sequence of AddressElement Items is hierarchical, as defined by reference data. Depending on reference data detail, some empty Items may thus be followed by ones filled again. Consequently, the item hierarchy on output is only really meaningful for AddressElements for which reference data is available (typically the ones that are postally relevant, like locality). Make sure to check the “ElementResultsStatus” described in chapter 5.27.3 and “ElementRelevance” in chapter 5.27.4 to decide whether the hierarchical sequence of the output has been retained from input (parsing only, see chapter 5.4) or was adjusted based on reference data (parsing and validation, see chapter 5.5). For information on AddressElement Item output from different countries, see Appendix 10.4. For a more complete introduction to international addresses and their address elements see the “The Global Source Book for Name and Address Data Management” by Graham Rhind: http://www.grcdi.nl/book2.htm 5.10 Address Item Types Normally, each AddressElement Item number should only occur once, although some address elements may contain several logically separate sub-elements called Item Types: For example items of the type TITLE, FIRST_NAME, MIDDLE_NAME, LAST_NAME and FUNCTION may be assigned to the sub-elements of each “Contact” address element at the same time, while there is little sense in assigning both, the FORMATTED and UNFORMATTED type, for a “PostalCode” address element – also see the InputData.xml examples in chapter 5.9 and chapter 6.7.2. For the majority of address elements the default item type is COMPLETE, which may be conveniently used whenever no separation of address elements into logically separate sub-elements is available for the input data (see chapter 6.7.2 on fielded assignment of address elements and the InputData.xml DTD in chapter 10.1 for valid item types): For instance you may assign “Paris Cedex 11” either in one piece, as Locality item 1, Type “COMPLETE” or separately, with “Paris” as Locality Item 1, Type “NAME” and “Cedex 11” as Locality Item 1, Type “SORTING_CODE”. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 49 The following list describes the valid Item/Type input combinations when the respective default type is used: Key: RECORD_ID and TRANSACTION_KEY may be set concurrently Organization: COMPLETE and DEPARTMENT may be set concurrently Contact: COMPLETE and FUNCTION and GENDER may be set concurrently (when NAME is used instead, FIRST_NAME, MIDDLE_NAME and LAST_NAME may not be set - see chapter 6.7.2 for more detail) Province: COUNTRY_STANDARD, ABBREVIATION and EXTENDED may be set concurrently PostalCode: FORMATTED and UNFORMATTED may be set concurrently Street: COMPLETE and ADD_INFO* may be set concurrently Locality: COMPLETE only Number: COMPLETE and ADD_INFO* may be set concurrently Building: COMPLETE only SubBuilding: COMPLETE only DeliveryService: COMPLETE and ADD_INFO* may be set concurrently It is recommended to refrain from setting “Type” attributes explicitly on input, apart from the examples given above and in chapter 6.7.2: Omitting them corresponds to their default values, which yields decent processing results in most practical situations. Under special circumstances, input item types might need to be adjusted under direction from Informatica AddressDoctor support (see chapter 9.3). Note that the majority of types documented in the InputData.xml DTD (see chapter 10.1) is only listed for reasons of symmetry with the Result.xml DTD and not really intended for actual use on input. For an overview and explanation of what item types are available on output, refer to the Result.xml DTD (see chapter 10.1). 5.11 Process Modes Informatica AddressDoctor supports several validation types. Most of them are country independent and work for all supported countries. An exception is the CERTIFIED validation type that offers country specific logic and does not work for all countries. Similarly, the single-line address validation, which is available in the fast completion mode, is currently available only for select countries. Each validation type is designed for a specific task. The validation process modes are: Correction Only (BATCH) * While ADD_INFO could in principle be used to provide additional information on AddressElement input that is supposed to be passed through validation without change, it is really intended to be filled on output, containing portions (provided such a split-off could be determined) of not postally relevant AddressElement input that could not be validated against reference data. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 50 Suggestions (INTERACTIVE) Fast Completion (FASTCOMPLETION) Certified (CERTIFIED) Address Code Lookup (ADDRESSCODELOOKUP) The process mode is a parameter to the AD_Process() function of Informatica AddressDoctor. When calling that function to process an AddressObject, the validation type that was supplied (see chapter 6.3) via AD_SetParametersXML() before the call determines the processing that takes place. Each AddressObject and thus each individual call of the AD_Process() function may use a different validation type. Additionally, two more process modes bypassing validation are available for special pre-processing purposes: Obtain separate tokens (possibly including transliteration) from parsed input data without corrections (PARSE) Identification of records missing country information, including correction where possible (COUNTRYRECOGNITION) See the figure in Appendix 10.3 for more details on the Informatica AddressDoctor processing flow. Note that process modes might fall back to others as described below, so it is recommended practice to check that the process mode used was the one intended, both, by interpreting the process status value (see chapter 5.17) and checking directly (see chapter 6.10). 5.11.1 Batch The Correction Only (also known as ”BATCH”) type is intended to be used in batch processing environments when no human input or selection is possible. It is optimized for speed and will terminate its attempts to correct an address when ambiguous data is encountered that cannot be corrected automatically. The Batch processing mode will fall back to Parse Only (see chapter 5.11.6), when the respective database is missing for a specific country. 5.11.2 Interactive When working in interactive environments, it is often useful to generate suggestions when an address input is ambiguous. This can be achieved by using a suggestions validation type, that is known as “INTERACTIVE”. This validation type is especially useful in Web based data entry environments when capturing data from customers or prospects. It requires the input of an almost complete address and will attempt to validate or correct the data provided. If ambiguities are detected, this validation type will generate up to 100 suggestions that can be used for pick lists (the maximum number of suggestions can be controlled by the MaxResultCount parameter in the SetConfig.xml and Parameters.xml). The Interactive processing mode will fall back to Parse Only (see chapter 5.11.6), when the respective database is missing for a specific country. 5.11.3 Fast Completion The Fast Completion validation type is used in quick address entry applications. It allows input of truncated data in several address fields and will generate suggestions for this input. Due to its fast response time, the engine can also be used to create suggestions while users type. The Fast Completion type is best suited when users are aware that they can purposely truncate input data. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 51 However, the “FASTCOMPLETION” validation type does not support extended parsing and thus can only be used when assigning fielded input data (see chapter 6.7) using the “AddressElement” Items (see chapter 5.9) – as opposed to AddressLine input using “FormattedAddressLine” or “DeliveryAddressLine”. The Fast Completion processing mode will fall back to Parse Only (see chapter 5.11.6), when the respective database is missing for a specific country. Effective in version 5.3.0 of Informatica AddressDoctor, the upper limit of the suggestion list has been increased from twenty to one hundred results. You can specify the upper limit of the results returned in the MaxResultCount parameter in SetConfig.xml as well as in Parameters.xml. The default value is set to “20” and can be overwritten by the user to a hundred for example. It should be noted that the maximum value for this parameter is one hundred, and the minimum value is one. Specifying a value greater than the maximum will result in an error. For example, specifying a greater value in Parameters.xml than in the SetConfig.xml will result in a reduction of the value in Parameters.xml to the value in SetConfig.xml, this will be reported back by the warning AD_SC_WRN_MAXRESULTCOUNT_REDUCED. 5.11.4 Single-Line Address Validation The single-line address validation feature is a new addition to the Fast Completion mode starting with Version 5.5.0. You can use single-line address validation to validate addresses entered into the AddressComplete element as a single line and receive suggestions to complete the address. Informatica AddressDoctor Version 5.5.0 supports single-line address validation for the following countries: Australia Canada Germany Great Britain New Zealand United States To activate single-line address validation, you need a separate unlock code of type SINGLE_LINE_VALIDATION. Contact your sales representative for more information about obtaining the unlock code. Informatica AddressDoctor identifies address elements in a single-line address input based on their position in the sequence the elements are entered. So, it is imperative that you follow the order shown in Table 1 when you enter single-line addresses. When you enter an address in single line, ensure that you do not mix Delivery Address Line (DAL) elements and Country-Specific Location Line (CSLLN) elements. Table 1 Country-Specific Order of Address Elements Country Order of Address Elements Australia Sub-building, House Number, Street, Main Locality, Province, Postal Code Canada Sub-building, House Number, Street, Delivery Service, Main Locality, Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 52 Country Order of Address Elements Province, Postal Code Germany Street, House Number, Postal Code, Locality, Province Great Britain Sub-building, House Number, Street, Main Locality, Sub-Locality, Postal Code New Zealand Sub-building, House Number, Street, Delivery Service, Locality, Postal Code United States Sub-building, House Number, Street, Locality, Province, Postal Code If you did not enable single-line address validation, Informatica AddressDoctor returns N7, Feature Not Unlocked, process status message. If the input maps to a country that is not supported for single-line address validation, Informatica AddressDoctor returns the process status code N6 denoting that single-line address validation is not supported for the specified country. As you can see from Table 1, the typical sequence of address elements is from the specific to the generic. You must enter the elements in the specified sequence even if you leave out some of the elements from the input. However, for optimum results, we recommend that you provide as many details as possible in the input. Even though delimiters are not mandatory in a single-line address input, a comma or semicolon in the input is considered as an element separator and might fetch better suggestions. Note that Informatica AddressDoctor currently does not support country, organization, building, or contact information in the single-line address input. If the single-line address input contains only a numeric input, Informatica AddressDoctor considers it as the Postal Code and returns suggestions accordingly. For countries where the house number appears on the left side of the street name or locality, if the single-line address input begins with a number that is followed by a string, Informatica AddressDoctor considers the number as a house number and the following string as the street name or locality. If no match is found for this combination, Informatica AddressDoctor attempts to interpret the input as street name without house number or as a combination of postal code and locality. When there is no perfect match for an input, Informatica AddressDoctor returns multiple suggestions to help you choose the most appropriate result. The maximum number of suggestions that Informatica AddressDoctor returns is decided by the value configured for MaxResultCount in parameters.xml and setconfig.xml files. 5.11.5 Certified A number of countries have special requirements for the processing of addresses from their countries. An example of such a special processing requirement is the CASS certification of the United States Postal Service (USPS). In order to process addresses compliant with a certification scheme, the validation type “CERTIFIED” is available. We now support all certifications required by major postal administrations of the world. Therefore, Informatica AddressDoctor supports the following certifications; CASS certification for the USA, SERP certification for CAN, AMAS certification for AUS, SNA certification for FRA and SendRight certification for NZL. The Certified processing mode will fall back to Batch if it is not supported for a specific country. Note that extended parsing of unfielded data is not supported in the “CERTIFIED” processing mode (see chapters 5.5 and 6.24 for the differences between certified and normal processing). Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 53 5.11.6 Address Code Lookup In Version 5.4.1 and later, Informatica AddressDoctor offers Address Code Lookup. Address Code Lookup is a new process mode that has its own unlock code. With Address Code Lookup, you can enter a country specific address code and retrieve the complete or partial address for the code. Address Code Lookup is currently supported for the following countries: Germany Great Britain Japan South Africa Serbia For example, the Choumei Aza code is an eleven-digit code that defines a unique delivery point for a Japan addresses. You can now search on the Choumei Aza code to find the associated address. You can also use a combination of the Choumei Aza code and the Gaiku code, which is a four-digit code that identifies a city block in Japan, to find more precise results. To use address code lookup, you must download the Address Code Lookup database, <XYZ>5AC.MD, and specify the value ADDRESS_CODE_LOOKUP for the Type attribute of the Database parameter in the SetConfig.xml file. In addition, you must specify the value ADDRESS_CODE_LOOKUP for the UnlockCode attribute Type in the GetConfig.xml file to indicate that the Address Code Lookup database should be unlocked. The following table describes the values that you can specify for the Type attribute of the AddressCode parameter in the InputData.xml file. Address Code Type Country Description DEU_AGS Germany The Amtliche Gemeindeschlüssel (AGS) is a variable length code that uniquely identifies a locality in Germany. There may be more than one locality for a given AGS code. For example, a DEU_AGS code with a value of 07338018 returns the following output: Locality: Maxdorf Province: Rheinland-Pfalz DEU_LOCALITY_ID Germany The Locality ID is a variable length code that uniquely identifies a German locality. For example, a DEU_LocalityID code with a value of 68015519 returns the following output: Locality: Maxdorf Province: Rheinland-Pfalz Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 54 Address Code Type Country Description DEU_STREET_ID Germany The Street ID is a variable length code that uniquely identifies a German street address. For example, a DEU_StreetID code with a value of 100560690 returns the following output: Röntgenstr. 67133 Maxdorf Germany GBR_UDPRN Great Britain The Unique Delivery Point Reference Number (UDPRN) is an eight character code that uniquely identifies each postal address of the Royal Mail PAF database. For example, a GBR_UDPRN code with a value of 15511432 returns the following output: Flat 16 Haden Court Lennox Road London N4 3HS United Kingdom JPN_CHOUMEI_AZA_CODE Japan The Choumei Aza code is an eleven-digit code that defines a unique delivery point for Japan addresses. For example, a JPN_CHOUMEI_AZA_CODE of 28201160001 returns the following output: 〒670-0081 兵庫県姫路市田寺東1丁目 Or: 01 Chome Taderahiga-shi Himeji-shi Hyogo-ken 670-0081 Japan SRB_PAK Serbia The Postal Address Code (PAK) is a six digit code that defines a unique Serbian address to the street level. For example, a SRB_PAK code with a value of 251133 returns the following output: Majora Ilica 1 14000 Valjevo Serbia Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 55 Address Code Type Country Description ZAF_NADID South Africa The National Address Database (NAD) ID is a unique numeric ID assigned to each South African street address. For example, a ZAF_NADID code with a value of 2170232 returns the following output: 4 Balmoral Road Vincent East London 5247 South Africa 5.11.7 Parse Only For separating address input into tokens for subsequent processing in other systems, bypassing Informatica AddressDoctor validation, the special process mode “PARSE” can be used (see chapter 5.4 for a general introduction on parsing and chapter 6.9 for an example of address parsing). A typical use case scenario for this mode might be that address data of already high quality simply needs to be tokenized quickly for export to an external system, possibly including transliteration (see the “PreferredScript” parameter in chapter 5.12.1), formatting (see chapter 5.13) and standardization (see chapter 5.14) of the output. 5.11.8 Country Recognition Sometimes input data lacks country information, which is crucial for successful Informatica AddressDoctor processing. To identify such problematic records quickly, without having to run the data set through full validation, a special process mode “COUNTRYRECOGNITION” is provided. This functionality is the first step of the Informatica AddressDoctor processing flow and thus part of all process modes (see Appendix 10.3). This process mode will also attempt to amend missing country information where possible, based on characteristic information like major locality or territory names (see chapter 5.17 for the possible Rx process status values). Note that such attempts at adding country information can only succeed where the information identified is absolutely unambiguous: For example, there is a Berlin in Germany as well as in numerous US states or South Africa, Columbia and El Salvador. In addition, “MA” might refer to the ISO2 code for Morocco or the US state of Massachusetts. 5.12 Process Parameters Numerous parameters pertaining to processing may be specified through Parameters.xml, see the DTD definition in the Appendix (chapter 10.1). These parameters may usually be defined with a global Informatica AddressDoctor scope or a per AddressObject scope, an example is given in chapter 6.3. 5.12.1 The PreferredScript Parameter The “PreferredScript” attribute of the “Result” element is used to specify in which alphabet the output should be returned (see the Character Set Mapping and Transliteration chapters 5.1 and 5.2): Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 56 DATABASE All Unicode characters (as per reference database standard) POSTAL_ADMIN_PREF All Unicode characters (as preferred by local postal administration) POSTAL_ADMIN_ALT All Unicode characters (local postal administration alternative) ASCII_SIMPLIFIED ASCII characters ASCII_EXTENDED ASCII characters with expansion of special characters (for example: Ö = OE) LATIN Latin characters LATIN_1 Latin I characters LATIN_ALT Latin characters (alternative transliteration) PRESERVE_INPUT Same characters as the input address (available only for Belarus, China, Greece, Japan, Kazakhstan, Macedonia, Russia, and Ukraine ) The default setting for the “PreferredScript” attribute is “DATABASE”. The alphabet in which the data is returned differs from country to country. For most countries the output will be Latin I or ASCII regardless of the selected preferred language. NOTE: If the input contains address elements that are not in the corresponding database, Informatica AddressDoctor copies such elements to the output in the same script the address was input irrespective of the value set for the PreferredScript parameter. If the parameter for address is set to “PRESERVE_INPUT”, Informatica AddressDoctor preserves the alphabet of the input address. If the input contains more than one script, Informatica AddressDoctor overrides the PRESERVE_INPUT configuration and returns the address in the default script in the reference database. For example, if a Japan address contains fields with both Kanji and Latin characters and if the PreferredScript parameter is set to PRESERVE_INPUT, Informatica AddressDoctor returns all address fields using Kanji characters because that is the default script for Japan addresses in the reference database. For countries that use an alphabet other than Latin I, the returned alphabet differs from country to country. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 57 The following table shows how the output is returned for specific countries: Country DATABASE POSTAL_ADMIN_PREF POSTAL_ADMIN_ALT BLR Cyrillic Cyrillic Cyrillic KAZ Cyrillic Cyrillic Cyrillic KOR LAT MDA ASCII Latin-7 Latin-2 ASCII Latin-7 Latin-2 ASCII Latin-7 Latin-2 LATIN Latin-2 (transliterated by ISO standard) Latin (Mandarin transliteration) ASCII ASCII Latin-1 (transliterated by ISO standard) ASCII ASCII ASCII Latin-7 Latin-2 (transliterated by ISO standard) ASCII ASCII ASCII CHN Hanzi Hanzi Hanzi CRI CZE Latin-1 Latin-2 Latin-1 Latin-2 Latin-1 Latin-2 GRC Greek Greek Greek HKG HUN ISR JPN ASCII Latin-2 ASCII Kanji ASCII Latin-2 ASCII Kanji ASCII Latin-2 ASCII Kana MKD Cyrillic Cyrillic Cyrillic ASCII POL ROM Latin-2 Latin-3 Latin-2 Latin-3 Latin-2 Latin-3 RUS Cyrillic Cyrillic Cyrillic SVK TWN Latin-2 ASCII Latin-2 ASCII Latin-2 ASCII ASCII ASCII Latin-2 (transliterated by ISO standard) ASCII ASCII UKR Cyrillic Cyrillic Cyrillic ASCII LATIN_ALT ASCII (transliterated by BGN standard) Latin (Cantonese transliteration) Latin-1 Latin-2 ASCII (transliterated by BGN standard) ASCII Latin-2 ASCII Latin-7 ASCII (transliterated by BGN standard) ASCII Latin-7 Latin-2 ASCII (transliterated by Macedonian BGN standard) Latin-2 Latin-3 ASCII (transliterated by BGN standard) Latin-2 ASCII ASCII (transliterated by Ukrainian BGN standard) LATIN_1 ASCII (transliterated by BGN standard) ASCII_SIMPLIFIED ASCII_EXTENDED Yes Yes Latin-1 Yes Yes Latin-1 Latin-1 Yes Yes Yes Yes ASCII Yes Yes ASCII Latin-1 ASCII ASCII ASCII (transliterated by BGN standard) ASCII ASCII Latin-1 ASCII (transliterated by Macedonian BGN standard) Latin-1 Latin-1 ASCII (transliterated by BGN standard) Latin-1 ASCII ASCII (transliterated by Ukranian BGN standard) Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Countries not listed in the table use the default output setting described previously. Examples using different scripts can be found in chapter 6.15. Here are some examples that show output based on the PRESERVE_INPUT setting. Example 1 – Japan address in Kanji script: <InputData> <AddressElements> <Country Item="1" Type="NAME">JAPAN</Country> <Locality Item="1" Type="COMPLETE">オオサカシ</Locality> <Locality Item="2" Type="COMPLETE">ミヤコジマク</Locality> <Locality Item="3" Type="COMPLETE">ウチンダイチョウ</Locality> <PostalCode Item="1" Type="UNFORMATTED">〒534-0013</PostalCode> <Province Item="1" Type="COUNTRY_STANDARD">オオサカフ</Province> <Street Item="1" Type="COMPLETE">02 チョウメ</Street> </AddressElements> </InputData> With PreferredScript set to PRESERVE_INPUT, the output is in Kanji script: <FormattedAddressLine Line="1">〒5340013オオサカフオオサカシミヤコジマ クウチンダイチョウ2チョウメ</FormattedAddressLine> Example 2 – Japan address in mixed input, Kanji and Latin: <InputData> <AddressElements> Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 58 <Country Item="1" Type="NAME">JAPAN</Country> <Locality Item="1" Type="COMPLETE">ŌSAKA-SHI</Locality> <Locality Item="2" Type="COMPLETE">MIYAKOJIMA-KU</Locality> <Locality Item="3" Type="COMPLETE">UCHINDAI-CHŌ</Locality> <PostalCode Item="1" Type="UNFORMATTED">534-0013</PostalCode> <Province Item="1" Type="COUNTRY_STANDARD">ŌSAKA-FU</Province> <Street Item="1" Type="COMPLETE">2 CHŌME</Street> <Organization Item="1" Type="COMPLETE">オ</Organization> </AddressElements> </InputData> The output in this case, where the input contains both Latin and Kanji scripts, is in Kanji script , which is the default script of the Japan address database. <FormattedAddressLine Line="1">〒534-0013 大阪府大阪市都島区内代町2丁目 オ </FormattedAddressLine> Example 3 – Russian address input in Latin script: <InputData> <AddressElements> <Country Item="1" Type="NAME">RUSSIAN FEDERATION</Country> <Locality Item="1" Type="COMPLETE">Majma</Locality> <PostalCode Item="1" Type="UNFORMATTED">649100</PostalCode> <Province Item="1" Type="COUNTRY_STANDARD">Altaj</Province> <Street Item="1" Type="COMPLETE">ul. Celinnaâ</Street> <Number Item="1" Type="COMPLETE">1</Number> </AddressElements> </InputData> The output in this case is in Latin script: <FormattedAddressLine Line="1">ul. Celinnaâ 1</FormattedAddressLine> <FormattedAddressLine Line="2">Majma</FormattedAddressLine> <FormattedAddressLine Line="3">Altaj</FormattedAddressLine> <FormattedAddressLine Line="4">649100</FormattedAddressLine> Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 59 Example 4 – Russian address in mixed input that contains Cyrillic and Latin scripts: <InputData> <AddressElements> <Country Item="1" Type="NAME">RUSSIAN FEDERATION</Country> <Locality Item="1" Type="COMPLETE">Majma</Locality> <PostalCode Item="1" Type="UNFORMATTED">649100</PostalCode> <Province Item="1" Type="COUNTRY_STANDARD">Altaj</Province> <Street Item="1" Type="COMPLETE">ul. Celinnaâ</Street> <Number Item="1" Type="COMPLETE">1</Number> <Organization Item="1" Type="COMPLETE">й</Organization> </AddressElements> </InputData> The output in this case, where the input contains both Latin and Cyrillic scripts, contains only Cyrillic because that is the default script of the Russian address database. <FormattedAddressLine Line="1">Й</FormattedAddressLine> <FormattedAddressLine Line="2">ул. Целинная 1</FormattedAddressLine> <FormattedAddressLine Line="3">Майма</FormattedAddressLine> <FormattedAddressLine Line="4">Алтай</FormattedAddressLine> <FormattedAddressLine Line="5">649100</FormattedAddressLine> 5.12.2 The PreferredLanguage Parameter The “PreferredLanguage” attribute of the “Result” element is used to specify the language in which the output should be returned. The default setting for “PreferredLanguage” is “DATABASE”. The alphabet in which the data is returned differs from country to country (see 5.12.1), but for most countries the output will be Latin, regardless of the selected preferred language: Value Description DATABASE Language derived from reference data for each address. ENGLISH English locality and province name output, if available. ALTERNATIVE_1,2,3 Alternative languages for multi-language countries. See the table below. If no alternative is provided as part of the postal reference data, this setting will revert to the default, which is “DATABASE”. PRESERVE_INPUT Return output in the same language the input was provided in. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 60 Alternative Language Options Customers can specify the language of the output address or preserve the language of the input address for Belgium and Canada. For example, users can output a French-language address in Flemish or preserve the input language. The following table describes the PreferredLanguage values that you can define for Belgium and Canada addresses: Value Language Output for Belgium Language Output for Canada ALTERNATIVE_1 Flemish English ALTERNATIVE_2 French French ALTERNATIVE_3 German [no language] 5.12.1 Multi-Language Support for Belgium Customers in Belgium can specify the language of the output or preserve the language of the input address. Belgium Address Example 1: For the following French-language address, the user specifies “PRESERVE_INPUT” as the PreferredLanguage value: PreferredLanguage = PRESERVE_INPUT Street/HNO = Rue Royale 4 Locality = Bruxelles Therefore, the address output is in French. If the option is set to “Database” then the official language from the reference database is used for the output. Belgium Address Example 2: For the following French-language address, the user specifies German as the output language: PreferredLanguage = ALTERNATIVE_3 Street/HNO = Rue Royale 4 Locality = Bruxelles Street element: this is the French and the database language, “Koningsstraat” is the Flemish value of this particular street. However, the input street is not available in German in the database, therefore, “Rue Royale” is also the German value of the street element, that is, the database value. Locality element: The locality for this address is available in French, Flemish and German. French is also the default value. Therefore the resulting address will be: Street/HNO = Rue Royale 4 Locality = Brüssel (German) Street defaults to Database because the PreferredLanguage = ALTERNATIVE_3 is not available for this record. The resulting formatted address is: Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 61 Rue Royale 4 1000 Brüssel Note that if the preferred language is not available for the element in the reference database then the resulting language will default to “DATABASE”. Therefore your final address may appear in multiple languages (i.e. Street in French and Locality in German). This issue will be addressed in a future release by providing additional information about the language via the new status codes. 5.12.2 Multi-Language Support for Canada Customers in Canada can specify the language of the output or preserve the language of the input address. This implies that customers can output an English address in French in Québec for example. Note that only street descriptors and provinces are available in multiple languages. Translations for Street Descriptors and Types The following are the only translations of street types recognized by Canada Post: Descriptor English Symbol French Symbol STREET ST RUE AVENUE AVE AV BOULEVARD BLVD BOUL The PreferredLanguage parameter is used to output the address in one of the two languages supported in Canada: DATABASE: The official language of the region in Canada, which is English for all provinces except Québec. DATABASE is the default option. ALTERNATIVE_1: English ALTERNATIVE_2: French Customers may use the “PRESERVE_INPUT” parameter to preserve the language of the input address. Canada Address Example: PreferredLanguage = Alternative_1 (English) Input: Output: 615 Av Monique 615 Monique Ave Québec QC G1B 2A8 Québec QC G1B 2A8 Canada Canada 5.12.3 The ForceCountryISO3 and DefaultCountryISO3 Parameters The “ForceCountryISO3” and “DefaultCountryISO3” attributes of the “Input” element allow a certain degree of influence on country recognition. While “ForceCountryISO3” will cause address records to be always treated as originating from the country set here (thus overriding any explicitly assigned country element), “DefaultCountryISO3” will only apply to records lacking such explicit country information. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 62 5.12.4 The CountryType and CountryofOriginISO3 Parameters The “CountryofOriginISO3” and “CountryType” attributes of the “Result” element are used to control country information output. While “CountryOfOriginISO3” will cause country information output to be suppressed for address records originating from the country set here, “CountryType” will determine in which format country information will be output. Some possible values for “CountryType” are (see the DTD in chapter 10.1 for a complete list, the default is “NAME_EN”): ABBREVIATION ISO_2 ISO_3 ISO_NUMBER NAME_CN NAME_DA NAME_DE NAME_EN NAME_ES NAME_FI NAME_FR NAME_GR NAME_HU NAME_IT NAME_JP NAME_KR NAME_NL NAME_PL NAME_PT NAME_RU NAME_SA NAME_SE 5.12.5 The MatchingAlternatives and MatchingScope Parameters The “MatchingAlternatives” and “MatchingScope” attributes of the “Process” element are used to influence the matching of address elements during validation. While “MatchingAlternatives” allows suppressing the use of historical and synonym (or, more precisely, exonym - see http://wikipedia.org/wiki/Exonym) data for matching address elements (NONE, SYNONYM_ONLY, ARCHIVE_ONLY, with a default of ALL), setting “MatchingScope” other than the default “ALL” will reduce the granularity of address elements (see chapter 5.3) for which matching must succeed, i.e. “LOCALITY_LEVEL” will only consider matches on province, locality and Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 63 postcode level, while “STREET_LEVEL” extends matching to streets and “DELIVERYPOINT_LEVEL” finally adds house number matching. Refer to the DTD in chapter 10.1 for a complete list of valid attribute values, these attribute settings may not have an effect for countries lacking the necessary level of detail in the postal reference data. 5.12.6 The MatchingExtendedArchive Parameter In version 5.4.1 and later, the MatchingExtendedArchive parameter can return the new address code for deprecated or outdated addresses for Japan. If the input address is an outdated address, and the new process parameter MatchingExtendedArchive = ON, Informatica AddressDoctor validates the old address against the archived addresses in the reference database. If MatchingExtendedArchive = OFF, the outdated input address is likely to be rejected, or to be corrected to some other address. If the address is an outdated address, then Informatica AddressDoctor returns the address with the following new Extended Element Result Status (EERS) code: EERS=F (output address is outdated) If the supplementary enrichment for Japan is activated, Informatica AddressDoctor returns the validated outdated address with the old Choumei Aza code and the new Choumei Aza code as enrichment values. The new Choumei Aza code can then be used as an input for the ADDRESS_CODE_LOOKUP processing mode to retrieve the corresponding new address. Note that both the JPN5E1.MD and JPN5AC.MD database files are needed with their respective unlock codes in order to search for the new address using the new Choumei Aza code. 5.12.7 The StandardizeInvalidAddresses Parameter Version 5.3.0 provides the ability to standardize address elements for invalid (Ix Process Status Code) addresses. Standardized addresses can improve downstream business processes such as matching and de-duplication. Address elements that may be standardized are: Street Types Pre and Post Directional Delivery Service Item Sub-building descriptors State/Province/Region; for example, California to CA The standardization of invalid address elements can be controlled by setting the StandardizeInvalidAddresses parameter of the Result element in the Parameters.xml to “ON”. The default is “OFF”, ensuring compatibility with previous versions. 5.12.8 The DualAddressPriority Parameter Starting in version 5.3.0 the users can specify which address type they would like to validate against. For example, when a single address record contains both a PO BOX/Rural Route address and a Street address, users can select the address they would like validated. Users can validate against the following address types: Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 64 POSTAL_ADMIN DELIVERY_SERVICE STREET The handling of dual addresses can be controlled by setting the DualAddressPriority parameter of the Result element in the Parameters.xml to one of the values above. The default is “POSTAL_ADMIN”, ensuring compatibility with previous versions. 5.12.9 The RangesToExpand Parameter Starting in version 5.3.0, the “RangesToExpand” parameter determines whether house number ranges should be expanded for countries where individual house numbers exist. RangesToExpand can have the following values: NONE – do not extend ranges (default value) ALL – The house number ranges will be expanded for all addresses where individual house numbers exist ONLY_WITH_VALID_ITEMS – This value will only expand those ranges where we are sure that all expandable items exist in the reference data. Example: Option = ONLY_WITH_VALID_ITEMS HNO range: 5-25 For countries such as the United Kingdom, where individual house numbers exist in the reference database, the Engine will expand the house number range and list the individual house numbers in the suggestion list. For countries, where we only receive house number ranges from the data provider, the Engine cannot expand the range because the individual house numbers do not exist in the reference database, and will only output house number ranges in the suggestion list. To summarize, when “ONLY_WITH_VALID_ITEMS” option is active, the Engine will only expand house number ranges if individual house numbers exist in the reference database, otherwise the behavior will be similar to “NONE”. The RangesToExpand parameter is used in conjunction with another parameter “FlexibleRangeExpansion” in order to control range expansion and to give the optimum results. FlexibleRangeExpansion contains the values ON and OFF. When set to “ON” (default), the Engine limits the expansion of ranges in such a manner that those at the end of the result list are not expanded. The Engine’s logic determines the number of results to expand and how many to keep as ranges without exceeding the MaxResultCount limit. Therefore, a suggestion list could contain both expanded and unexpanded ranges for house numbers and/or buildings, depending on the values specified for MaxResultCount, RangesToExpand, and FlexibleRangeExpansion. 5.12.10 The GlobalPreferredDescriptor Parameter You can configure Informatica AddressDoctor Version 5.6.0 to specify the output format for street, building and sub-building element descriptors. To specify the output format for element descriptors, configure one of the following values for the GlobalPreferredDescriptor parameter in parameters.dtd : Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 65 DATABASE. Returns the element descriptor available in the reference address database. This is the default value. If there is no matching entry in the database, Informatica AddressDoctor copies the input to the output. LONG. Returns the expanded form of the element descriptor. SHORT. Returns the abbreviated form of the element descriptor. PRESERVE_INPUT. Copies the element descriptor in the input to the output. If the input element descriptor is not an official synonym, Informatica AddressDoctor returns the corresponding value from the database in the output. In Informatica AddressDoctor Version 5.6.0 the GlobalPreferredDescriptor parameter works only for address element descriptors in Australia and New Zealand addresses and the Strasse element descriptor in Germany addresses. 5.12.11 The EnrichmentGeoCoding Parameter To enable the Geocoding Enrichment, set the EnrichmentGeoCoding parameter in the Parameters.xml file to “ON”. Starting in version 5.4.0, if you enable the Geocoding Enrichment, then you also must specify the type of geocoding to use in the Process attribute EnrichmentGeoCodingType in the Parameters.xml file. By default, the arrival point geocoding type is enabled. In Informatica AddressDoctor version 5.6.0, you can include the rooftop geocoordinates in validated address output for the United Kingdom. To include the rooftop geocoordinates for the U.K. addresses, set EnrichmentGeoCodingType to ARRIVAL_POINT. The following table describes the values that you can specify for EnrichmentGeoCodingType: Value Description NONE Uses the Standard Geocode database. ARRIVAL_POINT Uses the High Precision Arrival Point database. If the database finds the arrival point geocoordinates, the geocode is returned with the EGC9 status code. If the arrival point geocoordinates do not exist or if the Arrival Point database cannot be connected to, then Informatica AddressDoctor uses the Standard Geocode database as a fallback to interpolate the geocoordinates. Effective in Version 5.6.0, Informatica AddressDoctor returns rooftop geocoordinates for the United Kingdom addresses. PARCEL_CENTROID Uses the Parcel Centroid database. If the database finds the parcel centroid geocoordinates, the geocode is returned with the EGCA status code. If the parcel centroid geocoordinates do not exist, then Informatica AddressDoctor returns the EGC0 (no geocode available) status code. If the Parcel Centroid database cannot be connected to, then Informatica AddressDoctor returns one of the error status codes (EGCU, EGCN, or EGCC). Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 66 5.13 Output Formatting The AddressComplete multi-line output format generated by Informatica AddressDoctor when processing may be modified and adjusted from the standard behavior by setting parameters provided in the “Result” element of Parameters.xml (see the DTD in Appendix 10.1) passed via AD_SetParametersXML(). (See chapter 6.7.4 also. For a global reference of postal address formats, see http://www.addressdoctor.com/en/countries_data/addressformats.asp) The “Result” element allows formatting control over: The type of format through “FormatType” The delimiter used in output formatting through “FormatDelimiter” Choosing whether the country is included through “FormatWithCountry” Number of lines through “FormatMaxLines” Valid settings for “FormatType” are (default is “ALL”): ALL, ADDRESS_ONLY, WITH_ORGANIZATION, WITH_CONTACT, WITH_ORGANIZATION_CONTACT or WITH_ORGANIZATION_DEPARTMENT “FormatDelimiter” is from a choice of (default is “CRLF”): CRLF, LF, CR, SEMICOLON, COMMA, TAB, PIPE or SPACE. “FormatMaxLines” determines the maximum number of overall address lines returned in a range of 1-19 (the default is 19) and “FormatWithCountry” may be switched “ON”, from the default “OFF”. These formatting parameters are available both as attributes of the Input element (if the input is provided in multi-line fashion), as well as attributes of the “Result” element that are applied unless the “AddressComplete” attribute of the “Result” element is set to “OFF” (from the default “ON”). 5.14 Output Standardization The output generated by Informatica AddressDoctor when processing addresses follows the rules of the postal administrations and the Universal Postal Union (UPU). It is possible to modify and adjust the standard output behavior by setting attributes provided in the “Result” element of Parameters.xml (see the DTD in Appendix 10.1) passed via AD_SetParametersXML(). The “Result” element allows standardization control over: Element length (by means of abbreviation) through “GlobalMaxLength” Casing through “GlobalCasing” Abbreviation through “ElementAbbreviation” While the “GlobalMaxLength” attribute determines the default maximum number of characters per line for all address elements, “FormatMaxLines” (see the previous chapter 5.13) determines the maximum number of overall address lines returned in the case of multi-line “AddressComplete” output. Note that the default value for “GlobalMaxLength” is 1024. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 67 The “GlobalCasing” attribute can be used to influence the casing of the output. The five possible options are native casing as per reference database standard (NATIVE), upper casing (UPPER), lower casing (LOWER), mixed casing (MIXED) and unchanged (NOCHANGE). While upper casing and lower casing will create an output independent of the country the data is in, mixed casing will consider country specific rules, while the default native casing will be based on the reference database content. Setting the casing to be unchanged will return the data the way it was entered for the output of PARSE process mode, while validated results will be provided as found in the reference data and according to postal rules. Result address elements that could not be checked against the reference data will retain their input casing when NOCHANGE is set. Additionally, standardization may also be defined per address element via the “MaxLength” and “Casing” attributes of the “AddressElementStandardize” element, thus overriding the global settings: “MaxLength” is then used to set the maximum characters per line for each address element. Setting “MaxLength” to 0 will inherit the length configured globally. Each address element has a sensible allowed minimum length. Valid minimum values for “MaxLength” are as follows: Address Element Minimum Length Address Element Minimum Length Organization: 25 Street: 20 Department: 25 Number: 5 Contact: 25 DeliveryService: 25 First Name: 20 Locality: 20 Middle Name: 20 PostalCode: 5 Last Name: 20 Province: 2 Title: 20 Country: 2 Function: 20 CountrySpecificLocalityLine: 25 Salutation: 20 DeliveryAddressLine: 25 Gender: 1 RecipientLine: 25 Building: 25 FormattedAddressLine: 25 Sub-Building: 25 AddressComplete: 25 Note that depending on the data and the country selected return values might still exceed the selected minimum value. This happens if there is no useful way to abbreviate the values further. An example would be to abbreviate the following postal code from United Kingdom “AB123AD” to 5 characters. The return value will still be contains 7 digits: “AB123AD”. Release 5.2.7 introduces the new Parameter “ElementAbbreviation”. At the moment this parameter will only influence data from the USA. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 68 In CERTIFIED mode, you can set the ElementAbbreviation parameter ON if you want Informatica AddressDoctor to abbreviate street and locality names when a validated or corrected address has more characters than the maximum character length (13 characters) defined by the USPS. Note: You must have the relevant CASS databases set up for the ElementAbbreviation function to work. If you set the ElementAbbreviation parameter OFF, Informatica AddressDoctor returns the address output based on the input, field length setting, and database entries. In Version 5.4.1 and later, Informatica AddressDoctor abbreviates the locality and street information in Batch and Interactive modes if the locality and street names exceed the maximum allowable character length defined by the USPS. Informatica AddressDoctor uses the Alias database (USAC12.MD) to ensure accuracy of the abbreviation. In Version 5.2.9 and later, this parameter also has an impact on the output of German and Dutch addresses. If the parameter is set to ON, the output street name in German addresses will be abbreviated to 22 characters if the reference database includes the short name for the street. Similarly, for addresses in the Netherlands, the output street name will be abbreviated to 24 characters if the reference database includes the short name for the street. This parameter also influences the output of CHOME addresses in Japan. Usually the word “CHOME” will be output in the street field together with the number of the CHOME. If the parameter is switched to “ON” the word CHOME will not be output. The CHOME number will be inside the Number field. 5.15 Alternative Names and Aliases Informatica AddressDoctor recognizes alias names for streets, localities and provinces around the world. The setting “MatchingAlternatives” in the ParametersXML (see chapter 5.12.5) is used to influence whether aliases should be used for matching or not. Per default, Informatica AddressDoctor replaces any alias with the official or preferred name of the address item. With Informatica AddressDoctor Version 5.2.7 this was partially changed for the locality field. A new sub item has been introduced (PREFERRED_NAME). As a result the alias names will be retained in the locality NAME and COMPLETE fields in certain process modes (for example, CERTIFIED for Australia or the USA). The locality PREFERRED_NAME field always contains the official or preferred name for the locality. In Version 5.2.9, a new option is available in Certified mode called AliasStreet with the option values of “OFFICIAL” and “PRESERVE”. OFFICIAL will change the input alias street name to the USPS official street name, and PRESERVE will retain the input alias street name, unless it is a corrected alias, in which case it will be converted to the USPS official street name. 5.16 AliasStreet Option Examples The following examples show the address outputs when the AliasStreet option is set to OFFICIAL and PRESERVE. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 69 5.16.1 AliasStreet Option = OFFICIAL Input Output 407 W BRONSON HWY 407 W VINE ST KISSIMMEE FL 34741 KISSIMMEE FL 34741-4154 USA USA Note: The output street name differs from the input street name because the database contains the Alias record and the AliasStreet option is OFFICIAL. 5.16.1 AliasStreet Option = PRESERVE Input Output 407 W BRONSON HWY 407 W BRONSON HWY KISSIMMEE FL 34741 KISSIMMEE FL 34741-4154 USA USA Note: The street name is unchanged because the AliasStreet option is PRESERVE. Starting with Version 5.5.0, you can choose to retain locality aliases, also known as vanity names, in the validated output. Informatica AddressDoctor, Version 5.5.0, also gives you more control over the way street aliases are handled in the output. You can set AliasStreet and AliasLocality values in parameters.xml to define the handling of aliases for streets and localities. The following table shows the parameters and supported values. PRESERVE OFFICIAL (Default) OFF AliasStreet Retains the alias for the street in the output. Returns the street name – the alias or the postal name – as mandated by the postal regulations of the country. Returns the postal name for the street in the output. AliasLocality Retains the alias for the locality in the output. Returns the locality name – the vanity name or the postal name – as mandated by the postal regulations of the country. Returns the postal name for the locality in the output. If you want to validate addresses in the Certified mode and generate output that conforms to the postal regulations of the country, you must ensure that the AliasStreet and AliasLocality parameters are set to the default value, OFFICIAL. If you want Informatica AddressDoctor to preserve the vanity name for the locality or the alias for the street in the validated output, you must set the respective parameters to PRESERVE. If you want Informatica AddressDoctor to return the postal name of the locality or street in the validated output, set the respective parameters to OFF. The following examples show different outputs for PRESERVE and OFFICIAL settings for the same U.S. address. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 70 5.16.2 Input address: <InputData> <AddressElements> <Country Item="1" Type="NAME">USA</Country> <Locality Item="1" Type="COMPLETE">SHILOH</Locality> <PostalCode Item="1" Type="UNFORMATTED">62269</PostalCode> <Province Item="1" Type="COUNTRY_STANDARD">IL</Province> </AddressElements> <AddressLines> <DeliveryAddressLine Line="1">9468 RIEDER RD</DeliveryAddressLine> </AddressLines> </InputData> 5.16.3 Output when AliasLocality is set to PRESERVE: <AddressElements> <Country Type="NAME_EN" Item="1">UNITED STATES</Country> <Locality Item="1">SHILOH</Locality> <PostalCode Item="1">62269</PostalCode> <Province Item="1">IL</Province> <Province Item="2">SAINT CLAIR</Province> <Street Item="1">RIEDER RD</Street> <Number Item="1">9468</Number> </AddressElements> In the preceding output example, you can see that the Locality 1 information (SHILOH) is the same as the one provided in the input. 5.16.4 Output when validated in the Certified mode with AliasLocality set to OFFICIAL: <AddressElements> <Country Type="NAME_EN" Item="1">UNITED STATES</Country> <Locality Item="1">O FALLON</Locality> <PostalCode Item="1">62269</PostalCode> <Province Item="1">IL</Province> <Province Item="2">SAINT CLAIR</Province> <Street Item="1">RIEDER RD</Street> Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 71 <Number Item="1">9468</Number> </AddressElements> In the preceding output example, you can see that the Locality 1 information (O FALLON) is different from the one provided in the input (SHILOH). Shiloh is an alias for the locality that has O Fallon as its postal name, and Informatica AddressDoctor corrects the locality name to the postal name because of the OFFICIAL value set to AliasLocality. 5.17 Process Status Values The process status values returned by AD_GetResultXML() or AD_GetResultParameter() summarize the result output quality of a Process() call to Informatica AddressDoctor. For more detailed information on the results, consult the “ElementResultStatus” also (see chapter 5.27.3). The following table describes the process status values: Value Description A1 Address code lookup found a partial address or a complete address for the input code. A0 Address code lookup found no address for the input code. C4 Corrected. All postally relevant elements are checked. C3 Corrected. Some elements cannot be checked. C2 Corrected, but the delivery status is unclear due to absent reference data. C1 Corrected, but the delivery status is unclear because user standardization introduced errors. I4 Data cannot be corrected completely, but there is a single match with an address in the reference data. I3 Data cannot be corrected completely, and there are multiple matches with addresses in the reference data. I2 Data cannot be corrected. Batch mode returns partial suggested addresses. I1 Data cannot be corrected. Batch mode cannot suggest an address. N7 Validation error. Validation did not take place because single-line validation is not unlocked. N6 Validation error. Validation did not take place because single-line validation is not supported for the destination country. N5 Validation error. Validation did not take place because the reference database is out of date. N4 Validation error. Validation did not take place because the reference data is corrupt or badly formatted. N3 Validation error. Validation did not take place because the country data cannot be unlocked. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 72 Value Description N2 Validation error. Validation did not take place because the required reference database is not available. N1 Validation error. Validation did not take place because the country is not recognized or not supported. Q3 Suggestion List mode. Address validation can retrieve one or more complete addresses from the address reference data that correspond to the input address. Q2 Suggestion List mode. Address validation can combine the input address elements and elements from the address reference data to create a complete address. Q1 Suggestion List mode. Address validation cannot suggest a complete address. To generate a complete address suggestion, add data to the input address. Q0 Suggestion List mode. There is insufficient input data to generate a suggestion. RB Country recognized from abbreviation. Recognizes ISO two-character and ISO threecharacter country codes. Can also recognize common abbreviations such as "GER" for Germany. RA Country recognized from the ForceCountryISO3 setting. R9 Country recognized from the Default CountryISO3 setting. R8 Country recognized from the country name. R7 Country recognized from the country name, but Informatica AddressDoctor identified errors in the country data. R6 Country recognized from territory data. R5 Country recognized from province data. R4 Country recognized from major town data. R3 Country recognized from the address format. R2 Country recognized from a script. R1 Country not recognized because multiple matches are available. R0 Country not recognized. S4 Parse mode. The address was parsed perfectly. S3 Parse mode. The address was parsed with multiple results. S1 Parse mode. There was a parsing error due to an input format mismatch. V4 Verified. The input data is correct. Address validation checked all postally relevant elements, and inputs matched perfectly. V3 Verified. The input data is correct, but some or all elements were standardized, or the input contains outdated names or exonyms. V2 Verified. The input data is correct, but some elements cannot be verified because of Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 73 Value Description incomplete reference data. V1 Verified. The input data is correct, but user standardization has negatively impacted deliverability. For example, the post code length is too short. The Vx, Cx, and Ix Process Status values may be returned by “BATCH”, “INTERACTIVE” or “CERTIFIED” Process() calls, while Qx is only returned for “FASTCOMPLETION”, Sx for “PARSE”, Rx for “COUNTRYRECOGNITION”, and Ax for “ADDRESSCODELOOKUP” (see chapter 5.11 for details on the different Process Modes). Nx Process Status values may be returned for any Process()call. Processing the same input address in Batch or Interactive mode will usually yield the same process status, except for I4 / I3 in the case of wrong numeric inputs. In this case, Batch might return I4, while Interactive gives I3. Note that for BATCH processing it is strictly recommended to only accept records with Vx or Cx status for automated data updates. Ix records need to be reviewed manually before using these results for any data update whatsoever. When N1 (because country was not recognized or is fundamentally unsupported) is returned, recognized fundamentally unsupported countries will be reported in the Result parameter CountryISO3. This is the case for ex-countries such as the Soviet Union (SUN) or the Netherlands Antilles (ANT). Unrecognized countries will leave this parameter empty. 5.18 Mailability Scores Informatica AddressDoctor provides an estimate of how likely successful delivery of mail to an address might be. This is a simplification of the process status values (see chapter 5.17) and gives a measure to determine whether an address should be bothered with for mailing in a specific usage scenario: Value Description 5 Completely confident 4 Almost certain 3 Should be fine 2 Fair chance 1 Risky 0 Undeliverable Addresses with a mailability of 5 and 4 may always be considered for sending mail, while 0 or 1 should not be used independent of the scenario. Addresses marked with 2 or 3 may be used, but should be treated with caution: 2 indicates that the results are not corrected and therefore may still contain an incorrect address component. 3 indicates a correction which may require a review before sending the mail piece. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 74 If there is a requirement to understand exactly what was validated or corrected in the address, the ProcessStatusValue, ElementInputStatus, and ElementResultStatus fields should be used instead of the Mailability Score. 5.18.1 5: Completely Confident All relevant elements of the address that have been entered were checked in the processing and have been verified in the process. 5.18.2 4: Almost Certain An address is considered to be Almost Certain when one of the following two scenarios is present: Scenario 1: Some of the relevant elements of the address could not be checked due to reference data and the rest of the address have been verified in the process. Scenario 2: All relevant elements have been entered and some of the relevant elements of the address have been corrected in the process with a very high confidence. This only happens if the match was unique and the number of discrepancies was very low. 5.18.3 3: Should Be Fine Some of the relevant elements of the address have been corrected in the process. A correction only happens if the match was unique and the number of discrepancies was acceptable. 5.18.4 2: Fair Chance The address could not be corrected or validated in the process based on two scenarios: Scenario 1: A candidate match could not be made that had sufficient confidence. Scenario 2: The address matching ended with multiple candidates with similar confidence levels (multi-match situation). The input address, therefore, has a Fair Chance to be mailable as the relevant elements exist. 5.18.5 1: Risky The address entered could only generate a partial match. 5.18.6 0: Undeliverable The address entered is either missing too many components or a majority of the components could not be verified as they generate no matches against the reference data. 5.19 Geocoding Status Values Informatica AddressDoctor 5 enables geocoding for selected countries: this means the Version 5 API will provide the option to enrich a validated address by the respective geo-coordinates in the WGS84 (http://wikipedia.org/wiki/WGS84) format. The quality of coverage will vary from country to country and while Informatica AddressDoctor strives to provide geo-coordinates on house number or building level, depending on data availability, only street or even locality level geo-coordinates might be available. With version 5.4.0 of Informatica AddressDoctor, point address geocoding has been added. Point address geocoding enables accurate and precise geocoordinates for a specific point at an address without interpolating the values. The point address geocoding product includes the following types of geocoding: Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 75 Arrival Point geocoding. The geocoordinates are calculated for a point that is placed in the center of a street segment in front of the house. To use arrival point geocoding, you must download the High Precision Arrival Point database. If the arrival point geocoordinates do not exist, then Informatica AddressDoctor uses the Standard Geocode database as a fallback to interpolate the geocoordinates, if the Standard Geocode database is loaded. Otherwise, Informatica AddressDoctor returns the EGC0 (no geocode available) status code for the given address. Parcel Centroid geocoding. The geocoordinates are calculated for a point that is at the geographic center of the parcel of land. To use parcel centroid geocoding, you download only the Parcel Centroid database. If the parcel centroid geocoordinates do not exist, then Informatica AddressDoctor returns the EGC0 (no geocode available) status code. Version 5.4.2 extends the point address geocoding support to the following European countries: Austria, Denmark, Germany, the Netherlands, and Sweden. In earlier versions, the support for point address geocoding was limited to North America. Informatica AddressDoctor will extend support for other countries as and when the need arises. The corresponding status values returned with the processing result via AD_GetResultXML()or AD_GetResultParameter() are as follows: Value Description EGCN Informatica AddressDoctor cannot find the geocoding database. EGCU The geocoding database is not unlocked. EGCC The geocoding database is corrupt. EGC0 Informatica AddressDoctor could not append geocoordinates to the input address because no geocode is available for the address. EGC4 Geocoordinates are only partially accurate to the postal code level. For example, 795xx. EGC5 Geocoordinates are accurate to the postal code level. EGC6 Geocoordinates are accurate to the locality level. EGC7 Geocoordinates are accurate to the street level. EGC8 Geocoordinates are accurate to the house number level. (Estimated location of the parcel of land with street-side offset.) EGC9 High-precision arrival point geocoordinates. (Measured entryway to the parcel of land.) EGCA High-precision parcel centroid geocoordinates. (Measured center of the parcel of land.) To use point geocoding for any of the supported countries, you must download the corresponding Point Address Geocoding database. The High Precision Arrival Point database provides geocoordinates for a point that is placed in the center of a street segment in front of given address, Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 76 whereas the Parcel Centroid database provides geocoordinates for a point that is at the geographic center of the parcel of land. Arrival Point and Parcel Centroid Databases for Point Geocoding The following table lists the High Precision Arrival Point databases and the Parcel Centroid databases for the supported countries. Country Arrival Point Database Parcel Centroid Database Austria AUT5GCAP.MD AUT5GCPC.MD Canada CAN5GCAP.MD CAN5GCPC.MD Denmark DNK5GCAP.MD DNK5GCPC.MD Finland FIN5GCAP.MD FIN5GCPC.MD Germany DEU5GCAP.MD DEU5GCPC.MD Hungary HUN5GCAP.MD HUN5GCPC.MD Latvia LVA5GCAP.MD LVA5GCPC.MD Luxembourg LUX5GCAP.MD LUX5GCPC.MD Mexico MEX5GCAP.MD Not available Netherlands NLD5GCAP.MD NLD5GCPC.MD Norway NOR5GCAP.MD NOR5GCPC.MD Slovenia SVN5GCAP.MD SVN5GCPC.MD Sweden SWE5GCAP.MD SWE5GCPC.MD UK GBR5GCAP.MD Not available USA USA5GCAP.MD USA5GCPC.MD 5.20 CAMEO Status Values With version 5.2.8 of Informatica AddressDoctor, a new enrichment type ‘CAMEO’ has been introduced. The CAMEO Status values indicate if CAMEO codes are available for the input address or the reason why no codes are available. Value Description ECON No CAMEO codes provided because no CAMEO database for the country is available. ECOI No CAMEO codes provided – no CAMEO lookup was performed, as the address could not be corrected and has an Ix ProcessStatus. ECO0 No CAMEO codes provided because no CAMEO code was found for the input address. ECO1 CAMEO codes available. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 77 5.21 CASS Status Values Informatica AddressDoctor 5 provides the output required by the USPS CASS Standard, see chapter 6.24 for details on the actual output available. The corresponding status values returned with the processing result via AD_GetResultXML()or AD_GetResultParameter() are: Value Description ECA0 CASS output not available (for this address) ECA1 CASS attributes only partially provided (some databases are missing) ECA2..4 Reserved for future use ECA5 CASS attributes provided 5.22 SERP Status Values Informatica AddressDoctor 5 provides the output required by the Canada Post SERP Standard, see chapter 6.24 for details on the actual output available. The corresponding status values returned with the processing result via AD_GetResultXML()or AD_GetResultParameter() are: Value Description ESE0 SERP output not available (for this address) ESE1 SERP attributes provided If the Validation type is CERTIFIED and the SERP Enrichment Status is ON, two enrichments are provided: CATEGORY and EXCLUDED_FLAG. For details, see chapter 6.24.2. 5.23 SNA Status Values Informatica AddressDoctor 5 provides the output required by the La Poste SNA Standard, see chapter 6.24 for details on the actual output available. The corresponding status values returned with the processing result via AD_GetResultXML()or AD_GetResultParameter() are: Value Description ESN0 SNA output not available (for this address) ESN1 SNA attributes provided 5.24 AMAS Status Values Informatica AddressDoctor 5 provides the output required by the Australia Post AMAS Standard, see chapter 6.24 for details on the actual output available. The corresponding status values returned with the processing result via AD_GetResultXML()or AD_GetResultParameter() are: Value Description EAM0: AMAS output not available (for this address) EAM1: AMAS output is provided – Address is corrected or validated and DPID is delivered Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 78 Value Description EAM2 AMAS output is not provided – no correction or validation is possible – no DPID can be returned 5.25 SendRight Status Values The SendRightStatus parameter of the EnrichmentData may contain the following values: Value Description ESR0: SendRight output not available (for this address) ESR1: SendRight output is provided 5.26 Country Specific Enrichment Informatica AddressDoctor provides additional enrichment output required by the local markets for the following countries: Austria Serbia Brazil South Africa France Switzerland Germany United Kingdom Japan USA Poland See chapter 6.13 for details on the actual output available. You must use a valid unlock code to use the supplementary databases for these countries. 5.26.1 Country Specific Enrichment Status Values The corresponding status values returned with the processing result via AD_GetResultXML()or AD_GetResultParameter() are: 5.26.2 For USSupplementary: EUS0: US country specific output not available (for this address) EUS1: US country specific attributes provided (not necessarily all attributes are populated) EUSC: Database is corrupt EUSN: Database not found EUSU: Database not unlocked 5.26.3 For GBSupplementary: EGB0: GB country specific output not available (for this address) EGB1: GB country specific attributes provided (not necessarily all attributes are populated) EGBC: Database is corrupt EGBN: Database not found Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 79 EGBU: EJP0: Database not unlocked 5.26.4 For JPSupplementary: JP country specific output not available (for this address) EJP1: JP country specific attributes provided (not necessarily all attributes are populated) EJPC: Database is corrupt EJPN: Database not found EJPU: Database not unlocked 5.26.5 For RSSupplementary: ERS0: RS country specific output not available (for this address) ERS1: RS country specific attributes provided (not necessarily all attributes are populated) ERSC: Database is corrupt ERSN: Database not found ERSU: Database not unlocked 5.26.6 For BRSupplementary: EBR0: BR country specific output not available (for this address) EBR1: BR country specific attributes provided (not necessarily all attributes are populated) EBRC: Database is corrupt EBRN: Database not found EBRU: Database not unlocked 5.26.7 For CHSupplementary: ECH0: CH country specific output not available (for this address) ECH1: CH country specific attributes provided (not necessarily all attributes are populated) ECHC: Database is corrupt ECHN: Database not found ECHU: Database not unlocked 5.26.8 For DESupplementary: EDE0: DE country specific output not available (for this address) EDE1: DE country specific attributes provided (not necessarily all attributes are populated) EDEC: Database is corrupt EDEN: Database not found EDEU: Database not unlocked 5.26.9 For ZASupplementary: EZA0: ZA country specific output not available (for this address) Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 80 EZA1: ZA country specific attributes provided (not necessarily all attributes are populated) EZAC: Database is corrupt EZAN: Database not found EZAU: Database not unlocked 5.26.10 For ATSupplementary EAT0: AT country-specific output not available (for this address) EAT1: AT country-specific attributes provided (not necessarily all attributes are populated) EATC: Database is corrupt EATN: Database not found EATU: Database not unlocked 5.26.11 For FRSupplementary EFR0: FR country-specific output not available (for this address) EFR1: FR country-specific attributes provided (not necessarily all attributes are populated) EFRC: Database is corrupt EFRN: Database not found EFRU: Database not unlocked EPL0: EPL1: 5.26.12 For PLSupplementary PL country-specific output not available (for this address) FR country-specific attributes provided (not necessarily all attributes are populated) EPLC: Database is corrupt EPLN: Database not found EPLU: Database not unlocked 5.26.13 Country Specific Enrichment Output Fields The following output fields are currently supported: 5.26.14 For USSupplementary: COUNTY_FIPS_CODE 3 digit number identifying a county in the United States. The United States Federal Information Processing Standard (FIPS) maintains a set of codes that identify states, counties, and other territorial possessions. The two-digit state code identifies each state. The three-digit county code identifies a county within a state. The five digits of the state and county code can uniquely identify any county STATE_FIPS_CODE Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 81 2 digit number identifying states in the United States. The Federal Information Processing Standard (FIPS) controls the numerical and alphabetical codes that identify states and other territories of the United States. MSA_ID The Metropolitan Statistical Area identification number (MSAID) is a 4 digit number that identifies an urban area with a population greater than 50,000. CBSA_ID Represents a Core-Based Statistical Area (CBSA) identification number. A CBSA identifies an urban area with a population greater than 10,000. A CBSA can be a Metropolitan Statistical Area or Micropolitan Statistical Area. A Metropolitan Statistical Area has over 50,000 inhabitants. A Micropolitan Statistical Area has between 10,000 and 50,000 inhabitants. A CBSA_ID is a 5 digit number. FINANCE_NUMBER A finance number has six digits. The output is a code assigned to United States post offices and other postal facilities to enable collection of cost and statistical data. The first two digits of the finance number identify the state. The final four digits identify the USPS post office or postal facility. RECORD_TYPE A single-character code that describes the type of a mailbox or delivery. For example, the code can indicate if the address is in a high-rise building (value H) or a post office box (value P). CMSA_ID Represents a Consolidated Metropolitan Statistical Area (CMSA) identification number. A PMSA becomes a CMSA if local opinion favors the designation. The CMSA_ID is a 4 digit unique number. TIME_ZONE_CODE 1 to 3 characters numerical value identifying the difference to GMT. Example would be “-5” for Eastern Standard Time. TIME_ZONE_NAME 3 Characters identifying the time zone the address is in like “EST” Eastern Standard Time CENSUS_TRACT_NO Census Tract is a statistical subdivision of a county. The CENSUS_TRACT_NO. is a 6 digit number. CENSUS_BLOCK_NO Census Block is the smallest entity for which the Census bureau collects census information. The CENSUS_BLOCK_NO is a 4 digit number. CENSUS_BLOCK_GROUP Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 82 A Census Block Group is a group of Census blocks sharing the same first digit. PMSA_ID Represents a Primary Metropolitan Statistical Area (PMSA) identification number. Two or more PMSA are created if a MSA reaches a size of 1 million or more people. The PMSA_ID is a 4 digit unique number. MCD_ID Represents a Minor Civic Division which is a primary legal subdivision of a county defined by the Government. The MCD_ID is a 5 digit number. PLACE_FIPS_CODE 5 digit number identifying localities in the United States. The Federal Information Processing Standard (FIPS) controls the numerical codes that identify localities in the United States. 5.26.15 For BRSupplementary: Address Doctor provides the Brazilian Institute of Geography and Statistics (IBGE) code as an enrichment output field for Brazilian addresses. For ecommerce, a government agency in Brazil publishes a list of cities/states and their official numeric seven digit code called the IBGE code. This code is used for taxation and auditing purposes. Every order that gets placed is eventually crossreferenced with the city and state to get the associated IBGE code. You must have the new supplementary data for Brazil, BRA5E1.MD as well as version 5.4.1 or later of Informatica AddressDoctor to leverage this functionality. For example, if you enter the following address: Rua da Matriz 9 Centro Glória do Goitá-pe 55620-000 Brazil Along with the validated output, Informatica AddressDoctor returns the following enrichment value: IBGE_CODE: 2606101 5.26.16 For DESupplementary: Informatica AddressDoctor now provides the following enrichment output fields for German addresses: DEU_AGS. The Amtliche Gemeindeschlüssel (AGS) is a variable length code that uniquely identifies a locality in Germany. There may be more than one locality for a given AGS code. DEU_LOCALITY_ID. The Locality ID is a variable length code that uniquely identifies a German locality. DEU_STREET_ID. The Street ID is a variable length code that uniquely identifies a German street address. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 83 You must have the new German database as well as Version 5.4.1 or later of Informatica AddressDoctor to leverage this functionality. For example, if you enter the following address: Röntgenstr. 9 67133 Maxdorf Germany Along with the validated output, Informatica AddressDoctor returns the following enrichment values: DEU_AGS: 07338018 DEU_LOCALITY_ID: 68015519 DEU_STREET_ID: 100560690 5.26.17 For CHSupplementary: Swiss Post has introduced an additional two characters to the postal codes. Informatica AddressDoctor has updated its engine to allow the output of the additional postal code characters as an enrichment field. The new field is named POCO_EXT. You must have the new supplementary data for Switzerland, CHE5E1.MD, as well as version 5.4.2 or later of Informatica AddressDoctor to leverage this functionality. To use the new enrichment field, you must set the EnrichmentSupplementaryCH parameter to ON in the Parameters.xml file. For example, if you enter the following address in Batch mode: Hohlen 1 3800 Sundlauenen Switzerland Along with the validated output, Informatica AddressDoctor returns the following enrichment values: Status: ECH1 POCO_EXT: 05 5.26.18 For GBSupplementary: DELIVERY_POINT_SUFFIX The Royal Mail assigns a two-character suffix to every mailbox in a UK post code area. It uses the post code and delivery point suffix to identify every mailbox. The delivery point suffix format is a digit followed by a letter. UDPRN (Unique Delivery Point Reference Number): The Unique Delivery Point Reference Number, or UDPRN, is an eight character code that uniquely identifies each postal address of the Royal Mail PAF database. The UDPRN keeps a constant reference that remains uniquely tied to the physical delivery point regardless of any changes in the address. ADDRESS_KEY Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 84 Informatica AddressDoctor Version 5.6.0 extends the U.K. address enrichment to include Address Keys provided by Royal Mail. Address Keys are 8-digit numeric codes that map to addresses in the Postcode Address File (PAF) from Royal Mail. You can use Address Keys in conjunction with Organization Keys and the PostCode Type to uniquely identify an address. 5.26.19 For JPSupplementary: Informatica AddressDoctor provides the old Choumei Aza Code, the new Choumei Aza Code, and the Gaiku code enrichments to Japan Addresses. The Choumei Aza code is an eleven-digit code defining a unique delivery point for Japan addresses. The Gaiku code is a four-digit code that denotes a city block in a Japan address. CHOUMEI_AZA_CODE Returns the old Choumei Aza Code. For this setting to work, you must have the MatchingExtendedArchive attribute of the Process element set to ON to include the old Choumei Aza code in the output. NEW_CHOUMEI_AZA_CODE Returns the new Choumei Aza Code. GAIKU_CODE Returns the Gaiku code. 5.26.20 For RSSupplementary: Post Serbia has introduced an additional six-digit Postal Address Code (PAK) which goes down to the street level. The PAK ensures that mail is delivered correctly and promptly to recipients in Serbia. For items that are addressed to a P.O. Box, “poste restante” or to a military address, the PAK is not needed in the address. POSTAL_ADDRESS_CODE The postal address code (PAK). 5.26.21 For ZASupplementary: Informatica AddressDoctor provides the National Address Database (NAD) ID as an enrichment output field for South African addresses. The NAD ID is a unique numeric ID assigned to each street address. You must have the new South African database as well as version 5.4.2 or later of Informatica AddressDoctor to leverage this functionality. For example, if you enter the following address: 4 Balmoral Road Vincent East London 5247 South Africa Along with the validated output, Informatica AddressDoctor returns the following enrichment value: NAD_ID: 2170232 Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 85 5.26.22 For ATSupplementary POSTAL_ADDRESS_CODE Informatica AddressDoctor provides the Postal Address Code as an enrichment to Austrian addresses. You must have the AUT5E1.MD database installed and EnrichmentSupplementaryAT in the parameter.xml set to ON. This is supported only in Informatica AddressDoctor versions 5.5.0 and later. For example: <InputData> <AddressElements> <Country Item="1" Type="NAME">AUT</Country> <Locality Item="1" Type="COMPLETE">Perchtoldsdorf</Locality> <PostalCode Item="1" Type="UNFORMATTED">2380</PostalCode> <Province Item="1" Type="COUNTRY_STANDARD">Niederösterreich</Province> <Street Item="1" Type="COMPLETE">Plättenstraße</Street> <Number Item="1" Type="COMPLETE">7</Number> </AddressElements> </InputData> Along with the validated output, Informatica AddressDoctor returns the following enrichment values: Status: EAT1 PAC: 105176447 5.26.23 For FRSupplementary INSEE_CODE Informatica AddressDoctor provides the INSEE code and the INSEE-9 code as enrichments to French addresses. The INSEE code is a numerical indexing code used by the French National Institute for Statistics and Economic Studies (INSEE) to identify various entities including French communes and departments. INSEE codes are particularly helpful in uniquely identifying French communes that share the same name, spelling, and pronunciation. Of a five-digit INSEE code for a commune, the first two digits represent the department and the last three denote the commune. INSEE codes are also used as National Identification Numbers for French citizens. The INSEE-9 code is also known as the IRIS code. IRIS stands for aggregated units for statistical information in French, and represents a demographic group that contains a maximum of 2000 people. France is composed of around 16,100 IRIS units including 650 units in overseas departments. To use this enrichment, you must have the FRA5E1.MD database installed and EnrichmentSupplementaryFR in parameter.xml set to ON. This is supported only in Informatica AddressDoctor versions 5.5.0 and later. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 86 For example: <InputData> <AddressElements> <Country Item="1" Type="NAME">FRA</Country> <Locality Item="1" Type="COMPLETE">AGEN</Locality> <PostalCode Item="1" Type="UNFORMATTED">47000</PostalCode> <Street Item="1" Type="COMPLETE">RUE DU PUITS DU SAUMON</Street> <Number Item="1" Type="COMPLETE">6</Number> </AddressElements> </InputData> Along with the validated output, Informatica AddressDoctor returns the following information: STATUS: EFR1 INSEE_CODE 47001 INSEE_9_CODE 470010115 5.26.24 For PLSupplementary GMINA_CODE, LOCALITY_TERYT_ID, STREET_TERYT_ID Informatica AddressDoctor provides Gmina code, Locality and Street TerytIDs as enrichments for addresses in Poland. National Official Register of the Territorial Division of the Country (TERYT) is the official agency of Poland that is responsible for identifiers and names of territories, localities, roads, buildings, and so on. Gmina is the Polish equivalent of communes or municipalities. Gmina code and TerytIDs are assigned and managed by TERYT. To use these enrichments, you must have the POL5E1.MD database installed and EnrichmentSupplementaryPL in parameter.xml set to ON. This is supported only in Informatica AddressDoctor versions 5.5.0 and later. For example: <InputData> <AddressElements> <Country Item="1" Type="NAME">POL</Country> <Locality Item="1" Type="COMPLETE">Wrocław</Locality> <PostalCode Item="1" Type="UNFORMATTED">50510</PostalCode> <Province Item="1" Type="COUNTRY_STANDARD">dolnośląskie</Province> <Street Item="1" Type="COMPLETE">ul. Laskowa</Street> <Number Item="1" Type="COMPLETE">1</Number> </AddressElements> </InputData> Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 87 Along with the validated output, Informatica AddressDoctor returns the following enrichment values: Status: EPL 1 GMINA_CODE: 2183 LOCALITY_TERYT_ID: 0986544 STREET_TERYT_ID: 10666 5.27 Element Status and Relevance Values Element status values give a detailed explanation of the outcome of the validation operation. They are only meaningful after a validation operation has been performed, even though some information is available after a parsing operation for the “ElementInputStatus” value. In Informatica AddressDoctor 5 now 20 address elements are covered in both, “ElementInputStatus” and “ElementResultStatus”. The former provides per element information on the matching of input elements to reference data, while the latter categorizes the result in more detail than the overview process status values described in section 5.17 (by indicating if and how the output fields have been changed from the input fields). 5.27.1 Element Positions The element positions (from left to right) are, where level 0 pertains to the Item 1 status information, while level 1 summarizes the status information on Items 2-6 (see chapter 5.9 on address element items): Position Description 1 PostalCode level 0 2 PostalCode level 1 (for example, ZIP+4 – Plus 4 addition) 3 Locality level 0 4 Locality level 1 (for example, Urbanisation, Dependent Locality) 5 Province level 0 6 Province level 1 (for example, Sub Province) 7 Street level 0 8 Street level 1 (for example, Dependent street) 9 Number level 0 10 Number level 1 11 Delivery service level 0 (for example, PO Box, GPO, Packstation, Private Bags) 12 Delivery service level 1 13 Building level 0 14 Building level 1 Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 88 Position Description 15 SubBuilding level 0 16 SubBuilding level 1 17 Organization level 0 18 Organization level 1 19 Country level 0 (Mother country) 20 Country level 1 (for example, Territory) See chapter 5.3 for more in-depth information on address elements. 5.27.2 ElementInputStatus The possible values for validation are: Value Description 0 The input address contains no data at this position. 1 The data at this position cannot be found in the reference data. 2 The position cannot be checked because reference data is missing. 3 The data is incorrect. The reference database suggests that the Number or DeliveryService value is outside the range expected by the reference data. In batch and certified modes, the input data at this position is passed uncorrected as output. In suggestion list modes, Informatica AddressDoctor can provide alternatives. 4 The data at this position matches the reference data, but with errors. 5 The data at this position matches the reference data, but the data element was corrected or standardized. For example: 6 Parsing: Splitting of house number for “MainSt 1” Validation: Replacing an input that is an exonym, or dropping superfluous fielded input that is not valid according to the country reference database The data at this position matches the reference data without any error. For parsing, the following values are possible: Value Description 0 The input address contains no data at this position. 1 The element at this location was moved to another position. 2 The element at this position matched the reference data value but needed to be normalized. 3 The data at this position is correct. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 89 5.27.3 ElementResultStatus ElementResultStatus is set after validation to indicate whether verification (“verified”) or correction (“changed”) was possible. The following table describes the possible values for the address elements in positions 1 through 18: Value Description 0 The output address contains no data at this position. 1 The data at this position cannot be found in the reference data. The input data is copied to the output data. 2 Data at this position is not checked but is standardized. 3 Data at this position is checked but does not match the expected reference data. The reference data suggests that the number data is not in the valid range. The input data is copied to the output. The status value applies in batch mode only. 4 Data at this position is validated but not changed because reference data is missing. 5 Data at this position is validated but not changed because multiple matches exist in the reference data. The status value applies in batch mode only. 6 Data validation deleted the input value at this position. 7 Data at this position is validated but contained a spelling error. Validation corrected the error by copying the value from the reference data. 8 Data at this position is validated and updated by adding a value from the reference data. It can also mean that the reference database contains additional data for the input element. For example, validation can add a building or sub-building number if a perfect match is found for the street name or building name. 9 Data at this position is validated but not changed, and the delivery status is not clear. For example, the DPV value is wrong. C Data at this position is validated and verified, but the name data is out of date. Validation changed the name data. D Data at this position is validated and verified but changed from an exonym to an official name. E Data at this position is validated and verified. However, data validation standardized the character case or the language. Address validation can change the language if the value fully matches a language alternative. For example, address validation can change "Brussels" to "Bruxelles" in a Belgian address. F Data at this position is validated, verified, and not changed, due to a perfect match with reference data. Positions 19 and 20 in the output string relate to country data. The country data values apply to the COUNTRYRECOGNITION process mode also. For more information, see chapters 5.11.8 and 5.12.3). Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 90 The following table describes the possible values for the address elements in positions 19 through 20: Value Description 0 The output address contains no data at this position. 1 The country is not recognized. 4 The country is recognized from the DefaultCountryISO3 setting. 5 The country is not recognized because multiple matches are available. 6 The country is recognized from a script. 7 The country is recognized from the address format. 8 The country is recognized from major town data. 9 The country is recognized from province data. C The country is recognized from territory data. D The country is recognized from the country name, but the name contains errors. E The country is recognized from the country name without errors. F The country is recognized from the ForceCountryISO3 setting. 5.27.4 ElementRelevance In addition to the element status values described previously, information is available on which of the address elements of the address processed are actually relevant from the local postal operator’s point of view. The possible values for each address element are “1” for relevant and “0” otherwise. For any given address, all address elements with a value of “1” must be present for an output address to be deemed valid by the local postal authority. “ElementRelevance” may well vary from address to address for countries with different address types; for example, rural versus metropolitan addressing. Furthermore, AddressElements that have actually been validated against reference data (i.e. with ElementResultStatus 7 and higher) may override the default ElementRelevance value defined for that AddressElement. Note that “ElementRelevance” is really only meaningful for a “ProcessStatus” value of Cx or Vx (and possibly I3 and I4 for Process Mode INTERACTIVE, see chapter 5.17 for details on “ProcessStatus”). 5.28 Extended Element Result Status Fields 5.28.1 Address Resolution Code (ARC) The Address Resolution Code is a twenty character output string similar to the existing Element Status fields which is populated for invalid (Ix Process Status Code) records. The ARC explains why an address is rejected and directs you to possible resolutions. Informatica AddressDoctor generates the following Address Resolution Code values: Value Description 2 Missing element in address. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 91 Value Description 3 Numeric provided inside element is outside permissible range – for example, wrong numeric inside street name or house number; 100 Main St when house numbers range from 400-800. 4 Multiple inputs for the element. 5 Input element ambiguous / multiple matches. 6 Element contradicts other elements. For example, the postal code information and locality information do not match. 7 3 strike rule/too many corrections in combination of several elements. 8 General Postal Authority Rule. Note that for all other scenarios this value will be zero. 5.28.2 ARC = 3 (Numeric provided inside an element is outside permissible range) The house number in the following example is outside the range for the address. Two suggestions are generated for the address in interactive mode. Input Output Output Röntgenstr. 10 Röntgenstr. 1-9 Röntgenstr. 2-8 67133 Maxdorf 67133 Maxdorf 67133 Maxdorf Germany Germany Germany Process Status = I3: Data could not be corrected completely. Element EIS ERS ARC Relevance Explanation House Number 3 7 3 1 House Number provided is outside the permissible range 5.28.3 ARC = 4 (Multiple inputs for the element) Processing the following address in batch mode gives a process status of I4 and an ARC value of 4. Input Output Street =Rue des Ardennes Street=Rue des Ardennes House Number= 21 House Number=21 Postal Code=75019 Postal Code=75019 Locality=75935 Paris Locality=Paris Process Status = I4: This address has multiple postal codes and cannot be resolved. Element EIS ERS ARC Relevance Explanation Postal Code 6 0 4 0 Multiple postal codes for the address Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 92 This results in ARC = 4 for the postal code; i.e. multiple inputs for the element (two postal codes in input). 5.28.4 ARC = 5 (Input element ambiguous / multiple matches) Processing the following address in interactive mode gives a process status of I1 and an ARC value of 5. Input Output C.P. 102 C.P. 102 120557 Bucuresti 120557 Bucuresti Romania Romania This address is ambiguous because the input postal code is incorrect for Bucuresti. There are multiple suggestions for it in interactive mode; therefore the address is copied to the output, and a value of 5 is assigned to the postal code. Process Status Code = I1: Data could not be corrected and no suggestions are available in interactive mode. Element EIS ERS ARC Relevance Explanation Postal Code 3 0 5 0 Input element ambiguous because the postal code does not exist for the locality, leading to multiple suggestions in interactive mode. The address is therefore copied 5.28.5 ARC = 6 (element contradicts other elements; for example, Postal Code/Locality mismatch) Processing the following address in certified mode gives a process status of I2 and an ARC value of 6. Input Output 301-703 Riverwood Ave 301-703 Riverwood Ave Winnipeg MB T5A 0P8 Winnipeg MB T5A 0P8 Canada Canada The postal code and locality values in the input contradict each other. Postal code T5A 0PA is for Edmonton in Alberta and not for Winnipeg, Manitoba. SERP certification rules state that the postal code cannot be changed. Process Status Code = I2: Data could not be corrected in certified mode. Element EIS ERS ARC Relevance Explanation Postal Code 6 0 6 0 Postal Code contradicts Locality Locality 6 0 6 0 Locality contradicts Postal code Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 93 5.28.6 ARC = 7 (Too many corrections) Processing the following address in interactive mode gives a process status of I4 and an ARC value of 7. Input Output Peterlenstrasse 14 Peter-Anders-Str. 14 1000 Berlin 12057 Berlin Germany Germany Process Status = I4: Data could not be corrected completely. The postal code 1000 is incorrect for Berlin and Peterlenstrasse does not exist in Berlin. Therefore, these elements must be corrected aggressively to get some results returned. It is unclear whether these elements are completely correct. The elements are, therefore, assigned an ARC value of 7. Element EIS ERS ARC Relevance Explanation Postal Code 4 7 7 1 Too many corrections Street 4 7 7 1 Too many corrections 5.28.7 ExtElementStatus (EERS) The Extended Element Result Status (EERS) code is a twenty character output string similar to the Element Status fields for valid or corrected addresses. The EERS informs the user that additional information may be available in the reference database for the given address. The code can return the following values: Value Description 1 Data available for the element in the database, but not used for validation 2 Element unchecked, but changed because of wrong syntax/format 3 Numeric in element correct, but element changed because of wrong syntax/descriptor 4 Element correct or unchecked, but moved because of wrong format 5 Alternative available in database – for example, language, preferred locality name, alias name 6 Unvalidated parts inside element like additional information 7 Level change like moving HNO1 to HNO2 or swapping Locality2 with Locality1 8 Type change for fielded input only; for example, moving SubBuilding to Building Level 2 9 General Postal Authority Rule A Dominant match for dual address processing B Relevance is only a country-wide default and cannot be trusted C Fast Completion Overflow Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 94 Value Description D Numeric for range expansion (interpolated) E Language not available for the country, database language returned F Output address is outdated Note that for all other scenarios this value will be zero. 5.28.8 EERS = 2 (Element unchecked, but changed because of wrong syntax/format) Processing the following address in batch mode gives a process status of V2 and an EERS value of 2. Input: Output: 113/115 Rue Germaine Tailleferre 113 Rue Germaine Tailleferre 75019 Paris 75019 Paris France France Process Status = V2: Address is correct but some elements could not be verified. Element EIS ERS EERS Relevance Explanation House Number 2 2 2 1 The input contains the wrong syntax for house number; i.e. two house numbers. The first part of the house number (113) is not found in the database and is therefore copied to the output and the second part 115 is removed from the element 5.28.9 EERS = 3 (Numeric in element correct, but element changed because of wrong syntax/descriptor) Processing the following address in batch mode gives a process status of V3 and an EERS value of 3. Input Output 18-20 Rue Edouard Jacques 18 Rue Edouard Jacques 75014 Paris 75014 Paris Process Status = V3: Verified – input data correct on input but some elements were standardized. France does not permit ranges for house numbers. Therefore, the first part of the house number is matched against the reference database and the “-20” is removed. This leads to an assignment of 3 for EERS for the house number element of the address. Element EIS ERS EERS Relevance Explanation House Number 5 E 3 1 Numeric in element correct, but element changed because of wrong syntax/descriptor 5.28.10 EERS = 4 (Element correct or unchecked, but moved because of wrong format) Processing the following address in batch mode gives a process status of C4 and an EERS value of 4. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 95 Input Output Organization: Sinopia Financial Services Sinopia Financial Services DAL1: 4 Place del la Pyramide Immeuble Ile de France DAL2: Immeuble Ile de France 4 Place De La Pyramide CSLLN: Paris La Defense CEDEX 92912 Puteaux Country: France 92912 Paris La Defense CEDEX France Process Status = C4: Corrected Building Immeuble Ile de France has been moved above the street in the output and gets an EERS status of 4. Element EIS ERS EERS Relevance Explanation Building 6 F 4 0 Building moved one level up, i.e. above the street, because the input format was incorrect 5.28.11 EERS = 5 (Alternative available in database – for example, language, preferred locality name, alias name) In this example, the PreferredLanguage parameter is set to “Database”, and the address is processed in batch mode. Input Output Koningstraat 4 Rue Royale 4 Brussels 1000 Bruxelles Belgium Bruxelles-Capitale Belgium Process Status = C4: Corrected Element EIS ERS EERS Relevance Explanation of EERS Locality 5 E 5 1 Alternative language available in database Province 0 8 5 0 Alternative language available in database Street 4 7 5 1 Alternative language available in database 5.28.12 EERS = 6 (Unvalidated parts inside element like additional information) Processing the following address in batch mode gives a process status of V2 and an EERS value of 6. Input Output Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 96 Leona Vicario 7528 C 1 Leona Vicario 7528 C 1 Condominio Campestre Del Valle Condominio Campestre Del Valle 52177 Metepec, Mex 52177 Metepec, Mex Mexico Mexico Process Status = C3: Corrected but some elements could not be checked. LEONA VICARIO is found in the database as a valid street. “7528 C” is an unvalidated part of the street output. The house number is copied. It is not known whether the house number is relevant for delivery, and therefore gets an EERS of B. Element EIS ERS EERS Relevance Explanation Street 6 F 6 1 7528 C is an unvalidated part of the Street House Number 2 4 B 1 The element relevance is only a country-wide default and cannot be trusted 5.28.13 EERS = 7 (Level change like moving HNO1 to HNO2 or swapping Locality2 with Locality1) Processing the following address in batch mode gives a process status of C3 and an EERS value of 7. Input Output RUA EDUARDO RIZK 1135 RUA EDUARDO RIZK 1135 GUARUJÁ BALNEÁRIO CIDADE ATLÂNTICA BALNEÁRIO CIDADE ATLÂNTICA-SP GUARUJÁ-SP 11441-140 11441-140 BRAZIL BRAZIL Process Status = C3: Corrected but some elements could not be checked. Element EIS ERS EERS Relevance Explanation Locality1 5 E 7 1 Locality1 swapped with Locality2 – elements have changed level Locality2 6 F 7 0 Locality2 swapped with Locality1 – elements have changed levels 5.28.14 EERS = 8 (Type change for fielded input; for example, moving Sub-Building to Building Level 2) The EERS value of 8 is only set for fielded input. Processing the following address in batch mode gives a process status of V2 and an EERS value of 8. Input Output Organization = CENTRE GESTION AGREE Organization = CENTRE GESTION AGREE Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 97 ENTREPRISES ENTREPRISES Building = IMPASSE DE PECHABOUT Building = BOUTOULLE Sub-building = BOUTOULLE Street = IMPASSE DE PECHABOUT House Number = 53 House Number = 53 Delivery Service Name = BP Delivery Service Name =BP Delivery Service Number = 40098 Delivery Service Number = 40098 Postal Code = 47003 Postal Code = 47003 Locality = AGEN Locality = AGEN CEDEX Country = FRANCE Country = FRANCE Process Status = V2: Verified – input data correct but some elements could not be verified because of incomplete reference data. Element EIS ERS EERS Relevance Explanation Street 6 F 8 1 The street data in the output was the building data in the input Building 6 F 8 0 The building data in the output was the sub-building data in the input 5.28.15 EERS = A (Dominant match for dual address processing) In this example the DualAddressPriority is set to Street. This yields an EERS status of A for the Street element in batch mode. Input Output 3 Poplar St PO BOX 2 PO BOX 2 3 Poplar St New Haven 06513 New Haven CT 06513-4325 USA USA Process Status = C4: Corrected. Element EIS ERS EERS Relevance Explanation Street 6 F A 1 Street is the dominant match 5.28.16 EERS = B (relevance is only a country-wide default and cannot be trusted) A value of B in the extended element result status output field indicates that the relevance value cannot be trusted and is only a country-wide default value. User interpretation: Element empty, Relevance = 0, EERS = 0 No information about this element. Element empty, Relevance = 0, EERS = B This use case will not occur. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 98 Element empty, Relevance = 1 This use case must not happen; a missing relevant element should lead to a rejection. Non-empty element, Relevance = 0, EERS = 0 Element is definitely not relevant. Non-empty element, Relevance = 1, EERS = 0 Element is definitely relevant. Non-empty element, Relevance = 0, EERS = B Element relevance is a country-wide default. Non-empty element, Relevance = 1, EERS = B Element relevance is a country-wide default. 5.28.17 EERS = C (Fast Completion Overflow) In this example the MaxResultCount = 20 and the following elements are entered in the Fast Completion mode. Locality = New York Country = USA This results in an EERS value of C (Fast Completion Overflow) for the postal code element. Process Status = Q1: Suggested address incomplete. Element EIS ERS EERS Relevance Explanation Postal Code 0 8 C 1 More than 20 suggestions – overflow available 5.28.18 EERS = D (Numeric for range expansion (interpolated)) This value for the EERS output field is assigned if the RangesToExpand parameter is set to “ALL”. In the following example, the delivery service numeric range is 1-40. Only the interval limits of 1 and 40 are confirmed in the database. For all other 38 results, the EERS for Delivery Service = D in the Interactive and Fast Completion modes. Input Delivery Service = Postfach Postal Code = 91279 Locality = Kirchenthumbach Country = Germany RangesToExpand = “ALL” Process Mode = Fast Complete Process Status = Q3: Suggestions available – complete address Element EIS ERS EERS Relevance Explanation Delivery Service 0 8 D 1 The numbers 1 and 40 are confirmed in the database. For all other 38 results the delivery service numbers will be interpolated and the EERS status is set to “D” Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 99 5.28.19 EERS = E (Language not available for the country, default language returned) In this example the PreferredLanguage parameter = English, and the address is validated in batch mode. Input: Output: Koningstraat 4 Rue Royale 4 Brussels 1000 Brussels Belgium Brussels-Capitale Belgium Process Status = C4 Element EIS ERS EERS Relevance Explanation Street 4 7 E 1 Street not available in the preferred language (English) therefore defaults to Database (French) Province 0 8 E 0 Province not available in the preferred language (English) therefore defaults to Database (French) 5.29 ResultPercentage Values The “ResultPercentage” value gives an indication how similar a result is to the parsed input, values close to 100% imply high similarity. They are mainly provided to allow for filtering out too extensive corrections in records with Cx BATCH “ProcessStatus” value (see chapter 5.17) in master data management environments with very stringent data quality requirements. Also, “ResultPercentage” may be used to determine which INTERACTIVE results show the least deviation from input. Informatica AddressDoctor discourages using “ResultPercentage” values for any other use case scenarios than the two described above. 5.30 Language ISO Code Output In situations where a result address contains data from the database, its language may be output via the ResultData parameter LanguageISO3 as an ISO 639 3-letter code, i.e. “DEU” for German. For transliterated output the original language will be reported , that is, “JPN” in case of romanized Japanese output. 5.31 Address Types Informatica AddressDoctor can populate the AddressType output field with a value that represents the type of mailbox that the address identifies. For United States addresses, Informatica AddressDoctor returns the address type values that the United States Postal Service specifies. The United States Postal Service includes a Record Type value in the reference data that it provides for domestic addresses. Mail carriers from other countries do Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 100 not specify an address type flag in the same manner, aside from New Zealand, which specifies a rural address, and Canada, which specifies a large volume receiver. For most countries, Informatica AddressDoctor uses a range of criteria to interpret the address type from the validated address data. For example, Informatica AddressDoctor can recognize a mailbox at an organization when the mailbox serves a large volume receiver. Note that Informatica AddressDoctor cannot guarantee the accuracy of the address types when the reference data does not contain address type information. For more information on the address types in different countries, see the sections below. 5.31.1 Country-Specific Address Type Indicators When the reference data for a country does not contain a formal address type designator, Informatica AddressDoctor uses the data in the address to determine the address type. Informatica AddressDoctor uses different data elements to assign address types to addresses from different countries. When you read the address type values for a country that does not define address type indicators, consider the criteria that Informatica AddressDoctor uses to infer an address type from the address data. Informatica AddressDoctor defines a range of criteria to infer the address types in the following countries: Australia Canada France New Zealand Informatica AddressDoctor also defines a set of criteria that infer the address type when you process United States addresses in Fast Completion mode. Address Type Indicators in Addresses from the United States Informatica AddressDoctor returns the United States Postal Service address type for a United States address when you perform validation in Batch, Certified, or Interactive mode. The following table describes the address types that the United States Postal Service can specify for United States addresses: Address Type Description F The address identifies an organization. G The address is a general delivery address. In a general delivery address, the postal code and the recipient data identify the address. H The address identifies a high-rise building. The address contain sub-building elements such as apartment or suite. P The address identifies a Post Office Box or a delivery service. R The address is a rural route/highway contract address. S The address identifies a street. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 101 Address Type Description U Unidentified. The address is not valid, and Informatica AddressDoctor does not assign an address type. Address Type Indicators in Addresses from Australia The following table lists the address types that Informatica AddressDoctor can return for Australia addresses: Address Type Description B The address identifies a building. F The address identifies an organization. L The address post code identifies the organization as a large volume receiver. The reference data adds or validates the organization name. Informatica AddressDoctor can determine that the address is a large volume receiver in one of the following ways: The address post code identifies the organization as a large volume receiver. The reference data does not contain street or building information. P The address identifies a Post Office Box or a delivery service. S The address identifies a street. S is the default address type. If Informatica AddressDoctor cannot determine the address type from the address data, it returns the default value. U Unidentified. The address is not valid, and Informatica AddressDoctor does not assign an address type. If an address meets the criteria for more than one address type, Informatica AddressDoctor assigns the first applicable address type from the following list: L, F, P, B, S Note: For Australia addresses, Informatica AddressDoctor can return information relevant to the address type on other output elements. Consult the Process Status, Element Input Status, and Element Result Status values. Address Type Indicators in Addresses from Canada The following table lists the address types that Informatica AddressDoctor can return for Canada addresses: Address Type Description B The address identifies a building. F The address identifies an organization. In Canada addresses, the type F addresses are a subset of the type L addresses. Therefore, the address type F also indicates a large volume receiver. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 102 Address Type Description G The address is a general delivery address. In a general delivery address, the postal code and the recipient data identify the address. Informatica AddressDoctor uses the delivery record in the reference data to identify the address type. L The address post code identifies the organization as a large volume receiver. The address might or might not contain an organization name. P The address identifies a Post Office Box or a delivery service. R The address identifies a rural route. Informatica AddressDoctor uses the delivery record in the reference data to identify the address type. S The address identifies a street. S is the default address type. If Informatica AddressDoctor cannot determine the address type from the address data, it returns the default value. U Unidentified. The address is not valid, and Informatica AddressDoctor does not assign an address type. If an address meets the criteria for more than one address type, Informatica AddressDoctor assigns the first applicable address type from the following list: F, L, P, B, R, S, G Address Type Indicators in Addresses from France The following table lists the address types that Informatica AddressDoctor can return for France addresses: Address Type Description B The address identifies a building. F The address identifies an organization. The address does not include a CEDEX post code. G The address is a general delivery address. The reference data does not contain a match for the street information, but the reference data contains a match for the CEDEX post code in the address. L The address post code identifies the organization as a large volume receiver. The address might or might not contain an organization name. The reference data uses the CEDEX post code to add or validate the organization name. P The address identifies a Post Office Box or a delivery service. S The address identifies a street. S is the default address type. If Informatica AddressDoctor cannot determine the address type from the address data, it returns the default value. U Unidentified. The address is not valid, and Informatica AddressDoctor does not assign an address type. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 103 If an address meets the criteria for more than one address type, Informatica AddressDoctor assigns the first applicable address type from the following list: L, F, P, B, S, G Address Type Indicators in Addresses from New Zealand The following table lists the address types that Informatica AddressDoctor can return for New Zealand addresses: Address Type Description B The address identifies a building. F The address identifies an organization. L The address post code identifies the organization as a large volume receiver. The reference data adds or validates the organization name. Informatica AddressDoctor can determine that the address is a large volume receiver in one of the following ways: The address post code identifies the organization as a large volume receiver. The reference data does not contain street or building information. P The address identifies a Post Office Box or a delivery service. R The address identifies a rural route. Informatica AddressDoctor uses the delivery record in the reference data to identify the address type. S The address identifies a street. S is the default address type. If Informatica AddressDoctor cannot determine the address type from the address data, it returns the default value. U Unidentified. The address is not valid, and Informatica AddressDoctor does not assign an address type. If an address meets the criteria for more than one address type, Informatica AddressDoctor assigns the first applicable address type from the following list: L, F, P, B, R, S Address Type Indicators in United States Addresses in Fast Completion Mode The following table lists the address types that Informatica AddressDoctor can return for United States addresses in fast Completion mode: Address Type Description B The address identifies a building. F The address identifies an organization. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 104 L The address post code identifies the organization as a large volume receiver. The reference data adds or validates the organization name. Informatica AddressDoctor can determine that the address is a large volume receiver in one of the following ways: The address post code identifies the organization as a large volume receiver. The reference data does not contain street or building information. P The address identifies a Post Office Box or a delivery service. S The address identifies a street. S is the default address type. If Informatica AddressDoctor cannot determine the address type from the address data, it returns the default value. U Unidentified. The address is not valid, and Informatica AddressDoctor does not assign an address type. If an address meets the criteria for more than one address type, Informatica AddressDoctor assigns the first applicable address type from the following list: L, F, P, B, S Address Type Indicators for the Rest of the World The following table lists the address types that Informatica AddressDoctor can return for all countries that the preceding sections do not cover: Address Type Description B The address identifies a building. F The address identifies an organization. L The address post code identifies the organization as a large volume receiver. The reference data adds or validates the organization name. Informatica AddressDoctor can determine that the address is a large volume receiver in one of the following ways: The address post code identifies the organization as a large volume receiver. The reference data does not contain street or building information. P The address identifies a Post Office Box or a delivery service. S The address identifies a street. S is the default address type. If Informatica AddressDoctor cannot determine the address type from the address data, it returns the default value. U Unidentified. The address is not valid, and Informatica AddressDoctor does not assign an address type. If an address meets the criteria for more than one address type, Informatica AddressDoctor assigns the first applicable address type from the following list: Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 105 L, F, P, B, S Note: Informatica AddressDoctor can return information relevant to the address type on other output elements. Consult the Process Status, Element Input Status, and Element Result Status values. 5.32 Return Codes The use of Informatica AddressDoctor may result in success or error conditions signaled via return codes. All API functions return an AD_I32 (32 bit signed integer) return code value: A value of 0 (zero) indicates success. A negative value of -10000 or below indicates a very critical error, and further processing is usually impossible. It is strongly advised to shut down the whole process, as it may be in an instable state. Negative values between -1 and -9999 indicate critical errors, and further processing may be impossible. A positive value of 1000 and above indicates non-critical errors, and further processing is possible. Return code values between 1 and 999 have been assigned to warnings, indicating possible issues with configuration settings, address input or output. The return value must always be checked for by the calling logic. While it informs about fundamental errors, the actual validation results are returned via separate API functions (see chapter 6.11). Following are the most common error return codes, including an explanation (see the API documentation in chapter 10.2 for a complete and up-to-date list): 5.32.1 Success The operation was completed without error: Code Description 0 OK, no error 5.32.2 Warnings The operation was completed, but maybe with an unexpected result: Code Description 1 The SetConfig.xml contained at least one corrupt unlock code 2 The SetConfig.xml contained at least one expired or not yet valid unlock code 3 The SetConfig.xml listed at least one database file which was not found 4 The SetConfig.xml listed at least one corrupt database file 5 The SetConfig.xml listed at least one database with a not supported version 6 The SetConfig.xml listed at least one database which is not supported (i.e. DEU CERTIFIED) Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 106 Code Description 7 No valid unlock code for a database file 8 The SetConfig.xml listed at least one database at least two times 9 The MaxMemoryUsageMB setting in SetConfig.xml was too small to fulfil all preloading settings and/or the CacheSize setting 10 The environmental settings (for example, OperatingSystem) in at least one Unlock Code are incompatible to the current machine 11 The SetConfig.xml contained at least one none supported type of unlock code 100 An input element or line which already had content was overwritten 101 The AddressComplete input has too many lines, extra lines will be ignored for further processing 102 At least one character sequence of a string is not valid (i.e. contains control codes or does violate some constraint); these sequences are replaces by spaces 200 The output buffer is too small, the output was written, but truncated 201 At least one character of the output could not be encoded in the chosen encoding, these characters were replaced by an underscore ('_') 300 The engine usage period has expired or is not activated yet 301 The unlock code for a database file has expired or is not activated yet 400 Address lines and/or Address Complete were given on input; this part of the input was ignored 401 More than 10 lines were given via FormattedAddressLines or AddressComplete as input; the lines beyond 11 were ignored 500 The MaxResultCount in Parameters.xml was larger than the value in SetConfig.xml; for this reason it was reduced to the value in SetConfig.xml 900 No database at all was found, probably because the path was wrong 901 No database at all was opened, probably because the path was wrong and/or no valid unlock code was given 902 Error while attempting to open at least one of the extra CASS DBs 5.32.3 Errors The requested operation was not executed: Code Description 1000 A pointer parameter was NULL 1001 A function parameter was 0 1002 A NULL pointer to an object was used (not relevant for C API) Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 107 Code Description 1003 Two XMLs were given (as string and within a file), only one must be given 1004 The output buffer size is not valid (i.e. 0) 1005 Buffer misalignment, an AD_WCHAR* points to an odd address 1100 A parameter is out of range or illegal 1101 An XML string is invalid 1200 The character sequence of a string is not valid (i.e. contains control codes or does violate some constraint) 1201 The encoding parameter did not match the character size of the API call, i.e. UCS2 (16 bit) vs. char (8 bit) 1300 No SetConfig.xml was given as parameter for AD_Initialize 1301 The engine has already been initialized 1302 AD_DeInitialize() failed because not all AddressObjects have been released 1400 No AddressObject is available (all AddressObject handles have already been obtained via AD_GetAddressObject()) 1401 The passed AddressObject handle is not valid 1500 A database file has not been found 1501 A database file is invalid/corrupt 1502 No valid unlock code for a database file 1503 A database file has a non-supported version. 1600 A feature has not been unlocked 1700 The country could not be identified or is fundamentally unsupported 1701 The country is not supported for this this processing mode and type of input 1800 Results are available, for this reason no AO modification is allowed 1801 XML and direct API calls were used intermixed when setting the input data of an AddressObject 1802 AD_Process() has not been called successfully, no result is available 1803 The attempted operation was invalid, i.e. trying to set incompatible address elements 1900 The result index parameter is out of range (must be >= 1) 1901 The output buffer is too small to hold the result, no output was written 5.32.4 Critical Errors No further calls, except possibly AD_Initialize() or AD_DeInitialize() should be made to the engine: Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 108 Code Description -1300 The engine has not yet been initialized, need to call AD_Initialize() -1600 No valid unlock code was given -1601 The engine usage period has expired or is not activated yet -1602 A clock inconsistency has been detected -9900 A memory allocation request failed -9901 A file operation failed 5.32.5 Very Critical Errors Very Critical Errors should only occur under highly adverse circumstances. No further calls, except possibly AD_DeInitialize() should be made to the engine - report that you actually encountered one of these errors: Code Description -10000 Some unknown exception has been thrown; this event should never occur -10001 Some internal assertion has failed; this event should never occur -10002 Some internal error has been encountered; this event should never occur 5.33 OptimizationLevel Informatica AddressDoctor processing allows setting the “OptimizationLevel” attribute in Parameters.xml (see the DTD in chapter 10.1) upon AD_Initialize() for controlling the trade-off between processing speed and quality: NARROW: The parser will honor input assignment strictly, with the exception of separation of House Number from Street information. STANDARD: The parser will separate address element more actively, for example: o Province will be separated from Locality information o PostalCode will be separated from Locality information o House Number will be separated from Street information o SubBuilding will be separated from Street information o DeliveryService will be separated from Street information o SubBuilding will be separated from Building information o Locality will be separated from PostalCode information WIDE: Parser separation will happen similarly to STANDARD, but additionally up to 10 parsing candidates will be passed to validation for processing. Validation will widen its search tree and take additional reference data entries into account for matching. Note that adjusting “OptimizationLevel” might have no effect for countries that lack the postal reference data information required for the kind of separation described above. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 109 Obviously, increasing separation granularity from NARROW to DEFAULT already consumes some processing power, but the major impact on processing speed here is from Informatica AddressDoctor validation processing a larger search tree, thus increasing the number of data accesses and comparisons for the “OptimizationLevel” WIDE, in an attempt to make the most out of the input data given. Thus a recommended batch usage pattern for Informatica AddressDoctor would be (assuming rather low levels of address quality): Run a quick sweep through your data using the COUNTRYRECOGNITION process mode (see chapter 5.11.8) for separating out those records lacking country information, which might have to be amended manually before further processing them. Do a fast check of overall record quality using “OptimizationLevel” NARROW to identify the valid or correctable records and separate out all records that have not resulted in a V or C “ProcessStatus” value (see chapter 5.17). Feed those problematic records back into Informatica AddressDoctor, processing them with “OptimizationLevel” WIDE to see what might be salvaged, indicated by a V, C or I4 “ProcessStatus” value. 5.34 Preloading Performance is often critical when deploying Informatica AddressDoctor with large databases. Typically, the I/O subsystem is the slowest component in a system. As memory prices have fallen sharply, users can now afford machines with a lot of installed memory. To utilize the available memory for performance optimization, Informatica AddressDoctor offers the “PreloadingType” attribute for each DataBase element. It allows loading Informatica AddressDoctor reference databases (.md files) into the main memory of the computer. The following preloading types are available No preloading (PreloadingType="NONE" - the default) Partial preloading (PreloadingType="PARTIAL") Full preloading (PreloadingType="FULL") Partial preloading will load the metadata and indexing structures into memory. The reference data itself will remain on the hard drive. Partial preloading offers some performance enhancements and is an alternative when not enough memory is available to fully load the desired databases. Full preloading will move the entire reference database into memory. This may need a significant amount of memory for countries with large databases such as the USA or the United Kingdom, but it will increase the processing speed significantly. However, there are conditions where full preloading can have a negative impact on speed. See chapter 6.25 for details on this topic. Note that Informatica AddressDoctor itself requires additional memory (see chapter 2.3) in addition to the memory used for preloading. The “PreloadingType” attribute can be set per database as a configuration parameter of the AD_Initialize() call of Informatica AddressDoctor. If no preloading type is explicitly set, the default preloading (“NONE”) will be used. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 110 With version 5.1.4, Memory Mapped Files have been introduced as the new default preloading mechanism (PreloadingMethod=”MAP” in SetConfig.xml, see Appendix 10.1 for the DTD). Even though Informatica AddressDoctor continues to support the preloading method LOAD (PreloadingMethod=”LOAD”), discourages the use of LOAD method for new deployments. When using the default "MAP" method, the engine uses the file mapping mechanism of the operating system. To actually force the file contents into memory, the data is touched (read) once upon AD_Initialize. The "LOAD" mechanism on the other hand uses a memory allocation call and then reads the .md file data into the allocated memory block (see chapter 5.37 also). So in case enough physical memory is actually present, the behavior, including speed, is absolutely identical (although not completely: The OS will typically write-protect mapped data, thereby possibly masking certain bugs.). Specifically, in low memory conditions, the OS either starts discarding the mapped data or swaps the loaded data out to disc. Memory Mapping has two advantages over the LOAD preloading method: In multi-process conditions (multiple processes running Informatica AddressDoctor using a common set of .md files) the operating system will load the data into main memory only once, thus sharing preloaded reference databases between separate processes. The operating system will never write reference data contents to the paging file, in case of low memory conditions (but they might get dropped from the file system cache; if the data is needed later on, it is simply read from disk again). However, due to larger alignment requirements of the OS, "MAP" will use up more virtual memory space (2-3% more for all files). As "MAP" is the default, "PreloadingMethod" may be omitted if enabling "MAP" is intended. Since large amounts of memory may be allocated during preloading, with significant data amounts moved into memory, it might take some time to load the databases into memory. Databases will be preloaded in the order they are passed via SetConfig.xml (see the respective DTD in Appendix 10.1) on AD_Initialize(). The following information is available through AD_GetConfigSettingsXML() to check which databases have been successfully preloaded after the AD_Initialize() call: CountryISO3 Type (BATCH_INTERACTIVE | FASTCOMPLETION | CERTIFIED | GEOCODING| GEOCODING_ARRIVAL_POINT | GEOCODING_PARCEL_CENTROID | CAMEO| ADDRESS_CODE_LOOKUP) Path Status Size Version StartDate ExpirationDate Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 111 UnlockStartDate UnlockExpirationDate ReleaseDate DataDate Encoding PreloadingType (FULL | PARTIAL | NONE) PreloadingSize If a database could not be found or pre-loaded for some reason, the corresponding database ISO code does not have such a Database section. For all pre-loaded databases there will be a Database section which contains a “PreloadingType” attribute specifying the actual preloading type. Resetting the preloading parameters after the AD_Initialize() function has been called is only possible by issuing AD_DeInitialize() first (preceded by AD_ReleaseAllAddressObjects() for releasing all AddressObjects, see chapter 6.1). 5.35 Caching Caching reserves a certain portion of “MaxMemoryUsageMB” (see the SetConfig.xml DTD in Appendix 10.1) for speeding up file system lookups in reference data that has not been preloaded. Using the “CacheSize” attribute in SetConfig.xml (passed upon AD_Initialize()) the amount of memory reserved in such a way may be controlled - valid settings are NONE, SMALL, LARGE. Using the standard setting of “LARGE” is always recommended, unless all reference data needed is preloaded (so that “NONE” may be used) or the memory footprint needs to be reduced via the “SMALL” or “NONE” setting. However, “NONE” should be avoided, unless memory is really extremely scarce. The size of the cache may be determined through AD_GetConfigSettingsXML(): The actual size of the cache may be less than requested, if not enough memory is available (i.e. “SMALL”, although “LARGE” was requested). 5.36 Multithreading Informatica AddressDoctor API is multi-threading-safe, any number of threads may call any of the API functions at any time without having to fear a crash due to data corruption. However, it is strictly to be avoided to call multiple API functions from different threads at the same time using the same AddressObject; such a call sequence is typically a programming error. While Informatica AddressDoctor enables benefitting from multi-core processor architectures, the actual thread handling is strictly in the domain of the calling application: No threads are actually created or destroyed by the engine and there are no API functions to process more than one address. The number of threads which the engine actually allows to process addresses in parallel (by calling AD_Process() from a separate thread per address) is configurable (the default is 1); if more threads than configured using “MaxThreadCount” (see SetConfig.dtd) call AD_Process() at the same time, the Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 112 additional threads are blocked until other threads currently calling AD_Process() return. Note that such blocking only influences the timing and sequence of the AD_Process() calls, but not the data processing and its outcome. The figure below illustrates parallel address processing in a multi-threaded environment with n calling threads, but “MaxThreadCount” set to 4: Address n Address 5 Address 4 Address 3 Address 2 Address 1 processing waiting waiting AddressDoctor 5 Currently there is no similar limitation on calls to any other API functions which have an AddressObject handle as parameter; however, a limit like for AD_Process() may be imposed in the future. However, this would be totally transparent to the calling thread(s). “MaxThreadCount” should normally not be set to a value larger than the number of available cores/CPUs, possibly minus one (to allow for operating system overhead), as this is unlikely to increase performance. For the moment, a practical maximum value for “MaxThreadCount” of 1024 is enforced in SetConfig.xml (see the DTD in chapter 10.1 for reference). If the maximum number of AddressObjects as set by “MaxAddressObjectCount” is smaller than “MaxThreadCount”, “MaxThreadCount” is internally reduced to the number specified by “MaxAddressObjectCount” as no more parallel calls to AD_Process() could be made anyway. The actual value of “MaxThreadCount” can be determined by calling AD_GetConfigSettingsXML(). It is recommended to set “MaxAddressObjectCount” to the number of threads set with “MaxThreadCount”. However, depending on the implementation, 2 AddressObjects per thread are necessary if a double-buffering mechanism is employed. The largest performance gains (the best scalability) will be achieved in a multi-core environment with full preloading for all accessed databases, as otherwise the multiple threads will be blocked frequently by calls to the file system. In fact, this effect may become so dominant, that the scalability in most relevant cases will be significantly reduced, if the accessed databases are not preloaded. Note: The term scalability refers to the speedup factor which is achieved when trying to utilize additional cores/CPUs. Obviously the best possible speedup factor for N cores/CPUs as opposed to using only one core would be N, that is, N-times more addresses could be processed per hour when utilizing N cores/CPUs instead of one. In reality, such a perfect scaling is almost never achieved, either because only parts of the called functions can operate in parallel or there is a contention for some system resource(s) such as the front side bus, file system or memory allocation functions. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 113 The internal design of Informatica AddressDoctor allows a good or even very good scaling if the computer system itself is designed appropriately (i.e. big large local caches for each core, fast memory buses) and blocking due to file system contention is avoided. 5.37 Memory Management Informatica AddressDoctor handles different types of objects, such as address objects, pre-loaded reference address databases, and caches, in its memory. While making memory allocations for Informatica AddressDoctor, you must consider these different objects that have specific memory requirements. You can divide the memory requirements of Informatica AddressDoctor into the following blocks: General memory block. Used for general management functions. Typically, the general memory block size is 7 MB. Thread memory block. Used for address processing and validation routines. As many thread memory blocks are created as the number of simultaneous threads your Informatica AddressDoctor is configured to handle. The size of a thread memory block is about 38 MB for 32-bit systems and 48 MB for 64-bit systems. Address object memory block. Used to store the address objects defined. As many address object memory blocks are created as the number of address objects your Informatica AddressDoctor is configured to handle at any given time. The size of an address object memory block is about 3.7 MB + (0.24 MB x the value set for MaxResultCount) in the case of 32-bit systems. For 64-bit systems, the size of an address object memory block is about 4.8 MB + (0.24 MB x the value set for MaxResultCount). Memory block reserved for caching. Informatica AddressDoctor reserves one cache memory block for each of the validation or processing threads. Memory blocks for preloading reference address databases. Memory requirement for preloading reference address databases. This value for this depends on the number and size of the databases that you want to preload. Unallocated memory block. The following figure gives a schematic overview of the memory layout used for those different object types: Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 114 General Memory Block … Thread 1 Memory Block Thread z Memory Block … AddressObject 1 Memory Block AddressObject y Memory Block Memory Block* reserved for Caching Unallocated Memory Space Preloaded Country ISOx Preloaded Country ISO1 … MaxMemoryUsageMB You can configure the MaxMemoryUsageMB parameter to specify the maximum available memory for Informatica AddressDoctor. Memory allocation for different blocks are controlled by the values you set for the following parameters: MaxThreadCount. The maximum number of threads that Informatica AddressDoctor can process simultaneously. The value set for this parameter controls the number of thread memory blocks and thus, the total memory allocation for the thread blocks. MaxAddressObjectCount. The maximum number of address objects that Informatica AddressDoctor can store. You can set a maximum of double the number configured for MaxThreadCount. The value you set for MaxAddressObjectCount controls the number of address object memory blocks and thus, the total memory allocation for the address objects. CacheSize. The memory reserved for caching purpose. If the CacheSize parameter is set to None, no memory is allocated for caching. When CacheSize is set to Small, Informatica AddressDoctor allocates 0.4 MB of cache memory block for each of the threads. When the CacheSize is set to Large, Informatica AddressDoctor allocates 0.75 MB of cache memory block for each of the threads. For example, if MaxThreadCount is set to 4 and CacheSize to Small, Informatica AddressDoctor allocates a total of 1.6 MB for cache memory block. 5.37.1 Calculating Memory Requirements If the Informatica AddressDoctor configuration on a 32-bit system includes MaxThreadCount=4, MaxAddressObjectCount=8, CacheSize=SMALL, and MaxResultCount=20, you can calculate the dynamic memory requirement as follows: 7 + 8 x (3.7 + 20 x 0.24) + 4 x (38+0.4) = 228.6 MB where 7 is the general memory block size; 3.7, the size of an address object memory block; 20, the value set for MaxResultCount; 0.24, the size of a result object; 4, the number of threads; 38 the size of the thread block; and 0.4, the cache memory block size when CacheSize is set to small. To calculate the total memory requirement, add the total size of the reference address databases that you want to preload. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 115 If the Informatica AddressDoctor configuration on a 64-bit system includes MaxThreadCount=6, MaxAddressObjectCount=6, CacheSize=LARGE, and MaxResultCount=100, you can calculate the dynamic memory requirement as follows: 7 + 6 x (4.8 + 100 x 0.24) + 6 x (48 + 0.75) = 472.3 MB where 7 is the general memory block size; 4.8, the size of an address object memory block; 100, the value set for MaxResultCount; 0.24, the size of a result object; 6, the number of threads; 48, the size of the thread block; and 0.75, the cache memory block size when CacheSize is set to large. To calculate the total memory requirement, add the total size of the reference address databases that you want to preload. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 116 6. How do I… 6.1 …initialize Informatica AddressDoctor? or AD_InitializeW() must be called to actually initialize the engine: It evaluates the settings and configures the engine accordingly (see chapter 5.6 for an overview). Only after this function has returned successfully may AD_GetAddressObject() or any other functions be called. If the engine was not initialized properly, all Informatica AddressDoctor API functions will produce a return code of -1300 (see chapter 5.32 for reference). AD_Initialize() For example: AD_Initialize( "<?xml version='1.0' encoding='iso-8859-1' ?>\n" "<!DOCTYPE SetConfig SYSTEM 'SetConfig.dtd'>\n" "<SetConfig>\n" "<General />\n" "<UnlockCode>(Enter Code here)</UnlockCode>\n" "<DataBase CountryISO3='ALL' Type='BATCH_INTERACTIVE' Path='/ADDB' PreloadingType='NONE'/>\n" "</SetConfig>\n", NULL, NULL, NULL ); Or in Java: // Initialize the Engine using the 'Direct' API AddressDoctor.initialize( "<?xml version='1.0' encoding='UTF-16LE'?>\n" + "<!DOCTYPE SetConfig SYSTEM 'SetConfig.dtd'>\n" + "<SetConfig>\n" + "<General WriteXMLEncoding='UTF-16LE' />\n" + "<UnlockCode>(Enter Code here)</UnlockCode>\n" + "<DataBase CountryISO3='ALL' Type='BATCH_INTERACTIVE' Path='/ADDB' PreloadingType='NONE'/>\n" + "</SetConfig>", null, null, null ); Alternatively, the SetConfig XML string can be stored in an external file. In this case, the initialize call looks like this: Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 117 AD_Initialize( NULL, "SetConfig.xml", NULL, NULL ); Or in Java: // Initialize the Engine using the 'XML' API AddressDoctor.initialize( null, “SetConfig.xml”, null, null ); The following return codes (see section 5.32 for an explanation of return codes) are typical warnings and errors returned by AD_Initialize(): AD_SC_WRN_INIT_UNLOCKCODE_CORRUPT (1) AD_SC_WRN_INIT_UNLOCKCODE_EXPIRED (2) AD_SC_WRN_INIT_DB_NOT_FOUND (3) AD_SC_WRN_INIT_DB_CORRUPT (4) AD_SC_WRN_INIT_DB_UNSUPPORTED_VERSION (5) AD_SC_WRN_INIT_DB_NOT_SUPPORTED (6) AD_SC_WRN_INIT_DB_NOT_UNLOCKED (7) AD_SC_WRN_INIT_MULTIPLE_DB_ENTRIES (8) AD_SC_WRN_MAXMEMORYUSAGE_TOO_SMALL (9) AD_SC_WRN_ INIT_UNLOCKCODE_ENVIRONMENT_MISMATCH (10) AD_SC_WRN_MAXRESULTCOUNT_REDUCED (500) AD_SC_ERR_INIT_NO_DB_FOUND (900) AD_SC_ERR_INIT_NO_DB_OPENED (901) AD_SC_ERR_EXTRA_CASS_DBS_ERROR (902) If one of these codes is returned, the engine is initialized - however, some potential problem occurred which needs to be investigated: For that purpose, retrieving GetConfig.xml via AD_GetConfigSettingsXML() is strongly advised, as its contents provide additional information about problems with unlock codes and / or database files. AD_DeInitialize() must be called last to de-initialize the engine; the engine is then ready to be initialized again: All AddressObjects must have been released by calling AD_ReleaseAddressObject() or AD_ReleaseAllAddressObjects() before calling AD_DeInitialize(), see chapter 4.1 for a full example (including Java). Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 118 6.2 …determine Informatica AddressDoctor version? can be called at any time, even before calling AD_Initialize() / AD_InitializeW(), to retrieve the zero-terminated engine version string in the format x.x.x.x, i.e. "5.0.0.251". AD_GetVersion() 6.3 …specify processing or input parameters and a result format? An AddressObject has a result format configuration, for possible attributes see Parameters.dtd in chapter 10.1 (and the most common parameters are described starting with chapter 5.12). These parameter attributes can be set for each AddressObject individually by calling AD_SetParametersXML(). For example: AD_SetParametersXML(hAOHandle, "<?xml version='1.0' encoding='iso-8859-1' ?>\n" "<!DOCTYPE Parameters SYSTEM 'Parameters.dtd'>\n" "<Parameters>\n" "<Process Mode='BATCH' />\n" "<AddressElementStandardize>\n" "<Country Casing='UPPER' />\n" "</AddressElementStandardize>\n" "</Parameters>\n", NULL, NULL ); Or in Java: // This code assumes you’ve already acquired m_oAO as the active AddressObject m_oAO.setParametersXML( "<?xml version='1.0' encoding='UTF-16LE' ?>\n" + "<Parameters>\n" + "<Process Mode='BATCH'/>\n" + // Java uses UTF-16LE as default encoding for its String method "<Input Encoding='UTF-16LE'/>" + "<Result Encoding='UTF-16LE'/>" + "<AddressElementStandardize> \n" + "<Country Casing='UPPER' />\n" + "</AddressElementStandardize> \n" + "</Parameters>", null ); As shown for SetConfig.xml in chapter 6.1, alternatively a file name may be provided: AD_SetParametersXML( hAOHandle, NULL, NULL, "Parameters.xml" Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 119 ); Or in Java: // This code assumes you’ve already acquired m_oAO as the active AddressObject m_oAO.setParametersXML( null, "Parameters.xml" ); Instead of setting attributes for each AddressObject individually, Parameters.xml may already be passed on AD_Initialize() (refer to the API documentation in chapter 10.2 for details), thus applying global defaults to all AddressObjects that do not have individual parameters set via the method described above. For example: AD_Initialize( "<?xml version='1.0' encoding='iso-8859-1' ?>\n" "<!DOCTYPE SetConfig SYSTEM 'SetConfig.dtd'>\n" "<SetConfig>\n" "<General />\n" "<UnlockCode>(Enter Code here)</UnlockCode>\n" "<DataBase CountryISO3='ALL' Type='BATCH_INTERACTIVE' Path='/ADDB' PreloadingType='NONE'/>\n" "</SetConfig>\n", NULL, "<?xml version='1.0' encoding='iso-8859-1' ?>\n" "<!DOCTYPE Parameters SYSTEM 'Parameters.dtd'>\n" "<Parameters>\n" "<Process Mode='BATCH' />\n" "<AddressElementStandardize>\n" "<Country Casing='UPPER' />\n" "</AddressElementStandardize>\n" "</Parameters>\n", NULL ); Or in Java: AddressDoctor.initialize( "<?xml version='1.0' encoding='UTF-16' ?>"+ "<!DOCTYPE SetConfig SYSTEM 'SetConfig.dtd'>"+ "<SetConfig><General WriteXMLEncoding='UTF-16' />"+ " <UnlockCode>(Enter Code here)</UnlockCode>"+ " <DataBase CountryISO3='ALL' Type='BATCH_INTERACTIVE'"+ Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 120 " Path='/ADDB' PreloadingType='NONE' />"+ "</SetConfig>", null, "<?xml version='1.0' encoding='UTF-16' ?>"+ "<!DOCTYPE SetConfig SYSTEM 'Parameters.dtd'>"+ "<Parameters WriteXMLEncoding='UTF-16'>"+ " <Input Encoding='UTF-16' />"+ " <Result Encoding='UTF-16' />"+ "</Parameters>", null); Again, the Parameters XML string can be stored in an external file (as is the case for SetConfig, see above). Then the AD_Initialize() call would look like the following: AD_Initialize( NULL, "SetConfig.xml", NULL, "Parameters.xml" ); Or in Java: AddressDoctor.initialize( "<?xml version='1.0' encoding='UTF-16' ?>"+ "<!DOCTYPE SetConfig SYSTEM 'SetConfig.dtd'>"+ "<SetConfig><General WriteXMLEncoding='UTF-16' />"+ " <UnlockCode>(Enter Code here)</UnlockCode>"+ " <DataBase CountryISO3='ALL' Type='BATCH_INTERACTIVE'"+ " Path='/ADDB' PreloadingType='NONE' />"+ "</SetConfig>", null, "<?xml version='1.0' encoding='UTF-16' ?>"+ "<!DOCTYPE SetConfig SYSTEM 'Parameters.dtd'>"+ "<Parameters WriteXMLEncoding='UTF-16'>"+ " <Input Encoding='UTF-16' />"+ " <Result Encoding='UTF-16' />"+ "</Parameters>", null); Note that adjusting parameters might have no effect for countries that lack the postal reference data information required for their making a difference, examples would be “OptimizationLevel” (chapter 5.33), “PreferredLanguage” (chapter 5.12.2) or “MatchingScope” (chapter 5.12.5). For a reference on country coverage see: http://www.addressdoctor.com/en/countries-data/country-list.html Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 121 6.4 …handle unlock codes? To validate addresses from a country an unlock code is required. Each code unlocks a number of countries for a specified time. The unlock code is passed to the initialize function (see chapter 6.1). The function will return an error if the code is no longer valid or not correct at all. For more details on error return codes see 5.32. Note that separate unlock codes for validation, Address Code Lookup, supplementary, geocoding, and CAMEO databases are required. The use of multiple unlock codes is supported. The unlock codes have to be passed to the initialize function one after another, as shown in the example below. If there is more than one unlock code for a country the one with the longest valid date is used. Outdated unlock codes are ignored as long as there is one code that is still valid, for example: Code A unlocks DEU and USA validation until 31.12.2009 Code B unlocks CHE and USA validation until 31.12.2010 Code C unlocks CHE and USA geocoding until 31.12.2010 In this case DEU validation will be unlocked until 31.12.2009 while CHE and USA validation and geocoding continue to be unlocked until 31.12.2010. Note that unlock codes also carry a start date and will be invalid before that date, information on unlock codes may be queried using AD_GetConfigSettingsXML(), see the chapter 6.6 for details. The following very simple code example shows how to use multiple unlock codes: AD_Initialize( "<?xml version='1.0' encoding='iso-8859-1' ?>\n" "<SetConfig>\n" "<General />\n" "<UnlockCode>(Enter Code A here)</UnlockCode>\n" "<UnlockCode>(Enter Code B here)</UnlockCode>\n" "<UnlockCode>(Enter Code C here)</UnlockCode>\n" "<DataBase CountryISO3='USA' Type='GEOCODING' Path='/ADDB' PreloadingType='NONE' />\n” "<DataBase CountryISO3='CHE' Type='GEOCODING' Path='/ADDB' PreloadingType='NONE' />\n” "<DataBase CountryISO3='ALL' Type='BATCH_INTERACTIVE' Path='/ADDB' PreloadingType='NONE' />\n" "</SetConfig>\n", NULL, NULL, NULL ); Or in Java: AddressDoctor.initialize( "<?xml version='1.0' encoding='UTF-16LE'?>\n" + "<!DOCTYPE SetConfig SYSTEM 'SetConfig.dtd'>\n" + "<SetConfig>\n" + Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 122 "<General WriteXMLEncoding='UTF-16LE'/>\n" + // Engine & DB Unlock Code "<UnlockCode>(Enter Code A here)</UnlockCode>\n" + "<UnlockCode>(Enter Code B here)</UnlockCode>\n" + "<UnlockCode>(Enter Code C here)</UnlockCode>\n" + "<DataBase CountryISO3='USA' Type='GEOCODING' Path='/ADDB' PreloadingType='NONE'/>\n" + "<DataBase CountryISO3='CHE' Type='GEOCODING' Path='/ADDB' PreloadingType='NONE'/>\n" + "<DataBase CountryISO3='ALL' Type='BATCH_INTERACTIVE' Path='/ADDB' PreloadingType='NONE'/>\n" + "</SetConfig>", null, null, null); 6.5 …configure reference databases? While for convenience reasons the virtual ISO code “ALL” is provided for defining default settings, you may adjust paths and pre-loading settings (see chapter 5.34) for each country reference database type separately. The following lines would be an example of a non-trivial SetConfig.xml (see chapter 6.1 also and the DTD in appendix 10.1): <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE SetConfig SYSTEM "C:/AddressDoctor/DTD/SetConfig.dtd"> <SetConfig> <General WriteXMLEncoding="UTF-16" MaxMemoryUsageMB="2048" MaxAddressObjectCount="10" MaxThreadCount="2"></General> <UnlockCode>Address Validation Unlock Code</UnlockCode> <UnlockCode>Geocoding Unlock Code</UnlockCode> <DataBase CountryISO3="USA" Type="CERTIFIED" Path="C:/AddressDoctor/DB/CASS" PreloadingType="FULL"></DataBase> <DataBase CountryISO3="CAN" Type="CERTIFIED" Path="C:/AddressDoctor/DB/SERP" PreloadingType="FULL"></DataBase> <DataBase CountryISO3="USA" Type="SUPPLEMENTARY" Path="C:/AddressDoctor/DB/Enrichment" PreloadingType="PARTIAL"></DataBase> <DataBase CountryISO3="GBR" Type="SUPPLEMENTARY" Path="C:/AddressDoctor/DB/Enrichment" PreloadingType="PARTIAL"></DataBase> <DataBase CountryISO3="ALL" Type="GEOCODING" Path="C:/AddressDoctor/DB/Geocoding" PreloadingType="NONE"></DataBase> <DataBase CountryISO3="ALL" Type="CAMEO" Path="C:/AddressDoctor/DB/CAMEO" PreloadingType="NONE"></DataBase> Path="C:/AddressDoctor/DB" <DataBase CountryISO3="ALL" Type="BATCH_INTERACTIVE" PreloadingType="NONE"></DataBase> <DataBase CountryISO3="ALL" Type="FASTCOMPLETION" Path="C:/AddressDoctor/DB" PreloadingType="NONE"></DataBase> </SetConfig> Note that any country specific settings must precede the “DataBase” elements with CountryISO3=”ALL” to be actually applied (the effective database settings and their unlock status Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 123 may be verified through GetConfig.xml, as described in chapter 6.6; specifically check for “DataBase” element entries with Status attributes other than "ACTIVE"). In case of conflicting “DataBase” elements, the first occurrence in SetConfig.xml will always have precedence. There are a few notable deviations from that standard behaviour for most countries: For CountryISO3=”USA” and Type=”CERTIFIED” the GetConfig.xml output will only list a subset of the available reference database files as “DataBase” element with Type=”CERTIFIED”. Also, USA5BI.md must always be available, as this database is the basis for CASS processing (see chapter 6.24) also. Furthermore, US CERTIFIED mode will not work without pre-loading and full pre-loading will always be enforced on some of the CASS databases described in chapter 3.3.2, irrespectively of the settings. For CountryISO3=”CAN” and Type=”CERTIFIED” the GetConfig.xml output will now list CAN5C1.MD as the database, as there is now a separate database necessary for certified processing (see chapter 6.24.2). For CountryISO3=”JPN” and Type=”FASTCOMPLETION” the GetConfig.xml output will only list one “DataBase” element with Type=”BATCH_INTERACTIVE” also, due to a slightly different internal database layout. 6.6 …determine the current engine settings? On the global Informatica AddressDoctor level, calling AD_GetConfigSettingsXML() will return a GetConfig.xml with the engine configuration, which has been set upon calling AD_Initialize(), see chapter 6.1. Accordingly, calling AD_GetParametersSettingsXML() will return a Parameters.xml with the engine default set of parameters, which again have been set upon calling AD_Initialize(). In contrast, the parameters effectively applied when processing each AddressObject (which may well be identical to these global settings unless explicitly set using AD_SetParametersXML(), see the preceding chapter 6.3) can be queried via AD_GetParametersXML(). See the API reference in chapter 10.2 for details. 6.7 ...assign an address to the AddressObject? In order to achieve the best possible processing, it is important to understand the structure of your input data. One can then decide on the best way to input the data into Informatica AddressDoctor AddressObject (see chapter 5.7). In general, address data will exist as one of the following: Fielded data. In some databases, particularly ones driven by direct input all the data may be fielded (for example, Street, City, State, ZIP/PostCode are all stored in individual fields). Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 124 Partially fielded data. In many databases, address data has been partially broken out. For example a separate state or postal code field. But some of the address is left in generic “address lines”. Unfielded data. This could be data derived from scanning address labels for example. 6.7.1 General Overview If an old input address is still present, it must be cleared by a call of AD_ClearData(). The input address data can be set either with the direct API functions or via the AD_SetInputDataXML() function. Example for fielded input (direct API): AD_SetInputAddressElement( hAOHandle, "Country", 1, NULL, "Canada" ); AD_SetInputAddressElement( hAOHandle, "PostalCode", 1, NULL, "G1R 3X2" ); AD_SetInputAddressElement( hAOHandle, "Locality", 1, NULL, "Toronto" ); AD_SetInputAddressElement( hAOHandle, "DeliveryService", 1, NULL, "PO Box 1827" ); Or in Java: m_oAO.setInputAddressElement("Country", 1, null, "Canada"); m_oAO.setInputAddressElement("PostalCode", 1, null, "G1R 3X2"); m_oAO.setInputAddressElement("Locality", 1, null, "Toronto"); m_oAO.setInputAddressElement("DeliveryService", 1, null, "PO Box 1827"); Example for fielded input (XML API): AD_SetInputDataXML( hAOHandle, "<?xml version='1.0' encoding='ISO-8859-1'?>\n" "<!DOCTYPE InputData SYSTEM 'InputData.dtd'>\n" "<InputData>\n" "<AddressElements>\n" "<Country Item='1' Type='NAME'>SGP</Country>\n" "<Locality Item='1' Type='COMPLETE'>Singapore</Locality>\n" "<PostalCode Item='1' Type='FORMATTED'>048624</PostalCode>\n" "<Street Item='1' Type='COMPLETE'>Raffles Place</Street>\n" "<Number Item='1' Type='COMPLETE'>80</Number>\n" Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 125 "<Building Item='1' Type='COMPLETE'>#50-01 UOB Plaza 1</Building>\n" "<Organization Item='1' Type='NAME'>AddressDoctor GmbH</Organization>\n" "</AddressElements>\n" "</InputData>\n" ); For comparison in Java: m_oAO.setInputDataXML( "<?xml version='1.0' encoding='UTF-16'?>"+ "<!DOCTYPE InputData SYSTEM InputData.dtd'>"+ "<InputData>"+ "<AddressElements>"+ " <Key>4711</Key>"+ " <Country Item='1' Type='NAME'>SGP</Country>"+ " <Locality Item='1' Type='COMPLETE'>Singapore</Locality>"+ " <PostalCode Item='1' Type='FORMATTED'>048624</PostalCode>"+ " <Street Item='1' Type='COMPLETE'>Raffles Place</Street>"+ " <Number Item='1' Type='COMPLETE'>80</Number>"+ " <Building Item='1' Type='COMPLETE'>#50-01 UOB Plaza 1</Building>"+ " <Organization Item='1' Type='NAME'>AddressDoctor GmbH</Organization>"+ "</AddressElements>"+ "</InputData>"); 6.7.2 Fielded address input Fully fielded addresses will typically provide the most reliable results when cleansing an address. Even in databases that have the address components in separate columns it is not uncommon to have the house number and the street name in the same field. The structure may look like this: COUNTRY FIRSTNAME NAME STREET TOWN STATE United States Mark Myers 7563 Bangor Ave Hesperia CA United States Istvan Edgars 87 MILL LN New York NY ZIPCODE 10123 An address is still considered to be fielded when house number and street name reside in the same field. In this case the field containing the house number and street name may be assigned to the Street attribute together. To support environments where databases contain address data broken into discrete fields Informatica AddressDoctor allows direct input of each address component (including addressing information such as contact, organization, and so on) via the “AddressElements” element of InputData.xml (see DTD in chapter 10.1). Possible address elements are: Key, Country, Locality, PostalCode, Province, Street, Number, Building, SubBuilding, DeliveryService, Organization and Contact. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 126 Though the data is input to specific address elements, Informatica AddressDoctor can still perform parsing in case the data has been stored in the incorrect fields (depending on the “OptimizationLevel” chosen, see chapter 5.33). Incorrect fielding of data is particularly common for international addresses. If there is a high level of incorrect fielding it may be desirable to explore other input strategies (pre-processing to correct the fielding, concatenation and partially structured input, and so on). Note that InputData.xml does not only allow assigning an item attribute to each input address element but even supports flagging each of these items with corresponding type information, down to address sub-element level (see chapter 5.10 for reference). For example, available types for the “Contact” address element or sub-elements are: COMPLETE (element), FIRST_NAME (sub-element), MIDDLE_NAME (sub-element), LAST_NAME (sub-element), NAME (sub-element), TITLE (sub-element), FUNCTION (subelement), SALUTATION (sub-element) and GENDER (sub-element) Or for “Organization”: COMPLETE (element), NAME (sub-element), DESCRIPTOR (sub-element) and DEPARTMENT (sub-element). Consequently, you might either want to assign “AddressDoctor GmbH Support” as one “Organization” address element item of type “COMPLETE” or in sub-element items of: “AddressDoctor” with type “NAME”, “GmbH” with type “DESCRIPTOR” and “Support” with type “DEPARTMENT”. That such type attributes are provided on input for each address sub-element (i.e. item) is absolutely crucial for correct output formatting in the case of “Contact” and “Organization” addressing information, which is not covered by postal reference data. See the DTD in chapter 10.1 for a complete and up-to-date list of all the “AddressElements” item types supported by Informatica AddressDoctor, noting the limitations described in chapter 5.10. 6.7.3 Partially fielded address input Often databases contain contact information separate from address data. But the address itself is broken into “address lines”. For example: COUNTRY CUSTOMER ADDRESS_LINE_1 ADDRESS_LINE_2 CITY STATE ZIP USA John Smith 7563 Bangor Ave Suite 107 Hesperia CA 92345 USA Vlad Marcos Acme Products 3198 MARINO ST El Paso TX 79925 In this case the address data is input using the fielded address elements where possible (for example, Contact, Province, Locality, Country, PostalCode), and then the “AddressLines” element of Input.xml (see DTD in chapter 10.1) is used to input the remaining data. Typically, that will involve filling the “DeliveryAddressLines” (DAL) sub-element with input data, but in case data is available in that specific format, using the “RecipientLine” and “CountrySpecificLocalityLine” (CSLLN) subelements is also possible. As in the case of fully fielded address data, when data has been partially broken out, the best results are obtained by assigning that data to the appropriate address element. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 127 6.7.4 Unfielded address input Since unfielded data has no explicit structure (other than line feeds) this input is the most flexible. However, for the same reason, it will also produce the least reliable results. To populate an AddressObject with unfielded data, developers will need to use the “AddressComplete” element of Input.xml. The address is simply passed to “AddressComplete” as a set of strings separated by line feeds (see the following example). To return the best results it is important to set the most appropriate of the following “FormatType” attribute (see chapter 5.13 for details) of the Input element: ALL ADDRESS_ONLY WITH_ORGANIZATION WITH_CONTACT WITH_ORGANIZATION_CONTACT WITH_ORGANIZATION_DEPARTMENT The use of “AddressComplete” must not be combined with other address input, except for “Country”. In addition, better results will be obtained if the addresses resemble at least some of the structure used in the respective country. As an example, John Smith 7563 Bangor Ave Hesperia CA 92345 USA yields significantly better results than: John Smith 7563 Bangor Ave Hesperia CA 92345 USA A typical database structure might look like this: ADDRESS_1 ADDRESS_2 ADDRESS_3 ADDRESS_4 ADDRESS_5 ADDRESS_6 John Smith 7563 Bangor Ave Suite 107 Hesperia CA 92345 USA AddressDoctor GmbH Steffen Niehues Röntgenstr. 9 67133 Maxdorf Deutschland Vlad Marcos c/o Acme Products 123 Main Street #12 El Paso TX 79925 United States Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 128 In case of such data, the FormattedAddressLine (FAL) sub-element of AdddressLines might be a more appropriate alternative, which allows for input of up to 19 unfielded address lines 6.8 …validate an address? The AddressObject must have been filled with an input address. Processing an address is achieved by calling AD_Process(). The process mode must have been set to BATCH, INTERACTIVE, FAST_COMPLETION, CERTIFIED, or ADDRESSCODELOOKUP, otherwise the default processing mode BATCH is used. Detailed results are retrieved by specific API functions, see below in chapter 6.11. Example for C: AD_Process( hAOHandle ); And for Java: AddressDoctor.process(m_oAO); 6.9 …parse an address? The AddressObject must have been filled with an input address. Processing an address is achieved by calling AD_Process(). The process mode must have been set to PARSE. Detailed results are retrieved by specific API functions, see below in chapter 6.11. For example: AD_SetParametersXML( hAOHandle, "<?xml version='1.0' encoding='iso-8859-1' ?>\n" "<!DOCTYPE Parameters SYSTEM 'Parameters.dtd'>\n" "<Parameters>\n" "<Process Mode='PARSE'/>\n" "</Parameters>\n", NULL ); AD_Process( hAOHandle ); Or in Java: m_oAO.setParametersXML( "<?xml version='1.0' encoding='UTF-16LE' ?>\n" + "<Parameters>\n" + "<Process Mode='PARSE'/>\n" + // Java uses UTF-16LE as default encoding for its String method "<Input Encoding='UTF-16LE'/>" + "<Result Encoding='UTF-16LE'/>" + "</Parameters>", null); AddressDoctor.process(m_oAO); Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 129 6.10 …check the process mode? After AD_Process() has been called, the “ModeUsed” attribute of the “Result” element will allow checking that the process mode used was actually the one intended (see chapter 5.11 for possible process mode fallbacks): char sResultParameters[32]; AD_GetResultParameter(hAOHandle,"ModeUsed",sResultParameters,sizeof(sResultParameter)); Or in Java: System.out.println(m_oAO.getResultParameter("ModeUsed")); 6.11 …retrieve a suggested correction? must have been called upfront to process the input address, only then results are available: The return code of AD_Process() already gives some indication of fatal errors (for example, country not identified – see section 5.32 on return codes). AD_Process() When using the direct API, the first step before calling AD_GetResultAddressElement() is always retrieving the number of results first by calling AD_GetResultCount() , while the number of items or lines for a specific item can be retrieved by calling AD_GetResultAddressElementItemCount() or AD_GetResultAddressLineCount() , respectively. Example (direct API, no error handling): AD_U32 ulNumResults; size_t stCurResult; AD_GetResultCount( hAOHandle, &ulNumResults ); for( stCurResult = 1; stCurResult <= ulNumResults; stCurResult++ ) { char sStreet[ 256 ]; AD_U32 ulNumItems; size_t stCurItem; AD_GetResultAddressElementItemCount( hAOHandle, 1, "Street", &ulNumItems ); for( stCurItem = 1; stCurItem <= ulNumItems; stCurItem++ ) { AD_GetResultAddressElement( hAOHandle, stCurResult, "Street", stCurItem, "COMPLETE", sStreet, sizeof( sStreet ) ); printf( "Result %u: Street item %u: %s\n", stCurResult, stCurItem, sStreet ); Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 130 } } Or in Java: int NumResults = m_oAO.getResultCount(); int CurResult; for (CurResult = 1; CurResult <= NumResults; CurResult++) { int NumItems = m_oAO.getResultAddressElementItemCount(CurResult, "Street"); int CurItem; for (CurItem = 1; CurItem <= NumItems; CurItem++) { System.out.println(m_oAO.getResultAddressElement(CurResult, "Street", CurItem, "COMPLETE")); } } Example (C XML API, no error handling): char sResultXML[ 16 * 1024 ]; AD_GetResultXML( hAOHandle, sResultXML, sizeof( sResultXML ) ); Example (Java XML API, no error handling): String sResultXML = ""; sResultXML = m_oAO.getResultXML(); 6.12 ...retrieve the result status and additional information? For the direct API, AD_GetResultParameter() will return more detailed processing result information (see the code shown in chapter 6.10 for another example), for example, the process status value explained in chapter 5.17: char sResultParameters[32]; AD_GetResultParameter(hAOHandle,"ProcessStatus",sResultParameters, sizeof(sResultParameter)); Or in Java: System.out.println(m_oAO.getResultParameter("ProcessStatus")); To get a detailed status for any specific address element result, AD_GetResultDataParameter() can be called. Likewise, for Enrichments you may call AD_GetResultEnrichmentDataParameter(). For a list of all parameters available to “Result”, “ResultData” and “ResultEnrichmentData”, see the attributes for those elements of Result.dtd (chapter 10.1): For instance, the “Result” element provides parameter attributes like “ProcessStatus” or “ModeUsed”, while the “ResultData” element provides parameter attributes like “ElementInputStatus” or “ElementResultStatus”. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 131 The XML API on the other hand, has only a single function AD_GetResultXML() which writes a complete result XML text to the passed buffer (see chapter 6.11 above). The level of detail contained in that XML construct may be influenced using the three attributes “AddressElements” (NONE, STANDARD, DETAILED), “AddressLines” (ON, OFF) and “AddressComplete” (ON, OFF) of the “Result” element in Parameters.xml (see DTD in chapter 10.1 and chapter 6.3). Note that XML output of all possible address element Types (see chapter 5.10) is only available when the “AddressElements” attribute for Result.xml is set to “DETAILED”. Many of the types described in the DTD (see chapter 10.1) are only available where supported by the available reference data and thus may vary greatly from country to country and even address to address. They are primarily provided for analytical purposes for now, while for most practical applications the default result output Type “COMPLETE” is best suited, as returned for “AddressElements” set to “STANDARD”. 6.13 ...retrieve address enrichments? Informatica AddressDoctor supports the following enrichments (for all Process Modes, except FAST_COMPLETION): GeoCoding (set EnrichmentGeoCoding=”ON”) Point Address Geocoding (set EnrichmentGeoCodingType to “NONE”, “ARRIVAL_POINT”, or “PARCEL_CENTROID”. Default is “ARRIVAL_POINT”) CAMEO (set EnrichmentCAMEO=”ON”) SupplementaryUS (presently providing COUNTY_FIPS_CODE, STATE_FIPS_CODE, MSA_ID, CBSA_ID, FINANCE_NUMBER, RECORD_TYPE, CSMA_ID, TIME_ZONE_CODE, TIME_ZONE_NAME, CENSUS_TRACT_NO, CENSUS_BLOCK_NO, CENSUS_BLOCK_GROUP, PMSA_ID, MCD_ID and PLACE_FIPS_CODE) set EnrichmentSupplementaryUS="ON") SupplementaryGB (presently providing DELIVERY_POINT_SUFFIX, UDPRN, and ADDRESS_KEY; set EnrichmentSupplementaryGB="ON") SupplementaryJP (set EnrichmentSupplementaryJP="ON") SupplementaryRS (set EnrichmentSupplementaryRS="ON") SupplementaryBR (set EnrichmentSupplementaryBR="ON") SupplementaryDE (set EnrichmentSupplementaryDE="ON") SupplementaryZA (set EnrichmentSupplementaryZA="ON") SupplementaryCH (set EnrichmentSupplementaryCH="ON") SupplementaryPL (introduced in Version 5.5.0, this enrichment supports Gmina code, Locality and Street TerytIDs for Poland. Set EnrichmentSupplementaryPL ="ON") SupplementaryFR (introduced in Version 5.5.0, this enrichment supports INSEE code for France. Set EnrichmentSupplementaryFR="ON") SupplementaryAT (introduced in Version 5.5.0, this enrichment supports the PAC code for Austrian addresses. Set EnrichmentSupplementaryAT="ON") SERP (set EnrichmentSERP="ON") CASS (set EnrichmentCASS="ON") SNA (set EnrichmentSNA="ON") AMAS (set EnrichmentAMAS="ON") SENDRIGHT (set EnrichmentSENDRIGHT=”ON”) Enabled enrichments are processed as the last processing step when calling AD_Process(): To enable Geocoding for example, the “Process” attribute “EnrichmentGeoCoding” within Parameters.xml (see Appendix 10.1) must be set to ON (default for all enrichments is OFF) and the Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 132 attribute “EnrichmentGeoCodingType” must be set to NONE, ARRIVAL_POINT, or PARCEL_CENTROID. Respective switches are provided for all enrichments, see the Parameters.xml DTD in Appendix 10.1 for details. Enrichments might be subject to providing an extra unlock code (as is the case for Geocoding and supplementary databases, see chapter 6.4) and will usually require extra database files (see chapter 6.5 for examples). Enrichment results can then be obtained in the direct API case by first calling the function AD_GetResultEnrichmentElementExists() to check for their existence and then AD_GetResultEnrichmentElement() for actually retrieving them. is provided to access the enrichment specific result information, like “GeoCodingStatus” (for a list of the available parameter attributes see the elements of Result.dtd in chapter 10.1). When using the XML API, calling AD_GetResultXML() provides all enabled enrichment results also. AD_GetResultEnrichmentDataParameter() For example code see chapter 6.11, also see chapters 5.19 to 2 for GeoCoding, CAMEO, CASS, SERP, AMAS, SNA and the Supplementary status values and chapter 6.24 for details on the certified CASS, SERP, AMAS, SNA and SendRight enrichments. 6.14 ...analyze error conditions? For C, AD_GetLastError() provides you with the last error return code (see section 5.32 for a return code overview) and AD_GetExtendedErrorMsg() allows access to extended information pertaining to the last error. Error messages often point to configuration issues that are best analyzed by referring to GetConfig.xml or Parameters.xml (see chapter 6.6 on how to obtain those). For Java you use AddressDoctorException.getExtendedMessage() for that same purpose. Make sure to wrap Informatica AddressDoctor and AddressObject calls with try/catch blocks for proper exception handling – for a more detailed example see the code in chapter 4.1: try { AddressDoctor.process(m_oAO); iLastError = AddressDoctor.getLastError(); System.out.println("Process returned " + iLastError); } catch (AddressDoctorException ex) { System.out.println("Exception during process: " + ex.toString()); } The ConsoleDemo test application in C and Java provided by Informatica AddressDoctor (see chapter 7.1) may prove helpful in analyzing error conditions. Collect the information listed in chapter 9.3 before contacting Informatica AddressDoctor Support. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 133 6.15 ...assign and process addresses in non-Latin script? In pretty much the same way as the examples shown in the preceding chapters 6.7 and 6.8. Make sure to input your addresses using the appropriate bit width for the source character set you are using (see chapter 5.8, UTF-16 is typically the safe choice for non-Latin character sets). Here is an example of a Japanese Kanji address: <?xml version="1.0" encoding="UTF-16"?> <InputData> <AddressElements> <Country Item="1" Type="NAME">JAPAN</Country> </AddressElements> <AddressLines> <FormattedAddressLine Line="1">〒 949-7277</FormattedAddressLine> <FormattedAddressLine Line="2">新潟県南魚沼市国際町 777 番地</FormattedAddressLine> <FormattedAddressLine Line="3">国際大学</FormattedAddressLine> </AddressLines> </InputData> The Rōmaji result, illustrating the Informatica AddressDoctor transliteration capabilities via PreferredScript set to “LATIN” in Result.xml (see DTD in chapter 10.1), would look like this: <?xml version="1.0" encoding="UTF-16"?> <Result ProcessStatus="C4" ModeUsed="BATCH" Count="1" CountOverflow="NO" CountryISO3="JPN" PreferredScript="LATIN" PreferredLanguage="DATABASE"> <ResultData ResultNumber="1" MailabilityScore="3" ResultPercentage="83.20" ElementResultStatus="F0F8F040400040000060" ElementInputStatus="60606020200020000060" ElementRelevance="10111000000000000010"> <AddressElements> <Country Type="NAME_EN" Item="1">JAPAN</Country> <Locality Item="1">MINAMIUONUMA-SHI</Locality> <Locality Item="2">ANAJISHINDEN</Locality> <PostalCode Item="1">949-7277</PostalCode> <Province Item="1">NIIGATA-KEN</Province> <Street Item="1">KOKUSAI-CHŌ</Street> <Number Item="1">777 BANCHI</Number> <Building Item="1">KOKUSAIDAIGAKU</Building> </AddressElements> <AddressLines> Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 134 <DeliveryAddressLine Line="1">777 BANCHI KOKUSAI-CHŌ KOKUSAIDAIGAKU</DeliveryAddressLine> <CountrySpecificLocalityLine Line="1">MINAMIUONUMA-SHI NIIGATA-KEN 9497277</CountrySpecificLocalityLine> <FormattedAddressLine Line="1">777 BANCHI KOKUSAI-CHŌ KOKUSAIDAIGAKU</FormattedAddressLine> <FormattedAddressLine Line="2">ANAJISHINDEN</FormattedAddressLine> <FormattedAddressLine Line="3">MINAMIUONUMA-SHI NIIGATA-KEN 9497277</FormattedAddressLine> <FormattedAddressLine Line="4">JAPAN</FormattedAddressLine> </AddressLines> <AddressComplete>777 BANCHI KOKUSAI-CHŌ KOKUSAIDAIGAKU ANAJISHINDEN MINAMIUONUMA-SHI NIIGATA-KEN 949-7277 JAPAN </AddressComplete> </ResultData> </Result> Similarly a Russian example, in Cyrillic script: <?xml version="1.0" encoding="UCS-2LE"?> <InputData> <AddressElements> <Country Item="1" Type="NAME">RUS</Country> </AddressElements> <AddressLines> <FormattedAddressLine Line="1">Международный университет в Москве</FormattedAddressLine> <FormattedAddressLine Line="2">Ленинградский проспект 17</FormattedAddressLine> <FormattedAddressLine Line="3">125040 Москва</FormattedAddressLine> </AddressLines> </InputData> Results in (with PreferredScript set to “ASCII_SIMPLIFIED” this time, to suppress special characters like the “ž” in “Meždunarodnyj”, see chapter 5.12.1 for reference): <?xml version="1.0" encoding="UCS-2LE"?> <Result ProcessStatus="C4" ModeUsed="BATCH" Count="1" CountOverflow="NO" CountryISO3="RUS" PreferredScript="ASCII_SIMPLIFIED" PreferredLanguage="DATABASE"> <ResultData ResultNumber="1" MailabilityScore="4" ResultPercentage="82.50" ElementResultStatus="F0F080F0F000400000E0" ElementInputStatus="60600060600020000060" ElementRelevance="10101000000000000010"> Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 135 <AddressElements> <Country Type="NAME_EN" Item="1">RUSSIAN FEDERATION</Country> <Locality Item="1">Moskva</Locality> <PostalCode Item="1">125040</PostalCode> <Province Item="1">Moskva</Province> <Street Item="1">Leningradskij pr-kt</Street> <Number Item="1">17</Number> <Building Item="1">Mezdunarodnyj Universitet V Moskve</Building> </AddressElements> <AddressLines> <DeliveryAddressLine Line="1">Mezdunarodnyj Universitet V Moskve</DeliveryAddressLine> <DeliveryAddressLine Line="2">Leningradskij Pr-Kt 17</DeliveryAddressLine> <CountrySpecificLocalityLine Line="1">Moskva</CountrySpecificLocalityLine> <FormattedAddressLine Line="1">Mezhdunarodnyj Universitet V Moskve</FormattedAddressLine> <FormattedAddressLine Line="2">Leningradskij Pr-Kt 17</FormattedAddressLine> <FormattedAddressLine Line="3">Moskva</FormattedAddressLine> <FormattedAddressLine Line="4">125040</FormattedAddressLine> <FormattedAddressLine Line="5">Russian Federation</FormattedAddressLine> </AddressLines> <AddressComplete>Mezhdunarodnyj Universitet V Moskve Leningradskij Pr-Kt 17 Moskva 125040 Russian Federation </AddressComplete> </ResultData> </Result> 6.16 …use Informatica AddressDoctor with multiple processor cores? Let us assume a four processor core machine on which three cores are to be used for address processing: The main thread of the program integrating Informatica AddressDoctor 5 calls AD_Initialize(); (see chapter 6.1) with MaxThreadCount=3 and MaxAdressObjectCount=3 (see chapter 5.36) and creates three worker threads for processing addresses. Each worker thread then acquires one AddressObject handle via AD_GetAddressObject( &hAOHandle and subsequently keeps repeating the following sequence (see chapters 6.7.1, 6.8 and 6.11): ); AD_SetInputDataXML( hAOHandle, <XML string> ); AD_Process( hAOHandle ); AD_GetResultXML( hAOHandle, sResultXML, sizeof( sResultXML ) ); AD_ClearData( hAOHandle ); When you are finally shutting down, the main thread destroys all worker threads and de-initializes Informatica AddressDoctor (see chapter 6.1): AD_ReleaseAllAddressObjects(); Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 136 AD_DeInitialize(); 6.17 …produce valid Informatica AddressDoctor XML? Any XML input to Informatica AddressDoctor should always be well-formed and validated against the DTDs provided for that purpose by Informatica AddressDoctor (see chapter 10.1). Note that the sequence of the XML elements does matter (but not that of their attributes), which can be checked through DTD validation as well. Refer to http://wikipedia.org/wiki/XML for an introduction to XML. Apart from XML functionality being an integral part of most modern Integrated Development Environments (IDEs), there is a diverse choice of free validating XML editors, like WMHelp XMLPad or XML Copy Editor from SourceForge.net. When dealing with XML files produced on different platforms, note that end-of-line (EOL) characters differ between Windows (CR+LF) and UNIX (LF), see http://wikipedia.org/wiki/Linebreak. 6.18 …use Informatica AddressDoctor XML for flexible Business Processes? Standards like BPEL (the Business Process Execution Language) allow for more flexible business processes implemented using information technology. For instance, you might model and implement a business process including global address verification based on an InputData.xml template that contains placeholder variables mapping to certain input data columns provided by data sources. Some of the external influences (like new postal regulations) the business side might have to react on, may thus be implemented without programming knowledge, simply by adjusting these placeholders in the XML template. For example, let us assume you are dealing with addresses for a country that has recently introduced a postal code system. So far, your InputData.xml template might have looked like this (the “$” character is used to delimit placeholder names here): <?xml version='1.0' encoding='UTF-16'?> <!DOCTYPE InputData SYSTEM 'InputData.dtd'> <InputData> <AddressElements> <Key>$COLUMN1$</Key> <Country Item='1' Type='NAME'>$COLUMN7$</Country> <Locality Item='1' Type='COMPLETE'>$COLUMN6$</Locality> <Street Item='1' Type='COMPLETE'>$COLUMN5$</Street> <Building Item='1' Type='COMPLETE'>$COLUMN4$</Building> <Organization Item='1' Type='NAME'>$COLUMN2$</Organization> <Contact Item='1' Type='NAME'>$COLUMN3$</Contact> </AddressElements> </InputData> Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 137 Due to that new postal regulation, postal codes have been added to the data source, which now need to be verified also. A new eighth column has thus been made available and can be mapped as easily as follows: <?xml version='1.0' encoding='UTF-16'?> <!DOCTYPE InputData SYSTEM 'InputData.dtd'> <InputData> <AddressElements> <Key>$COLUMN1$</Key> <Country Item='1' Type='NAME'>$COLUMN7$</Country> <Locality Item='1' Type='COMPLETE'>$COLUMN6$</Locality> <PostalCode Item='1' Type='UNFORMATTED'>$COLUMN8$</PostalCode> <Street Item='1' Type='COMPLETE'>$COLUMN5$</Street> <Building Item='1' Type='COMPLETE'>$COLUMN4$</Building> <Organization Item='1' Type='NAME'>$COLUMN2$</Organization> <Contact Item='1' Type='NAME'>$COLUMN3$</Contact> </AddressElements> </InputData> All that is needed to facilitate this kind of change is a simple editor as described in chapter 6.17. 6.19 …use Informatica AddressDoctor for Master Data Management? Informatica AddressDoctor provides a batch validation mode that was designed for mass data address quality, for example, for use in Master Data Management (MDM) or Data Integration systems. This validation mode (see chapter 5.11.1 for details) allows address input into the AddressObject irrespective of data quality. The input is then automatically corrected to the extent possible, returning the single most likely candidate as the processing result. When designing an application for batch processing, call the AD_Process() function with the BATCH validation process mode (see chapter 5.11.1). Informatica AddressDoctor returns a single corrected result whenever possible (Process Status “Vx” or “Cx”, see chapter 5.17). For tackling severe address quality challenges, a recommended batch usage pattern for Informatica AddressDoctor 5 based on the “OptimizationLevel” concept is described in chapter 5.33. 6.20 …use Informatica AddressDoctor in an eBusiness Environment? Informatica AddressDoctor provides an interactive validation mode that was designed for point of data entry address quality, for example, for use in online registration forms, be it for a web shop, an auction platform or a customer feedback system. This validation mode (see chapter 5.11.2 for details) allows for address input into the AddressObject irrespective of data quality. The input is then automatically corrected to the extent possible, returning a choice of likely candidates. If processing can identify one definite candidate, the result returned will only be that candidate When designing an application for interactive entry, call the AD_Process() function with the INTERACTIVE validation process mode (see chapter 5.11.2), for example, using the web form content that has been posted to a web server online. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 138 Informatica AddressDoctor returns a number of possible results (candidates), which are then to be presented to the user entering data for picking the most correct result. If the input data entered was already complete and correct, there would obviously be no need for such user interaction. Note that the user should usually have the option to edit the returned result once more before final submission: For instance in the case of new construction activity, an address might not yet be featured even in the most recent set of postal reference data. 6.21 …use the Quick Address Entry Feature? Informatica AddressDoctor features a validation mode (Fast Completion) that can be used in call center environments where data entry personnel should be assisted in their data entry task. The same use case will usually apply to Customer Relationship Systems (CRM), Property and Reservation Management Systems (PMS) or Point of Sales (POS) systems. This validation mode (see chapter 5.11.3 for details) allows for incomplete address input into the AddressObject. This input is automatically completed to the extent possible. When designing an application for quick address entry it is possible to call the AD_Process() function with the FASTCOMPLETION validation process mode (see chapter 5.11.3) after each keystroke. Provided the reference databases are either accessible quickly or even stored locally, pick lists can be displayed in real time. As an example we are going to input the following data: Country: USA Locality: Wash Street: Pennsyl Informatica AddressDoctor returns 100 results (suggestions) and an overflow indication will be set: If the “CountOverflow” attribute of Result.xml (see DTD in chapter 10.1) is set to YES, this indicates that potentially more results would be available. It is then recommended that the AD_Process() function is called again with additional input data. 6.22 …use Informatica AddressDoctor in a multi-tenant hosted environment? A multi-tenant hosted solution requires initialization of separate Informatica AddressDoctor instances with a customer-specific unlock code for each, in order to meet the terms and conditions set by the different reference data providers. Informatica AddressDoctor has examined making use of a RAM disk to share the reference database files across these several Informatica AddressDoctor instances (typically threads): Internal benchmarks using ramfs on Linux have shown that the address validation throughput of a RAM disk (with PreloadingType="NONE", see chapter 5.34) is only about 12% less compared to full preloading (PreloadingType="FULL"), in the case of 4 threads running on a 4 core machine. Blocking of the different threads thus seems reduced by the low latency of RAM compared to hard disk storage. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 139 This is a good compromise between speed and hardware requirements, as about 8-10G of RAM should suffice to hold the world Batch/Interactive reference databases in a shared ramdisk. Note that with Informatica AddressDoctor 5.1.4 a new default method of preloading has been introduced (PreloadingMethod=”MAP”, see chapter 5.34) which allows sharing memory mapped reference database files across instances out of the box, without the performance hit due to the ramdisk driver. Where several GB RAM are not available for memory mapped files, each customer will then at least require separate storage with their own copy of the reference database files for performance reasons: These are then only partially preloaded (PreloadingType="PARTIAL") to their own Informatica AddressDoctor instance (0.5 to 1 GB RAM per customer should usually suffice here), which features a caching facility for this use case as well. Keeping separate copies of the reference database files should ensure that customer instance I/O accesses to these files don't block each other - note that full preloading is a prerequisite for proper multicore scalability (see chapter 5.36), so such a partially preloaded setup will probably limit the usable processor cores per customer thread to no more than two (because of I/O blocking again, this time between the multiple threads used for one customer). We have found SATA Solid State Disks to improve performance vastly in such a setup, for reference see chapter 6.25. 6.23 …use Informatica AddressDoctor for Web Services? As demonstrated in chapter 4.1, Informatica AddressDoctor 5 introduced an XML API (see Appendix 10.2 for reference) that makes it even easier to integrate global address correction in Web Services environments - be it Software as a Service (SaaS) SOAP calls for Internet cloud computing or an Enterprise Service Bus (ESB) as part of a Service Oriented Architecture (SOA) in the Intranet. Simply feed address data from your web service in XML format directly into Informatica AddressDoctor via AD_SetInputDataXML() (see chapter 6.7.1 for details), which only requires prior XML transformation (using broadly adopted technologies like XSLT, see http://wikipedia.org/wiki/XSLT) on the basis of the DTD information made available by Informatica AddressDoctor (see chapters 6.17 and 10.1). Note that Informatica AddressDoctor also offers secure and synchronous Web Services for direct and ready-to-use integration via SOAP (see the product line overview in chapter 2). The following figure provides an overview of the Informatica AddressDoctor Data Quality Platform: Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 140 Batch (max. 10 addresses per SOAP call) o Validation and automatic correction of addresses in batch with immediate results. Interactive o Online validation of addresses with interactive correction. o Ideal for online shops and CRM systems. FastCompletion o Support for call centers. The Informatica AddressDoctor Web Services have a proven track record in both, high availability (> 99.9 %) and high volume throughput. Web Service pricing is very competitive and transaction based. Also, address enrichment options are available, for details see http://www.addressdoctor.com/en/products/ecommerce. 6.24 ...validate an address in CERTIFIED mode? For some countries, Informatica AddressDoctor offers a special validation process mode “CERTIFIED” which is used to validate an address according to the certification rules defined by the local postal authority. This validation type allows integrators to develop their own application for certification by the respective postal organization. Special database files may be necessary for CERTIFIED processing - for details see the following chapters. It is very important to note that the following Parameters must not be changed from their default settings to ensure proper CERTIFIED processing: PreferredLanguage (see chapter 5.12.2) MatchingAlternatives and MatchingScope (see chapter 5.12.5) GlobalMaxLength, GlobalCasing, and AddressElementStandardize (MaxLength or Casing, see 5.14) OptimizationLevel (see chapter 5.33) Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 141 6.24.1 ...process an address following the rules for CASS certification? For US addresses, Informatica AddressDoctor engine offers a special validation process mode “CERTIFIED” which is used to validate an address according to the USPS CASS rules. This validation type allows integrators to develop their own CASS Application for certification by USPS. Special database files are necessary for CASS processing - for details see chapter 3.3.2. For CountryISO3=”USA” and Type=”CERTIFIED” the GetConfig.xml output will only list a subset of the available reference database files as “DataBase” element with Type=”CERTIFIED”. Also, USA5BI.md must always be available, as this database is the basis for CASS processing (see chapter 6.24) also. Furthermore, US CERTIFIED mode will not work without pre-loading and full pre-loading will always be enforced on some of the CASS databases described in chapter 3.3.2, irrespectively of the settings. During the validation process the input address is corrected according to CASS rules. In this process all CASS attributes are generated and the ZIP + 4 is added to the ZIP code. The output address is retrieved from the AddressObject as usual (see chapter 6.11). A CASS processing status value (see chapter 5.21) can be retrieved from the AddressObject through the “CASSStatus” attribute returned with the “EnrichmentData” element of the Result.xml. To actually have CASS attributes available in the CASS element of Result.xml, the Process attribute “EnrichmentCASS” within Parameters.xml (see Appendix 10.1) must be set to ON (default is OFF). With “EnrichmentCASS” set to OFF, Informatica AddressDoctor still provides ZIP+4 codes (as PostalCode item type “ADD_ON”), as long as USA5BI.MD is available in the database folder. For convenience reasons ZIP+4 codes are also provided by the US BATCH process mode, although some result variations may very well occur – the definite ZIP+4 reference is available in US CERTIFIED process mode only. You may check for correct initialization of all required CASS databases (see chapter 3.3.2 for the full list) by querying GetConfig.xml (see the respective DTD in Appendix 10.1) for the EnrichmentSupportInfo element: <EnrichmentSupportInfo CountryISO3="USA" Type="CERTIFIED">FULL</EnrichmentSupportInfo> Note that the CASS attribute output provided by Informatica AddressDoctor is only valid for use during a special validation period, varying depending on the product. The valid time ranges are as follows (as defined on the USPS PS 3553 form, which will need to be created by the calling application to qualify for USPS mailing discounts, for an example, see http://ribbs.usps.gov/cassmass/documents/tech%5Fguides/PS_FORM_3553): ZIP + 4/DPV Coded From Date To Date 30 days before (the 15th of each month or bimonthly) or no later than 105 days 180 days after the ZIP + 4 valid “From” date. after the file date. Total Delivery Point Barcoded 30 days before (the 15th of each month or bimonthly) or no later than 105 days 180 days after the DPBC valid “From” date. after the ZIP + 4 product file date. Total Carrier 30 days before or up to 105 days after the ZIP + Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 90 days after the Carrier 142 Route Coded From Date To Date 4, Five-Digit ZIP, or the Carrier Route Valid “From” date. Route product date (the 15th of each month or bimonthly) or up to 105 days after the file date. Five-Digit Coded 30 days before (the 15th of each month or bimonthly) or no later than 105 days 365 days after the FiveDigit Valid “From” date. after the ZIP + 4, Five-digit ZIP, or the Carrier Route product date. Note that any application based on Informatica AddressDoctor 5 must meet the CASS Terms and Conditions to qualify for USPS mailing discounts: http://ribbs.usps.gov/cassmass/documents/tech_guides/FORMS/CASSDEVS.pdf The following list shows which CASS attributes are available (see Result.dtd in Appendix 10.1 also) – for an explanation of the different attributes, refer to the CASS documentation at http://ribbs.usps.gov/cassmass/documents/tech_guides/TECHNICAL_GUIDES/CASSTECH_N.PDF: Carrier Route Answer Record Type Code Delivery Point Answer Delivery Point Check Digit Answer High-rise Default High-rise Exact Rural route Default Rural route Exact DSF² LACS Indicator DPV Confirmation Indicator* DPV CRMA Indicator* DPV False Positive Indicator* DPV Footnote 1* DPV Footnote 2* DPV Footnote 3* Concatenation of DPV Footnotes* Result of the call to the DPV NOSTATS Table Result of the call to the DPV VACANT Table LACSLink Return Code* SUITELink Return Code* ZIPMove Return Code* Early Warning System (EWS) Return Code Congressional District Barcode Residential Delivery Indicator ** eLOT Ascending/Descending eLOT Sequence Number CARRIER_ROUTE RECORDTYPE DELIVERY_POINT DELIVERY_POINT_CHECK_DIGIT HIGHRISE_DEFAULT HIGHRISE_EXACT RURALROUTE_DEFAULT RURALROUTE_EXACT LACS DPV_CONFIRMATION DPV_CMRA DPV_FALSE_POSITIVE DPV_FOOTNOTE_1 DPV_FOOTNOTE_2 DPV_FOOTNOTE_3 DPV_FOOTNOTE_COMPLETE DSF2_NOSTATS_INDICATOR DSF2_VACANT_INDICATOR LACSLINK_RETURNCODE SUITELINK_RETURNCODE ZIPMOVE_RETURNCODE EWS_RETURNCODE CONGRESSIONAL_DISTRICT BARCODE RDI_INDICATOR ELOT_FLAG ELOT_SEQUENCE Note: Attributes marked with * will only be populated for US customers as per USPS licensing restrictions. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 143 See chapter 7 in http://ribbs.usps.gov/dpv/documents/tech_guides/DPV_LPR.PDF for details on how to programmatically act on DPV_FALSE_POSITIVE being set. Attributes marked ** requires customers to acquire the data from USPS to enable the optional part of the processing. See Chapter 3.3.2 on how to acquire the data and rename the necessary files. 6.24.2 ...process an address following the rules for SERP certification? For Canada addresses, Informatica AddressDoctor offers a special validation process mode “CERTIFIED” which is used to validate an address according to the Canada Post SERP rules. This validation type allows integrators to develop their own SERP Application for certification by Canada Post. As the new databases now contain PoCAD (Point of Call Address Data) data, an additional CAN5C1.MD is needed for certified mode. Those who want to use the new engine with older databases will have to make a copy of CAN5BI.MD and rename the copy to CAN5C1.MD. See chapter 6.5 also. You may check for correct initialization of all required databases by querying GetConfig.xml (see the respective DTD in Appendix 10.1) for the EnrichmentSupportInfo element: <EnrichmentSupportInfo CountryISO3="CAN" Type="CERTIFIED">FULL</EnrichmentSupportInfo> A SERP processing status value (see chapter 5.22) can be retrieved from the AddressObject through the “SERPStatus” attribute returned with the “EnrichmentData” element of the Result.xml. To actually have SERP attributes available in the SERP sub-element of EnrichmentData in Result.xml, the Process attribute “EnrichmentSERP” within Parameters.xml (see Appendix 10.1) must be set to ON (default is OFF). Note that SERP certification requirements are only met, when the “PreferredScript” attribute is set to “ASCII_SIMPLIFIED” (see chapter 5.12.1). If the Validation type is CERTIFIED and the SERP Enrichment Status is ON, two enrichments are provided: CATEGORY and EXCLUDED_FLAG The category provides the following possible values: Value Description V Verified. The process status is of type Vx. C Corrected. The process status is of type Cx. N Incorrect. The process status is of type Ix. VQ Valid, but questionable. Rural addresses (those with a '0' as second digit in the PostalCode, for example, "K0A 1L0") are usually considered valid because they are determined by the PostalCode. Questionable means that either delivery information is missing in the input or that some part or all of the delivery input has not been verified by the database. See also the address accuracy handbook provided by Canada Post. V1A Valid, residential type record. Some records in the database containing buildings are marked as apartment type records, either residential or commercial. This information is provided in the enrichment. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 144 Value Description V2A Valid, commercial type record. This refers to commercial building records in the database. C1A Corrected, residential type record. C2A Corrected, commercial type record. Since December 2010, PoCAD data has been added to the databases to provide more detailed suite information. (Note that the corresponding database CAN5C1.MD will be made available early January 2011.) The EXCLUDED_FLAG informs about PoCAD addresses with wrong user input. The Informatica AddressDoctor output for this flag can either be empty or the Text ‘EXCLUDED’: EXCLUDED: Incorrect suite input for a PoCAD address, category N, process status Ix Effective January 17, 2011, the statement of accuracy has to report addresses as Excluded. However, starting August 1, 2011 this flag will no longer be needed. Addresses will no longer show up as being excluded. See Result.dtd in Appendix 10.1 also. Refer to Canada Post’s website for more information: http://www.canadapost.ca/cpo/mc/business/productsservices/atoz/addressaccuracy.jsf 6.24.3 ...process an address following the rules for AMAS certification? For Australian addresses, Informatica AddressDoctor offers a special validation process mode “CERTIFIED” which is used to validate an address according to the Australia Post AMAS rules. This validation type allows integrators to develop their own AMAS Application for certification by Australia Post. Special databases are necessary for AMAS processing - for details see chapter 3.3.24. These new databases contain Postal Address File (PAF) data including Australia Post’s Delivery Point Identifiers (DPIDs). The additional AMAS information can be found in the section EnrichmentData of the Result.xml, You may check for correct initialization of all required AMAS databases (see chapter 3.3.24 for the full list) by querying GetConfig.xml (see the respective DTD in Appendix 10.1) for the EnrichmentSupportInfo element: <EnrichmentSupportInfo CountryISO3="AUS" Type="CERTIFIED">FULL</EnrichmentSupportInfo> The following status codes or parameters are available (see AMAS documentation for more details): Parameter Description ERRORCODE Internal error code RECORD TYPE Type of address (for example, S for Street) DELIVERY_POINT_ID Delivery point identifier (DPID), 8 digits LOT_NBR Lot number (for example, 100) POSTAL_DELIVERY_NBR Postal delivery number (for example, "00123" of "123A”) POSTAL_DELIVERY_NBR_PFX Postal delivery number prefix (for example, "A" of "A123”) Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 145 Parameter Description POSTAL_DELIVERY_NBR_SFX Postal delivery number suffix (for example, "A" of "123A”) HOUSE_NBR_1 House (street) number 1 (for example, "00123" of "123A456B) HOUSE_NBR_SFX_1 House (street) number 1 suffix (for example, "A" of "123A456B) HOUSE_NBR_2 House (street) number 2 (for example, "00456" of "123A456B) HOUSE_NBR_SFX_2 House (street) number 2 suffix (for example, "B" of "123A456B) Other AMAS relevant fields are regular address elements and can be found in the standard ResultData section. Beginning with version 5.2.8 the Locality Sub field “PREFERRED_NAME” has to be used to comply with AMAS rules because the COMPLETE or NAME fields may contain vanity names if they were entered instead of the official names requested by the postal administration of Australia. 6.24.4 ...process an address following the rules for SNA certification? For French addresses, Informatica AddressDoctor offers a special validation process mode “CERTIFIED” which is used to validate an address according to the La Poste SNA rules. This validation type allows integrators to develop their own SNA Application for certification by La Poste. No special database files apart from FRA5BI.md are necessary for CERTIFIED processing. See chapter 6.5 also. You may check for correct initialization of all required databases by querying GetConfig.xml (see the respective DTD in Appendix 10.1) for the EnrichmentSupportInfo element: <EnrichmentSupportInfo CountryISO3="FRA" Type="CERTIFIED">FULL</EnrichmentSupportInfo> For SNA certified processing (see: http://www.laposte.fr/sna) it is required to enter addresses in a six line FormattedAddressLine format, including empty lines wherever a part of the address is missing: Line 1: ORGANIZATION IDENTIFICATION or IDENTITY OF THE ADDRESSEE Line 2: INDIVIDUAL IDENTIFICATION (i.e. Company Contact) or DELIVERY POINT ACCESS INFORMATION (i.e. SubBuilding) Line 3: DELIVERY POINT LOCATION (i.e. Building) Line 4: STREET NUMBER or PLOT and THOROUGHFARE Line 5: DELIVERY SERVICE or THOROUGHFARE COMPLEMENTARY IDENTIFICATION Line 6: POSTCODE and LOCALITY or CEDEX POSTCODE and DISTRIBUTION AREA INDICATOR A SNA processing status value (see chapter 5.23) can be retrieved from the AddressObject through the “SNAStatus” attribute returned with the “EnrichmentData” element of the Result.xml. To actually have SNA attributes available in the SNA sub-element of EnrichmentData in Result.xml, the Process attribute “EnrichmentSNA” within Parameters.xml (see Appendix 10.1) must be set to ON (default is OFF). Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 146 Note that SNA certification requirements are only met when the “PreferredScript” attribute is set to “ASCII_SIMPLIFIED” (see chapter 5.12.1) and “GlobalMaxLength” to 38 (with the ”MaxLength” for each AddressElement set to 0, see chapter 5.14). The only SNA attribute available is “CATEGORY” with possible values of “ORI/RES/AVE/NOK”, as per La Poste definition (see Result.dtd in Appendix 10.1 also). Note that the SNA certification of the CERTIFIED mode for FRA is still pending. 6.24.5 …process an address following the rules for SendRight certification? Informatica AddressDoctor has passed the stringent rules set by New Zealand Post to obtain the SendRight Certification. For more information on the Certification Programme, contact New Zealand Post directly at www.nzpost.co.nz One of the requirements of New Zealand Post was that no address cleansing occurs during the certification process. Therefore it is recommended that customers use the Batch mode before running their addresses through the Certification mode if address corrections or standardizations are required. Here is a quote from the SendRight Certification Handbook, Section 2.4 Software that does more than PAF validation (page 5): “SendRight™ certification only assesses PAF validation and SOA-issuing functionality. Any other functionality such as address cleansing must not impinge on the functionality to be assessed as it may invalidate the testing process and results. The purpose of the software testing is to determine whether the software can take an input address, match it to the PAF data elements and accurately calculate the desired result.” For more details, refer to the SendRight Certification Handbook from New Zealand Post. http://www.nzpost.co.nz/sites/default/files/uploads/shared/sendrightcertification.pdf 6.25 ...optimize performance? The speed of your application will depend on the functionality of Informatica AddressDoctor that is used. Parsing and Country Recognition, as implemented by Informatica AddressDoctor (AD_Process() for C) with the Process Modes PARSE or COUNTRYRECOGNITION, do not access any databases whereas all other modes (BATCH, INTERACTIVE, FAST_COMPLETION, ADDRESSCODELOOKUP and so on) do. When validating an address, a number of read operations access the corresponding country database. These accesses are random in nature. To reduce the number of read operations accessing the hard disk, preloading part of or all of a database is very much recommended. See chapter 5.34 for details on this topic. However, if the size of the free physical memory (or the part made available to the process running Informatica AddressDoctor) is too small to fully preload every country database needed, Informatica AddressDoctor must access the hard disk. To reduce the number of file system calls needed, Informatica AddressDoctor manages its own cache (see section 5.35). The operating system will also cache file accesses, thereby significantly speeding up subsequent calls. Naturally, the OS must have sufficient free memory available for this purpose. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 147 Preloading on the other hand, reduces the amount of memory the operating system can use efficiently for file caching and it may even temporarily swap out the preloaded data blocks to the hard disk. For this reason it is recommended to limit the memory amount used for preloading, if it can be foreseen that additional hard disk accesses are necessary (as would for instance be the case, when the total memory available is not sufficient to allow full pre-loading of all country reference databases needed). As minimizing accesses to the hard disk is a key to validation performance, installing more memory (see chapter 5.37 on memory allocation) will speed up processing significantly, as hard disk accesses can be avoided this way. See chapter 6.22 for reference if you are running a multi-tenant installation of Informatica AddressDoctor 5. Note that some operating systems can use more memory for file caching than the supported memory size per process: For example, 32 Bit Windows 2003 Server Enterprise Edition can address up to 32 GB of RAM, but the limit per process is still 2 GB (or rather 3 GB in case of using the /3GB boot.ini switch, for reference see http://support.microsoft.com/kb/291988) and the limit for standard CPU memory access (unless using AWE/PAE, see http://support.microsoft.com/kb/283037 for reference) is 4 GB. Monitoring whether you have enough memory installed is possible using a tool to monitor resource utilization (such as the Performance Monitor perfmon.exe on Windows). If sufficient memory is installed, there should be almost a 100% utilization of one processor core when processing a large batch of records in single-thread mode. While Informatica AddressDoctor provides the means to utilize multi-threading internally (see chapter 5.36), having more than one core or processor in the system speeds up processing in itself already, because other threads in the system can run independently. Here is the summary of tips to optimize the performance of validation Install as much memory as possible to allow country databases to be fully pre-loaded into memory. At least as much memory as the size of the most often used country databases combined plus 256 MB should be available. If all countries available from Informatica AddressDoctor are to be used simultaneously, add more memory to cover the entire size of all databases. Preload at least the databases of frequently used countries with the proper parameters set in the SetConfig.xml passed to the AD_Initialize() function. When full preloading is not an option, store the database files on a fast hard disk or even better a SATA Solid State Disk (ideally exceeding 200MB/sec read transfer rate - for development purposes, high-speed USB or FireWire flash modules exceeding 30MB/sec read transfer rate might suffice). Especially the access latency (average seek time) should be minimized: Internal Informatica AddressDoctor benchmarks for “PreloadingType=NONE” with an Intel X25M G2 SATA SSD have shown a typical performance increase of a factor 20. Keep the Informatica AddressDoctor reference databases on a separate hard drive. Read and write address data from other drives. Make absolutely sure to keep the database files defragmented, internal tests have shown that performance may easily decrease by as much as 35% when the files are heavily fragmented. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 148 Informatica AddressDoctor is very data-intensive, with a significant amount of non-localized memory accesses during processing: As such, it greatly benefits from direct multi-channel memory access (for example, via Quick Path Interconnect or HyperTransport) with high bandwidth and low latency, combined with large processor caches, such as found in top-of-the line server processors. Use high performance multi-core processors, like Intel Xeon X55xx/65xx/75xx and higher, AMD Opteron 24xx/84xx and higher or IBM POWER7 and higher. Provided there is enough memory available for full preloading, the processor clock frequency will directly determine the speed of address processing. See http://www.spec.org/cpu2006/results/rint2006.html for a comparison of integer processing throughput between different processor architectures. When running batch processes without having a sufficient amount of memory installed, try to process records ordered by country with intermittent re-initialization of Informatica AddressDoctor using the appropriate pre-loading settings (see chapter 5.34). The engine will also benefit from internal and OS caches for addresses sorted by country as compared to addresses in random order, as they would for instance occur in a Web Service environment. Examples for typical performance-oriented settings: Given Resources System 1 System 2 System 3 System 4 System 5 Cores for Informatica AddressDoctor 1 2 4 6 12 RAM for Informatica AddressDoctor 512 1024 2048 6000 16000 MaxMemoryUsageMB 450 950 1950 5950 15950 CacheSize SMALL LARGE LARGE LARGE LARGE MaxThreadCount 1 2 4 6 12 MaxAddressObjectCount* 1 2 4 6 12 PreloadingMethod MAP MAP MAP MAP MAP PreloadingType for very important countries PARTIAL FULL FULL FULL FULL PreloadingType for important countries NONE PARTIAL PARTIAL FULL FULL PreloadingType for remaining countries NONE NONE NONE PARTIAL FULL Chosen Settings * This setting depends on the implementation of the calling code. In some scenarios with double buffering two times the given value may be used. In reality the PreloadingType depends on the size of the databases, so for system 2 a FULL preload for a couple of countries may not be possible in case of large databases. The above examples System Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 149 1 and 2 are typical for 32-bit Java usage scenarios. When benchmarking Informatica AddressDoctor, consider the following: The operating system will have to read data from the hard disk if any of the databases used are not fully preloaded. These file system accesses are cached, at least until the OS file cache is full. This leads to the effect that physical hard disk accesses are always necessary for the first addresses of a specific country. Later on some or even all of these accesses will hit the file cache. For this reason, the processing speed of the first addresses (first meaning the first few thousand) of a specific country is usually much lower than for the later ones. Thus, it is recommended to use at least 50.000 addresses per country to produce realistic benchmark results. If, on the other hand, all accessed databases are fully preloaded, speed is not expected to vary with the number of addresses already processed so far. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 150 7. Demonstration Applications The Informatica AddressDoctor package is accompanied by demonstration applications that can be used to quickly test the functionality of the library. See the ZIP archive structure described in chapter 3.2 for reference. 7.1 ConsoleDemo Application The ConsoleDemo application provided as source code under src and also contained as an executable in the bin directory, gives an overview of the basic address validation process. Before running the application, copy the example XML files from etc over to your working directory, so that they may be found by the executable and edited for experimentation purposes. Specifically, a sample XML file InputData.xml containing an address for XML processing via ConsoleDemo –xml or ConsoleDemoJava –xml, respectively is provided in etc. Ensure that the minimal SetConfig configuration XML provided contains a valid Unlock Code that you received when purchasing Informatica AddressDoctor and the correct destination path your reference database files have been unpacked to (see chapters 6.4 and 6.5 for details): Alternatively, make sure to copy (or link) at least the Swiss reference database (CHE5BI.MD, see chapter 3.3) to the working directory before running the ConsoleDemo executable: The ConsoleDemo application will attempt to validate a sample address from Switzerland that requires this database (otherwise, that example address will only be parsed). Remember that the contents of the lib directory may have to be added to your shared library path (set PATH=%PATH%;.\lib on Windows or export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:./lib on Unix) for an executable using the C-API to work. For the Java-API simply call (for UNIX, see chapter 3.2.2 also): java –Xss2048k -cp bin:lib/AddressDoctor5.jar -Djava.library.path=lib ConsoleDemoJava And for Windows: java –Xss2048k -cp bin;lib/AddressDoctor5.jar -Djava.library.path=lib ConsoleDemoJava 7.2 AddressCheck (Windows only) Starting with Informatica AddressDoctor 5.1, the AddressCheck demonstration application featuring a Windows GUI is made available in binary form in the bin sub-directory (AD5_WIN_32 ZIP archive only). Note that AddressCheck requires installation of the Microsoft .NET Framework 2.0 or higher to Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 151 run and is provided as is for experimentation purposes, without warranty or support of any kind. Note that you will have to copy the AddressDoctor5.dll from lib and the SetConfig.xml example from etc/C to the bin directory, for AddressCheck to be able to locate them (a corresponding AddressCheck.cfg that allows for configuring these paths is written upon the first successful initialization). Like for the ConsoleDemo, make sure that the minimal SetConfig configuration XML provided contains a valid Unlock Code that you received when purchasing Informatica AddressDoctor and the correct destination path your reference database files have been unpacked to (see the preceding chapter 7.1 for an example and chapters 6.4 and 6.5 for details, AddressCheck requires a “MaxAddressObjectCount” of at least 6). For 64 Bit Windows systems, the following command might be needed for running AddressCheck: corflags addresscheck.exe /32bit+ AddressCheck allows for interactive entry of fielded, partially fielded or unfielded address data (see chapter 6.7) and processing in different ProcessModes (see chapter 5.11), using the processing parameter settings (see chapters 5.12, 5.13 and 5.14) chosen via menus: Also, it may be used for producing valid Informatica AddressDoctor InputData (hit the “Get XML” button on the “XML Input” Tab after parsing or validating your address entered on one of the other Tabs “Fielded Input”, “Partially Fielded Input” or “Unfielded Input”), GetConfig, Parameters and Result XML files (see chapter 6.17 also) for submission to Informatica AddressDoctor Support (see chapter 9.3). The “Status Help” button is very useful in analysis of the Element Input and Result Status values (see chapter 5.27) after processing. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 152 8. Sample Address Data for Testing The following addresses are provided for you to test your implementation of Informatica AddressDoctor. For each address the status code values are provided and explained. If not otherwise mentioned the addresses have been processed using the Validation process mode Suggestions (INTERACTIVE) that is explained in chapter 5.11.2. The data was input using the FormattedAddressLine element, see chapter 6.7.4. Further example address input and output may be found in chapters 4.1 and 6.15. 8.1 Addresses with Status Code Vx Addresses whose processing results in a status code of Vx were correct on input. Depending on other parameters some minor standardizations may take place. 8.1.1 Correct Address The following input address is entirely correct, the postal code is properly spaced and the address also is in the proper capitalization for a Swedish address. Because of this, no standardization will have to take place. VASAGATAN 22 111 20 STOCKHOLM SVERIGE The ElementInputStatus (see chapter 5.27.2) would be: 60600060600000000060 With the PreferredScript parameter set to Latin Script (LATIN) and PreferredLanguage set to English (ENGLISH) see Chapter 5.12.2 the result would be (process status value V4, see chapter 5.17): Street: VASAGATAN HouseNumber: 22 POBox: Locality: STOCKHOLM PostalCode: SE-111 20 Province: STOCKHOLMS LÄN Country: SWEDEN The ElementResultStatus (see chapter 5.27.3) would be: F0F080F0F000000000E0 8.1.2 Address with Exonym replaced The following input address is written with the English exonym for München (Munich). Because this is a correct name for the city the overall status value would now be V3. Prinzregentenstr. 93 81677 Munich Germany The ElementInputStatus would be: 60500060600000000060 Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 153 With the PreferredLanguage parameter set to English (ENGLISH) the result would be: Street: Prinzregentenstr. HouseNumber: 93 POBox: Locality: Munich PostalCode: 81677 Province: Bavaria Country: GERMANY The ElementResultStatus would be: F0D080F0F000000000E0 With the PreferredLanguage parameter set to the reference data standard (DATABASE) and CountryType set to “NAME_DE” the result would then be: Street: Prinzregentenstr. HouseNumber: 93 POBox: Locality: München PostalCode: 81677 Province: Bayern Country: DEUTSCHLAND The ElementResultStatus would still be: F0D080F0F000000000E0 8.2 Addresses with Status Code Cx Addresses that Informatica AddressDoctor can automatically correct will result in a status code of Cx. This indicates that either some address components were missing or incorrect. The returned address can be used instead of the original input address. 8.2.1 Address with missing Postal Code The following input address is basically correct, but it is missing the postal code. Informatica AddressDoctor automatically appends the correct postal code and return a status value of C4 for the address: 2827 yonge street toronto on Canada The ElementInputStatus would be: 00606060600000000060 With PreferredLanguage set to DATABASE the result would be: Street: YONGE STREET HouseNumber: 2827 POBox: Locality: TORONTO PostalCode: M4N 2J4 Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 154 Province: ON Country: CANADA The ElementResultStatus would be: 80F0F0F0F000000000E0 8.2.2 Address with Misspellings in Street and City Name The following input address is basically correct, but has misspellings in the street and city name. Informatica AddressDoctor automatically corrects these misspellings and return a status value of C4 for the address. 100 GOULD ST Neu York NY 10038 United States The ElementInputStatus would be: 60406040600000000060 With PreferredLanguage set to DATABASE the result would be: Street: GOLD ST HouseNumber: 100 POBox: Locality: NEW YORK PostalCode: 10038-1605 Province: NY Province Item 2 (County): NEW YORK Country: UNITED STATES The ElementResultStatus would be: F870F870F000000000E0 Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 155 9. Miscellaneous Topics This chapter lists various topics that have not been discussed before. 9.1 Background on the (Postal) Reference Database In order to validate postal addresses, so called postal reference data is required. This reference data is typically a collection of locality (city) names, streets, provinces, building numbers, postal codes (ZIP codes), and Post Office Box numbers. Informatica AddressDoctor obtains this data from various sources around the world and updates it regularly. The specific update schedule for a country can be found on the Informatica AddressDoctor Web Site. Also available online is an interactive map that illustrates the latest updates and data coverage: The world map that shows countries, shown in blue, that Informatica AddressDoctor supports for address validation. An interactive version of this match is available on the Informatica AddressDoctor website. The reference data is typically provided by postal organizations around the globe. The Informatica AddressDoctor development team checks each dataset and then transfers the data into a central data store. This master database is then used to create the postal reference data files in the Informatica AddressDoctor-proprietary, platform-independent (database) file format. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 156 Argentina Armenia Australia Austria Belgium Anguilla Angola Canada China … Denmark Andorra Egypt Algeria Finland Albania France Afghanistan Germany Central AddressDoctor data store Zimbabwe Hungary United States Italy United Kingdom AddressDoctor File Format Japan Turkey Luxemburg Spain Mexico Singapore Sweden Norway Netherlands Each country has its own reference database. The databases follow a specific naming scheme that makes it easy to tell them apart. XXX5BI.MD XXX represents the ISO3 code of the country. A list of these codes can be found at the Informatica AddressDoctor website: http://www.addressdoctor.com/en/countries_data/isocodes.asp The databases are self-contained and platform independent that is they can be used on Windows, Solaris, Unix, or Linux without changes. An external database system or run-time files are not required. 9.1.1 Database Format The Informatica AddressDoctor postal reference database is a read-only file that stores the postal reference data and all required indexes for fast data access. The data conversion process to create this database format is very resource intensive and is performed on a cluster of high speed computers. While the creation of the database is resource intensive, the access to the data is very fast. The databases contain information for fuzzy (fault tolerant) searching as well as for all process modes supported by Informatica AddressDoctor. 9.1.2 Database Size The postal reference databases for all countries combined (without enrichments) require approximately 15 to 20 GB storage space. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 157 9.1.3 Database Updates Updates to the postal databases are available regularly. Their frequency depends on updates made available to Informatica AddressDoctor by the data providers. For some countries monthly updates are available, while others only have an irregular update frequency. When the Informatica AddressDoctor team receives new data, it is first checked for accuracy and consistency. Then, the data is transferred into the central data store where enrichment operations take place. Exonyms (alternate names) for places and streets are added and indexes for fast access are stored in the database. To replace a database with an updated version, simply copy the new database file over the existing file. While doing this, however, no application may be accessing the databases. 9.2 Postal Certifications Some postal operators have instituted a certification process for software vendors. The certification will ensure that the software conforms to the rules and regulations of a specific postal organization. Depending on the intended use of the product, a certification might not produce the best results for poor input data. Certifications tend to be very strict and their major goal is to avoid that improperly addressed mail enters the postal system. The primary goal is not to improve all addresses that can possibly be corrected. Informatica AddressDoctor Version 5 has been certified by USPS for CASS Cycle M and is regularly submitted to USPS for re-certification. Informatica AddressDoctor Version 5 was certified by Canada Post for SERP in 2010 and is regularly submitted to Canada Post for re-certification. In 2011, the engine was certified for the AMAS Cycle 2011 and is regularly submitted to Australia Post for recertification. To process addresses according to the specific rules defined by postal organizations, a special process mode is available (process mode CERTIFIED, see chapter 5.11.5). In 2012, the engine was certified for the France (SNA) and New Zealand (SendRight) certifications. In 2013, the engine was re-certified for the SendRight Cycle 2014. In 2013, the engine was re-certified for the SERP Cycle 2014. The current engine meets the following certification cycles: AMAS Cycle 2013 CASS Cycle N SERP Cycle 2014 SendRight Cycle 2014 SNA (HEXAPOSTE, HEXAVIA) Note: The Statement of Accuracy (SOA) ID is now changed from “ADR13_xxxxxxxx” to “ADR14_xxxxxxxx”, where xxxxxxxx is the unique identifier for the SOA. 9.3 Support Information You may contact Informatica AddressDoctor Support at: [email protected] Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 158 When doing so, make sure to provide the following four XML files (see chapters 6.6 and 6.12 for more details and 10.1 for the corresponding DTDs) in a ZIP archive, after having run them through the ConsoleDemo (see chapter 7.1) application provided by Informatica AddressDoctor to check for reproducibility of your issue: SetConfig.xml – may be retrieved using AD_GetConfigSettingsXML() (in Java: getConfigXML()) Parameters.xml - may be retrieved using AD_GetParametersXML() (in Java: getParametersXML()) InputData.xml - may be retrieved using AD_GetInputDataXML()(in Java: getInputDataXML()) Result.xml - may be retrieved using AD_GetResultXML()(in Java: getResultXML()) These XML files will provide Informatica AddressDoctor support with a basic set of information like the software library and reference database versions as well as the parameter settings used to process an input address facing issues. Additionally, the following information will be needed to assist with your problem: Platform version and patch level, Informatica AddressDoctor is run on (for supported platforms see chapter 2.2), including bitness (32 or 64 bit). In case of Java: JDK version and the parameters used to initialize the JVM (for example, –Xmx for maximum heap, see chapter 3.2.2 for examples). Additionally, the Java stack trace (in case of a crash). A detailed description of the steps required to trigger the problem. In case of a crash that could not be reproduced using the Informatica AddressDoctor ConsoleDemo, a compact binary test application that actually triggers the crash. A constantly updated list of frequently asked questions (FAQ) can be found on the Informatica AddressDoctor Web Site at: http://www.addressdoctor.com/en/support/FAQ 9.4 Recommended Database Layout for International Addresses Postal addresses come in numerous varieties around the world. The formats vary in the placement of postal codes, the placement of building numbers, the usage of provinces and the length of address elements. Informatica AddressDoctor recommends using just one database layout to store addresses from all countries of the world. The fields of the proposed format are then mapped to the various elements that appear in different countries. As an example, the United States has states, while Canada has provinces. Japan is divided into prefectures and Switzerland into cantons. Instead of having separate fields for each, Informatica AddressDoctor maps all of these subdivisions to the “province” field. This mapping is done for all address elements that can be represented in an AddressObject (see 5.7). Thus, we recommend storing addresses in a format amenable to AddressObject mapping. As business and consumer addresses vary in the information they require, some fields are not required for consumer addresses. Note that all fields are of type character to allow for any combination of numeric and alpha content. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 159 Field name Field length (min.) Content Organization 50 Company or Organization name including a company type descriptor such as Inc., AG, or GmbH Department 50 Department or Mail Stop information Function 60 Function of the contact Gender 1 Gender of the contact FirstName 40 First name of the contact MiddleName 40 Middle name of the contact LastName 50 Last name of the contact Building 50 Building name. Frequently used in the United Kingdom Subbuilding_1 50 Information that further subdivides a Building, for example, the floor. Subbuilding_2 50 Information that further subdivides a Building, for example, the suite or apartment number. Street_1 50 Name of the street or thoroughfare Street_2 50 Dependent street or thoroughfare Number_1 15 Number of a Building/House in a street. Placement varies by country. Number_2 15 Number of a Building/House in a dependent street. Placement varies by country. DeliveryService_1 50 Code of the respective post office in charge of delivery. DeliveryService_2 50 Post Box descriptor (POBox, Postfach, Case Postale, and so on) and number. Locality_1 50 Primary place name. Typically a “province” is subdivided into localities. Some countries may contain yet another hierarchy level for subdividing provinces. Examples are counties in the US and Kreise in Germany Locality_2 50 Dependent place name that further subdivides a Locality. Examples are colonias in Mexico, Urbanisaciones in Spain Locality_3 50 Dependent place name that further subdivides a Locality. An example would be Mahalle in Turkey. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 160 Field name Field length (min.) Content SortingCode 10 Speeds up delivery in certain countries for large localities, like for example Prague or Dublin. PostalCode 10 Postal code or ZIP code. Province_1 50 Store the state, province, canton, prefecture or other sub-division of a country. Province_2 50 Dependent province information that further subdivides a province. An example would be a US county. CountryName 50 Optionally needed if required for display. It is recommended to just store the ISO code so that the country name can be displayed in any language. CountryISO 3 ISO alpha3 code according to ISO 3166. Can be used to generate the name of a country in any language. When data has been stored in the format suggested above, this is of major benefit when using Informatica AddressDoctor functionality for automatically generating addresses for printing and display. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 161 10. Appendix 10.1 API Document Type Definitions The following accompanying DTD files are provided in the documentation package (see chapter 3.1.2): SetConfig.dtd - Configuration settings passed with AD_Initialize() Parameters.dtd - Parameters passed with AD_SetParametersXML() or AD_Initialize() with SetConfig) InputData.dtd - Structure of data input as XML, using AD_SetInputDataXML() and AD_GetInputDataXML() Result.dtd - Structure of the XML result from AD_GetResultXML() GetConfig.dtd - Structure of the XML result from AD_GetConfigSettingsXML() IMPORTANT Notice: As Informatica AddressDoctor sees the DTDs referred to here as part of the API definition, where disruptive changes must be minimized, these files may at any given time contain elements without apparent functionality. Refer to the chapters 5 and 66 of this document to understand what functionality is actually available in Informatica AddressDoctor 5 and how it should be used. Simply relying on the DTDs with their comments will not suffice as basis for a successful integration of Informatica AddressDoctor. 10.2 API Reference For details on the available function calls and parameters, see the accompanying Application Programming Interface Reference provided in HTML format for C and Java as part of the documentation package (see chapter 3.1.2). Again, the API Reference should not be used without prior consultation of this documentation, as explained before in the notice to chapter 10.1. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 162 10.3 Schematic Representation of Informatica AddressDoctor Processing Flow Address Input Plausibility Check Tokenization Step 1 Country Detection Parsing Tokenization Step 2 Parser 1 USA Parser 2 JPN Parser 3 CHN Parser n XXX Validation Formatting Standardization (Truncation/ Casing) Enrichment Geo Coding CASS ... Validate n XXX Enrichment OFF Formatting Validate 3 CHN PARSE Normalization Validate 2 JPN COUNTRYRECOGNITION Validate 1 USA Transliteration Plausibility Check Address Output 10.4 AddressElement Output Examples To view the international address formats, see the International Address Formats page under the Countries and Data section on the Informatica AddressDoctor website at http://www.addressdoctor.com/en/countries-data/address-formats.html. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 163 10.5 Province Output When retrieving the Province of a validated or corrected address from the AddressObject, you will receive either the Province name or the Abbreviation, according to the postal rules of this country. The following table shows what is returned for a specific country: ISO Code ABW AFG AGO AIA ALB AND ARE ARG ARM ATA ATG AUS AUT AZE BDI BEL BEN BES BFA BGD BGR BHR BHS BIH BLR BLZ BMU BOL BRA BRB BRN BTN BWA CAF CAN CHE CHL CHN CIV CMR COD COG COK Province output form Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Abbreviation Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Abbreviation Province name Province name Province name Province name Province name Abbreviation Abbreviation Province name Province name Province name Province name Province name Province name Province name Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 164 COL COM CPV CRI CUB CUW CYM CYP CZE DEU DJI DMA DNK DOM DZA ECU EGY ERI ESH ESP EST ETH FIN FJI FLK FRA FRO GAB GBR GEO GHA GIB GIN GMB GNB GNQ GRC GRD GRL GTM GUY HKG HND HRV HTI HUN IDN IND IOT IRL Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 165 IRN IRQ ISL ISR ITA JAM JOR JPN KAZ KEN KGZ KHM KIR KNA KOR KWT LAO LBN LBR LBY LCA LIE LKA LSO LTU LUX LVA MAR MCO MDA MDG MDV MEX MKD MLI MLT MMR MNE MNG MOZ MRT MSR MUS MWI MYS NAM NER NFK NGA NIC Province name Province name Province name Province name Abbreviation Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Abbreviation Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 166 NIU NLD NOR NPL NRU NZL OMN PAK PAN PCN PER PHL PNG POL PRK PRT PRY QAT ROU RUS RWA SAU SDN SEN SGP SGS SHN SLB SLE SLV SMR SOM SRB SSD STP SUR SVK SVN SWE SWZ SXM SYC SYR TCA TCD TGO THA TJK TKL TKM Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 167 TON TTO TUN TUR TUV TZA UGA UKR URY USA UZB VAT VCT VEN VGB VNM VUT WSM YEM ZAF ZMB ZWE Province name Province name Province name Province name Province name Province name Province name Province name Province name Abbreviation Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Province name Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 168 10.6 Reference Data Copyright Notices Australia: Copyright 2009. Based on data provided under license from PSMA Australia Limited (www.psma.com.au). Canada: In case Licensed User licensed the Canadian reference database, it contains Postal Code OM data copied under license from Canada Post Corporation. The Canada Post Corporation file from which this data was copied is from the most current data available from Canada Post Corporation at the time Informatica AddressDoctor made the data available to Licensed User respective Integrator. Great Britain: You are receiving or have received information which is derived from databases (or parts or extracts thereof) of which Royal Mail is the owner or creator, or otherwise authorised to use (the "Data"). Royal Mail owns, or is licensed, all Intellectual Property Rights which subsist in and/or relate to that Data from time to time. You must not at any time copy, reproduce, publish, sell, let, lend, extract, reutilise or otherwise part with possession or control of or relay or disseminate any part of this information or use it for any purpose other than your own private or internal use. New Zealand: The address data within the PAF is sourced from New Zealand Post, Land Information New Zealand and the Crown. New Zealand Post and Crown copyright reserved. United States of America: © United States Postal Service® 2009. Prices are not established, controlled or approved by the United States Postal Service®. The following trademarks and registrations are owned by the USPS®: CASS Certified™, CASS™, DPV™, United States Postal Service®, USPS®, ZIP + 4®, ZIP Code™, ZIP™ Geocodes: The data (“Data”) is provided for your personal, internal use only and not for resale. It is protected by copyright, and is subject to the terms and conditions which are agreed to by you, on the one hand, and Informatica AddressDoctor (“AddressDoctor”) and its licensors (including their licensors and suppliers) on the other hand. © 2009 NAVTEQ. All rights reserved. The Data for areas of Canada includes information taken with permission from Canadian authorities, including: © Her Majesty the Queen in Right of Canada, © Queen's Printer for Ontario, © Canada Post Corporation, GeoBase®, © Department of Natural Resources Canada. All rights reserved. NAVTEQ holds a non-exclusive license from the United States Postal Service® to publish and sell ZIP+4® information. © United States Postal Service® 2009. Prices are not established, controlled or approved by the United States Postal Service®. The following trademarks and registrations are owned by the USPS: United States Postal Service, USPS, and ZIP+4. Data for Europe and World Markets: Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 169 Territory Notice Australia ”Copyright. Based on data provided under license from PSMA Australia Limited (www.psma.com.au).“ Austria “© Bundesamt für Eich- und Vermessungswesen” Croatia, Cyprus, Estonia, Latvia, Lithuania, Moldova, Poland, Slovenia & Ukraine “© EuroGeographics” France “Source: © IGN 2009 – BD TOPO ®” Germany “Die Grundlagendaten wurden mit Genehmigung der zuständigen Behörden entnommen” Great Britain “Based upon Crown Copyright material.” Greece “Copyright Geomatics Ltd.” Hungary “Copyright © 2003; Top-Map Ltd.” Italy “La Banca Dati Italiana è stata prodotta usando quale riferimento anche cartografia numerica ed al tratto prodotta e fornita dalla Regione Toscana.” Jordan “© Royal Jordanian Geographic Centre” Norway “Copyright © 2000; Norwegian Mapping Authority” Portugal “Source: IgeoE – Portugal” Spain “Información geográfica propiedad del CNIG” Sweden “Based upon electronic data: © National Land Survey Sweden.” Switzerland “Topografische Grundlage: © Bundesamt für Landestopographie.“ Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 170 11. Glossary 11.1.1 ISO Country Codes The international standard for Country Codes ISO 3166 is one of the most widely used standards maintained by ISO TC 46. It provides a standard numeric and 2-letter and 3-letter alphabetic codes for 240 countries or areas of special sovereignty. First released in 1974, ISO 3166 has grown to encompass three parts, including two new sections on codes for subdivisions (states, regions, major cities, and so on) and a listing of retired codes. For a list, see http://www.addressdoctor.com/en/countries_data/isocodes.asp 11.1.2 Normalization Normalization refers to the consolidation of address element descriptors, for example, to treat “Street”, “ST” and “St.” all as equivalent to “St.” It is on one hand applied in an internal step before validation to aid in matching. On the other hand, normalization is used to produce address element output meeting the postal regulations for each country. 11.1.3 Parsing Parsing is the capability to split an unstructured address string into meaningful entities. That means an unstructured address such as AddressDoctor GmbH Röntgenstr. 9 D-67133 Maxdorf would be split into Company: AddressDoctor GmbH Street: Röntgenstr. Number: 9 Postal Code: 67133 Locality: Maxdorf Country: Germany Parsing can also be used to rearrange incorrectly fielded data. 11.1.4 Romanization Romanization is a method of using letters of the Roman alphabet (ABCD...) to recreate the sounds of a language whose writing system may or may not use the Roman alphabet. A Chinese Hanzi romanization system would thus be a method of using the Roman alphabet to pronounce Chinese Hanzi characters. 11.1.5 Standardization Postal regulations and target database standards require address element length and casing to be adjusted or standardized. 11.1.6 Tokenization Address input needs to be separated into tokens for mapping these to address elements/items. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 171 11.1.7 Transformation Transformation is the process of changing one character into other characters of the same character set. A string like 'Änderung' would be transformed using the Transform method to for example, an HTML encoded version of this string: 'Änderung'. 11.1.8 Transcription/Transliteration Transcription/transliteration is the process of changing one character of one character set into other characters of another character set, such as converting from Greek to Latin, or Japanese Katakana to Latin. This conversion makes use of either the sound of the character or the spelling. 11.1.9 Unified Ideographs Unified ideographs are characters of the CJK writing system. They consist of Chinese Hanzi, Japanese Kanji, and Korean Hanja. 11.1.10 Validation Validation is the process of checking individual address elements against postal reference data. The validation process will, for instance, verify if a postal code or a locality exist. The validation process will also check if a street name is spelled properly and if the postal code and locality combination given is correct for the building number provided for this street. Informatica AddressDoctor Documentation – Last Revision: 5-Nov-14 @ 12:26 172
© Copyright 2026 Paperzz