Title: Chinese address standardisation via hybrid approach combining statistical and rule-based methods
Authors: Xi Chen; Cheng Fang; Jasmine Chang; Yanjiang Yang; Yuan Hong; Haibing Lu
Addresses: GEIRI North America, 250 W Tasman Dr., San Jose, CA 95134, USA ' Zhejiang University of Finance and Economics, Qiushi Rd, Xihu Qu, Hangzhou Shi, Zhejiang Sheng 310018, China ' Rutgers University, 195 University Ave, Newark, NJ 07102, USA ' Singapore Research Center, Huawei, Singapore ' Illinois Institute of Technology, 10 W 35th St, Chicago, IL 60616, USA ' Santa Clara University, 500 El Camino Real, Santa Clara, CA 95053, USA
Abstract: This paper is derived from the research project of cleansing customer address data for the State Grid Corporation of China (SGCC), which is the largest electric utility company in the world and was ranked the 2nd in the 2016 Fortune Global 500. Address standardisation involves development of a standard address format for data integration, de-duplication, auto address correction/completion, and is widely considered as a very challenging data cleansing task. Address standardisation is critical for routine business tasks, customer relationship management, business intelligence for customer-oriented cooperates, and others. Address standardisation is particularly difficult for the Chinese language. The underlying reasons include: 1) the current address standard placed in China is only realised at the city/town level; 2) due to a number of reasons, many hand-written addresses are incomplete or contain errors; 3) it is difficult to process the Chinese language in a machine fashion due to the language. characteristics. To tackle challenges, we propose a hybrid approach combining both statistical and rule-based methods, which are the two mainstream address standardisation approaches. Our hybrid approach utilises the merits of the both methods and can complete the address standardisation task with a little human efforts and computational time, while achieving high accuracy.
Keywords: natural languge processing; Chinese address; machine learning; rule-based method.
International Journal of Internet and Enterprise Management, 2019 Vol.9 No.2, pp.179 - 193
Received: 27 Feb 2019
Accepted: 01 Jul 2019
Published online: 22 Oct 2019 *