Abstract
The purpose of entity resolution (ER) is to identify records that refer to the same real-world entity from different sources. Most traditional ER studies identify records based on string-based data, so the ER problem relies mostly on string comparison techniques. There is little research on numeric-based data. Traditional ER approaches are widely used in many domains, such as papers, gene sequencing and restaurants, but they have not been used in an earthquake disaster. In this paper, earthquake disaster event information that was collected from different websites is denoted with numeric data. To solve the problem of ER in numeric data, we use the following methods to conduct experiments. First, we treat numbers as strings and use string-based approaches. Second, we use the Euclidean distance to measure the difference between two records. Third, we combine the above two strategies and use a comprehensive approach to measure the distance between the two records. We experimentally evaluate our methods on real datasets that represent earthquake disaster event information. The experimental results show that a comprehensive approach can achieve high performance.
Similar content being viewed by others
References
Ayat N, Afsarmanesh H, Akbarinia R, Valduriez P (2012) An uncertain data integration system. In: On the Move to meaningful internet systems: Otm
Ayat N, Akbarinia R, Afsarmanesh H, Valduriez P (2014) Entity resolution for probabilistic data. Inf Sci 277:492–511
Baeza-Yates R, Gonnet GH (1992) A new approach to text searching. Commun ACM 35(10):74–82
Boyer RS, Moore JS (1977) A fast string searching algorithm. Commun ACM 20(10):762–772
Chang WI, Lampe J (1992) Theoretical and empirical comparisons of approximate string matching algorithms. In: Combinatorial pattern matching, third annual symposium, CPM 92, Tucson, Arizona, USA, April 29–May 1, 1992, Proceedings. Springer
Christen P (2011) A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans Knowl Data Eng 24(9):1537–1555
Christen P, Goiser K (2007) Quality and complexity measures for data linkage and deduplication. Complexity 43:127–151
Clark DE (2004) Practical introduction to record linkage for injury research. Injury Prev 10(3):186–191
Du MW, Chang SC (1994) An approach to designing very fast approximate string matching algorithms. IEEE Trans Knowl Data Eng 6(4):620–633
Elmagarmid AK, Ipeirotis PG, Verykios VS (2006) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19(1):1–16
Fan X (2016) GEOFON data center. Recent Dev World Seismol 452(8):33–41
Fellegi IP, Sunter AB (1969) A theory for record linkage. J Am Stat Assoc 64(328):1183–1210
Galil Z, Giancarlo R (1988) Data structures and algorithms for approximate string matching. J Complex 4(1):33–72
Geller RJ (2007) Earthquake prediction: a critical review. Geophys J Int 131(3):425–450
Gomaa WH, Fahmy AA (2013) A survey of text similarity approaches. Int J Comput Appl 68(13):13–18
Hassanzadeh O, Chiang F, Lee HC, Miller RJ (2009) Framework for evaluating clustering algorithms in duplicate detection. Proc VLDB Endow 2(1):1282–1293
Jaro MA (1980) UNIMATCH, a record linkage system: users manual. Bureau of the Census
Kelman CW, Bass AJ, Holman CDJ (2010) Research use of linked health data — a best practice protocol. Aust N Z J Publ Health 26(2):251–255
Khan B, Rauf A, Shah SH, Khusro S (2011) Identification and removal of duplicated records. World Appl Sci J 13(5):1178–1184
Knuth DE, Morris JH Jr, Pratt VR (1977) Fast pattern matching in strings. SIAM J Comput 6(2):323–350
Köpcke H, Thor A, Rahm E (2010) Evaluation of entity resolution approaches on real-world match problems. Proc VLDB Endow 3(1–2):484–493
Koudas N, Marathe A, Srivastava D (2004) Flexible string matching against large databases in practice. In: Thirtieth international conference on very large data bases
Lee S, Lee J, Hwang SW (2014) Efficient entity matching using materialized lists. Inf Sci 261:170–184
Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet physics doklady, vol 10, No 8, pp 707–710
Li L, Li J, Gao H (2015) Rule-based method for entity resolution. IEEE Trans Knowl Data Eng 27(1):250–263
Magnani M, Montesi D (2010) A survey on uncertainty management in data integration. J Data Inf Qual 2(1):1–33
Miller FP, Vandome AF, Mcbrewster J (1980) Approximate string matching. ACM Comput Surv 12(4):381–402
Monge AE (2000) Matching algorithms within a duplicate detection system. IEEE Data Eng Bull 23(4):14–20
Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surv 33(1):31–88
Peterson JL (1980) Computer programs for detecting and correcting spelling errors. Commun ACM 23(12):676–687
Pinheiro JC, Sun DX (1998). Methods for linking and mining massive heterogeneous databases. In: Proceedings of the fourth international conference on knowledge discovery and data mining, August. AAAI Press, pp 309–313
Ristad ES, Yianilos PN (1998) Learning string-edit distance. IEEE Trans Pattern Anal Mach Intell 20(5):522–532
Steorts RC, Ventura SL, Sadinle M, Fienberg SE (2014) A comparison of blocking methods for record linkage. In: International conference on privacy in statistical databases. Springer, Cham
Sun CC, Shen DR, Kou Y, Nie TZ, Yu G (2016) Entity resolution oriented clustering algorithm. J Softw 27(9):2303–2319 (in Chinese)
Sutinen E, Tarhio J (1995) On using q-gram locations in approximate string matching. In: Algorithms-esa 95, third European symposium, Corfu, Greece, September. DBLP
Ukkonen E (1992) Approximate string-matching with q-grams and maximal matches. Theor Comput Sci 92(1):191–211
Waterman MS, Smith TF, Beyer WA (1976) Some biological sequence metrics. Adv Math 20(3):367–387
Winkler WE (2004) Methods for evaluating and creating data quality. Inf Syst 29(7):531–550
Winkler WE (2006) Overview of record linkage and current research directions. In: Bureau of the Census
Zhu B, Suo M, Chen Y, Zhang Z, Li S (2018) Mixed H∞ and passivity control for a class of stochastic nonlinear sampled-data systems. J Frankl Inst 355(7):3310–3329
Acknowledgements
The authors thank the anonymous referees for their valuable comments and suggestions, which improved the technical content and the presentation of the article. This research was supported by the National Key Research and Development Program of China (2016YFB0501504).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Tian, C., Li, G. Data integration for earthquake disaster using real-world data. Acta Geophys. 68, 19–28 (2020). https://doi.org/10.1007/s11600-019-00381-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11600-019-00381-4