Skip to main content

Advertisement

Log in

Data integration for earthquake disaster using real-world data

  • Research Article - Solid Earth Sciences
  • Published:
Acta Geophysica Aims and scope Submit manuscript

Abstract

The purpose of entity resolution (ER) is to identify records that refer to the same real-world entity from different sources. Most traditional ER studies identify records based on string-based data, so the ER problem relies mostly on string comparison techniques. There is little research on numeric-based data. Traditional ER approaches are widely used in many domains, such as papers, gene sequencing and restaurants, but they have not been used in an earthquake disaster. In this paper, earthquake disaster event information that was collected from different websites is denoted with numeric data. To solve the problem of ER in numeric data, we use the following methods to conduct experiments. First, we treat numbers as strings and use string-based approaches. Second, we use the Euclidean distance to measure the difference between two records. Third, we combine the above two strategies and use a comprehensive approach to measure the distance between the two records. We experimentally evaluate our methods on real datasets that represent earthquake disaster event information. The experimental results show that a comprehensive approach can achieve high performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  • Ayat N, Afsarmanesh H, Akbarinia R, Valduriez P (2012) An uncertain data integration system. In: On the Move to meaningful internet systems: Otm

    Google Scholar 

  • Ayat N, Akbarinia R, Afsarmanesh H, Valduriez P (2014) Entity resolution for probabilistic data. Inf Sci 277:492–511

    Google Scholar 

  • Baeza-Yates R, Gonnet GH (1992) A new approach to text searching. Commun ACM 35(10):74–82

    Google Scholar 

  • Boyer RS, Moore JS (1977) A fast string searching algorithm. Commun ACM 20(10):762–772

    Google Scholar 

  • Chang WI, Lampe J (1992) Theoretical and empirical comparisons of approximate string matching algorithms. In: Combinatorial pattern matching, third annual symposium, CPM 92, Tucson, Arizona, USA, April 29–May 1, 1992, Proceedings. Springer

  • Christen P (2011) A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans Knowl Data Eng 24(9):1537–1555

    Google Scholar 

  • Christen P, Goiser K (2007) Quality and complexity measures for data linkage and deduplication. Complexity 43:127–151

    Google Scholar 

  • Clark DE (2004) Practical introduction to record linkage for injury research. Injury Prev 10(3):186–191

    Google Scholar 

  • Du MW, Chang SC (1994) An approach to designing very fast approximate string matching algorithms. IEEE Trans Knowl Data Eng 6(4):620–633

    Google Scholar 

  • Elmagarmid AK, Ipeirotis PG, Verykios VS (2006) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19(1):1–16

    Google Scholar 

  • Fan X (2016) GEOFON data center. Recent Dev World Seismol 452(8):33–41

    Google Scholar 

  • Fellegi IP, Sunter AB (1969) A theory for record linkage. J Am Stat Assoc 64(328):1183–1210

    Google Scholar 

  • Galil Z, Giancarlo R (1988) Data structures and algorithms for approximate string matching. J Complex 4(1):33–72

    Google Scholar 

  • Geller RJ (2007) Earthquake prediction: a critical review. Geophys J Int 131(3):425–450

    Google Scholar 

  • Gomaa WH, Fahmy AA (2013) A survey of text similarity approaches. Int J Comput Appl 68(13):13–18

    Google Scholar 

  • Hassanzadeh O, Chiang F, Lee HC, Miller RJ (2009) Framework for evaluating clustering algorithms in duplicate detection. Proc VLDB Endow 2(1):1282–1293

    Google Scholar 

  • Jaro MA (1980) UNIMATCH, a record linkage system: users manual. Bureau of the Census

  • Kelman CW, Bass AJ, Holman CDJ (2010) Research use of linked health data — a best practice protocol. Aust N Z J Publ Health 26(2):251–255

    Google Scholar 

  • Khan B, Rauf A, Shah SH, Khusro S (2011) Identification and removal of duplicated records. World Appl Sci J 13(5):1178–1184

    Google Scholar 

  • Knuth DE, Morris JH Jr, Pratt VR (1977) Fast pattern matching in strings. SIAM J Comput 6(2):323–350

    Google Scholar 

  • Köpcke H, Thor A, Rahm E (2010) Evaluation of entity resolution approaches on real-world match problems. Proc VLDB Endow 3(1–2):484–493

    Google Scholar 

  • Koudas N, Marathe A, Srivastava D (2004) Flexible string matching against large databases in practice. In: Thirtieth international conference on very large data bases

  • Lee S, Lee J, Hwang SW (2014) Efficient entity matching using materialized lists. Inf Sci 261:170–184

    Google Scholar 

  • Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet physics doklady, vol 10, No 8, pp 707–710

  • Li L, Li J, Gao H (2015) Rule-based method for entity resolution. IEEE Trans Knowl Data Eng 27(1):250–263

    Google Scholar 

  • Magnani M, Montesi D (2010) A survey on uncertainty management in data integration. J Data Inf Qual 2(1):1–33

    Google Scholar 

  • Miller FP, Vandome AF, Mcbrewster J (1980) Approximate string matching. ACM Comput Surv 12(4):381–402

    Google Scholar 

  • Monge AE (2000) Matching algorithms within a duplicate detection system. IEEE Data Eng Bull 23(4):14–20

    Google Scholar 

  • Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surv 33(1):31–88

    Google Scholar 

  • Peterson JL (1980) Computer programs for detecting and correcting spelling errors. Commun ACM 23(12):676–687

    Google Scholar 

  • Pinheiro JC, Sun DX (1998). Methods for linking and mining massive heterogeneous databases. In: Proceedings of the fourth international conference on knowledge discovery and data mining, August. AAAI Press, pp 309–313

  • Ristad ES, Yianilos PN (1998) Learning string-edit distance. IEEE Trans Pattern Anal Mach Intell 20(5):522–532

    Google Scholar 

  • Steorts RC, Ventura SL, Sadinle M, Fienberg SE (2014) A comparison of blocking methods for record linkage. In: International conference on privacy in statistical databases. Springer, Cham

    Google Scholar 

  • Sun CC, Shen DR, Kou Y, Nie TZ, Yu G (2016) Entity resolution oriented clustering algorithm. J Softw 27(9):2303–2319 (in Chinese)

    Google Scholar 

  • Sutinen E, Tarhio J (1995) On using q-gram locations in approximate string matching. In: Algorithms-esa 95, third European symposium, Corfu, Greece, September. DBLP

  • Ukkonen E (1992) Approximate string-matching with q-grams and maximal matches. Theor Comput Sci 92(1):191–211

    Google Scholar 

  • Waterman MS, Smith TF, Beyer WA (1976) Some biological sequence metrics. Adv Math 20(3):367–387

    Google Scholar 

  • Winkler WE (2004) Methods for evaluating and creating data quality. Inf Syst 29(7):531–550

    Google Scholar 

  • Winkler WE (2006) Overview of record linkage and current research directions. In: Bureau of the Census

  • Zhu B, Suo M, Chen Y, Zhang Z, Li S (2018) Mixed H∞ and passivity control for a class of stochastic nonlinear sampled-data systems. J Frankl Inst 355(7):3310–3329

    Google Scholar 

Download references

Acknowledgements

The authors thank the anonymous referees for their valuable comments and suggestions, which improved the technical content and the presentation of the article. This research was supported by the National Key Research and Development Program of China (2016YFB0501504).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Guoqing Li.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tian, C., Li, G. Data integration for earthquake disaster using real-world data. Acta Geophys. 68, 19–28 (2020). https://doi.org/10.1007/s11600-019-00381-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11600-019-00381-4

Keywords

Navigation