Data Analysis : Data Cleansing

2501 Words11 Pages
Data cleansing is an issue with very critical importance in data mining. It helps in preserving the quality of data used for data mining. An important sub task of data cleansing is duplicate detection. Duplication occurs when some real world object has multiple representations in data source. A lots of work has been done in duplicate detection in structured data such as relational database. But it is just a recently that the focus has been given on duplicate detection in hierarchical and semi structured data such as XML. In this paper we provide an overview of different approaches used for duplicate detection in XML.
Keywords: Data cleansing, duplicate detection, XML, data mining, hierarchical data
1. Introduction
Data mining has become an important process in many business applications and decisions. So to preserve the quality of data on which data mining is performed becomes essential. Data cleansing therefore becomes a task of critical importance for the effective as well as efficient data mining. Duplicate detection is a sub task of data cleansing. Duplicates are the multiple representations of same real world object in data sources. Duplicates may occur due to various reasons such as typographical errors or due to inconsistent representations of same real world object. This problem of duplicate detection has been studied comprehensively for structured data such as relational databases and lots of work has been done on the same. But this work cannot be applied
Get Access