1. Abstract

As a result of increasing need for data analysis and the relative ease with which data can be procured nowadays, the size of data used for various kinds of analytics is increasing. The primary problem is that big data cannot be uploaded and made to run the aggregation exercise through full table scans as it takes prohibitive time. Big data needs to be pre-processed before it is uploaded to the analysis box. The aim of this project is to study and research various data pre-processing techniques used in practice in different domains to deal with big data, grasp an insight on the merits and demerits and find out information regarding the popularity of each of them. This project also includes the implementation of a highly popular and efficient technique known as Metropolis-Hastings algorithm in the appendix.

2. Introduction

Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the objective of finding useful information, advising conclusions, and supporting decision-making. Data analysis has multiple aspects and approaches, covering various techniques under a lot of names, in different fields such as business, science, and social science.
Data gathering methods are sometimes loosely controlled, leading to out-of-range values (e.g., Income: -100), impossible data combinations (e.g., Gender: Male, Pregnant: Yes), missing values, etc. Analyzing data that has not been carefully tested for such problems can give…
