# Data Mining Fundamentals

2140 Words Jan 10th, 2012 9 Pages
Data Mining

DM Defined Is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner

Process of analyzing data from different perspectives and summarizing it into useful information

A class of database applications that look for hidden patterns in a group of data that can be used to predict future behavior.
DM Defined The relationships and summaries derived are referred to as models or patterns. Examples include linear equations, rules, clusters, graphs, tree structures and recurrent patterns in time series.

Utilizes observational data as opposed to experimental data. Data that have already
…show more content…
• Analyze the data by application software.

• Present the data in a useful format, such as a graph or table.
Seeking relationships
The process should aim at accurate, convenient and useful summaries.

The steps are as follows:

• Determine the nature and structure of the representation to be used

• Decide how to quantify and compare how well different representations fit the data (choosing the score function)

• Choose an algorithmic process to optimize the score function

• Decide what principles of data management are required to implement the algorithm efficiently.
Seeking relationships: example In simple regression, one can build a predictive model to relate the predictor variable, X to a response variable Y through a relationship Y = aX + b. e.g. we might build a model which would allow us to predict a persona’s annual credit card spending given their annual income. Not perfect but good for a rough characterization.

Step wise:
• The representation is a model in which the response variable, spending, is linearly related to the predictor variable, income.
• The score function: the sum of squared discrepancies between predicted spending and observed spending in group of people described by the data
• The optimization algorithm is quite simple: a and b are expressed as explicit functions of the observed values of spending and income
• Unless the data set is