Sample bartleby Q&A Solution
You ask questions, our tutors answer
Browse
Question

What is the difference between categorical and numerical data types and how does it affect modeling in business analytics?

Expert Answer

Categorical data types are those data where are usually non-numerical in nature such as names, addresses, marital status, etc. However, other seemingly numerical data also can belong to categorical data type such as phone number, zip code, shoe size, etc. Strictly speaking there are further two different types of categorical data type – nominal and ordinal. Nominal data types are label data types such as name, address, phone number, employee number. Not much sophisticated numerical analysis is possible on these data types except for frequency counting and non-parametric statistics. For example, it will not make sense to find the average employee ID or average phone number. Categorical data can also be ordered or ranked such as shoe sizes and response scales in survey questionnaires. These are of ordinal data type. Distance between each rank is not necessarily the same. Frequency counting, median computations and certain rank based non-parametric statistics are possible with this data type.

Numerical data types are those data which are strictly numerical in nature and a range of sophisticated parametric statistical techniques can be applied. Strictly speaking there are further two different types of numerical data type – interval and ratio. In case of interval data the distance between consecutive numbers is the same. Examples of this are the temperature scale (Celsius, Fahrenheit). The interval scale does not have a fixed zero for example zero of Celsius scale is not the same as the zero of Fahrenheit scale – fixing zero is a matter of convention. Parametric statistical methods can be applied to this scale. In simplest terms, mean, median and mode can be calculated for this data type. Ratio scale is same as interval data type and in addition has a fixed zero allowing ratio based statistical computations to be done. Examples are length, mass, weight, etc. All the parametric statistical techniques can be applied.

Choosing the proper data type is essential as application of certain statistical techniques require data of the correct type. For example, considering a seemingly number datatype such as phone number as a numerical data type and applying regression techniques during model building would lead to erroneous models. Such categorical data types would need to be taken care of by converting them into valid numerical types using sophisticated techniques such one-hot encoding before applying regression techniques.

Similarly for skewed numerical data such as income distribution (where the density is highest in the lower income group compared to the higher income groups) it is often converted into categorical data (by a technique known as binning) for better insight into the data. Bins are nothing but ranges of data. Hence the numerical income data can be converted into 3 bins or categories - lower income, middle income and higher income categories. So the exact income figure would be replaced by a class label or category label. Frequency count can then be taken for each category revealing insightful information. This is usually done in the exploratory data analysis stage before model building in analytics.