on 02/21/2014 - 01:17 pm

What is data?

Let us begin our discussion by defining what is data. Data are values of qualitative or quantitative variables, belonging to a set of items. Simply put, it's an attribute or property or characteristics of an object. Point to note here is, data can be both qualitative (brown eye color) and quantitative (20cm long).

A common way of representing or displaying a set of correlated data is through table type structures comprised of rows and columns. In such structures, the columns of the table generally signify attributes or characteristics or features and the rows (tuple) signify a set of co-related features belonging to one single item.

While speaking about data, it is important to understand the difference of data with other similar terms like information or knowledge. While a set of data can be used together to directly derive an information, knowledge or wisdom is often derived in an indirect manner.

Another way to look at it is by considering the level of abstraction in them. Data is objective and thus have the lowest level of abstraction whereas information and knowledge are increasingly subjective and involves higher levels of abstraction.

In terms of scientific definition, one may conclude that data have higher level of entropy than information or knowledge.

Types of Data

One of the fundamental aspects you must learn before attempting to do any kind of data modeling is the fact that how we model the data depends completely on the nature or type of data. Data can be both qualitative and quantitative. It's important to understand the distinctions between them.

Qualitative Data

Qualitative data are also called categorical data as they represent distinct categories rather than numbers. In case of dimensional modeling, they are often termed as "dimension". Mathematical operations such as addition or subtraction do not make any sense on that data.

Example of qualitative data are, color, zip code, phone number etc.

Qualitative data can be further classified into below classes:

NOMINAL :

Nominal data differentiates between items or subjects based only on their names or (meta-)categories and other qualitative classifications they belong to. Examples include gender, nationality, ethnicity, language, genre, style, biological species, and form.

ORDINAL :

A set of data is said to be ordinal if the values / observations belonging to it can be ranked (put in order) or have a rating scale attached. You can count and order, but not measure, ordinal data.

The categories for an ordinal set of data have a natural order, for example, suppose a group of people were asked to taste varieties of biscuit and classify each biscuit on a rating scale of 1 to 5, representing strongly dislike, dislike, neutral, like, strongly like. A rating of 5 indicates more enjoyment than a rating of 4, for example, so such data are ordinal.

Quantitative data

Quantitative data are also called numeric data as they represent numbers. In case of dimensional data modeling approach, these data are termed as "Measure".

Example of quantitative data is, height of a person, amount of goods sold, revenue etc.

Quantitative attributes can be further classified as below.

INTERVAL :

Interval classification is used where there is no true zero point in the data and division operation does not make sense. Bank balance, temperature in Celsius scale, GRE score etc. are the examples of interval class data. Dividing one GRE score with another GRE score will not make any sense. In dimensional modeling this is synonymous to semi-additive facts.

RATIO :

Ratio class is applied on the data that has a true "zero" and where division does make sense. Consider revenue, length of time etc. These measures are generally additive.

Below table illustrates different actions that are possible to implement on various data types

ACTIONS --> Distinct Order Addition Multiplication

Nominal Y

Ordinal Y Y

Interval Y Y Y

Ratio Y Y Y Y

It is essential to understand the above differences in the nature of data and suggest appropriate model to store them. Many of our analytical (e.g. MS Excel) and data mining tools (e.g. R) do not automatically understand the nature of the data, so we need to explicitly model the data for those tools. For example, "R" provides 2 test function "is.numeric()" and "is.factor()" to determine if the data is numeric or categorical (dimensional) respectively, and if the default attribution is wrong we can use functions like "as.factor()" or "as.numeric()" to re-attribute the nature of the data.

Dimensional data model is most often used in data warehousing systems. This is different from the 3rd normal form, commonly used for transactional (OLTP) type systems. As you can imagine, the same data would then be stored differently in a dimensional model than in a 3rd normal form model.

To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling:

Dimension: A category of information. For example, the time dimension.

Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension.

Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year → Quarter → Month → Day.

Fact Table: A fact table is a table that contains the measures of interest. For example, sales amount would be such a measure. This measure is stored in the fact table with the appropriate granularity. For example, it can be sales amount by store by day. In this case, the fact table would contain three columns: A date column, a store column, and a sales amount column.

Lookup Table: The lookup table provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse. Each row (each quarter) may have several fields, one for the unique ID that identifies the quarter, and one or more additional fields that specifies how that particular quarter is represented on a report (for example, first quarter of 2001 may be represented as "Q1 2001" or "2001 Q1").

A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables.

In designing data models for data warehouses / data marts, the most commonly used schema types are Star Schema and Snowflake Schema.

Whether one uses a star or a snowflake largely depends on personal preference and business needs. Personally, I am partial to snowflakes, when there is a business case to analyze the information at that particular level.

Goals and Benefits of Dimensional Modeling

Faster Data Retrieval

Better Understandability

Extensibility

Now that we know the reasons behind creating a dimensional modeling, let's find out what exactly is done in this type of models.

In dimensional model, everything is divided in 2 distinct categories - dimension or measures. Anything we try to model, must fit in one of these two categories. So let's say, I want to store information of how many burgers and fries are getting sold per day from a single McDonalds outlet, we will have to first classify this data in dimension and measures. And then we will have 2 different categories of tables (i.e. dimension table and measure table a.k.a fact table to store them).