Friday, November 19, 2010

SQL 2005 - What is "Data Normalization"?

Normalization


"Normalization" is a well-known standard that helps efficient storage and retrieval of data from the relational database systems. There is three important normalization are available. They are:

  1. First Normal Form
  2. Second Normal Form
  3. Third Normal Form

This article talks about these three normal forms with examples. Let us start with "Database Table Normalization".

First Normal Form


The "First normal form" or First "database normalization rules"  says, “The information stored in the form of database table should be well split into columns. When you split make sure same information is not split into multiple columns”.

Consider the Table Below which stores the C# Corner members for Example:

Normal Table
Fig 1.  Normal Table


Let us say there are 4000 such members in the table. If you want to retrieve the data in the descending order based on the last name then you need to perform string manipulation on the stored data. Also, it is not always guaranteed that the information you split is fully qualified as the last name as you may perform string manipulation based on the space. The first statement in the Normalization (First Normal form) specifies this. "Data should be well split into column". Why do I put “well split” into underline? Because we all know that data should be split into the column. Definitely, no one will design the above table like below:


Wrong way of storing data
Fig 2. Storing everything in single column (Worst Way)

Below is the Well split table:


Well Split Columns
Fig 3. Well Split Columns

Well, but the First normal form says “When splitting, do not split same information into multiple columns”. It may confuse that I split the same information, that is the column 'Name' into multiple columns. OK. Do not look at the initial column before the split that is; the member name. Look at the final result and apply the rule:
“Do not split same information into multiple columns”
First Name and Last Name are different, so it is not the same information. Each column is well categorised. So why the First Norm says “Do not split same information into multiple columns”?

To explain that look at the below table (Order):


Sameinfo in multiple columns
Fig 4. Sample information spawning into multiple columns

The table stores the order in the Orderid column and the customer who placed the order is placed in custid column. But, if you have a look at the columns for products (Product1, Product2, Product3) in the table, it is stored as a separate column. The same information is repeated in multiple columns like product1, product2 etc., to show the products in each order. The number of products that belongs to the Order is not constant as it varies. So Data Manipulation should accommodate the 'change in data' like inserting a row or Modifying a row in such a way that schema should not change.

The correct design is shown below:


Correct design
Fig 5. Correct design

Second Normal Form


The Second Normal form says, “Entire columns in the table should functionally depend on a single key that uniquely identifies the row also should adhere to the first normal form”. Note that, here the key may be a combination of more than one column.


Second Normal form
Fig 6. Second Norm

In the above table say Orders, Customerid and OrderPrice are depending on the single key column Orderid. The CustomerName field is not related to OrderId and should go on a Customer table. If we remove the Customer Name fields from this field the table is in Second normal form.

Third Normal Form


The third normal form says, “Any columns in the table should not depend on column which is not a key column that uniquely identifies the row. Also, first it should be in second normal form”.

It means if a table has some columns that depend on some other column that is not a primary key column then the table is not in third normal form. Let us consider the below example:


Third normal form
Fig 7. Third Normal Form

In the above example, the TotalPrice column is dependent on UnitPrice and NoOfItems. Hence, this column should be removed from the database. The program that processes this table should calculate the TotalPrice or a view could be created to have the calculated column such as TotalPrice. Whatever can be a solution, for the above table should be in the third normal form the TotalPrice column should be removed from the table.

No comments:

Post a Comment

Leave your comment(s) here.

Like this site? Tell it to your Firend :)