Tag: DataDiscretization

Posted 2017-11-14Updated 2023-08-22Big Data / Data Mining3 minutes read (About 428 words)

Data Mining : Intuitive Partitioning of Data or 3-4-5 Rule

Introduction

Intuitive partitioning or natural partitioning is used in data discretization. Data discretization is the process of converting continuous values of an attribute into categorical data or partitions or intervals. This helps reducing data size by reducing number of possible values, so instead of storing every observation, we store partition range in which each observation falls. One of the easiest ways to partition numeric values is using intuitive (natural) partitioning.

Intuitive partitioning for data discretization

If an interval covers 3, 6, 7 or 9 distinct values at most significant digit, then create 3 intervals. Here, there can be 3 equal width intervals for 3, 6, 9; and 3 intervals in the grouping of 2-3-2 each for 7.
If it covers 2, 4 or 8 distinct values at most significant digit, then create 4 sub-intervals of equal-width.
If it covers 1, 5, or 10 distinct values at the most significant digit, then partition range into 5 equal-width intervals.

Let’s understand with an example:

Part I : The Data

Assume that we have records showing profits made in each sale throughout a financial year. Profit data range is -3,51,976 to +4,70,00,896. Negative profit value is loss ;)

Part II : Dealing with noisy data

For purpose of avoiding noise, extremely high or extremely low values are not considered. So first we need to smooth out our data so let’s discard bottom 5% and top 5% values.

Part III : Finding MSD and interval range

Suppose after discarding above data new values for LOW = -159876 and HIGH = 1838761. Here, most Significant Digit or MSD is at million position.
Next step is to round down LOW and round up HIGH at MSD. So LOW = -1000000 and HIGH = 2000000. -1000000 is nearest down million to -159876 and 2000000 is nearest up million to 1838761.
Next we identify range of this interval. Range = HIGH - LOW that is 2000000 - (-1000000) = 3000000. We consider only MSD here, so range of this interval is: 3.

Part IV : Applying rules

Now that we know range = 3, we can apply rule #1.
Rule #1 states that we can divide this interval into three equal size intervals:
- Interval 1 : -1000000 to 0
- Interval 2 : 0 to 1000000
- Interval 3 : 1000000 to 2000000
You should be thinking how 0 can be part of multiple intervals? You’re right! We should represent it as follows:
- Interval 1 : (-1000000 … 0]
- Interval 2 : (0 … 1000000]
- Interval 3 : (1000000 … 2000000]
- Here (a … b] denotes range that excludes a but includes b. ( , ] is notation for half-open interval.

Conclusion

Now that we have partitions, we would want to replace profit data points with partition value in which each of it falls. This will save us storage space and complexity.

Introduction

Intuitive partitioning for data discretization

Let’s understand with an example:

Part I : The Data

Part II : Dealing with noisy data

Part III : Finding MSD and interval range

Part IV : Applying rules

Conclusion

References

Links

Categories

Recents

Archives

Tags

Subscribe for updates