Data Mining : Intuitive Partitioning of Data or 3-4-5 Rule

Introduction

Intuitive partitioning or natural partitioning is used in data discretization. Data discretization is the process of converting continuous values of an attribute into categorical data or partitions or intervals. This helps reducing data size by reducing number of possible values, so instead of storing every observation, we store partition range in which each observation falls. One of the easiest ways to partition numeric values is using intuitive (natural) partitioning.

Intuitive partitioning for data discretization

  1. If an interval covers 3, 6, 7 or 9 distinct values at most significant digit, then create 3 intervals. Here, there can be 3 equal width intervals for 3, 6, 9; and 3 intervals in the grouping of 2-3-2 each for 7.
  2. If it covers 2, 4 or 8 distinct values at most significant digit, then create 4 sub-intervals of equal-width.
  3. If it covers 1, 5, or 10 distinct values at the most significant digit, then partition range into 5 equal-width intervals.

Let’s understand with an example:

Part I : The Data

Assume that we have records showing profits made in each sale throughout a financial year. Profit data range is -3,51,976 to +4,70,00,896. Negative profit value is loss ;)

Part II : Dealing with noisy data

For purpose of avoiding noise, extremely high or extremely low values are not considered. So first we need to smooth out our data so let’s discard bottom 5% and top 5% values.

Part III : Finding MSD and interval range

  • Suppose after discarding above data new values for LOW = -159876 and HIGH = 1838761. Here, most Significant Digit or MSD is at million position.
  • Next step is to round down LOW and round up HIGH at MSD. So LOW = -1000000 and HIGH = 2000000. -1000000 is nearest down million to -159876 and 2000000 is nearest up million to 1838761.
  • Next we identify range of this interval. Range = HIGH - LOW that is 2000000 - (-1000000) = 3000000. We consider only MSD here, so range of this interval is: 3.

Part IV : Applying rules

  • Now that we know range = 3, we can apply rule #1.
  • Rule #1 states that we can divide this interval into three equal size intervals:
    • Interval 1 : -1000000 to 0
    • Interval 2 : 0 to 1000000
    • Interval 3 : 1000000 to 2000000
  • You should be thinking how 0 can be part of multiple intervals? You’re right! We should represent it as follows:
    • Interval 1 : (-1000000 … 0]
    • Interval 2 : (0 … 1000000]
    • Interval 3 : (1000000 … 2000000]
    • Here (a … b] denotes range that excludes a but includes b. ( , ]  is notation for half-open interval.

Conclusion

Now that we have partitions, we would want to replace profit data points with partition value in which each of it falls. This will save us storage space and complexity.

References

  1. Han, J., Kamber M. (2006), Data Mining : Concepts and Techniques, Second Edition,  91-94.
  2. The Range (Statistics), MathIsFun.com

k-means Clustering Algorithm with Python : Learn Data Science

_k_-means clustering algorithm is used to group samples (items) in k clusters; k is specified by the user. The method works by calculating mean distance between cluster centroids and samples, hence the name k-means clustering. Euclidean distance is used as distance measure. See references for more information on the algorithm. This is a article describes k-means Clustering Algorithm with Python.

About this implementation :

  • First k samples are assigned as cluster centroids
  • Cluster IDs start with 1
  • Final assignments are printed in the file named assignment-results.txt
  • Final assignments are printed

Implementation :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import math

nsample = int (input ("Number of Samples: "))
nvar = int (input ("Number of Variables: "))
k = int (input ("Number of Clusters: "))

sampleList = [[0 for x in range(nvar)] for y in range(nsample)]

#Input samples
sampleCount = 0
for sample in sampleList:
print ("\n\nCollecting Data for Sample #{}:".format(sampleCount+1))
print ("----------------------------------------")
i = 0
while i < nvar:
sample [i] = int (input ("Data for var-{} : ".format(i+1)))
i += 1

#First k samples are chosen as cluster centroids
centroidList = [[0 for x in range(nvar)] for y in range(k)]
i = 0
while i < k:
j = 0
while j < nvar:
centroidList[i][j] = sampleList[i][j]
j += 1
i += 1

# distanceList maintains Euclidean distance of given sample
# for all clusters k
distanceList = [0.0 for x in range (k)]

#Open file for writing assignments
fileObject = open ("assignment-results.txt","w")

for sample in sampleList:
n = 0
for centroid in centroidList:
var = 0
total = 0
while var < nvar:
temp = (sample[var] - centroid[var]) ** 2 var += 1
total += temp
distanceList[n] = math.sqrt (total)
n += 1

#Write assignments to file
fileObject.write("{} \t {}\n".format(sample, distanceList.index(min(distanceList))+1))

#Close the file
fileObject.close()
print ("\n\n Final assignments successfully written to file! \n")

References :

  1. K Means Clustering Algorithm: Explained

Commonly Used HDFS Commands : Learn Data Science

Hadoop Distributed File System or HDFS is the underlying storage for all Hadoop applications. HDFS can be manipulated using APIs such as Java API or REST API but using HDFS shell is the most commonly used option. Below is a list of ten commonly used HDFS commands.

1. Invoking the file system: HDFS Shell supports various file systems and not just HDFS. This means you can invoke file systems including Local FS, HFTP FS, S3 FS, and others. Invoking generic file system (any file system listed above):

hadoop fs

2. Listing Directory Contents

hdfs dfs -ls /user/hadoop/file1

3. Creating Directories

hdfs dfs -mkdir /user/hadoop/dir1 /user/hadoop/dir2

4. Copying Local Files to HDFS

hdfs dfs -put localfile /user/hadoop/hadoopfile

5. Copying From HDFS to Local FS

hdfs dfs -get /user/hadoop/file1 localfile

6. Renaming or Moving Files

hdfs dfs -mv /user/hadoop/file1 /user/hadoop/file2

7. Copying Files Within HDFS

hdfs dfs -cp /user/hadoop/file1 /user/hadoop/file2

Tip: There is distcp command for inter-hdfs transfers.

8. Deleting Files From HDFS

hdfs dfs -rm /user/hadoop/file1

Tip: -skipTrash option can also be used. Tip: -rmr command can be used for recursive deletion

9. Display file contents

hdfs dfs -cat /user/hadoop/file1

Tip: We can pipe cat output to native head as there is no head command here.

10. Empty Trash

hdfs dfs -expunge

Please feel free to ask questions in the comments!

Python MapReduce with Hadoop Streaming in Hortonworks Sandbox

Hortonworks sandbox for Hadoop Data Platform (HDP) is a quick and easy personal desktop environment to get started on learning, developing, testing and trying out new features. It saves the user from installation and configuration of Hadoop and other tools. This article explains how to run Python MapReduce word count example using Hadoop Streaming. Requirements: Minimum system requirement is 8 GB+ RAM. If you have 10 GB+ RAM perhaps than only you can run a VM with 8 GB. So if you do not fulfill this requirement, you can try it on cloud services such as Azure, AWS or Google Cloud. This article uses examples based on HDP 2.3.2 running on Oracle VirtualBox hosted Ubuntu 16.06. Download and Installation: Follow this guide from Hortonworks to install sandbox on Oracle VirtualBox. Steps:

  1. Download example code and data from here

  2. Start sandbox image from VirtualBox

  3. From Ubuntu’s web browser login to dashboard using : 127.0.0.1:8888 username/password: raj_ops/raj_ops

  4. From dashboard GUI, create directory input

  5. Upload sample.txt to input using Ambari > Files View > Upload

  6. Again, from web browser login to HDP shell using: 127.0.0.1:4200 username/password: root/password

  7. From shell upload mapper.py and reducer.py using following secure copy (scp) command:

    scp -P 2222 /home/username/Downloads/mapper.py root@sandbox.hortonworks.com:/
    scp -P 2222 /home/username/Downloads/reducer.py root@sandbox.hortonworks.com:/

  8. Run the job using:

    hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar \
    -input /input -output /output -mapper /mapper.py -reducer /reducer.py

    Note: Do not create output directory in advance. Hadoop will create it.

  9. Test output:

    hadoop -fs cat /output/part-0000
    real 1
    my 2
    is 2
    but 1
    kolkata 1
    home 2
    kutch 2

References:

  1. Python MapReduce : Running Your First Hadoop Streaming Job
  2. Map Reduce Word Count With Python : The Simplest Tutorial

MapReduce Streaming Example : Running Your First Job on Hadoop

MapReduce streaming example will help you running word count program using Hadoop streaming. We use Python for writing mapper and reducer logic. Data is stored as sample.txt file. mapper, reducer and data can be downloaded in a bundle from the link provided. Prerequisites:

  • Hadoop 2.6.5
  • Python 2.7
  • Log in with your Hadoop user
  • Working directory should be set to /usr/local/hadoop.

Steps:

  1. Make sure you’re in /usr/local/hadoop, if not use:

    cd /usr/local/hadoop

  2. Start HDFS:

    start-dfs.sh

  3. Start YARN:

    start-yarn.sh

  4. Check if everything is up (6 services should be running):

    jps

  5. Download data and code files to be used in this tutorial from here.

  6. Unzip contents of streaming.zip:

    unzip streaming.zip

  7. Move mapper and reducer Python files to /home/$USER/:

    mv streaming/mapper.py /home/$USER/mapper.py
    mv streaming/reducer.py /home/$USER/reducer.py

  8. Create working directories for storing data and downloading output when the Hadoop job finishes:

    mkdir /tmp/wordcount/
    mkdir /tmp/wordcount-output/

  9. Move sample.txt to /tmp

    mv streaming/sample.txt /tmp/wordcount/

  10. Upload data to HDFS:

hdfs dfs -copyFromLocal /tmp/wordcount /user/hduser/wordcount
  1. Submit the job:
yarn jar share/hadoop/tools/lib/hadoop-streaming-2.6.5.jar 
-file /home/$USER/mapper.py -mapper /home/$USER/mapper.py
-file /home/$USER/reducer.py -reducer /home/$USER/reducer.py
-input /user/$USER/wordcount/\*
-output /user/$USER/wordcount-output
  1. When the job finishes, download output data:
hdfs dfs -getmerge /user/$USER/wordcount-output /tmp/wordcount-output/output.txt
  1. See word count output on terminal:
cat /tmp/wordcount-output/output.txt

Note: Many common errors are documented in comments section. Please see comments section for help. References

  1. Hadoop Streaming
  2. Hadopp Installation

Map Reduce Word Count With Python : Learn Data Science

We spent multiple lectures talking about Hadoop architecture at the university. Yes, I even demonstrated the cool playing cards example! In fact we have an 18-page PDF from our data science lab on the installation. Still I saw students shy away perhaps because of complex installation process involved. This tutorial jumps on to hands-on coding to help anyone get up and running with Map Reduce. No Hadoop installation is required.

Problem : Counting word frequencies (word count) in a file. Data : Create sample.txt file with following lines. Preferably, create a directory for this tutorial and put all files there including this one.

my home is kolkata
but my real home is kutch

Mapper : Create a file mapper.py and paste below code there. Mapper receives data from stdin, chunks it and prints the output. Any UNIX/Linux user would know about the beauty of pipes. We’ll later use pipes to throw data from sample.txt to stdin.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
#!/usr/bin/env python
import sys

# Get input lines from stdin
for line in sys.stdin:
# Remove spaces from beginning and end of the line
line = line.strip()

# Split it into words
words = line.split()

# Output tuples on stdout
for word in words:
print '%s\\t%s' % (word, "1")

Reducer : Create a file reducer.py and paste below code there. Reducer reads tuples generated by mapper and aggregates  them.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#!/usr/bin/env python
import sys

# Create a dictionary to map words to counts
wordcount = {}

# Get input from stdin
for line in sys.stdin:
#Remove spaces from beginning and end of the line
line = line.strip()

# parse the input from mapper.py
word, count = line.split('\\t', 1)
# convert count (currently a string) to int
try:
count = int(count)
except ValueError:
continue

try:
wordcount\[word\] = wordcount\[word\]+count
except:
wordcount\[word\] = count

# Write the tuples to stdout
# Currently tuples are unsorted
for word in wordcount.keys():
print '%s\\t%s'% ( word, wordcount\[word\] )

Execution : CD to the directory where all files are kept and make both Python files executable:

1
2
chmod +x mapper.py
chmod +x reducer.py

And now we will feed cat command to mapper and mapper to reducer using pipe (). That is output of cat goes to mapper and mapper’s output goes to reducer. (Recall that cat command is used to display contents of any file.

1
cat sample.txt  ./mapper.py  ./reducer.py

Output :

1
2
3
4
5
6
7
real1
kutch1
is2
but1
kolkata1
home2
my2

Yay, so we get the word count kutch x 1, is x 2, but x 1, kolkata x 1, home x 2 and my x 2! You can put your questions in comments section below!