Data Mining : Intuitive Partitioning of Data or 3-4-5 Rule

Introduction

Intuitive partitioning or natural partitioning is used in data discretization. Data discretization is the process of converting continuous values of an attribute into categorical data or partitions or intervals. This helps reducing data size by reducing number of possible values, so instead of storing every observation, we store partition range in which each observation falls. One of the easiest ways to partition numeric values is using intuitive (natural) partitioning.

Intuitive partitioning for data discretization

  1. If an interval covers 3, 6, 7 or 9 distinct values at most significant digit, then create 3 intervals. Here, there can be 3 equal width intervals for 3, 6, 9; and 3 intervals in the grouping of 2-3-2 each for 7.
  2. If it covers 2, 4 or 8 distinct values at most significant digit, then create 4 sub-intervals of equal-width.
  3. If it covers 1, 5, or 10 distinct values at the most significant digit, then partition range into 5 equal-width intervals.

Let’s understand with an example:

Part I : The Data

Assume that we have records showing profits made in each sale throughout a financial year. Profit data range is -3,51,976 to +4,70,00,896. Negative profit value is loss ;)

Part II : Dealing with noisy data

For purpose of avoiding noise, extremely high or extremely low values are not considered. So first we need to smooth out our data so let’s discard bottom 5% and top 5% values.

Part III : Finding MSD and interval range

  • Suppose after discarding above data new values for LOW = -159876 and HIGH = 1838761. Here, most Significant Digit or MSD is at million position.
  • Next step is to round down LOW and round up HIGH at MSD. So LOW = -1000000 and HIGH = 2000000. -1000000 is nearest down million to -159876 and 2000000 is nearest up million to 1838761.
  • Next we identify range of this interval. Range = HIGH - LOW that is 2000000 - (-1000000) = 3000000. We consider only MSD here, so range of this interval is: 3.

Part IV : Applying rules

  • Now that we know range = 3, we can apply rule #1.
  • Rule #1 states that we can divide this interval into three equal size intervals:
    • Interval 1 : -1000000 to 0
    • Interval 2 : 0 to 1000000
    • Interval 3 : 1000000 to 2000000
  • You should be thinking how 0 can be part of multiple intervals? You’re right! We should represent it as follows:
    • Interval 1 : (-1000000 … 0]
    • Interval 2 : (0 … 1000000]
    • Interval 3 : (1000000 … 2000000]
    • Here (a … b] denotes range that excludes a but includes b. ( , ]  is notation for half-open interval.

Conclusion

Now that we have partitions, we would want to replace profit data points with partition value in which each of it falls. This will save us storage space and complexity.

References

  1. Han, J., Kamber M. (2006), Data Mining : Concepts and Techniques, Second Edition,  91-94.
  2. The Range (Statistics), MathIsFun.com

Why Logarithms are Beautiful?

Binary logarithms or log2 n is the power to which the number 2 must be raised to obtain value n. Binary logarithm (and others) has numerous applications in computer science. Let’s take analysis of algorithms for example. All algorithms have a running time, also called time complexity of algorithms. An algorithm’s running time is measured against size of the input. When an algorithm’s running time increases in exponential proportion to the input size it is exponential time. This can be represented as O (cn) using Big-O notation. But when dealing with algorithms where time complexity decreases with each iteration by a factor of any given number, representation becomes difficult. This is where logarithms help because logx n is inverse of nx.

An Example

Take example of binary search, if we had a sorted list of 32 elements running time can be computed as:

Iteration Running Time
1 32
2 16
3 8
4 4
5 2

At each iteration running time decreases by a factor of 2. Thus we say that time complexity of binary search algorithm is O (log2 n). If there is a function whose running time decreases by a factor of 5 it can be represented as O (log5 n). Generally it is written as O (lg n). Without logarithms it would have been really ugly to represent slowly growing functions.

References

  1. Binary logarithms, Wikipedia
  2. Logarithmic growth, Wikipedia
  3. BigOCheatSheet.com
  4. Working with exponentials and logarithms, MathIsFun.com

k-means Clustering Algorithm with Python : Learn Data Science

_k_-means clustering algorithm is used to group samples (items) in k clusters; k is specified by the user. The method works by calculating mean distance between cluster centroids and samples, hence the name k-means clustering. Euclidean distance is used as distance measure. See references for more information on the algorithm. This is a article describes k-means Clustering Algorithm with Python.

About this implementation :

  • First k samples are assigned as cluster centroids
  • Cluster IDs start with 1
  • Final assignments are printed in the file named assignment-results.txt
  • Final assignments are printed

Implementation :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import math

nsample = int (input ("Number of Samples: "))
nvar = int (input ("Number of Variables: "))
k = int (input ("Number of Clusters: "))

sampleList = [[0 for x in range(nvar)] for y in range(nsample)]

#Input samples
sampleCount = 0
for sample in sampleList:
print ("\n\nCollecting Data for Sample #{}:".format(sampleCount+1))
print ("----------------------------------------")
i = 0
while i < nvar:
sample [i] = int (input ("Data for var-{} : ".format(i+1)))
i += 1

#First k samples are chosen as cluster centroids
centroidList = [[0 for x in range(nvar)] for y in range(k)]
i = 0
while i < k:
j = 0
while j < nvar:
centroidList[i][j] = sampleList[i][j]
j += 1
i += 1

# distanceList maintains Euclidean distance of given sample
# for all clusters k
distanceList = [0.0 for x in range (k)]

#Open file for writing assignments
fileObject = open ("assignment-results.txt","w")

for sample in sampleList:
n = 0
for centroid in centroidList:
var = 0
total = 0
while var < nvar:
temp = (sample[var] - centroid[var]) ** 2 var += 1
total += temp
distanceList[n] = math.sqrt (total)
n += 1

#Write assignments to file
fileObject.write("{} \t {}\n".format(sample, distanceList.index(min(distanceList))+1))

#Close the file
fileObject.close()
print ("\n\n Final assignments successfully written to file! \n")

References :

  1. K Means Clustering Algorithm: Explained

Commonly Used HDFS Commands : Learn Data Science

Hadoop Distributed File System or HDFS is the underlying storage for all Hadoop applications. HDFS can be manipulated using APIs such as Java API or REST API but using HDFS shell is the most commonly used option. Below is a list of ten commonly used HDFS commands.

1. Invoking the file system: HDFS Shell supports various file systems and not just HDFS. This means you can invoke file systems including Local FS, HFTP FS, S3 FS, and others. Invoking generic file system (any file system listed above):

hadoop fs

2. Listing Directory Contents

hdfs dfs -ls /user/hadoop/file1

3. Creating Directories

hdfs dfs -mkdir /user/hadoop/dir1 /user/hadoop/dir2

4. Copying Local Files to HDFS

hdfs dfs -put localfile /user/hadoop/hadoopfile

5. Copying From HDFS to Local FS

hdfs dfs -get /user/hadoop/file1 localfile

6. Renaming or Moving Files

hdfs dfs -mv /user/hadoop/file1 /user/hadoop/file2

7. Copying Files Within HDFS

hdfs dfs -cp /user/hadoop/file1 /user/hadoop/file2

Tip: There is distcp command for inter-hdfs transfers.

8. Deleting Files From HDFS

hdfs dfs -rm /user/hadoop/file1

Tip: -skipTrash option can also be used. Tip: -rmr command can be used for recursive deletion

9. Display file contents

hdfs dfs -cat /user/hadoop/file1

Tip: We can pipe cat output to native head as there is no head command here.

10. Empty Trash

hdfs dfs -expunge

Please feel free to ask questions in the comments!

Python MapReduce with Hadoop Streaming in Hortonworks Sandbox

Hortonworks sandbox for Hadoop Data Platform (HDP) is a quick and easy personal desktop environment to get started on learning, developing, testing and trying out new features. It saves the user from installation and configuration of Hadoop and other tools. This article explains how to run Python MapReduce word count example using Hadoop Streaming. Requirements: Minimum system requirement is 8 GB+ RAM. If you have 10 GB+ RAM perhaps than only you can run a VM with 8 GB. So if you do not fulfill this requirement, you can try it on cloud services such as Azure, AWS or Google Cloud. This article uses examples based on HDP 2.3.2 running on Oracle VirtualBox hosted Ubuntu 16.06. Download and Installation: Follow this guide from Hortonworks to install sandbox on Oracle VirtualBox. Steps:

  1. Download example code and data from here

  2. Start sandbox image from VirtualBox

  3. From Ubuntu’s web browser login to dashboard using : 127.0.0.1:8888 username/password: raj_ops/raj_ops

  4. From dashboard GUI, create directory input

  5. Upload sample.txt to input using Ambari > Files View > Upload

  6. Again, from web browser login to HDP shell using: 127.0.0.1:4200 username/password: root/password

  7. From shell upload mapper.py and reducer.py using following secure copy (scp) command:

    scp -P 2222 /home/username/Downloads/mapper.py root@sandbox.hortonworks.com:/
    scp -P 2222 /home/username/Downloads/reducer.py root@sandbox.hortonworks.com:/

  8. Run the job using:

    hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar \
    -input /input -output /output -mapper /mapper.py -reducer /reducer.py

    Note: Do not create output directory in advance. Hadoop will create it.

  9. Test output:

    hadoop -fs cat /output/part-0000
    real 1
    my 2
    is 2
    but 1
    kolkata 1
    home 2
    kutch 2

References:

  1. Python MapReduce : Running Your First Hadoop Streaming Job
  2. Map Reduce Word Count With Python : The Simplest Tutorial

MapReduce Streaming Example : Running Your First Job on Hadoop

MapReduce streaming example will help you running word count program using Hadoop streaming. We use Python for writing mapper and reducer logic. Data is stored as sample.txt file. mapper, reducer and data can be downloaded in a bundle from the link provided. Prerequisites:

  • Hadoop 2.6.5
  • Python 2.7
  • Log in with your Hadoop user
  • Working directory should be set to /usr/local/hadoop.

Steps:

  1. Make sure you’re in /usr/local/hadoop, if not use:

    cd /usr/local/hadoop

  2. Start HDFS:

    start-dfs.sh

  3. Start YARN:

    start-yarn.sh

  4. Check if everything is up (6 services should be running):

    jps

  5. Download data and code files to be used in this tutorial from here.

  6. Unzip contents of streaming.zip:

    unzip streaming.zip

  7. Move mapper and reducer Python files to /home/$USER/:

    mv streaming/mapper.py /home/$USER/mapper.py
    mv streaming/reducer.py /home/$USER/reducer.py

  8. Create working directories for storing data and downloading output when the Hadoop job finishes:

    mkdir /tmp/wordcount/
    mkdir /tmp/wordcount-output/

  9. Move sample.txt to /tmp

    mv streaming/sample.txt /tmp/wordcount/

  10. Upload data to HDFS:

hdfs dfs -copyFromLocal /tmp/wordcount /user/hduser/wordcount
  1. Submit the job:
yarn jar share/hadoop/tools/lib/hadoop-streaming-2.6.5.jar 
-file /home/$USER/mapper.py -mapper /home/$USER/mapper.py
-file /home/$USER/reducer.py -reducer /home/$USER/reducer.py
-input /user/$USER/wordcount/\*
-output /user/$USER/wordcount-output
  1. When the job finishes, download output data:
hdfs dfs -getmerge /user/$USER/wordcount-output /tmp/wordcount-output/output.txt
  1. See word count output on terminal:
cat /tmp/wordcount-output/output.txt

Note: Many common errors are documented in comments section. Please see comments section for help. References

  1. Hadoop Streaming
  2. Hadopp Installation

Extracting Text from PDF Using Apache Tika - Learn NLP

Most NLP applications need to look beyond text and HTML documents as information may be contained in PDF, ePub or other formats. Apache Tika toolkit extracts meta data and text from such document formats. It comes with a REST based Python library. In this example we’ll see extracting text from PDF using Apache Tika toolkit.

Tika Installation
pip install tika

Extracting Text

1
2
3
4
5
6
from tika import parser

#Replace document.pdf with filename
text = parser.from_file('document.pdf')

print (text ['content'])

Tika makes it very convenient to extract text not just from PDFs but more than ten formats. Here is a list of all supported document formats.

References:

  1. Apache Tika Home Page
  2. PyPi Tika 1.15 Package

Pangat : Finding CS Parallels in Non-Computing Environments

A letter from one Fred Murphy published in Communications of the ACM in June 2015 excited me as a teacher as it explained how computer science can be learned even in a cafeteria. I have cited this article numerous times to my students explaining them that resources should never keep you from learning. Word Pangat originates from Sanskrit term Pankti meaning line. In India, Pangat is preferred way to serve food where people sit on the floor forming a line with their plates on floor. Food is served by volunteers who carry different food items in containers. There is a Wikipedia article on it (although it reflects only Sikh perspective, Pangat is common across all communities and regions in India). Pangat volunteers would have mastered some basic principles which can be easily correlated with commonly used principles in Computer Science. Here are a few of them.

  1. Priority Scheduling : When multiple volunteers serve food at different speeds, the slower one gives way to the faster. For example, a volunteer serving rice will naturally be slower than one serving chutney so it gets priority.
  2. De-fragmentation : Items have to be rearranged time to time as volunteers sometimes misplace an item on the plate or items are spread in the plate.
  3. Flags : A person denies more food being served in the plat when she feels that she is full.
  4. Garbage Collection : Pangats generally by virtue require that you eat what you take and clean your own plate.

We find computer science theories, metaphors and parallels can be found in non-computing environments if observe closely. CS Unplugged is a collection of free learning activities that teach Computer Science through engaging games and puzzles that use cards, string, crayons and lots of running around. References :

  1. Reaching a Broader Population of Students through ‘Unplugged’ Activities, Thomas J. Cortina, Communications of the ACM
  2. CSUnplugged.org

Improving fastText Classifier

This post is in continuation of the previous post Text Classification With Python Using fastText. This post describes how to improve fastText classifier using various techniques.

More on Precision and Recall

Precision: Number of correct labels out of total labels predicted by classifier. Recall: Number of labels successfully predicted out of real labels. Example:

Why not put knives in the dishwasher?

This question has three labels on StackExchange: equipment, cleaning and knives. Let us obtain top five labels predicted from our model (k = top k labels to predict):

1
2
3
text = ['Why not put knives in the dishwasher?']
labels = classifier.predict ('text', k=5)
print labels

This gives us food-safety, baking, equipment, substitutions and bread. One out of five labels predicted by the model is correct, giving a precision of 0.20. Out of the three real labels, only one is predicted by the model, giving a recall of 0.33.

Improving the Model

We ran the model with default parameters and training data as it is. Now let’s tweak a little bit. We will employ following techniques to improve :

  • Preprocessing the data
  • Changing the number of epochs (using the option epoch, standard range 5 - 50)
  • Changing the learning rate (using the option lr, standard range 0.1 - 1.0)
  • Using word n-grams (using the option wordNgrams, standard range 1 - 5)

We will perform these techniques and see improvement in precision and recall at each stage. Preprocessing The Data Preprocessing includes removal of special characters and converting entire text to lower case.

1
2
3
4
5
6
7
8
9
10
cat cooking.stackexchange.txt  sed -e "s/([.\!?,'/()])/ 1 /g"  tr "\[:upper:\]" "[:lower:]" > cooking.preprocessed.txt
head -n 12404 cooking.preprocessed.txt > cooking.train
tail -n 3000 cooking.preprocessed.txt > cooking.valid

classifier = fasttext.supervised('cooking.train', 'model\_cooking')
result = classifier.test ('cooking.valid')
print result.precision
#0.161
print result.recall
#0.0696266397578

So after preprocessing precision and recall have improved. More Epoch and Increased Learning Rate Epoch can be set using epoch parameter. Default value is 5. We are going to set it to 25. More epoch will result into increased training time but it would be worth.

1
2
3
4
5
6
classifier = fasttext.supervised('cooking.train', 'model\_cooking', epoch=25)
result = classifier.test ('cooking.valid')
print result.precision
#0.493
print result.recall
#0.213204555283

Now let’s change learning rate with lr parameter:

1
2
3
4
5
6
classifier = fasttext.supervised('cooking.train', 'model\_cooking', lr=1.0)
result = classifier.test ('cooking.valid')
print result.precision
#0.546
print result.recall
#0.236125126135

Results with both epoch and lr together:

1
2
3
4
5
6
classifier = fasttext.supervised('cooking.train', 'model\_cooking', epoch=25, lr=1.0)
result = classifier.test ('cooking.valid')
print result.precision
#0.565
print result.recall
#0.244630243621

Using Word n-grams Word n-grams deal with sequencing of tokens in the text. See examples of word n-grams on Wikipedia.

1
2
3
4
5
6
classifier = fasttext.supervised('cooking.train', 'model\_cooking', epoch=25, lr=1.0, )
result = classifier.test ('cooking.valid')
print result.precision
#???
print result.recall
#???

I am unable to show results for word n-grams because Python on my system keeps crashing. I will update the post asap.

References:

  1. Introduction to fastText

  2. Text Classification With Python Using fastText

  3. PyPi, fastext 0.7.0, Python.org

  4. fasText, Text classification with fastText

  5. Cooking StackExchange, cooking.stackexchange.com

Tutorial: Text Classification With Python Using fastText

Text classification is an important task with many applications including sentiment analysis and spam filtering. This article describes supervised text classification using fastText Python package. You may want to read Introduction to fastText first. Note: Shell commands should not be confused with Python code.

Get the Training Data Set: We start by training the classifier with training data. It contains questions from cooking.stackexchange.com and their associated tags on the site. Let’s build a classifier that automatically recognize a topic of the question and assign a label to it. So, first we download the data.

1
2
3
wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/cooking.stackexchange.tar.gz && tar xvzf cooking.stackexchange.tar.gz

head cooking.stackexchange.txt

As head command shows each line of the text file contains a list of labels followed by corresponding documents. fastText recognizes labels starting with __label__ but this file is alredy in shape. Next task is to train the classifier.

Training the Classifier: Let’s check the size of training data set:

wc cooking.stackexchange.txt

15404 169582 1401900 cooking.stackexchange.txt

It contains 15404 examples. Let’s split it into a training set of 12404 examples and a validation set of 3000 examples:

head -n 12404 cooking.stackexchange.txt > cooking.train

tail -n 3000 cooking.stackexchange.txt > cooking.valid

Now let’s train using cooking.train

classifier = fasttext.supervised('cooking.train', 'model\_cooking')

Our First Prediction:

1
2
3
4
5
label = classifier.predict('Which baking dish is best to bake a banana bread ?')
print label

label = classifier.predict('Why not put knives in the dishwasher? ')
print label

It may come up with something tag like baking and food-safety respectively! Second tag is not relevant which points out that our classifier is poor in quality. Let’s test it’s quality next.

**Testing Precision and Recall: ** Precision and recall are used to measure quality of models in pattern recognition and information retrieval. See this Wikipedia article. Let’s test the model against cooking.valid data:

1
2
3
print result.precision
print result.recall
print result.nexamples

There are a number of ways we can improve our classifier, See next post: Improving fastText Classifier

References:

  1. PyPi, fastext 0.7.0, Python.org
  2. fasText, Text classification with fastText
  3. Cooking StackExchange, cooking.stackexchange.com

If you think this post was helpful, kindly share with others or say thank you in the comments below, it helps!