iDevji

Posted 2019-01-09Updated 2023-08-22NLP / Big Data2 minutes read (About 285 words)

Perhaps my quest for an ultimate IDE ends with Emacs. My goal was to use Emacs as full-flagged Python IDE. This post describes how to setup Anaconda on Emacs. My Setup:

OS: Trisquel 8.0
Emacs: GNU Emacs 25.3.2

Quick Key Guide (See full guide) :

C-x = Ctrl + x
M-x = Alt + x
RET = ENTER

1. Downloading and installing Anaconda

1.1 Download: Download Anaconda from here. You should download Python 3.x version as Python 2 will run out of support in 2020. You don’t need Python 3.x on your machine. It will be installed by this install script. 1.2 Install:

cd ~/Downloads
bash Anaconda3-2018.12-Linux-x86.sh

2. Adding Anaconda to Emacs

2.1 Adding MELPA to Emacs Emacs package named anaconda-mode can be used. This package is on the MELPA repository. Emacs25 requires this repository to be added explicitly. Important : Follow this post on how to add MELPA to Emacs. 2.2 Installing anaconda-mode package on Emacs

M-x package-install RET
anaconda-mode RET

2.3 Configure anaconda-mode in Emacs

echo “(add-hook ‘python-mode-hook ‘anaconda-mode)” > ~/.emacs.d/init.el

3. Running your first script on Anaconda from Emacs

3.1 Create new .py file

C-x C-f
HelloWorld.py RET

3.2 Add the code

print (“Hello World from Emacs”)

3.3 Running it

C-c C-p
C-c C-c

Output

Python 3.7.1 (default, Dec 14 2018, 19:46:24)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type “help”, “copyright”, “credits” or “license” for more information.

python.el: native completion setup loaded
Hello World from Emacs

I was encouraged for Emacs usage by Codingquark; Errors and omissions should be reported in comments. Cheers!

Posted 2018-09-07Updated 2023-08-22Blockchain5 minutes read (About 751 words)

Creating Your First Blockchain with Python

In this tutorial we’ll learn how to create a very basic Blockchain with Python. We will create a Blockchain with just 30 lines of code! The aim is to introduce you to Blockchain programming without getting into inessential details. You should already know fundamentals of Blockchain, if not then you may want to read this article first. You should also know basic Python programming and object oriented concepts.

Prerequisites

Python 3 Installed on Windows, Linux or Mac.
Knowledge of running Python programs on terminal or IDLE.

The Fundamentals

A Blockchain is merely a series of blocks. These blocks are logically connected to each other as each one stores previous block’s hash. We will create a class that represents a block and store its predecessor’s address as one of the variables within it. Other variables in the Block class include:

Data: Actual data to be stored in the block
Hash: Hash of this data
Timestamp: Denotes when the block was added to the blockchain
Nonce: A random number that is used to find target hash

Python Libraries

Firstly we import these three Python libraries. You need not to download them as they are part of standard distribution:

hashlib: To generate SHA-256 of block’s data.
datetime: To get current timestamp.
random: To generate nonces.

#for computing sha256
import hashlib

#for generating timestamps
import datetime

#for generating random numbers
import random

The Block Class

Block class uses a constructor to initialize variables when it is instantiated. Let’s define proof-of-work as finding a hash value that must start with two zeroes (“00”). We define a function getPoWHash() which returns a hash value which satisfies our proof-of-work condition. Basically it appends different random numbers (nonces) to the data until it finds a target hash. When we find the target hash, we also store nonce used to find this hash in the block so that others can verify it.

class Block:
def getPoWHash (self, data):
nonce = str(random.randint(0, 300000))
noncedData = data + nonce
hash = hashlib.sha256 (noncedData.encode(‘utf-8’)).hexdigest()
#hashlib requires data to be encoded before hashed.
if hash.startswith (“00”):
self.nonce = nonce
return hash
else:
return self.getPoWHash (data)

def __init__ (self, data, previousHash):
self.data = data
self.nonce = None
self.previousHash = previousHash
self.timeStamp = datetime.datetime.now()
self.hash = self.getPoWHash (data)

Adding Genesis Block and Other Blocks

Next, we add some blocks to our Blockchain. First block is special, so we don’t have previous block’s hash instead we use a 0 here for the sake of simplicity. We simply instantiate the Block class each time we want to add a block.

genesisBlock = Block(“I am genesis block!”, 0)

secondBlock = Block(“I am second block”, genesisBlock.hash)

thirdBlock = Block(“I am third block”, secondBlock.hash)

Viewing the Blockchain

To view our Blockchain we can simply print the hashes of each of the blocks.

print (“Genesis Block Hash: “, genesisBlock.hash)

print (“Second Block Hash: “, secondBlock.hash)

print (“Third Block Hash: “, thirdBlock.hash)

Output: As you can see all hashes start with two zeroes as that is the condition we’ve set as proof-of-work. In Bitcoin, this requirement is 17 leading zeroes.

Genesis Block Hash: 00ddef133aa388b89d431ffd80874d94c808b363ce1467eb6e38a7c38c7712d1

Second Block Hash: 0029e17e48976fdd651e75ce283770634c33ddaca2ab9d05b2df30a0ab779bfa

Third Block Hash: 0094e2b65446530ff4a046be4a0843799e48e4d8d993fd01669a0328bcf43e79

Congrats! You’ve Made It.

We are all done! Here is the full code for this tutorial:

import hashlib
import datetime
import random

def __init__ (self, data, previousHash):
self.data = data
self.nonce = None
self.previousHash = previousHash
self.timeStamp = datetime.datetime.now()
self.hash = self.getPoWHash (data)

genesisBlock = Block(“I am genesis block!”, 0)
secondBlock = Block(“I am second block”, genesisBlock.hash)
thirdBlock = Block(“I am third block”, secondBlock.hash)

print (“Genesis Block Hash: “, genesisBlock.hash)
print (“Second Block Hash: “, secondBlock.hash)
print (“Third Block Hash: “, thirdBlock.hash)

What’s Ahead?

Our Blockchain is stored on a single machine but a Blockchain serves its purpose only when it is distributed. Nevertheless, it is a great start isn’t it?

Posted 2017-11-20Updated 2023-08-22Big Data / Data Mining / Machine Learning / Apache Sparka minute read (About 134 words)

Logistic Regression with Spark : Learn Data Science

Logistic regression with Spark is achieved using MLlib. Logistic regression returns binary class labels that is “0” or “1”. In this example, we consider a data set that consists only one variable “study hours” and class label is whether the student passed (1) or not passed (0).

from pyspark import SparkContext
from pyspark import SparkContext
import numpy as np
from numpy import array
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.classification import LogisticRegressionWithLBFGS

sc = SparkContext ()

def createLabeledPoints(label, points):
    return LabeledPoint(label, points)

studyHours = [
 [ 0, [0.5]],
 [ 0, [0.75]],
 [ 0, [1.0]],
 [ 0, [1.25]],
 [ 0, [1.5]],
 [ 0, [1.75]],
 [ 1, [1.75]],
 [ 0, [2.0]],
 [ 1, [2.25]],
 [ 0, [2.5]],
 [ 1, [2.75]],
 [ 0, [3.0]],
 [ 1, [3.25]],
 [ 0, [3.5]],
 [ 1, [4.0]],
 [ 1, [4.25]],
 [ 1, [4.5]],
 [ 1, [4.75]],
 [ 1, [5.0]],
 [ 1, [5.5]]
]

data = []

for x, y in studyHours:
data.append(createLabeledPoints(x, y))

model = LogisticRegressionWithLBFGS.train( sc.parallelize(data) )

print (model)

print (model.predict([1]))

Output:

1
2
3

spark-submit regression-mllib.py
(weights=[0.215546777333], intercept=0.0)
1

References:

Posted 2017-11-17Updated 2023-08-22Big Data / Data Mining / Apache Spark3 minutes read (About 450 words)

k-Means Clustering Spark Tutorial : Learn Data Science

k-Means clustering with Spark is easy to understand. MLlib comes bundled with k-Means implementation (KMeans) which can be imported from pyspark.mllib.clustering package. Here is a very simple example of clustering data with height and weight attributes.

Arguments to KMeans.train:

k is the number of desired clusters
maxIterations is the maximum number of iterations to run.
runs is the number of times to run the k-means algorithm

initializationMode can be either ‘random’or ‘k-meansII’

from pyspark import SparkContext
from pyspark.mllib.clustering import KMeans
from numpy import array

sc = SparkContext()
sc.setLogLevel ("ERROR")

#12 records with height, weight data
data = array([185,72, 170,56, 168,60, 179,68, 182,72, 188,77, 180,71, 180,70, 183,84, 180,88, 180,67, 177,76]).reshape(12,2)

#Generate Kmeans
model = KMeans.train(sc.parallelize(data), 2, runs=50, initializationMode="random")

#Print out the cluster of each data point
print (model.predict(array([185, 71])))
print (model.predict(array([170, 56])))
print (model.predict(array([168, 60])))
print (model.predict(array([179, 68])))
print (model.predict(array([182, 72])))
print (model.predict(array([188, 77])))
print (model.predict(array([180, 71])))
print (model.predict(array([180, 70])))
print (model.predict(array([183, 84])))
print (model.predict(array([180, 88])))
print (model.predict(array([180, 67])))
print (model.predict(array([177, 76])))

Output
0
1
1
0
0
0
0
0
0
0
0
0
(10 items go to cluster 0, where as 2 items go to cluster 2)

Above is a very naive example in which we use training dataset as input data too. In real world we will train a model, save it and later use it for predicting clusters of input data. So here is how you can save a trained model and later load it for prediction.

Training and Storing the Model

from pyspark import SparkContext
from pyspark.mllib.clustering import KMeans
from numpy import array

sc = SparkContext()

#12 records with height, weight data
data = array([185,72, 170,56, 168,60, 179,68, 182,72, 188,77, 180,71, 180,70, 183,84, 180,88, 180,67, 177,76]).reshape(12,2)

#Generate Kmeans
model = KMeans.train(sc.parallelize(data), 2, runs=50, initializationMode="random")

model.save(sc, "savedModelDir")

This will create a directory, _savedModelDir_ with two subdirectories _data_ and _metadata_ where the model is stored. **Using Already Trained Model for Predicting Clusters** Now, let's use trained model by loading it. We need to import KMeansModel in order to use it for loading the model from file.

from pyspark import SparkContext
from pyspark.mllib.clustering import KMeans, KMeansModel
from numpy import array

sc = SparkContext()

#Generate Kmeans
model = KMeansModel.load(sc, "savedModelDir")

#Print out the cluster of each data point
print (model.predict(array([185, 71])))
print (model.predict(array([170, 56])))
print (model.predict(array([168, 60])))
print (model.predict(array([179, 68])))
print (model.predict(array([182, 72])))
print (model.predict(array([188, 77])))
print (model.predict(array([180, 71])))
print (model.predict(array([180, 70])))
print (model.predict(array([183, 84])))
print (model.predict(array([180, 88])))
print (model.predict(array([180, 67])))
print (model.predict(array([177, 76])))

References:

Clustering and Feature Extraction in MLlib, UCLA
k-Means Clustering Algorithm Explained, DnI Institute
k-Means Clustering with Python, iDevji

Posted 2017-11-15Updated 2023-08-22Algorithms2 minutes read (About 255 words)

Functional Programming in Python with Lambda Map Reduce and Filter

Functional programming in Python is possible with the use of lambda map reduce and filter functions. This article briefly describe use of each these functions.

Lambda : Lambda specifies an anonymous function. It is used to declare a function with no name; When you want to use function only once. But why would you declare a function if you don’t want to reuse the code? Read on you’ll see. Syntax: lambda arg1, arg2 : expression

lambda x : x*x

This lambda expression with just one argument x which returns square of x.

Map : It takes two arguments, the first argument is name of a function and second argument is a sequence. map() applies function f to all elements in the sequence and returns a new sequence. Syntax: map (func, sequence)

list = [1, 2, 3]
map (lambda x : x*x, list)

#output: [1, 4, 9]

This code also demonstrates use of lambda. Instead writing a square function we substituted it with a lambda expression. map() applies it to all elements in the list and returns a new list with each element square of original element.

Reduce: reduce() continuously applies a function to a sequence and returns one value. In the following example we sum all elements in the original list. Syntax: reduce (func, sequence)

(lambda x,y : x+y, list)

1	#output: 6

Filter: It filters all values in a sequence for which given function returns True. Syntax: filter (booleanFunc, sequence)

1 2	filter (lambda x : x%2, list) #output: [1, 3]

Above example returns all odd integers in the list. Remember 2%2=0 is treated as boolean value False.

Posted 2017-11-11Updated 2023-08-22Big Data / Algorithms / Data Mining / Machine Learning2 minutes read (About 291 words)

k-means Clustering Algorithm with Python : Learn Data Science

_k_-means clustering algorithm is used to group samples (items) in k clusters; k is specified by the user. The method works by calculating mean distance between cluster centroids and samples, hence the name k-means clustering. Euclidean distance is used as distance measure. See references for more information on the algorithm. This is a article describes k-means Clustering Algorithm with Python.

About this implementation :

First k samples are assigned as cluster centroids
Cluster IDs start with 1
Final assignments are printed in the file named assignment-results.txt
Final assignments are printed

Implementation :

import math

nsample = int (input ("Number of Samples: "))
nvar = int (input ("Number of Variables: "))
k = int (input ("Number of Clusters: "))

sampleList = [[0 for x in range(nvar)] for y in range(nsample)]

#Input samples
sampleCount = 0
for sample in sampleList:
print ("\n\nCollecting Data for Sample #{}:".format(sampleCount+1))
print ("----------------------------------------")
i = 0
while i < nvar:
sample [i] = int (input ("Data for var-{} : ".format(i+1)))
i += 1

#First k samples are chosen as cluster centroids
centroidList = [[0 for x in range(nvar)] for y in range(k)]
i = 0
while i < k:
  j = 0
  while j < nvar:
    centroidList[i][j] = sampleList[i][j]
    j += 1
  i += 1

# distanceList maintains Euclidean distance of given sample
# for all clusters k
distanceList = [0.0 for x in range (k)]

#Open file for writing assignments
fileObject = open ("assignment-results.txt","w")

for sample in sampleList:
  n = 0
  for centroid in centroidList:
    var = 0
    total = 0
    while var < nvar:
      temp = (sample[var] - centroid[var]) ** 2 var += 1
      total += temp
    distanceList[n] = math.sqrt (total)
n += 1

#Write assignments to file
fileObject.write("{} \t {}\n".format(sample, distanceList.index(min(distanceList))+1))

#Close the file
fileObject.close()
print ("\n\n Final assignments successfully written to file! \n")

References :

K Means Clustering Algorithm: Explained

Posted 2017-11-09Updated 2023-08-22Big Data / Hadoop2 minutes read (About 326 words)

Python MapReduce with Hadoop Streaming in Hortonworks Sandbox

Hortonworks sandbox for Hadoop Data Platform (HDP) is a quick and easy personal desktop environment to get started on learning, developing, testing and trying out new features. It saves the user from installation and configuration of Hadoop and other tools. This article explains how to run Python MapReduce word count example using Hadoop Streaming. Requirements: Minimum system requirement is 8 GB+ RAM. If you have 10 GB+ RAM perhaps than only you can run a VM with 8 GB. So if you do not fulfill this requirement, you can try it on cloud services such as Azure, AWS or Google Cloud. This article uses examples based on HDP 2.3.2 running on Oracle VirtualBox hosted Ubuntu 16.06. Download and Installation: Follow this guide from Hortonworks to install sandbox on Oracle VirtualBox. Steps:

Download example code and data from here
Start sandbox image from VirtualBox
From Ubuntu’s web browser login to dashboard using : 127.0.0.1:8888 username/password: raj_ops/raj_ops
From dashboard GUI, create directory input
Upload sample.txt to input using Ambari > Files View > Upload
Again, from web browser login to HDP shell using: 127.0.0.1:4200 username/password: root/password
From shell upload mapper.py and reducer.py using following secure copy (scp) command:

scp -P 2222 /home/username/Downloads/mapper.py root@sandbox.hortonworks.com:/
scp -P 2222 /home/username/Downloads/reducer.py root@sandbox.hortonworks.com:/
Run the job using:

hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar \
-input /input -output /output -mapper /mapper.py -reducer /reducer.py

Note: Do not create output directory in advance. Hadoop will create it.
Test output:

hadoop -fs cat /output/part-0000
real 1
my 2
is 2
but 1
kolkata 1
home 2
kutch 2

References:

Posted 2017-11-08Updated 2023-08-22Big Data / Hadoop2 minutes read (About 313 words)

MapReduce Streaming Example : Running Your First Job on Hadoop

MapReduce streaming example will help you running word count program using Hadoop streaming. We use Python for writing mapper and reducer logic. Data is stored as sample.txt file. mapper, reducer and data can be downloaded in a bundle from the link provided. Prerequisites:

Hadoop 2.6.5
Python 2.7
Log in with your Hadoop user
Working directory should be set to /usr/local/hadoop.

Steps:

Make sure you’re in /usr/local/hadoop, if not use:

cd /usr/local/hadoop
Start HDFS:

start-dfs.sh
Start YARN:

start-yarn.sh
Check if everything is up (6 services should be running):

jps
Download data and code files to be used in this tutorial from here.
Unzip contents of streaming.zip:

unzip streaming.zip
Move mapper and reducer Python files to /home/$USER/:

mv streaming/mapper.py /home/$USER/mapper.py
mv streaming/reducer.py /home/$USER/reducer.py
Create working directories for storing data and downloading output when the Hadoop job finishes:

mkdir /tmp/wordcount/
mkdir /tmp/wordcount-output/
Move sample.txt to /tmp

mv streaming/sample.txt /tmp/wordcount/
Upload data to HDFS:

hdfs dfs -copyFromLocal /tmp/wordcount /user/hduser/wordcount

Submit the job:

yarn jar share/hadoop/tools/lib/hadoop-streaming-2.6.5.jar 
-file /home/$USER/mapper.py -mapper /home/$USER/mapper.py
-file /home/$USER/reducer.py -reducer /home/$USER/reducer.py
-input /user/$USER/wordcount/\*
-output /user/$USER/wordcount-output

When the job finishes, download output data:

hdfs dfs -getmerge /user/$USER/wordcount-output /tmp/wordcount-output/output.txt

See word count output on terminal:

cat /tmp/wordcount-output/output.txt

Note: Many common errors are documented in comments section. Please see comments section for help. References

Posted 2017-11-07Updated 2023-08-22NLPa minute read (About 131 words)

Extracting Text from PDF Using Apache Tika - Learn NLP

Most NLP applications need to look beyond text and HTML documents as information may be contained in PDF, ePub or other formats. Apache Tika toolkit extracts meta data and text from such document formats. It comes with a REST based Python library. In this example we’ll see extracting text from PDF using Apache Tika toolkit.

Tika Installation
pip install tika

Extracting Text

from tika import parser

#Replace document.pdf with filename
text = parser.from_file('document.pdf')

print (text ['content'])

Tika makes it very convenient to extract text not just from PDFs but more than ten formats. Here is a list of all supported document formats.

References:

Posted 2017-11-04Updated 2023-08-22NLP3 minutes read (About 422 words)

Tutorial: Text Classification With Python Using fastText

Text classification is an important task with many applications including sentiment analysis and spam filtering. This article describes supervised text classification using fastText Python package. You may want to read Introduction to fastText first. Note: Shell commands should not be confused with Python code.

Get the Training Data Set: We start by training the classifier with training data. It contains questions from cooking.stackexchange.com and their associated tags on the site. Let’s build a classifier that automatically recognize a topic of the question and assign a label to it. So, first we download the data.

1
2
3

wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/cooking.stackexchange.tar.gz && tar xvzf cooking.stackexchange.tar.gz

head cooking.stackexchange.txt

As head command shows each line of the text file contains a list of labels followed by corresponding documents. fastText recognizes labels starting with __label__ but this file is alredy in shape. Next task is to train the classifier.

Training the Classifier: Let’s check the size of training data set:

wc cooking.stackexchange.txt

15404 169582 1401900 cooking.stackexchange.txt

It contains 15404 examples. Let’s split it into a training set of 12404 examples and a validation set of 3000 examples:

head -n 12404 cooking.stackexchange.txt > cooking.train

tail -n 3000 cooking.stackexchange.txt > cooking.valid

Now let’s train using cooking.train

classifier = fasttext.supervised('cooking.train', 'model\_cooking')

Our First Prediction:

label = classifier.predict('Which baking dish is best to bake a banana bread ?')
print label

label = classifier.predict('Why not put knives in the dishwasher? ')
print label

It may come up with something tag like baking and food-safety respectively! Second tag is not relevant which points out that our classifier is poor in quality. Let’s test it’s quality next.

**Testing Precision and Recall: ** Precision and recall are used to measure quality of models in pattern recognition and information retrieval. See this Wikipedia article. Let’s test the model against cooking.valid data:

1
2
3

print result.precision
print result.recall
print result.nexamples

There are a number of ways we can improve our classifier, See next post: Improving fastText Classifier

References:

If you think this post was helpful, kindly share with others or say thank you in the comments below, it helps!

1. Downloading and installing Anaconda

2. Adding Anaconda to Emacs

3. Running your first script on Anaconda from Emacs

Prerequisites

The Fundamentals

Python Libraries

The Block Class

Adding Genesis Block and Other Blocks

Viewing the Blockchain

Congrats! You’ve Made It.

What’s Ahead?

Links

Categories

Recents

Archives

Tags

Subscribe for updates