Merge Sort : Why Sorting Lists in Halves is Interesting?

Searching and sorting are two basic problems which occur in various more complex computing problems. Time complexity of sorting algorithms cannot get better than O (nlogn). Many popular algorithms have polynomial time complexity O (n2). Undoubtedly sorted lists are interesting because we can solve many problems using it: from binary search to Kruskal’s algorithm. This post tries to point out why sorting divisions of lists (say, two halves) is interesting using examples of merge sort and selection algorithms. If we have a list of unsorted numbers:
A= [9, 5, 8, 3, 6, 7, 4, 0, 2, 5]

sorting it using any O(n2) sorting algorithm requires 102 = 100 units of time. If we divide the list into two lists of 5 elements each

A1 = [9, 5, 8, 3, 6]

and
A2 = [7, 4, 0, 2, 5]

each list will require 52 = 25 units of time which is only half of original 100. But the problem is that if we again concatenate the sorted lists the resultant list still remains unsorted :

A' = [3, 5, 6, 8, 9, 0, 2, 4, 5, 7]

Even if we could not sort the list as whole reducing time complexity is certainly interesting! This property can be used in recursively sorting the array more efficiently. Merge-sort is one such algorithm with O (n log n) time complexity. Merge sort does not concatenate divided list but as the name suggests it merges them. Merging two sorted lists is easy. Another example is selection algorithms where the problem is to find kth smallest/largest elements in a list. Selection also requires sorting. Median of medians is a selection algorithm which uses this property to look for kth smallest/largest element by reducing search space into half at each iteration. Both these algorithms are examples of divide and conquer strategy.

Zeno's Paradoxes

Zeno was a Greek philosopher who lived circa 490 to 430 BC. Zeno’s paradoxes paradoxes have puzzled us for more than 2500 years now; three of them are presented here for you to ponder upon.

1. Achilles and the tortoise: In a race, the quickest runner can never overtake the slowest, since the pursuer must first reach the point whence the pursued started, so that the slower must always hold a lead.

2. Dichotomy paradox: That which is in locomotion must arrive at the half-way stage before it arrives at the goal.  

3. Arrow paradox: If everything when it occupies an equal space is at rest, and if that which is in locomotion is always occupying such a space at any moment, the flying arrow is therefore motionless.

Let’s think!

How Credit Card Numbers Work?

How credit card numbers work? It is interesting to see how quickly payment gateway web page can check whether given card number is valid. There is a pattern that every card number adheres to. Such patterns can be matched using JavaScript. A credit card number is made-up of four different components: MII (digit 1), IIN (digits 1-6), IAI (digits 7-15) and a check digit (last digit). Let’s see how credit card numbers work with the help of an example.

MII: In the above example, 6 is MII or Major Industry Identifier. This one digit identifier tells us which industry the card belongs to. Following table lists MIIs and industries they represent:

MII Industry
0 ISO/TC 68 and other industry assignments
1 Airlines
2 Airlines, financial and other future industry assignments
3 Travel and entertainment
4 Banking and financial
5 Banking and financial
6 Merchandising and banking/financial
7 Petroleum and other future industry assignments
8 Healthcare, telecommunications and other future industry assignments
9 For assignment by national standards bodies

IIN/BIN: Digits 1 to 6 make up IIN or Issuer Identification Number. This six digit number includes the first digit of MII. IIN tells you which bank has issued the card. It is sometimes also referred to as BIN (Bank Identifier Number). The website Binlists is an organized collection of BINs for credit cards as well as debit cards.

IAI: 7 to 15 digits represent Individual Account Identifier. This number is used to identify the customer account associated with the card.

Check Digit: The last digit is a check digit calculated by Luhn algorithm. In our example card number, check digit is 5.

6069 9832 3412 3455

Luhn test is at the heart of the question how credit card numbers work! Luhn test is the frontline defence against fraud. If you are a programmer, you may want to take a look at Rosettacode.org where this algorithm is implemented in 98 different programming languages.

References

  1. Patterns in Card Numbers
  2. Binlists.com
  3. Find Digital Root of a Number
  4. Rosettacode.org

Apriori Algorithm for Generating Frequent Itemsets

Apriori Algorithm is used in finding frequent itemsets. Identifying associations between items in a dataset of transactions can be useful in various data mining tasks. For example, a supermarket can make better shelf arrangement if they know which items are purchased together frequently. The challenge is that given a dataset D having T transactions each with n number of attributes, how to find itemsets that appear frequently in D? This can be trivially solved by generating all possible itemsets (and checking each of the candidate itemset against support threshold.) which is computationally expensive. Apriori algorithm effectively eliminates majority of itemsets without counting their support value. Non-frequent itemsets are known prior to calculation of support count, thus the name Apriori alogrithm. It works on the following principle which is also known as apriori property:  

If an itemset is frequent, then all of its subsets must also be frequent.

  Suppose{c, d, e}is a frequent itemset. Clearly, any transaction that contains {c, d, e} must also contain its subsets, {c, d}, {c, e}, {d, e}, {c}, {d}, and {e}. As a result, if {c, d, e} is frequent, then all subsets of {c, d, e} must also be frequent. The algorithm works on bottom-up approach, that is generate 1-itemsets, eliminate those below support threshold then generate 2-itemsets, eliminate … and so on. This way only frequent _k_-itemsets are used to generate _k+1_-itemsets significantly reducing number of candidates. Let’s see an example: Our dataset contains four transactions with following particulars:

TID

Items

100

1, 3, 4

200

2, 3, 5

300

1, 2, 3, 5

400

2, 5

Support value is calculated as support (a -> b) = Number of transactions a and b appear in / total transactions. But for the sake of simplicity we use support value as number of times each transaction appears. We also assume support threshold = 2. Step 1 : Generate 1-itemsets, calculate support for each and mark itemsets below support threshold

Itemset

Support

{1}

2

{2}

3

{3}

3

{4}

1

{5}

3

Now, mark itemsets where support is below threshold

Itemset

Support

{1}

2

{2}

3

{3}

3

{4}

1

{5}

3

Step 1 : Generate 2-itemsets, calculate support for each and mark itemsets below support threshold. Remember for generating 2-itemsets we do not consider eliminated 1-itemsets.

Itemset

Support

{1, 2}

1

{1, 3}

2

{1, 5}

1

{2, 3}

2

{2, 5}

3

{3, 5}

2

Marking 2-itemsets below support threshold

Itemset

Support

{1, 2}

1

{1, 3}

2

{1, 5}

1

{2, 3}

2

{2, 5}

3

{3, 5}

2

Step 3 : Generate 3-itemsets, calculate support for each and mark itemsets below support threshold

Itemset

Support

{2, 3, 5}

2

  Stop: We stop here because 4-itemsets cannot be generated as there are only three items left! Following are the frequent itemsets that we have generated and which are above support threshold: {1}, {2}, {3}, {5}, {1, 3}, {2, 3}, {2, 5}, {3, 5} and {2, 3, 5}. Itemsets generated but not qualified are : {4}, {1, 2} and {1, 5}. If we do not use Apriori and instead rely on brute-force approach then number of candidate itemsets generated will be much larger! Feel free to ask your questions in comments section below!

Functional Programming in Python with Lambda Map Reduce and Filter

Functional programming in Python is possible with the use of lambda map reduce and filter functions. This article briefly describe use of each these functions.

Lambda : Lambda specifies an anonymous function. It is used to declare a function with no name; When you want to use function only once. But why would you declare a function if you don’t want to reuse the code? Read on you’ll see. Syntax: lambda arg1, arg2 : expression

lambda x : x*x

This lambda expression with just one argument x which returns square of x.

Map : It takes two arguments, the first argument is name of a function and second argument is a sequence. map() applies function f to all elements in the sequence and returns a new sequence. Syntax: map (func, sequence)

1
2
3
4
list = [1, 2, 3]
map (lambda x : x*x, list)

#output: [1, 4, 9]

This code also demonstrates use of lambda. Instead writing a square function we substituted it with a lambda expression. map() applies it to all elements in the list and returns a new list with each element square of original element.

Reduce: reduce() continuously applies a function to a sequence and returns one value. In the following example we sum all elements in the original list. Syntax: reduce (func, sequence)

(lambda x,y : x+y, list)
1
#output: 6

Filter: It filters all values in a sequence for which given function returns True. Syntax: filter (booleanFunc, sequence)

1
2
filter (lambda x : x%2, list)
#output: [1, 3]

Above example returns all odd integers in the list. Remember 2%2=0 is treated as boolean value False.

Why Logarithms are Beautiful?

Binary logarithms or log2 n is the power to which the number 2 must be raised to obtain value n. Binary logarithm (and others) has numerous applications in computer science. Let’s take analysis of algorithms for example. All algorithms have a running time, also called time complexity of algorithms. An algorithm’s running time is measured against size of the input. When an algorithm’s running time increases in exponential proportion to the input size it is exponential time. This can be represented as O (cn) using Big-O notation. But when dealing with algorithms where time complexity decreases with each iteration by a factor of any given number, representation becomes difficult. This is where logarithms help because logx n is inverse of nx.

An Example

Take example of binary search, if we had a sorted list of 32 elements running time can be computed as:

Iteration Running Time
1 32
2 16
3 8
4 4
5 2

At each iteration running time decreases by a factor of 2. Thus we say that time complexity of binary search algorithm is O (log2 n). If there is a function whose running time decreases by a factor of 5 it can be represented as O (log5 n). Generally it is written as O (lg n). Without logarithms it would have been really ugly to represent slowly growing functions.

References

  1. Binary logarithms, Wikipedia
  2. Logarithmic growth, Wikipedia
  3. BigOCheatSheet.com
  4. Working with exponentials and logarithms, MathIsFun.com

k-means Clustering Algorithm with Python : Learn Data Science

_k_-means clustering algorithm is used to group samples (items) in k clusters; k is specified by the user. The method works by calculating mean distance between cluster centroids and samples, hence the name k-means clustering. Euclidean distance is used as distance measure. See references for more information on the algorithm. This is a article describes k-means Clustering Algorithm with Python.

About this implementation :

  • First k samples are assigned as cluster centroids
  • Cluster IDs start with 1
  • Final assignments are printed in the file named assignment-results.txt
  • Final assignments are printed

Implementation :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import math

nsample = int (input ("Number of Samples: "))
nvar = int (input ("Number of Variables: "))
k = int (input ("Number of Clusters: "))

sampleList = [[0 for x in range(nvar)] for y in range(nsample)]

#Input samples
sampleCount = 0
for sample in sampleList:
print ("\n\nCollecting Data for Sample #{}:".format(sampleCount+1))
print ("----------------------------------------")
i = 0
while i < nvar:
sample [i] = int (input ("Data for var-{} : ".format(i+1)))
i += 1

#First k samples are chosen as cluster centroids
centroidList = [[0 for x in range(nvar)] for y in range(k)]
i = 0
while i < k:
j = 0
while j < nvar:
centroidList[i][j] = sampleList[i][j]
j += 1
i += 1

# distanceList maintains Euclidean distance of given sample
# for all clusters k
distanceList = [0.0 for x in range (k)]

#Open file for writing assignments
fileObject = open ("assignment-results.txt","w")

for sample in sampleList:
n = 0
for centroid in centroidList:
var = 0
total = 0
while var < nvar:
temp = (sample[var] - centroid[var]) ** 2 var += 1
total += temp
distanceList[n] = math.sqrt (total)
n += 1

#Write assignments to file
fileObject.write("{} \t {}\n".format(sample, distanceList.index(min(distanceList))+1))

#Close the file
fileObject.close()
print ("\n\n Final assignments successfully written to file! \n")

References :

  1. K Means Clustering Algorithm: Explained

Pangat : Finding CS Parallels in Non-Computing Environments

A letter from one Fred Murphy published in Communications of the ACM in June 2015 excited me as a teacher as it explained how computer science can be learned even in a cafeteria. I have cited this article numerous times to my students explaining them that resources should never keep you from learning. Word Pangat originates from Sanskrit term Pankti meaning line. In India, Pangat is preferred way to serve food where people sit on the floor forming a line with their plates on floor. Food is served by volunteers who carry different food items in containers. There is a Wikipedia article on it (although it reflects only Sikh perspective, Pangat is common across all communities and regions in India). Pangat volunteers would have mastered some basic principles which can be easily correlated with commonly used principles in Computer Science. Here are a few of them.

  1. Priority Scheduling : When multiple volunteers serve food at different speeds, the slower one gives way to the faster. For example, a volunteer serving rice will naturally be slower than one serving chutney so it gets priority.
  2. De-fragmentation : Items have to be rearranged time to time as volunteers sometimes misplace an item on the plate or items are spread in the plate.
  3. Flags : A person denies more food being served in the plat when she feels that she is full.
  4. Garbage Collection : Pangats generally by virtue require that you eat what you take and clean your own plate.

We find computer science theories, metaphors and parallels can be found in non-computing environments if observe closely. CS Unplugged is a collection of free learning activities that teach Computer Science through engaging games and puzzles that use cards, string, crayons and lots of running around. References :

  1. Reaching a Broader Population of Students through ‘Unplugged’ Activities, Thomas J. Cortina, Communications of the ACM
  2. CSUnplugged.org

Hello World in 10 Different Programming Languages

Writing Hello World is a custom I dare not to break as I start off with this brand new blog. Here are Hello World programs in ten different programming languages.

1. The C Not only C can be termed as mother of most high level languages but it remains relevant till date. It is undoubtedly the most popular language of all times.

1
2
3
#include void main() {
printf("Hello World!");
}

2. C ++ Bjarne Stroustrup’s language gave programmers a new world view with Object Oriented Programming.

1
2
3
#include void main() {
cout << "Hello World!";
}

3. UNIX Shell Shell is like a Ninja’s sword! You need to see this.

1
2
#!/bin/sh
echo "Hello World!"

4. Java Java is perhaps the only ubiquitous multi platform language we have seen.

1
2
3
4
5
public class HelloWorld {
public static void main(String args \[\]) {
System.out.print("Hello World!");
}
}

5. Hello World in C# C# is a great programming language based on C from Microsoft.

1
2
3
4
5
public class HelloWorld {
static void Main() {
System.Console.Write("Hello World!");
}
}

6. JavaScript JS shaped platforms which employ half of the Indian programmers.

1
alert ("Hello World");

7. PHP PHP makes great front-ends for web applications.

1
echo "Hello World!";

8. Python My favourite! Python is simple, readable and powerful.

1
print ('Hello World!')

9. PL/SQL When you cannot think in SQL, you use PL/SQL!

1
2
3
begin
dbms\_output.put\_line ('Hello World!');
end;

10. Arduino C Arduino propels IoT development with affordable open source prototyping boards for everyone. Arduino-C is indeed C!

1
2
3
4
5
6
7
8
9
10
11
12
13
// setup () function runs only once
void setup() {
// Use on-board LED as output
pinMode(LED\_BUILTIN, OUTPUT);
}

// loop function runs forever
void loop() {
digitalWrite(LED\_BUILTIN, HIGH); // turn the LED on
delay(1000); // wait a sec (or 1000 ms)
digitalWrite(LED\_BUILTIN, LOW); // turn the LED off
delay(1000); // wait a sec (or 1000 ms)
}

So this is how journey at iDevji starts, I hope it will create value, may be little, for everyone specially my students!