Zeno's Paradoxes

Zeno was a Greek philosopher who lived circa 490 to 430 BC. Zeno’s paradoxes paradoxes have puzzled us for more than 2500 years now; three of them are presented here for you to ponder upon.

1. Achilles and the tortoise: In a race, the quickest runner can never overtake the slowest, since the pursuer must first reach the point whence the pursued started, so that the slower must always hold a lead.

2. Dichotomy paradox: That which is in locomotion must arrive at the half-way stage before it arrives at the goal.  

3. Arrow paradox: If everything when it occupies an equal space is at rest, and if that which is in locomotion is always occupying such a space at any moment, the flying arrow is therefore motionless.

Let’s think!

How Credit Card Numbers Work?

How credit card numbers work? It is interesting to see how quickly payment gateway web page can check whether given card number is valid. There is a pattern that every card number adheres to. Such patterns can be matched using JavaScript. A credit card number is made-up of four different components: MII (digit 1), IIN (digits 1-6), IAI (digits 7-15) and a check digit (last digit). Let’s see how credit card numbers work with the help of an example.

MII: In the above example, 6 is MII or Major Industry Identifier. This one digit identifier tells us which industry the card belongs to. Following table lists MIIs and industries they represent:

MII Industry
0 ISO/TC 68 and other industry assignments
1 Airlines
2 Airlines, financial and other future industry assignments
3 Travel and entertainment
4 Banking and financial
5 Banking and financial
6 Merchandising and banking/financial
7 Petroleum and other future industry assignments
8 Healthcare, telecommunications and other future industry assignments
9 For assignment by national standards bodies

IIN/BIN: Digits 1 to 6 make up IIN or Issuer Identification Number. This six digit number includes the first digit of MII. IIN tells you which bank has issued the card. It is sometimes also referred to as BIN (Bank Identifier Number). The website Binlists is an organized collection of BINs for credit cards as well as debit cards.

IAI: 7 to 15 digits represent Individual Account Identifier. This number is used to identify the customer account associated with the card.

Check Digit: The last digit is a check digit calculated by Luhn algorithm. In our example card number, check digit is 5.

6069 9832 3412 3455

Luhn test is at the heart of the question how credit card numbers work! Luhn test is the frontline defence against fraud. If you are a programmer, you may want to take a look at Rosettacode.org where this algorithm is implemented in 98 different programming languages.

References

  1. Patterns in Card Numbers
  2. Binlists.com
  3. Find Digital Root of a Number
  4. Rosettacode.org

Apache Log Visualization with Matplotlib : Learn Data Science

This post discusses Apache log visualization with Matplotlib library. First, download the data file used in this example from here.

We will require numpy and matplotlib

In [1]:

import numpy as np
import matplotlib.pyplot as plt

numpy.loadtext() can directly load a text file in an array requests-idevji.txt contains only hour on which request was made, this is achieved by pre-prcoessing the Apache log.

In [2]:

data = np.loadtxt (‘requests-idevji.txt’)

We need 24 bins because we have 24 hours’ data. For other attributes of hist() see references.

In [3]:

plt.hist(data, bins=24)
plt.title(“Requests @ iDevji”)
plt.xlabel(“Hours”)
plt.ylabel(“# Requests”)

Out[3]:

Text(0,0.5,’# Requests’)

In [4]:

plt.show()

![](data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAY4AAAEWCAYAAABxMXBSAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz AAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4wLCBo dHRwOi8vbWF0cGxvdGxpYi5vcmcvpW3flQAAGYZJREFUeJzt3XnUHXWd5/H3h80FZQ8IAYwiake7 RTsNatPTKLIIjjA2IkprRBxaD/bRcQWPR7TVaTxti9rdOoIgAUVk3KDdMI2KOi4QEFmlyWAYiBGC YXcj8J0/6he4hjzJrZD7rO/XOc95qn5Vt+p76zy5n/x+tdxUFZIkDWujiS5AkjS1GBySpF4MDklS LwaHJKkXg0OS1IvBIUnqxeCQpoEkRyb51jjt66+SXDswf1WSfcZj35oc4n0cmqySLAF2AO4D7ga+ Cbyhqu6eyLrG0j48P1NVOz+MbTwaeAPwMuAJwD3A94APVdVPe2yngN8ABfweuAw4uao+v761SavY 49Bk91+r6jHAHsAzgeMnuJ6RSfI44EfAE4HXAI8D/gT4EvCZJEf13OQz2rF7CnA68K9JTthwFWum Mjg0JVTVr4Dz6QIEgCSPSPKhJP8vyc1J/leSRw0sf1uSZUl+meQ1SSrJk9qy7yZ57cC6r07yg4H5 pyZZmGRFkmuTHD6w7KAkVye5K8nSJG9NsjnwDWCnJHe3n52S7JlkUZI7W40fXsvbPAv4aFW9rqp+ VlV/qKq7q+qLwN7AW5LstqYXrl7/asfu1qo6E3g9cHySbdtrtkxyajtGS5O8P8nG7bjenuTpA9uf leS3SbZPsk+SmwaWLUnygrW8L00zBoemhCQ7Ay8EFg80nwg8mS5MngTMBt7d1j8QeCuwH7A7MPQH WwuBhXQf5NsDRwAfTzK3rXIq8HdV9Vjg6cC3q+qeVt8vq+ox7eeXwEfpwmALYDfgnDH2+dfAvVV1 WpJdknw7ya+TfDLJRVV1G/CPdB/+6+tcYBNgzzZ/OrCS7tg9E9gfeG1V/Z6ul/PygdceDlxYVbc8 jP1rmjA4NNl9JcldwI3ALcAJAEkCHAP8j6paUVV3Af+T7kMeug+6T1fVle1D/T099vkiYElVfbqq VrZzC18EXtqW3wvMTbJFVd1WVZeuZVv3Ak9Ksl3rPfx4jPX2A85u0x8Cfkh3fucrwLzWfhnw1B7v 449U1b3ArcA2SXYADgLeVFX3tEA4iQeP31kD0wCvaG2SwaFJ79D2P/t96D40t2vts4BHA5e0YZXb 6U6ez2rLd6ILm1Vu6LHPxwN7rdpu2/aRdOccAP6G7kP3hiQXJnnOWrZ1NF2v6OdJLk7yojHW2x5Y 2qb/FDirhdY36D7sAXYZWKe3JJvSHZ8VdO9xU2DZwHv8ZKsD4DvAo5PslWQOXa/uy+u7b00vm0x0 AdIwqurCJKfT/W/8ULoP098CT6uqNX2YLqP7oF1l19WW30MXPKs8bmD6Rrphmf3GqOVi4JD2QfwG uuGnXeiuYFp93euAlyfZCHgJ8IUk27Ze0KBbgR3b9BXAK5K8h26Ibbt2buYDwOvWVNOQDqEbmroI 2IzuaqvtqmrlGuq+L8k5dMNVNwNfbb06yR6HppSPAPsleUZV3Q+cApyUZHuAJLOTHNDWPQd4dZK5 7RLX1a8mugx4SZJHtw/loweWfRV4cpJXJtm0/fxFkj9Jslm7Z2LLNvRzJ3B/e93NwLZJtly1oSR/ m2RWq/f21nw/D/Vt4LA2/VbguXQB9jLgQuBTwNtbaPWSZJskRwL/Bnywqn5dVcuAbwH/nGSLJBsl 2a2da1nlrLb/I3GYSgMMDk0ZVbUcOIN2Ahx4B93J8h8nuRP4D7pLT2lDPB+h+0Be3H4POgn4A92H /QLgswP7uYvuRPERwC+BXwEfBB7RVnklsKTt83V0H6xU1c+BzwHXt+GfnYADgauS3E13ovyIqvrt Gt7bfwBbJzmyqm6squdX1Y5VdVRVPQ94QVVd0POQ/aztdzHwWrrzQe8eWP4qup7H1cBtwBd4sNdD Vf2Erme2E90VYxLgDYCaQdLdFLd7VS1e58oTIMlsul7AV+l6U9fTfZC/BtirqsY6PzKukjwf+FRV PXGia9HEsMchTRLtXM1zgN/RXU21gm6Yakvg1RNX2UM8HfjFRBehiWOPQzPGZO9xTAVJPgq8GJhf Vd+b6Ho0MQwOSVIvDlVJknqZlvdxbLfddjVnzpyJLkOSppRLLrnk1qqata71pmVwzJkzh0WLFk10 GZI0pSQZ6gkLDlVJknoxOCRJvRgckqReDA5JUi8GhySpF4NDktSLwSFJ6sXgkCT1YnBIknqZlneO SwBzjvtar/WXnHjwiCqRphd7HJKkXgwOSVIvBockqZeRBkeSJUmuSHJZkkWtbZskC5Nc135v3dqT 5GNJFie5PMmzBrYzv61/XZL5o6xZkrR249HjeF5V7VFV89r8ccAFVbU7cEGbB3ghsHv7OQb4BHRB A5wA7AXsCZywKmwkSeNvIoaqDgEWtOkFwKED7WdU58fAVkl2BA4AFlbViqq6DVgIHDjeRUuSOqMO jgK+leSSJMe0th2qalmb/hWwQ5ueDdw48NqbWttY7X8kyTFJFiVZtHz58g35HiRJA0Z9H8feVbU0 yfbAwiQ/H1xYVZWkNsSOqupk4GSAefPmbZBtSpIeaqQ9jqpa2n7fAnyZ7hzFzW0Iivb7lrb6UmCX gZfv3NrGapckTYCRBUeSzZM8dtU0sD9wJXAesOrKqPnAuW36POBV7eqqZwN3tCGt84H9k2zdTorv 39okSRNglENVOwBfTrJqP2dV1TeTXAyck+Ro4Abg8Lb+14GDgMXAb4CjAKpqRZL3ARe39f6hqlaM sG5J0lqMLDiq6nrgGWto/zWw7xraCzh2jG2dBpy2oWuUJPXnneOSpF4MDklSLwaHJKkXg0OS1IvB IUnqxeCQJPVicEiSejE4JEm9GBySpF4MDklSLwaHJKkXg0OS1IvBIUnqxeCQJPVicEiSejE4JEm9 GBySpF4MDklSLwaHJKkXg0OS1IvBIUnqxeCQJPVicEiSejE4JEm9GBySpF4MDklSLwaHJKkXg0OS 1IvBIUnqxeCQJPVicEiSehl5cCTZOMlPk3y1zT8hyU+SLE7y+SSbtfZHtPnFbfmcgW0c39qvTXLA qGuWJI1tPHocbwSuGZj/IHBSVT0JuA04urUfDdzW2k9q65FkLnAE8DTgQODjSTYeh7olSWsw0uBI sjNwMPCpNh/g+cAX2ioLgEPb9CFtnrZ837b+IcDZVfX7qvoFsBjYc5R1S5LGNuoex0eAtwP3t/lt gduramWbvwmY3aZnAzcCtOV3tPUfaF/Dax6Q5Jgki5IsWr58+YZ+H5KkZmTBkeRFwC1Vdcmo9jGo qk6uqnlVNW/WrFnjsUtJmpE2GeG2/xJ4cZKDgEcCWwAfBbZKsknrVewMLG3rLwV2AW5KsgmwJfDr gfZVBl8jSRpnIwuOqjoeOB4gyT7AW6vqyCT/GzgMOBuYD5zbXnJem/9RW/7tqqok5wFnJfkwsBOw O3DRqOqWpMliznFf6/2aJScePIJK/tgoexxjeQdwdpL3Az8FTm3tpwJnJlkMrKC7koqquirJOcDV wErg2Kq6b/zLliTBOAVHVX0X+G6bvp41XBVVVb8DXjrG6z8AfGB0FUqShuWd45KkXgwOSVIvE3GO Q+ptfU4SShoNexySpF4MDklSLwaHJKkXg0OS1IvBIUnqxauq1qDvFTzjcYu/JE0W9jgkSb0YHJKk XgwOSVIvBockqReDQ5LUi8EhSerF4JAk9WJwSJJ6MTgkSb30Co4kWyf5s1EVI0ma/NYZHEm+m2SL JNsAlwKnJPnw6EuTJE1Gw/Q4tqyqO4GXAGdU1V7AC0ZbliRpshomODZJsiNwOPDVEdcjSZrkhgmO 9wLnA4ur6uIkTwSuG21ZkqTJapjHqi+rqgdOiFfV9Z7jkKSZa5gex78M2SZJmgHG7HEkeQ7wXGBW kjcPLNoC2HjUhUmSJqe1DVVtBjymrfPYgfY7gcNGWZQkafIaMziq6kLgwiSnV9UNAEk2Ah7TLs+V ppW+XxkMfm2wZqZhznH8Y7sBcHPgSuDqJG8bcV2SpElqmOCY23oYhwLfAJ4AvHKkVUmSJq1hgmPT JJvSBcd5VXUvUKMtS5I0WQ0THJ8ElgCbA99L8ni6E+RrleSRSS5K8rMkVyV5b2t/QpKfJFmc5PNJ Nmvtj2jzi9vyOQPbOr61X5vkgP5vU5K0oawzOKrqY1U1u6oOqs4NwPOG2PbvgedX1TOAPYADkzwb +CBwUlU9CbgNOLqtfzRwW2s/qa1HkrnAEcDTgAOBjyfxcmBJmiDDPB13hySnJvlGm58LzF/X61rI 3N1mN20/BTwf+EJrX0A3BAZwSJunLd83SVr72VX1+6r6BbAY2HOYNydJ2vCGGao6ne5ZVTu1+f8E 3jTMxpNsnOQy4BZgIfB/gduramVb5SZgdpueDdwI0JbfAWw72L6G1wzu65gki5IsWr58+TDlSZLW wzDBsV1VnQPcDw98qN83zMar6r6q2gPYma6X8NT1LXSIfZ1cVfOqat6sWbNGtRtJmvGGCY57kmxL u5Kqnae4o89Oqup24DvAc4Ctkqy68XBnYGmbXgrs0vaxCbAl8OvB9jW8RpI0zoYJjjcD5wG7Jfk/ wBnA36/rRUlmJdmqTT8K2A+4hi5AVj2yZD5wbps+jwfPnRwGfLuqqrUf0a66egKwO3DREHVLkkZg nY9Vr6pLk/w18BQgwLXtXo512RFY0K6A2gg4p6q+muRq4Owk7wd+Cpza1j8VODPJYmAF3ZVUVNVV Sc4BrgZWAsdW1VBDZZKkDW+dwZHkVas1PSsJVXXG2l5XVZcDz1xD+/Ws4aqoqvod8NIxtvUB4APr qlWSNHrDfJHTXwxMPxLYF7iUbshKkjTDDDNU9UfnM9p5i7NHVpEkaVIb5uT46u6he9ChJGkGGuYc x7/z4EMNNwLmAueMsihJ0uQ1zDmODw1MrwRuqKqbRlSPJE1b6/NlYZPRMOc4LhyPQiRJU8MwQ1V3 sebv3wjdswy32OBVSZImrWGGqj4CLAPOpAuLI4Edq+rdoyxMkjQ5DXNV1Yur6uNVdVdV3VlVn6B7 1LkkaQYa9iGHR7ZHpG+U5Ei6S3IlSTPQMMHxCuBw4Ob289LWJkmagYa5qmoJDk1Jkpphvjr2yUku SHJlm/+zJO8afWmSpMlomKGqU4DjgXvhgafeHjHKoiRJk9cwwfHoqlr9i5NWrnFNSdK0N0xw3Jpk Nx786tjD6O7rkCTNQMPcAHgscDLw1CRLgV8AfzvSqiRJk9YwV1VdD7wgyebARlV11+jLkiRNVmsd qmo3/W0HUFX3AL9P8t+TXDMu1UmSJp0xgyPJEcAK4PIkFybZH7geOIjueVWSpBlobUNV7wL+vKoW J3kW8CPgsKr69/EpbepYn2fsLznx4BFUIkmjt7ahqj9U1WKAqroUuM7QkCStrcexfZI3D8xvNThf VR8eXVmSpMlqbcFxCvDYtcxLkmagMYOjqt47noVIkqaGYe4clyTpAQaHJKkXg0OS1Msw38fxroHp R4y2HEnSZLe2O8ffkeQ5wGEDzT8afUmSpMlsbZfj/pzu+8WfmOT7bX7bJE+pqmvHpbppzLvNJU1V axuquh14J7AY2Af4aGs/LskP17XhJLsk+U6Sq5NcleSNrX2bJAuTXNd+b93ak+RjSRYnubw95mTV tua39a9LMn8936skaQNYW3AcAHwN2A34MLAXcE9VHVVVzx1i2yuBt1TVXODZwLFJ5gLHARdU1e7A BW0e4IXA7u3nGOAT0AUNcELb/57ACavCRpI0/sYMjqp6Z1XtCywBzgQ2BmYl+UGSdT6zqqqWtWdc 0b7D4xpgNnAIsKCttgA4tE0fApxRnR/TPeJkR7oAW1hVK6rqNmAhcGD/typJ2hCG+QbA86tqEbAo yeurau9V39ExrCRzgGcCPwF2qKpVXz37K2CHNj0buHHgZTe1trHaV9/HMXQ9FXbdddc+5UmSeljn 5bhV9faB2Ve3tluH3UGSxwBfBN5UVXeutu2ifZf5w1VVJ1fVvKqaN2vWrA2xSUnSGgzT43hAVf2s z/pJNqULjc9W1Zda881JdqyqZW0o6pbWvhTYZeDlO7e2pXQn5wfbv9unDkna0NbnysjpYmR3jicJ cCpwzWqPYD8PWHVl1Hzg3IH2V7Wrq54N3NGGtM4H9k+ydTspvn9rkyRNgF49jp7+EnglcEWSy1rb O4ETgXOSHA3cABzeln2d7mtpFwO/AY4CqKoVSd4HXNzW+4eqWjHCuiVJazGy4KiqHwAZY/G+a1i/ gGPH2NZpwGkbrjpJ0vryIYeSpF4MDklSLwaHJKkXg0OS1IvBIUnqZZSX42qG6HsjlI+Hl6Y2exyS pF4MDklSLwaHJKkXz3FMIZ5LkDQZ2OOQJPVij0MaZ/YcNdXZ45Ak9WKPQ+NuJn8BjjQd2OOQJPVi j2MaW5//2TueLmld7HFIknoxOCRJvRgckqReDA5JUi8GhySpF6+q0h/xHgvNRP7d92OPQ5LUi8Eh SerF4JAk9WJwSJJ6MTgkSb0YHJKkXgwOSVIv3schPQxe/6+ZyB6HJKmXkQVHktOS3JLkyoG2bZIs THJd+711a0+SjyVZnOTyJM8aeM38tv51SeaPql5J0nBG2eM4HThwtbbjgAuqanfggjYP8EJg9/Zz DPAJ6IIGOAHYC9gTOGFV2EiSJsbIgqOqvgesWK35EGBBm14AHDrQfkZ1fgxslWRH4ABgYVWtqKrb gIU8NIwkSeNovM9x7FBVy9r0r4Ad2vRs4MaB9W5qbWO1P0SSY5IsSrJo+fLlG7ZqSdIDJuzkeFUV UBtweydX1byqmjdr1qwNtVlJ0mrGOzhubkNQtN+3tPalwC4D6+3c2sZqlyRNkPEOjvOAVVdGzQfO HWh/Vbu66tnAHW1I63xg/yRbt5Pi+7c2SdIEGdkNgEk+B+wDbJfkJrqro04EzklyNHADcHhb/evA QcBi4DfAUQBVtSLJ+4CL23r/UFWrn3CXJI2jkQVHVb18jEX7rmHdAo4dYzunAadtwNIkSQ+Dd45L knoxOCRJvRgckqReDA5JUi8GhySpF4NDktSLwSFJ6sXgkCT14lfHSppW/Drf0bPHIUnqxR6HpEnN HsTkY49DktSLwSFJ6sWhKmmSW5+hmiUnHjyCSh4+h52mB4NDmoamU9ho8nGoSpLUiz0OSevFYaeZ y+CQBBgEGp5DVZKkXgwOSVIvBockqReDQ5LUi8EhSerF4JAk9WJwSJJ6MTgkSb0YHJKkXgwOSVIv BockqReDQ5LUi8EhSeplygRHkgOTXJtkcZLjJroeSZqppkRwJNkY+DfghcBc4OVJ5k5sVZI0M02J 4AD2BBZX1fVV9QfgbOCQCa5JkmakqfJFTrOBGwfmbwL2GlwhyTHAMW327iTXPoz9bQfc+jBeP114 HDoeh47HoTOpj0M++LBe/vhhVpoqwbFOVXUycPKG2FaSRVU1b0NsayrzOHQ8Dh2PQ8fjMHWGqpYC uwzM79zaJEnjbKoEx8XA7kmekGQz4AjgvAmuSZJmpCkxVFVVK5O8ATgf2Bg4raquGuEuN8iQ1zTg ceh4HDoeh86MPw6pqomuQZI0hUyVoSpJ0iRhcEiSejE4BvhYk06SJUmuSHJZkkUTXc94SnJakluS XDnQtk2ShUmua7+3nsgax8MYx+E9SZa2v4vLkhw0kTWOhyS7JPlOkquTXJXkja19xv1NDDI4Gh9r 8hDPq6o9ZuD16qcDB67WdhxwQVXtDlzQ5qe703nocQA4qf1d7FFVXx/nmibCSuAtVTUXeDZwbPtc mIl/Ew8wOB7kY01EVX0PWLFa8yHAgja9ADh0XIuaAGMchxmnqpZV1aVt+i7gGronWcy4v4lBBseD 1vRYk9kTVMtEK+BbSS5pj3KZ6XaoqmVt+lfADhNZzAR7Q5LL21DWjBqeSTIHeCbwE2b434TBoTXZ u6qeRTdsd2yS/zLRBU0W1V2/PlOvYf8EsBuwB7AM+OeJLWf8JHkM8EXgTVV15+Cymfg3YXA8yMea NFW1tP2+Bfgy3TDeTHZzkh0B2u9bJrieCVFVN1fVfVV1P3AKM+TvIsmmdKHx2ar6Umue0X8TBseD fKwJkGTzJI9dNQ3sD1y59ldNe+cB89v0fODcCaxlwqz6oGz+GzPg7yJJgFOBa6rqwwOLZvTfhHeO D2iXF36EBx9r8oEJLmncJXkiXS8DukfSnDWTjkOSzwH70D06+2bgBOArwDnArsANwOFVNa1PHI9x HPahG6YqYAnwdwPj/NNSkr2B7wNXAPe35nfSneeYUX8TgwwOSVIvDlVJknoxOCRJvRgckqReDA5J Ui8GhySpF4NDWk9J7l5t/tVJ/nWi6pHGi8EhTTJJpsRXOmvmMjikEUgyJ8m32wMBL0iya2s/Pclh A+vd3X7vk+T7Sc4Drm538H8tyc+SXJnkZRP0VqSH8H820vp7VJLLBua34cHH1PwLsKCqFiR5DfAx 1v3o7WcBT6+qXyT5G+CXVXUwQJItN3Dt0nqzxyGtv98OfKnRHsC7B5Y9BzirTZ8J7D3E9i6qql+0 6SuA/ZJ8MMlfVdUdG65s6eExOKTxtZL27y7JRsBmA8vuWTVRVf9J1wO5Anh/ksFQkiaUwSGNxg/p nrAMcCTdg/Kgezjgn7fpFwObrunFSXYCflNVnwH+iS5EpEnBcxzSaPw98OkkbwOWA0e19lOAc5P8 DPgmA72M1fwp8E9J7gfuBV4/4nqlofl0XElSLw5VSZJ6MTgkSb0YHJKkXgwOSVIvBockqReDQ5LU i8EhSerl/wNIVKauyJc3VgAAAABJRU5ErkJggg==)

References:

  1. numpy.loadtext(), Scipy.org
  2. PyPlot API, Matplotlib.org

NLTK Example : Detecting Geographic Setting of Sherlock Holmes Stories

As a young adult nothing thrilled me more than Jeremy Brett’s performance as Sherlock Holmes. “You know my methods, apply them!” he would say. So let’s try to play Sherlock ourselves. We use Natural Language Tool Kit or NLTK to guess setting of a Sherlock story in terms of its geographic location. In this NLTK example, our approach is very naive: identify the most frequent place mentioned in the story. We use Named Entity Recognition (NRE) to identify geopolitical entities (GPE) and filter out the most frequent of them. This approach is very naive because there is no pre-processing on the text and GPEs may include other concepts apart from geographic locations such as nationalities. But we want to keep this really simple and fun. So here we go: Code :

#NLTK example
#This code reads one text file at a time

from nltk import word_tokenize, pos_tag, ne_chunk

read a text file

text = file (‘filepath/file.txt’)

replace \n with a spcae

data=text.read().replace(‘\n’, ‘ ‘)

chunked = ne_chunk (pos_tag ( word_tokenize (data) ))

extract GPEs

extracted = []
for chunk in chunked:
if hasattr (chunk, ‘label’):
if chunk.label() == ‘GPE’:
extracted.append (‘’.join (c[0] for c in chunk))

extract most frequent GPE

from collections import Counter
count = Counter(extracted)
count.most_common(1)

Results:

Sr.

Story

Extracted Location

Actual Setting

Result

The Adventure of the Dancing Men

[(‘Norfolk’, 14)]

Norfolk

Success

The Adventure of the Solitary Cyclist

[(‘Farnham’, 6)]

Farnham

Success

A Scandal in Bohemia

[(‘Bohemia’, 6)]

Bohemia

Success

The Red-Headed League

[(‘London’, 7)]

London

Success

The Final Problem

[(‘London’, 8)]

London

Success

The Greek Interpreter

[(‘Greek’, 15)]

Greece

Fail

We got 5/6 predictions correct! These are not discouraging results and we may think of using this code somewhere in a more serious application. References:

  1. Sherlock Holmes Stories in Plain Text
  2. NLTK Documentation

Building a Movie Recommendation Service with Apache Spark

In this tutorial I’ll show you building a movie recommendation service with Apache Spark. Two users are alike if they rated a product similarly. For example, if Alice rated a book 3/5 and Bob also rated the same book 3.3/5 they are very much alike. Now if Bob buys another book and rates it 4/5 we should suggest that book to Alice, that’s what a recommender system does. See references if you want to know more about how recommender systems work. We are going to use Alternating Least Squares method from MLLib, and MovieLens 100K dataset which is only 5 MB in size. Download the dataset from https://grouplens.org/datasets/movielens/. Code :

from pyspark.mllib.recommendation import ALS,MatrixFactorizationModel, Rating
from pyspark import SparkContext

sc = SparkContext ()

#Replace filepath with appropriate data
movielens = sc.textFile(“filepath/u.data”)

movielens.first() #u’196\t242\t3\t881250949’
movielens.count() #100000

#Clean up the data by splitting it,
#movielens readme says the data is split by tabs and
#is user product rating timestamp
clean_data = movielens.map(lambda x:x.split(‘\t’))

#We’ll need to map the movielens data to a Ratings object
#A Ratings object is made up of (user, item, rating)
mls = movielens.map(lambda l: l.split(‘\t’))
ratings = mls.map(lambda x: Rating(int(x[0]),\
int(x[1]), float(x[2])))

#Setting up the parameters for ALS
rank = 5 # Latent Factors to be made
numIterations = 10 # Times to repeat process

#Need a training and test set, test set is not used in this example.
train, test = ratings.randomSplit([0.7,0.3],7856)

#Create the model on the training data
model = ALS.train(train, rank, numIterations)

For Product X, Find N Users to Sell To

model.recommendUsers(242,100)

For User Y Find N Products to Promote

model.recommendProducts(196,10)

#Predict Single Product for Single User
model.predict(196, 242)

References:

  1. Building a Recommender System in Spark with ALS, LearnByMarketing.com
  2. MovieLens
  3. Video : Collaborative Filtering, Stanford University
  4. Matrix Factorisation and Dimensionality Reduction, Thierry Silbermann
  5. Building a Recommendation Engine with Spark, Nick Pentreath, Packt

GraphFrames PySpark Example : Learn Data Science

In this post, GraphFrames PySpark example is discussed with shortest path problem. GraphFrames is a Spark package that allows DataFrame-based graphs in Saprk. Spark version 1.6.2 is considered for all examples. Including the package with PySaprk shell :

pyspark –packages graphframes:graphframes:0.1.0-spark1.6

Code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from pyspark import SparkContext
from pyspark.sql import SQLContext
sc = SparkContext ()
sqlContext = SQLContext(sc)

# create vertex DataFrame for users with id and name attributes
v = sqlContext.createDataFrame([
("a", "Alice"),
("b", "Bob"),
("c", "Charlie"),
], ["id", "name"])

# create edge DataFrame with "src" and "dst" attributes
e = sqlContext.createDataFrame([
("a", "b", "friends"),
("b", "c", "follow"),
("c", "b", "follow"),
], ["src", "dst", "relationship"])

# create a GraphFrame with v, e
from graphframes import *
g = GraphFrame(v, e)

# example : getting in-degrees of each vertex
g.inDegrees.show()

Output:

id inDegree
b 2
c 1

example : getting “follow” relationships in the graph

1
g.edges.filter("relationship = 'follow'").count()

Output:

2

getting shortest paths to “a” from each vertex

1
2
results = g.shortestPaths(landmarks=\["a"\])
results.select("id", "distances").show()

Feel free to ask your questions in the comments section!

Logistic Regression with Spark : Learn Data Science

Logistic regression with Spark is achieved using MLlib. Logistic regression returns binary class labels that is “0” or “1”. In this example, we consider a data set that consists only one variable “study hours” and class label is whether the student passed (1) or not passed (0).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
from pyspark import SparkContext
from pyspark import SparkContext
import numpy as np
from numpy import array
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.classification import LogisticRegressionWithLBFGS

sc = SparkContext ()

def createLabeledPoints(label, points):
return LabeledPoint(label, points)

studyHours = [
[ 0, [0.5]],
[ 0, [0.75]],
[ 0, [1.0]],
[ 0, [1.25]],
[ 0, [1.5]],
[ 0, [1.75]],
[ 1, [1.75]],
[ 0, [2.0]],
[ 1, [2.25]],
[ 0, [2.5]],
[ 1, [2.75]],
[ 0, [3.0]],
[ 1, [3.25]],
[ 0, [3.5]],
[ 1, [4.0]],
[ 1, [4.25]],
[ 1, [4.5]],
[ 1, [4.75]],
[ 1, [5.0]],
[ 1, [5.5]]
]

data = []

for x, y in studyHours:
data.append(createLabeledPoints(x, y))

model = LogisticRegressionWithLBFGS.train( sc.parallelize(data) )

print (model)

print (model.predict([1]))

Output:

1
2
3
spark-submit regression-mllib.py
(weights=[0.215546777333], intercept=0.0)
1

References:

  1. Logistic Regression - Wikipedia.org
  2. See other posts in Learn Data Science

k-Means Clustering Spark Tutorial : Learn Data Science

k-Means clustering with Spark is easy to understand. MLlib comes bundled with k-Means implementation (KMeans) which can be imported from pyspark.mllib.clustering package. Here is a very simple example of clustering data with height and weight attributes.

Arguments to KMeans.train:

  1. k is the number of desired clusters
  2. maxIterations is the maximum number of iterations to run.
  3. runs is the number of times to run the k-means algorithm
  4. initializationMode can be either ‘random’or ‘k-meansII’
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    from pyspark import SparkContext
    from pyspark.mllib.clustering import KMeans
    from numpy import array

    sc = SparkContext()
    sc.setLogLevel ("ERROR")

    #12 records with height, weight data
    data = array([185,72, 170,56, 168,60, 179,68, 182,72, 188,77, 180,71, 180,70, 183,84, 180,88, 180,67, 177,76]).reshape(12,2)

    #Generate Kmeans
    model = KMeans.train(sc.parallelize(data), 2, runs=50, initializationMode="random")

    #Print out the cluster of each data point
    print (model.predict(array([185, 71])))
    print (model.predict(array([170, 56])))
    print (model.predict(array([168, 60])))
    print (model.predict(array([179, 68])))
    print (model.predict(array([182, 72])))
    print (model.predict(array([188, 77])))
    print (model.predict(array([180, 71])))
    print (model.predict(array([180, 70])))
    print (model.predict(array([183, 84])))
    print (model.predict(array([180, 88])))
    print (model.predict(array([180, 67])))
    print (model.predict(array([177, 76])))

Output
0
1
1
0
0
0
0
0
0
0
0
0
(10 items go to cluster 0, where as 2 items go to cluster 2)

Above is a very naive example in which we use training dataset as input data too. In real world we will train a model, save it and later use it for predicting clusters of input data. So here is how you can save a trained model and later load it for prediction.

Training and Storing the Model

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
from pyspark import SparkContext
from pyspark.mllib.clustering import KMeans
from numpy import array

sc = SparkContext()

#12 records with height, weight data
data = array([185,72, 170,56, 168,60, 179,68, 182,72, 188,77, 180,71, 180,70, 183,84, 180,88, 180,67, 177,76]).reshape(12,2)

#Generate Kmeans
model = KMeans.train(sc.parallelize(data), 2, runs=50, initializationMode="random")

model.save(sc, "savedModelDir")

This will create a directory, _savedModelDir_ with two subdirectories _data_ and _metadata_ where the model is stored. **Using Already Trained Model for Predicting Clusters** Now, let's use trained model by loading it. We need to import KMeansModel in order to use it for loading the model from file.

from pyspark import SparkContext
from pyspark.mllib.clustering import KMeans, KMeansModel
from numpy import array

sc = SparkContext()

#Generate Kmeans
model = KMeansModel.load(sc, "savedModelDir")

#Print out the cluster of each data point
print (model.predict(array([185, 71])))
print (model.predict(array([170, 56])))
print (model.predict(array([168, 60])))
print (model.predict(array([179, 68])))
print (model.predict(array([182, 72])))
print (model.predict(array([188, 77])))
print (model.predict(array([180, 71])))
print (model.predict(array([180, 70])))
print (model.predict(array([183, 84])))
print (model.predict(array([180, 88])))
print (model.predict(array([180, 67])))
print (model.predict(array([177, 76])))

References:

  1. Clustering and Feature Extraction in MLlib, UCLA
  2. k-Means Clustering Algorithm Explained, DnI Institute
  3. k-Means Clustering with Python, iDevji

Apriori Algorithm for Generating Frequent Itemsets

Apriori Algorithm is used in finding frequent itemsets. Identifying associations between items in a dataset of transactions can be useful in various data mining tasks. For example, a supermarket can make better shelf arrangement if they know which items are purchased together frequently. The challenge is that given a dataset D having T transactions each with n number of attributes, how to find itemsets that appear frequently in D? This can be trivially solved by generating all possible itemsets (and checking each of the candidate itemset against support threshold.) which is computationally expensive. Apriori algorithm effectively eliminates majority of itemsets without counting their support value. Non-frequent itemsets are known prior to calculation of support count, thus the name Apriori alogrithm. It works on the following principle which is also known as apriori property:  

If an itemset is frequent, then all of its subsets must also be frequent.

  Suppose{c, d, e}is a frequent itemset. Clearly, any transaction that contains {c, d, e} must also contain its subsets, {c, d}, {c, e}, {d, e}, {c}, {d}, and {e}. As a result, if {c, d, e} is frequent, then all subsets of {c, d, e} must also be frequent. The algorithm works on bottom-up approach, that is generate 1-itemsets, eliminate those below support threshold then generate 2-itemsets, eliminate … and so on. This way only frequent _k_-itemsets are used to generate _k+1_-itemsets significantly reducing number of candidates. Let’s see an example: Our dataset contains four transactions with following particulars:

TID

Items

100

1, 3, 4

200

2, 3, 5

300

1, 2, 3, 5

400

2, 5

Support value is calculated as support (a -> b) = Number of transactions a and b appear in / total transactions. But for the sake of simplicity we use support value as number of times each transaction appears. We also assume support threshold = 2. Step 1 : Generate 1-itemsets, calculate support for each and mark itemsets below support threshold

Itemset

Support

{1}

2

{2}

3

{3}

3

{4}

1

{5}

3

Now, mark itemsets where support is below threshold

Itemset

Support

{1}

2

{2}

3

{3}

3

{4}

1

{5}

3

Step 1 : Generate 2-itemsets, calculate support for each and mark itemsets below support threshold. Remember for generating 2-itemsets we do not consider eliminated 1-itemsets.

Itemset

Support

{1, 2}

1

{1, 3}

2

{1, 5}

1

{2, 3}

2

{2, 5}

3

{3, 5}

2

Marking 2-itemsets below support threshold

Itemset

Support

{1, 2}

1

{1, 3}

2

{1, 5}

1

{2, 3}

2

{2, 5}

3

{3, 5}

2

Step 3 : Generate 3-itemsets, calculate support for each and mark itemsets below support threshold

Itemset

Support

{2, 3, 5}

2

  Stop: We stop here because 4-itemsets cannot be generated as there are only three items left! Following are the frequent itemsets that we have generated and which are above support threshold: {1}, {2}, {3}, {5}, {1, 3}, {2, 3}, {2, 5}, {3, 5} and {2, 3, 5}. Itemsets generated but not qualified are : {4}, {1, 2} and {1, 5}. If we do not use Apriori and instead rely on brute-force approach then number of candidate itemsets generated will be much larger! Feel free to ask your questions in comments section below!

Functional Programming in Python with Lambda Map Reduce and Filter

Functional programming in Python is possible with the use of lambda map reduce and filter functions. This article briefly describe use of each these functions.

Lambda : Lambda specifies an anonymous function. It is used to declare a function with no name; When you want to use function only once. But why would you declare a function if you don’t want to reuse the code? Read on you’ll see. Syntax: lambda arg1, arg2 : expression

lambda x : x*x

This lambda expression with just one argument x which returns square of x.

Map : It takes two arguments, the first argument is name of a function and second argument is a sequence. map() applies function f to all elements in the sequence and returns a new sequence. Syntax: map (func, sequence)

1
2
3
4
list = [1, 2, 3]
map (lambda x : x*x, list)

#output: [1, 4, 9]

This code also demonstrates use of lambda. Instead writing a square function we substituted it with a lambda expression. map() applies it to all elements in the list and returns a new list with each element square of original element.

Reduce: reduce() continuously applies a function to a sequence and returns one value. In the following example we sum all elements in the original list. Syntax: reduce (func, sequence)

(lambda x,y : x+y, list)
1
#output: 6

Filter: It filters all values in a sequence for which given function returns True. Syntax: filter (booleanFunc, sequence)

1
2
filter (lambda x : x%2, list)
#output: [1, 3]

Above example returns all odd integers in the list. Remember 2%2=0 is treated as boolean value False.