0%

Text classification is an important task with many applications including sentiment analysis and spam filtering. This article describes supervised text classification using fastText Python package. You may want to read Introduction to fastText first. Note: Shell commands should not be confused with Python code.

Get the Training Data Set: We start by training the classifier with training data. It contains questions from cooking.stackexchange.com and their associated tags on the site. Let’s build a classifier that automatically recognize a topic of the question and assign a label to it. So, first we download the data.

1
2
3
wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/cooking.stackexchange.tar.gz && tar xvzf cooking.stackexchange.tar.gz

head cooking.stackexchange.txt

As head command shows each line of the text file contains a list of labels followed by corresponding documents. fastText recognizes labels starting with __label__ but this file is alredy in shape. Next task is to train the classifier.

Training the Classifier: Let’s check the size of training data set:

wc cooking.stackexchange.txt

15404 169582 1401900 cooking.stackexchange.txt

It contains 15404 examples. Let’s split it into a training set of 12404 examples and a validation set of 3000 examples:

head -n 12404 cooking.stackexchange.txt > cooking.train

tail -n 3000 cooking.stackexchange.txt > cooking.valid

Now let’s train using cooking.train

classifier = fasttext.supervised('cooking.train', 'model\_cooking')

Our First Prediction:

1
2
3
4
5
label = classifier.predict('Which baking dish is best to bake a banana bread ?')
print label

label = classifier.predict('Why not put knives in the dishwasher? ')
print label

It may come up with something tag like baking and food-safety respectively! Second tag is not relevant which points out that our classifier is poor in quality. Let’s test it’s quality next.

**Testing Precision and Recall: ** Precision and recall are used to measure quality of models in pattern recognition and information retrieval. See this Wikipedia article. Let’s test the model against cooking.valid data:

1
2
3
print result.precision
print result.recall
print result.nexamples

There are a number of ways we can improve our classifier, See next post: Improving fastText Classifier

References:

  1. PyPi, fastext 0.7.0, Python.org
  2. fasText, Text classification with fastText
  3. Cooking StackExchange, cooking.stackexchange.com

If you think this post was helpful, kindly share with others or say thank you in the comments below, it helps!

We spent multiple lectures talking about Hadoop architecture at the university. Yes, I even demonstrated the cool playing cards example! In fact we have an 18-page PDF from our data science lab on the installation. Still I saw students shy away perhaps because of complex installation process involved. This tutorial jumps on to hands-on coding to help anyone get up and running with Map Reduce. No Hadoop installation is required.

Problem : Counting word frequencies (word count) in a file. Data : Create sample.txt file with following lines. Preferably, create a directory for this tutorial and put all files there including this one.

my home is kolkata
but my real home is kutch

Mapper : Create a file mapper.py and paste below code there. Mapper receives data from stdin, chunks it and prints the output. Any UNIX/Linux user would know about the beauty of pipes. We’ll later use pipes to throw data from sample.txt to stdin.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
#!/usr/bin/env python
import sys

# Get input lines from stdin
for line in sys.stdin:
# Remove spaces from beginning and end of the line
line = line.strip()

# Split it into words
words = line.split()

# Output tuples on stdout
for word in words:
print '%s\\t%s' % (word, "1")

Reducer : Create a file reducer.py and paste below code there. Reducer reads tuples generated by mapper and aggregates  them.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#!/usr/bin/env python
import sys

# Create a dictionary to map words to counts
wordcount = {}

# Get input from stdin
for line in sys.stdin:
#Remove spaces from beginning and end of the line
line = line.strip()

# parse the input from mapper.py
word, count = line.split('\\t', 1)
# convert count (currently a string) to int
try:
count = int(count)
except ValueError:
continue

try:
wordcount\[word\] = wordcount\[word\]+count
except:
wordcount\[word\] = count

# Write the tuples to stdout
# Currently tuples are unsorted
for word in wordcount.keys():
print '%s\\t%s'% ( word, wordcount\[word\] )

Execution : CD to the directory where all files are kept and make both Python files executable:

1
2
chmod +x mapper.py
chmod +x reducer.py

And now we will feed cat command to mapper and mapper to reducer using pipe (). That is output of cat goes to mapper and mapper’s output goes to reducer. (Recall that cat command is used to display contents of any file.

1
cat sample.txt  ./mapper.py  ./reducer.py

Output :

1
2
3
4
5
6
7
real1
kutch1
is2
but1
kolkata1
home2
my2

Yay, so we get the word count kutch x 1, is x 2, but x 1, kolkata x 1, home x 2 and my x 2! You can put your questions in comments section below!

The idea of vernacular chatbots struck me yesterday when I decided to try iki.ai for payment of my electricity bill. Niki is India’s first chat bot for shopping. Shopping with chat bot is totally a different experience where user talks to a bot about what she wants. Following is the complete chat I had with Niki. On the left hand side are Niki’s messages and mine are on right with bold face.

Hi i am Niki! :) I am here to make your life simple & convenient. You can chat with me like you chat with your friend.Can i show you how the booking experience on chat can be 300% faster than other apps ? (in 10 seconds!)

Pay my electricity bill

Hello ! Let me assist you in paying your electricity bill. You can type cancel at any point to exit this conversation Please give me: your electricity provider. For e.g BESCOM

PGVCL

Understood. Picking your electricity provider as Paschim Gujarat Vij Company. I request you to give me: your Customer Account Number. It is 1-11 digits long (Alphanumeric)

*********** (my customer number)

Sure. Picking your Customer Account Number as: *********** I request you to provide: the bill amount.

**** (amount)

Got it. Picking the amount as **** The summary of your order is: Operator: Paschim Gujarat Vij Company Customer Account Number: *********** Amount: **** rs Should I proceed to payment?

Yes

Fantastic. Please pay from the below options We have received the payment, going to process the order! You will get a confirmation in the next 5 mins. Thank you, your Electricity bill payment was successful You will receive a confirmation message shortly. It will get reflected within 3-4 business days. Your order id is ****************. For any problems or queries with your order, please email us at support@niki.ai

how are you doing otherwise?

As you can see, Niki is very good at enhancing your shopping experience but when I asked her about her well being she did not bother to reply. Technology can reach those who do not know English with vernacular chatbots. Going a step further, if bots can actually “speak” on phone having real conversation with the user in vernacular, people who cannot read and write can also shop online. There is enough progress in speech recognition for English but local efforts are required for research in vernacular languages to achieve this. There is surely a huge market for this as more than 92% of the population has a mobile phone and only 74% of the population is literate*. There is a clear gap of 18%. That is, more than 238 million people*! * Literacy rate as per 2011 census. References:

  1. List of Countries By Mobile Phone In Use, Wikipedia
  2. Status of Literacy, Census of India 2011, census.gov.in

https://twitter.com/alexstamos/status/916697104197627904 This recent tweet by Facebook’s security chief Alex Stamos stimulated my thinking about literature reading of Silicon Valley executives and importance of humanities education to the tech students. Stamos argues here that his company should not become “Ministry of Truth” (the propaganda machine in Orwell’s 1984) by running algorithms on deciding what is truth and what is propaganda or fake news. An analogy from Orwellian classic helped Stamos explain a complex evil scenario in simple terms. But is the role of cultural literacy limited to simplifying otherwise complex scenarios? No, if culture is defined by the values of its practitioners, I think it cannot be separated from anything they create. We create Artificial Intelligence. An algorithm cannot be neutral when it is designed by biased humans. An undiscriminating AI model trained by discriminating expert is improbable. We need to find ways to pass our values to what we are set to create. Fortunately, tech leaders do understand this. Emma Williams is a general manager at Microsoft whose area is Anglo-Saxon literature. Her role is to ensure that Cortana’s personality is calm and sober. Cortana is Microsoft’s premiere chat AI. It is evident that identifying role of human values in technology has never been more important than now. In Cortana’s case, for example, who would want to chat with a short tempered AI? [caption id=”attachment_202” align=”alignnone” width=”436”] Satya Nadella Quotes Ghalib at a Presentation in New Delhi[/caption] “Hazaaron khwaishein aisi, ke har khwaish pe dum nikle. Bohat nikle mere armaan, fir bhi kam nikle,” people were surprised when Microsoft CEO Satya Nadella quoted great Rekhta poet Ghalib during a presentation in New Delhi last year. ‘Yet another geek into poetry’ one may wonder. But it perfectly suites CEO of a tech giant whose products aspire to be as good as human in some ways. If poetry can ease complexity of affairs by creatively deploying words why shouldn’t robots use them? Also, ill-informed reading in history undermines innovations and experiments in social and political thoughts. Which I think is a great cause of concern for tech students. For example, civil disobedience as an idea may not occupy any place in the mind of an engineer as a revolutionary approach towards fighting oppression. It is only hypocrisy talking about singularity and at the same time denying human culture and values any place in there. If future is what we build today, we should build it good and not evil. In India universities are finally moving towards offering more diverse learning experience to the students where an engineering student can study Shakespeare. Though the progress is very slow the outcomes should be positive with CBCS or choice based credit system. References:

  1. Facebook Stumbles With Early Effort to Stamp Out Fake News, Sarah Frier, Bloomberg
  2. The Bing Search Experience: A Practice In Italian Renaissance Art & User Perspective, Amy Gesenhues, Search Engine Land
  3. Yes, Microsoft CEO Satya Nadella can quote Mirza Ghalib, The Indian Express
  4. Choice based credit system: the path ahea, M.S. Bhat, The Hindu

Flea markets are cool! Sunday’s visit to a flea market in Ahmedabad awed me with unexpected variety of products. What excited me more was the experience of moving in the streets checking out stalls one-by-one. Holding an unknown product in my hands, learning how it works by talking to the vendor, discussing it with my shopping partner, weighing it, negotiating on prices, asking other shoppers how much did a product cost them, these interactions built my shopping experience there. The experience was significantly different from today’s online stores (and even brick-and-mortar shops or supermarkets). No one knew me there, I did not sign-up, no one asked for my credit card number, it was a complete one-off relationship that made my experience pleasant. Privacy is a fundamental right and subject of heated discussions across the world.

Missing Exploratory Shopping Experience: E-commerce fails to deliver an exploratory shopping experience where a customer does not know what is she looking for. Online stores are too organized where products are arranged in categories and sub-categories. Deals occasionally provide an exploration opportunity but that is limited to items which are discounted. But such an experience can be designed with e-commerce websites, technically, may be not economically. Perhaps excessive regulatory control and trust-building with customer drove these businesses to focus on written attributes of products which are verifiable in the form of a contract. For example, describing a product as “brand-new and unused” rather than showing it to be so. Perhaps technology is a bigger driver, web pages over which this content is delivered have never been more friendly to anything but text.

Anonymous Virtual Flea Markets: A marketplace where anonymous sellers can come and sell almost anything is not a dream any more. There is a galaxy of marketplaces on the dark web. Cryptocurrencies are gaining ground. “Why can’t we use Mixed Reality (MR) technologies to build such markets?” is the central question. I think we can. A number mixed reality headsets are already in market from Oculus Rift to Microsoft HoloLens. There are shopping cart applications such as this that are designed for virtual reality. An independent, community backed, anonymous, mixed reality flea markets software effort is what we need.

This article is for those who are getting started with fastText. fastText is a text representation and classification library from Facebook Research developed by FAIR lab. Classification of text documents is an important natural language processing (NLP) task. It is originally written in C++ but can be accessed using Python interface. It is massively fast. See references for two defining papers. In this article we’ll discuss Python installation of fastText. pip and cython are pre-requisites, install them if not already installed:

Installing pip
sudo apt-get install pip

Installing Cython
pip install cython

Finally install fastText which may also download other missing packages like numpy for us.

Installing fastText
pip install fasttext

And we are done! See next post Text Classification With Python Using fastText. References:

  1. fastTex - Facebook Research
  2. Representation: Enriching Word Vectors with Subword Information, Piotr Bojanowski, Edouard Grave, Armand Joulin and Tomas Mikolov, 2016
  3. Classification: Bag of Tricks for Efficient Text Classification, Armand Joulin, Edouard Grave, Piotr Bojanowski, Tomas Mikolov, 2016

Can a computer read stories, the way humans do? Of course computers can read from files much faster and accurately but second part of the question is more important. When we read a story we understand it we read the feelings of the protagonist, challenge here is to make computers do the same. If computers could read this way, they can write stories the same way. Imagine a robot telling bedtime story to a sleepless kid! University of Vermont researchers shows how this can be achieved. I will try to simplify here what they did. They measured rise and fall in the emotional trajectory of the story. How can this be done? Protagonist’s actions and words can be input to a sentiment analysis function which identifies the sentiment in a given line. Sentiments are assigned a happiness weight on a scale of 100 which help create a graph showing rise and fall in the emotional state of the protagonist. This graph is self-explanatory, well, unless you are not into Harry Potter.

The study could identify six arcs in all stories. This study was in fact inspired by Kurt Vonnegut also multiple people from literature talks about few basic genres in all stories. There can be huge computing applications here but for discussing it I will do a separate post.

References:

  1. Kurt Vonnegut
  2. The Six Main Arcs in Storytelling, as Identified by an A.I., Adrian Lafrance, The Atlantic.com
  3. The emotional arcs of stories are dominated by six basic shapes, Andrew Reagan et al.

Satya Nadella talks about culture, empathy, philosophy and trust apart from technology in his Hit Refresh: The Quest to Rediscover Microsoft’s Soul and Imagine a Better Future for Everyone. Here are the six most powerful thoughts from the book that inspired me and will surely help you stay grounded. Spanning nine chapters with a foreword by none other than Bill Gates the book is worth reading for everyone in and even outside the technology industry.

Advising:

Any advice that advocates passivity in the face of bias is wrong.

Success:

Success can cause people to unlearn the habits that made them successful in the first place.

Leadership:

Leadership means making choices and rallying the team around these choices.

Valuing Differences:

We need not just to value differences but to actively seek them out, invite them in.

Introspection:

When I learn about a shortcoming it is a thrilling moment. The person who points out has given me the gift of insight.

Agreement:

For anything great to happen - great software, innovative hardware, or even a sustainable institution - there needs to be a one great mind or a set of agreeing minds.

Share quotes that inspired you from your favourite books in the comments section.

More than 4.8 lac vehicles met accident in India last year [1] most of them resulting in insurance claims. Agricultural and other miscellaneous insurance products also see a large number of claims. Going through the claims settlement process is generally a clumsy experience for consumers. A particular problem is of reach and timeliness of claim inspectors or surveyors as they are known in India. Verifiable authenticity of claims is also a concern. Surveyors are few. Additionally surveyors travelling from place to place can be expensive for companies, adding to their operational costs. Weather and other conditions may also be restricting for human reach. Drones can reach the spot faster as they take aerial route. Thus, time between accident and survey can be reduced which prevents possible further collisions for on road vehicles.

A drone can be equipped with AI (Artificial Intelligence) or it can be operated by an operator from a distant or nearby location. Field stations can be built near accident prone areas on highways to further accelerate the process. Above all, a drone can reach places where it is difficult for a human to reach. Natural calamities causing large scale damage to properties also present a specific case for drones usage when large number of claims arise.

ICICI Lombard : A Case Study The company is already using drones for agricultural insurance claims. An article [2] from the company’s website projects drones to be insurance industry’s next game changers. Drones have helped the company to reduce operational costs, keep a tab on false claims among other.

Conclusion There is a great potential for drones in insurance industry across the verticals. Advancements in robotics and AI can bring exciting scenarios for insurance companies and consumers.

References

  1. Road Accidents in India
  2. ICICI Lombard Claims Drones to be Game Changers

Writing Hello World is a custom I dare not to break as I start off with this brand new blog. Here are Hello World programs in ten different programming languages.

1. The C Not only C can be termed as mother of most high level languages but it remains relevant till date. It is undoubtedly the most popular language of all times.

1
2
3
#include void main() {
printf("Hello World!");
}

2. C ++ Bjarne Stroustrup’s language gave programmers a new world view with Object Oriented Programming.

1
2
3
#include void main() {
cout << "Hello World!";
}

3. UNIX Shell Shell is like a Ninja’s sword! You need to see this.

1
2
#!/bin/sh
echo "Hello World!"

4. Java Java is perhaps the only ubiquitous multi platform language we have seen.

1
2
3
4
5
public class HelloWorld {
public static void main(String args \[\]) {
System.out.print("Hello World!");
}
}

5. Hello World in C# C# is a great programming language based on C from Microsoft.

1
2
3
4
5
public class HelloWorld {
static void Main() {
System.Console.Write("Hello World!");
}
}

6. JavaScript JS shaped platforms which employ half of the Indian programmers.

1
alert ("Hello World");

7. PHP PHP makes great front-ends for web applications.

1
echo "Hello World!";

8. Python My favourite! Python is simple, readable and powerful.

1
print ('Hello World!')

9. PL/SQL When you cannot think in SQL, you use PL/SQL!

1
2
3
begin
dbms\_output.put\_line ('Hello World!');
end;

10. Arduino C Arduino propels IoT development with affordable open source prototyping boards for everyone. Arduino-C is indeed C!

1
2
3
4
5
6
7
8
9
10
11
12
13
// setup () function runs only once
void setup() {
// Use on-board LED as output
pinMode(LED\_BUILTIN, OUTPUT);
}

// loop function runs forever
void loop() {
digitalWrite(LED\_BUILTIN, HIGH); // turn the LED on
delay(1000); // wait a sec (or 1000 ms)
digitalWrite(LED\_BUILTIN, LOW); // turn the LED off
delay(1000); // wait a sec (or 1000 ms)
}

So this is how journey at iDevji starts, I hope it will create value, may be little, for everyone specially my students!