MapReduce is a great approach to problem solving. It is very popular too, but MapReduce examples other than word-count are scarce on the web. This article describes MapReduce problem solving that is beyond word-count.

# learn-datascience

## Apache Log Visualization with Matplotlib : Learn Data Science

This post discusses Apache log visualization with Matplotlib library. First, download the data file used in this example We will require numpy and matplotlib In [1]: import numpy as np import matplotlib.pyplot as plt numpy.loadtext() can directly load a text file in an array requests-idevji.txt contains only hour on which request was made, this is achieved […]

## Building a Movie Recommendation Service with Apache Spark

In this tutorial I’ll show you building a movie recommendation service with Apache Spark. Two users are alike if they rated a product similarly. For example, if Alice rated a book 3/5 and Bob also rated the same book 3.3/5 they are very much alike. Now if Bob buys another book and rates it 4/5 […]

## GraphFrames PySpark Example : Learn Data Science

In this post, GraphFrames PySpark example is discussed with shortest path problem. GraphFrames is a Spark package that allows DataFrame-based graphs in Saprk. Spark version 1.6.2 is considered for all examples. Including the package with PySaprk shell : pyspark –packages graphframes:graphframes:0.1.0-spark1.6pyspark –packages graphframes:graphframes:0.1.0-spark1.6 Code: from pyspark import SparkContext from pyspark.sql import SQLContext sc = SparkContext […]

## Logistic Regression with Spark : Learn Data Science

Logistic regression with Spark is achieved using MLlib. Logistic regression returns binary class labels that is “0” or “1”. In this example, we consider a data set that consists only one variable “study hours” and class label is whether the student passed (1) or not passed (0). from pyspark import SparkContext from pyspark import SparkContext […]

## k-Means Clustering Spark Tutorial : Learn Data Science

k-Means clustering with Spark is easy to understand. MLlib comes bundled with k-Means implementation (KMeans) which can be imported from pyspark.mllib.clustering package. Here is a very simple example of clustering data with height and weight attributes. Arguments to KMeans.train: k is the number of desired clusters maxIterations is the maximum number of iterations to run. […]

## Apriori Algorithm for Generating Frequent Itemsets

Apriori Algorithm is used in finding frequent itemsets. Identifying associations between items in a dataset of transactions can be useful in various data mining tasks. For example, a supermarket can make better shelf arrangement if they know which items are purchased together frequently. The challenge is that given a dataset D having T transactions each […]

## Data Mining : Intuitive Partitioning of Data or 3-4-5 Rule

Intuitive partitioning or natural partitioning is used in data discretization. Data discretization is the process of converting continuous values of an attribute into categorical data or partitions or intervals. Discretization helps reducing data size by reducing number of possible values. Instead of storing every observation we can only store partition range in which each observation […]

## k-means Clustering Algorithm with Python : Learn Data Science

k-means clustering algorithm is used to group samples (items) in k clusters; k is specified by the user. The method works by calculating mean distance between cluster centroids and samples, hence the name k-means clustering. Euclidean distance is used as distance measure. See references for more information on the algorithm. This is a article describes k-means […]

## Commonly Used HDFS Commands : Learn Data Science

Hadoop Distributed File System or HDFS is the underlying storage for all Hadoop applications. HDFS can be manipulated using APIs such as Java API or REST API but using HDFS shell is the most commonly used option. Below is a list of ten commonly used HDFS commands. 1. Invoking the file system: HDFS Shell supports […]

## Python MapReduce with Hadoop Streaming in Hortonworks Sandbox

Hortonworks sandbox for Hadoop Data Platform (HDP) is a quick and easy personal desktop environment to get started on learning, developing, testing and trying out new features. It saves the user from installation and configuration of Hadoop and other tools. This article explains how to run Python MapReduce word count example using Hadoop Streaming.

## MapReduce Streaming Example : Running Your First Job on Hadoop

Step by step tutorial to run Python Map Reduce code using Hadoop Streaming

## Tutorial: Text Classification With Python Using fastText

We start by training the classifier with training data. It contains questions from cooking.stackexchange.com and their associated tags on the site. Let’s build a classifier that automatically recognize a topic of the question and assign a label to it.

## Map Reduce Word Count With Python : Learn Data Science

This tutorial jumps on to hands-on coding to help anyone get up and running with Map Reduce. No Hadoop installation is required.