This is first step towards learning data. I would be learning and running experiments on data as part of my learning. The intention is to learn
- Hadoop with Java
- Hadoop with Ruby
- Big Data Analysis
- Big Data visualization
The first step was to get the dataset and I got my first dataset from SNAP and I ran my first node Count using the Hadoop streaming API.
You many find the source code for mapper and reducer at my github project.
The command that I used to run the mapper and reducer was:
$HADOOP_STREAMING -mapper ‘ruby <mapper_file>’ -reducer ‘ruby <reducer_file>’ -file <mapper_file> -file <reducer_file> -input ‘<path_where_snap_data_is_extracted>’ -output <output_dir_must_not_exist>
Once the process is done, it will create a file in <output_dir>/part-00000 if you have just one reduced(in my case it was one). Else based on your reducers you might have that many part-xxxxx files
This created 739455 data points of pattern (count, node).
I finally used R and ggplot2 to plot them, but still I can’t see all these points, so I would be working to get most of them on graph if not all
I formatted the output file to that I can create a chart out of it. So in the first line of my part-00000. I did the following
count <put a tab> node
then in R
p <- read.table(“<path_to_part-0000_file>”, header=T, sep=”\t”)
d <- ggplot(p, aes(count, node))
d + geom_point(aes(color=’red’))
and I got a nice little chart as
- most of the nodes are more or same connected to same number of nodes
- there is a rare node which has highest connectivity like to 435 other nodes
It would be interesting to learn that what all we can analyze with such data, so if you know just let me know I would try them