Mahout - Naive Bayes - hadoop

I tried deploying 20- news group example with mahout, it seems working fine. Out of curiosity I would like to dig deep into the model statistics,
for example: bayes-model directory contains the following sub directories,
trainer-tfIdf trainer-thetaNormalizer trainer-weights
which contains part-0000 files. I would like to read the contents of the file for better understanding, cat command doesnt seems to work, it prints some garbage.
Any help is appreciated.
Thanks

The 'part-00000' files are created by Hadoop, and are in Hadoop's SequenceFile format, containing values specific to Mahout. You can't open them as text files, no. You can find the utility class SequenceFileDumper in Mahout that will try to output the content as text to stdout.
As to what those values are to begin with, they're intermediate results of the multi-stage Hadoop-based computation performed by Mahout. You can read the code to get a better sense of what these are. The "tfidf" directory for example contains intermediate calculations related to term frequency.

You can read part-0000 files using hadoop's filesystem -text option. Just get into the hadoop directory and type the following
`bin/hadoop dfs -text /Path-to-part-file/part-m-00000`
part-m-00000 will be printed to STDOUT.
If it gives you an error, you might need to add the HADOOP_CLASSPATH variable to your path. For example, if after running it gives you
text: java.io.IOException: WritableName can't load class: org.apache.mahout.math.VectorWritable
then add the corresponding class to the HADOOP_CLASSPATH variable
export HADOOP_CLASSPATH=/src/mahout/trunk/math/target/mahout-math-0.6-SNAPSHOT.jar
That worked for me ;)

In order to read part-00000 (sequence files) you need to use the "seqdumper" utility. Here's an example I used for my experiments:
MAHOUT_HOME$: bin/mahout seqdumper -s
~/clustering/experiments-v1/t14/tfidf-vectors/part-r-00000
-o ~/vectors-v2-1010
-s is the sequence file you want to convert to plain text
-o is the output file

Related

Working with zips in pyspark

I have n zips in a directory and I want to extract each one of those and then extract some data out of a file or two lying inside the zips and add it to a graph DB. I have made a sequential python script for this whole thing, but I am stuck at converting it for spark. All of my zips are in a HDFS directory. And, he graph DB is Neo4j. I am yet to learn about connecting spark with neo4j but I am stuck at a more initial step.
I am thinking my code should be along these lines.
# Names of all my zips
zip_names = ["a.zip", "b.zip", "c.zip"]
# function extract_&_populate_graphDB() returns 1 after doing all the work.
# This was done so that a closure can be applied to start the spark job.
sc.parallelize(zip_names).map(extract_&_populate_grapDB).reduce(lambda a, b: a+b)
What I cant do to test this is how to extract the zips and read the files within. I was able to read the zip by sc.textFile but it on running take(1) on it, it returned hex data.
So, is it possible to read in a zip and extract the data? Or, should I extract the data before putting it into the HDFS? Or maybe there's some other approach to deal with this?
Updating Answer*
If you'd like to use Gzip compressed files, there are parameters you can set when you configure your Spark shell or Spark job that allow you to read and write compressed data.
--conf spark.hadoop.mapred.output.compress=True \
--conf spark.hadoop.mapred.output.compression.codec=True \
--conf spark.hadoop.mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
--conf spark.hadoop.mapred.output.compression.type: BLOCK
Add those to the bash script you are currently using to create a shell (e.g. pyspark) and you can read and write compressed data.
Unfortunately, there is no innate support of Zip files, so you'll need to do a bit more legwork to get there.

Hadoop: How to generate custom reduce output file name?

Now, I use MultipuleOuputs.
I would like to remove the suffix string "-00001" from reducer's output filename such as "xxxx-[r/m]-00001".
Is there any idea?
Thanks.
From Hadoop javadoc to the write() method of MultipleOutputs:
Output path is a unique file generated for the namedOutput. For example, {namedOutput}-(m|r)-{part-number}
So you need to rename or merge these files on the HDFS.
I think you can do it on job driver. When your job completes, change the file names. Also you could do it via terminal commands.

Hadoop preinstalled example Jars

I just successfully set up Hadoop on my local machines. I am following one of the examples in a popular book I just bought. I am trying to get a list of all hadoop examples that comes with installation. I type the following command to do so:
bin/hadoop jar hadoop-*-examples.jar
Once I enter this I am supposed to get a list of Hadoop examples right? However all I see is this error message:
Not a valid JAR: /home/user/hadoop/hadoop-*-examples.jar
How do I solve this problem? Is it just a simple permission issue?
This is most probably configuration issue or usage of invalid file paths.
Most probably the name of hadoop-*-examples.jar is not correct because in my version of Hadoop (1.0.0) file name is hadoop-examples-1.0.0.jar.
So I have run following command to list all examples and it works like charm:
bin/hadoop jar hadoop-examples-*.jar
An example program must be given as the first argument.
Valid program names are:
aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
dbcount: An example job that count the pageview counts from a database.
grep: A map/reduce program that counts the matches of a regex in the input.
join: A job that effects a join over sorted, equally partitioned datasets
multifilewc: A job that counts words from several files.
pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
pi: A map/reduce program that estimates Pi using monte-carlo method.
randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
randomwriter: A map/reduce program that writes 10GB of random data per node.
secondarysort: An example defining a secondary sort to the reduce.
sleep: A job that sleeps at each map and reduce task.
sort: A map/reduce program that sorts the data written by the random writer.
sudoku: A sudoku solver.
teragen: Generate data for the terasort
terasort: Run the terasort
teravalidate: Checking results of terasort
wordcount: A map/reduce program that counts the words in the input files.
Also if I use same file name pattern as You I got error:
bin/hadoop jar hadoop-*examples.jar
Exception in thread "main" java.io.IOException: Error opening job jar: hadoop-*examples.jar
HTH
You must specify the class name of the jar file which you want to use:
hadoop jar pathtojarfile classname arg1 arg2 ..
Example:
hadoop jar example.jar wordcount inputPath outputPath
#Anup. The full / relative path to the jar file is required.
In your case it might be /home/user/hadoop/share/hadoop-*-examples.jar
The complete command from hadoop's directory might be
/home/user/hadoop/bin/hadoop /home/user/hadoop/share/hadoop-*-examples.jar
(I used absolute full paths there, but you can use relative paths).
you will find the jar in $HADOOP_HOME/share/hadoop/mapreduce/hadoop-*-examples*.jar

Hadoop read files with following name patterns

This may sound very basic but I have a folder in HDFS with 3 kinds of files.
eg:
access-02171990
s3.Log
catalina.out
I want my map/reduce to read only files which begin with access- only. How do I do that via program? or specifying via the input directory path?
Please help.
You can set the input path as a glob:
FileInputFormat.addInputPath(jobConf, new Path("/your/path/access*"))

How to successfully run kmeans clustering using Mahout (esp. get human-readable output)

I tried to follow many online tutorials to run kmeans example present in Mahout.
But did not succeed yet to get meaningful output. The main problem I am facing is,
the conversion from text file to sequencefile and back.
When I followed the steps of "Mahout Wiki" for "Clustering of synthetic control data"
(https://cwiki.apache.org/MAHOUT/clustering-of-synthetic-control-data.html) I could run the clustering process (using $MAHOUT_HOME/bin/mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job) and that created some readable console output. But I wish to get output files (as the size is large) from the clustering process.
The output files which were generated by Mahout clustering are all sequence file and I cant convert them to readable files.
When I tried to do "clusterdump" ($MAHOUT_HOME/bin/mahout clusterdump --seqFileDir output/clusters-10...) I got errors.
First it complains that "seqFileDir" option is unexpected and I guess either there is no "seqFileDir" for clusterdump or I am missing something.
Trying to use Mahout in the way of "mahout in action" seems tricky. I am not sure what are the required classes ("import ??") to compile that code.
Can you please suggest me the steps to successfully RUN kmeans on Mahout ? Specially how to get readable output from sequence files ?
Regarding 2nd question - you can obtain source code for the book from the repository. The code in master branch is for Mahout 0.5, while code in the branches mahout-0.6 & mahout-0.7 is for corresponding Mahout's version.
The source code is also posted to book's site, so you download it there (but this is version only for Mahout 0.5)
P.S. If you're reading book right now, then I recommend to use Mahout 0.5 or 0.6, as all code was checked for version 0.5, while for other versions it will be different - this is especially true for clustering code in Mahout 0.7
As for seqFileDir in clusterdump, you need to use --input not --seqFileDir.
I'm using Mahout 0.7. The call to clusterdump that i use to (for example) get a simple dump is:
mahout clusterdump --input output/clusters-9-final --pointsDir output/clusteredPoints --output <absolute path of dir where you want to output>/clusteranalyze.txt
Be sure that the path to the directory output/clusters-9-final above is correct for your system. Depending on the clustering algorithm, this directory may be different. Look in the output directory and make sure you use the directory with the word "final" init.
To dump data as CSV or GRAPH_ML, you would add the -of CSV argument to the above call. For e.g.:
mahout clusterdump --input output/clusters-9-final -of CSV --pointsDir output/clusteredPoints --output <absolute path of dir where you want to output>/clusteranalyze.txt
Hope that helps.

Resources