I want to run kmeans clustering on a Hadoop pseudo-distributed mode. I have 5 million of vectors in a .mat file, with 38 numeric features for each vector, like this:
0 0 1 0 0 0 0 0 0 0 0 0 ...
I've run the examples that I've found, like Reuters (mhttps://mahout.apache.org/users/clustering/k-means-clustering.html) or synthetic data. I know i have to convert this vectors to SequenceFile, but I don't know if I have to do something more before.
I'm using Mahout 0.7 and Hadoop 1.2.1.
Yes, you need a small preprocessing step.
Since the MAT file generated is a Binary File, converting it into a text file (.txt) with each line begin a vector with 38 feature values would be the first step.
Then, using SeqDirectory (or writing your own SequenceFileWriter to get it done) would be next step and all the other steps follow as in the Reuters example.
Example for your own Sequence File Writer would be How to convert .txt file to Hadoop's sequence file format
I tried the same for Mahout LDA where I wrote my own Sequence File Writer and gave it as an input to the next step in LDA process namely seq2sparse.
Never use pseudo-distributed mode
Mahout only pays off if you have data that is way too large to be analyzed on a single computer, but where you really need at least a dozen computers to hold and process the data.
The reason is the architecture. Mahout is built on top of map-reduce and relies on writing plenty of iterim data to disk, to be able to recover from crashes.
In pseudo-distributed mode, it cannot recover from such crashes well anyway.
Pseudo-distributed mode is okay if you want to learn installing and configuring Mahout, without having access to a real cluster. It is not reasonable to use for analyzing real data.
Instead, use the functionality built-in into Matlab, or use a clustering tool designed for single nodes such as ELKI. It will usually outperform Mahout by an order of magnitude by not writing everything to disk a number of times. In my experiments, these tools were able to outperform a 10 core Mahout cluster by a factor of 10 on a single core. Because I/O cost completely dominates runtime.
Benchmark yourself
If you don't trust me on this, benchmark yourself. Load the reuters data into Matlab, and cluster it there. I'm pretty sure, Matlab will make Mahout look like an old fad.
Related
I manage a small team of developers and at any given time we have several on going (one-off) data projects that could be considered "Embarrassingly parallel" - These generally involve running a single script on a single computer for several days, a classic example would be processing several thousand PDF files to extract some key text and place into a CSV file for later insertion into a database.
We are now doing enough of these type of tasks that I started to investigate developing a simple job queue system using RabbitMQ with a few spare servers (with an eye to use Amazon SQS/S3/EC2 for projects that needed larger scaling)
In searching for examples of others doing this I keep coming across the classic Hadoop New York Times example:
The New York Times used 100 Amazon EC2 instances and a Hadoop application to process 4 TB of raw image TIFF data (stored in S3) into 11 million finished PDFs in the space of 24 hours at a computation cost of about $240 (not including bandwidth)
Which sounds perfect? So I researched Hadoop and Map/Reduce.
But what I can't work out is how they did it? Or why they did it?
Converting TIFF's in PDF's is not a Map/Reduce problem surely? Wouldn't a simple Job Queue have been better?
The other classic Hadoop example is the "wordcount" from the Yahoo Hadoop Tutorial seems a perfect fit for Map/Reduce, and I can see why it is such a powerful tool for Big Data.
I don't understand how these "Embarrassingly parallel" tasks are put into the Map/Reduce pattern?
TL;DR
This is very much a conceptual question, basically I want to know how would I fit a task of "processing several thousand PDF files to extract some key text and place into a CSV file" into a Map/Reduce pattern?
If you know of any examples that would be perfect, I'm not asking you to write it for me.
(Notes: We have code to process the PDF's, I'm not asking for that - it's just an example, it could be any task. I'm asking about putting that processes like that into the Hadoop Map/Reduce pattern - when there is no clear "Map" or "Reduce" elements to a task.)
Cheers!
Your thinking is right.
The above examples that you mentioned used only part of the solution that hadoop offers. They definitely used parallel computing ability of hadoop plus the distributed file system. It's not necessary that you always will need a reduce step. You may not have any data interdependency between the parallel processes that are run. in which case you will eliminate the reduce step.
I think your problem also will fit into hadoop solution domain.
You have huge data - huge number of PDF files
And a long running job
You can process these files parallely by placing your files on HDFS and running a MapReduce job. Your processing time theoretically improves by the number of nodes that you have on your cluster. If you do not see the need to aggregate the data sets that are produced by the individual threads you do not need to use a reduce step else you need to design a reduce step as well.
The thing here is if you do not need a reduce step, you are just leveraging the parallel computing ability of hadoop plus you are equipped to run your jobs on not so expensive hardware.
I need to add one more thing: error handling and retry. In a distributed environment nodes fail is pretty common. I regularly run EMR cluster consisting of several hundred nodes at time for 3 - 8 days and find out that 3 or 4 fail during that period is very likely.
Hadoop JobTracker will nicely re-submit failed tasks (up to a certain number of times) in a different node.
I am new to Map-reduce and I am using Hadoop Pipes. I have an input file which contains the number of records, one per line. I have written one simple program to print those lines in which three words are common. In map function I have emitted the word as a key and record as a value and compared those records in reduce function. Then I compared Hadoop's performance with simple C++ program in which I read the records from file and split it into words and load the data in map. Map contains word as a key and record as a value. After loading all the data, I compared that data. But I found that for doing the same task Hadoop Map-reduce takes long time compared with plain C++ program. When I run my program on hadoop it takes about 37 minutes where as it takes only about 5 minutes for simple C++ program. Please, somebody help me to figure out whether I am doing wrong somewhere? Our application needs performance.
There are several points which should be made here:
Hadoop is not high performance - it is scalable. Local program doing the same on small data set will always outperform hadoop. So its usage makes sense only when you want to run on cluster on machine and enjoy Hadoop's parallel processing.
Hadoop streaming is also not best thing performance wise since there are task switches per line. In many cases native hadoop program written in Java will have better performance
I'm a very beginner of hadoop. But I had this interesting observation.
Using the example in hadoop documentation,
By running the same example in Standalone Operation and Pseudo-Distributed Operation, the standalone one took less than 1 minute but Pseudo-distributed operation took more than 3 minutes. This is big difference. I could understand there are extra network and scheduling overhead in distributed mode. But the difference just seems to be too much. This may not be a real comparison because the example is very simple.
My question is, how much difference did you experience between the standalone and distributed mode for a real-world job?
These are reasonably different scenarios. In stand-alone mode, it never starts up a proper one-node Hadoop cluster. Everything happens locally, inline, in the JVM. Data never has to be even written out to disk, potentially. Pseudo-distributed operation is the smallest "real" Hadoop installation, of one local node. You have to read/write data to a local HDFS instance, spawn another JVM, etc. All of that adds a lot of overhead. Maybe the overhead is indeed a few minutes. This seems entirely sensible to me.
Hadoop frame work is meant for processing BIG DATA..
So the size of the data matters a lot ,because ,a smaller file would get processed in traditional file system very quickly than in hadoop because hadoop mapreduce frame work has internal work to do (to make chunks of data file and to send it to data nodes and while processing again access from data nodes ).So for a smaller files ,hadoop frame work is not suitable.
Coming to standalone and pseudo distributed mode ,one aspect u should consider is size of the file and second being actual difference in standalone and pseudo distributed mode.
In standalone mode there is no concept of HDFS,data is not copied to hadoop distributed file system (obviously time saved).Where as in pseudo distributed mode ,hdfs involved which need to be copied with the data that's need to be processed.
Small size data files better to use traditional file processing and if the file size become huge and huge ,hadoop framework gives better processing time!
Hope this helps!
I am having clarification regarding using Hadoop for large file size around 2 million. I have file data that consists of 2 million lines for which I want to split each line as single file, copy it in Hadoop File System and do perform calculation of term frequency using Mahout. Mahout uses map-reduce computation in a distributed fashion. But for this, say If I have a file that consist of 2 million lines, I want to take each line as a document for calculation of term-frequency. I will finally have one directory where I will have 2 million documents, each document consist of single line. Will this create n-maps for n-files, here 2 million maps for the process. This takes lot of time for computation. Is there is any alternative way of representing documents for faster computation.
2 millions files is a lot for hadoop. More then that - running 2 million tasks will have roughly 2M seconds overhead, what means a few days of small cluster work.
I think that the problem is of algorithmic nature - how to map your computation to the map reduce paradigm in the way that you will have modest number of mappers. Please drop a few lines about task you need, and I might suggest algorithm.
Mahout has implementation for calcualating TF and IDF for text.
check mahout liberary for it,
and splitting each line as a file is not good idea in hadoop map reduce framework.
I am trying to profile which functions consume the most time for a TeraSort Hadoop job. for my test system, I am using a basic 1-node pseudo-distributed setup. This means that the NameNode, DataNode, Tasktracker, and Jobtracker JVMs all run on the same machine.
I first generate ~9GB of data using TeraGen and then run the TeraSort on it. While the JVMs execute, I sample their execution using VisualVM. I know this is not the most accurate profiler out there, but it's free and easy to use! I use the latest version of Apache hadoop distribution, and my experiments are run on an Intel Atom based system.
When I look at the Self time (CPU) for Hot Spots-Methods in VisualVM, I see java.util.zip.CRC32.update() function taking up nearly 40% of total time. When I look at this function in the call tree, it's invoked by the main() function of the mapper, specifically when the IdentityMapper.map() is reading input files from the HDFS. The function that actually makes the call to CRC32.update() function is org.apache.hadoop.fs.FSInputChecker.readChecksumChunk()
I have a three questions regarding this:
Why is CRC32 checksum being updated for blocks being read from the HDFS? If I understand correctly, once a block is read, a simple comparison of the data read from the disk with the block's CRC should be the only operation, not generating and updating the blocks CRC value.
I looked up the source for the update function, and it's implemented by the java.util.zip.CRC32.java file. The specific function called is the overloaded update() method with three arguments. Since this function is implemented in Java, is it possible that multiple layers of abstraction (Hadoop, JVM, CPU instructions) are reducing the native efficiency of CRC calculation?
Finally, is there something grossly wrong with my VisualVM instrumentation methodology, or interpretation of the sampling results?
Thanks,
To your first question, I think the answer is that the CRC files have replicas and can be corrupted. For example, assume we have a bunch of files/directories with replication factor of 2, then following scenarios can happen and CRC will need to be recalculated and updated:
Delete meta file on one replica
Truncates meta file on one replica
Corrupts the meta file header on one replica
Corrupts any random offset and portion of the meta file
Swaps two meta files, i.e the format of the meta files are valid but their CRCs do not match with their corresponding data blocks
If you take a look at the JIRA issues for Hadoop Common, you can find many issues related with CRC corruption.
For the second question, could you tell me which version of Hadoop are you using? The efficiency of CRC has been complained and improved again and again.