Merge HDFS files without going through the network - hadoop

I could do this:
hadoop fs -text /path/to/result/of/many/reudcers/part* | hadoop fs -put - /path/to/concatenated/file/target.csv
But it will make the HDFS file get streamed through the network. Is there a way to tell the HDFS to merge few files on the cluster itself?

I have problem similar to yours.
Here is article with number of HDFS files merging options but all of them have some specifics. No one from this list meets my requirements. Hope this could help you.
HDFS concat (actually FileSystem.concat()). Not so old API. Requires original file to have last block full.
MapReduce jobs: probably I will take some solution based on this technology but it's slow to setup.
copyMerge - as far as I can see this will be again copy. But I did not check details yet.
File crush - again, looks like MapReduce.
So main result is if MapReduce setup speed suits you, no problem. If you have realtime requirements, things are getting complex.
One of my 'crazy' ideas is to use HBase coprocessor mechanics (endpoints) and files blocks locality information for this as I have Hbase on the same cluster. If the word 'crazy' doesn't stop you, look at this: http://blogs.apache.org/hbase/entry/coprocessor_introduction

Related

Understanding file handling in hadoop

I am new to Hadoop ecosystem with some basic idea. Please assist on following queries to start with:
If the file size (file that am trying to copy into HDFS) is very big and unable to accommodate with the available commodity hardware in my Hadoop ecosystem system, what can be done? Will the file wait until it gets an empty space or the there is an error?
How to find well in advance or predict the above scenario will occur in a Hadoop production environment where we continue to receive files from outside sources?
How to add a new node to a live HDFS ecosystem? There are many methods but I wanted to know which files I need to alter?
How many blocks does a node have? If I assume that a node is a CPU with storage(HDD-500 MB), RAM(1GB) and a processor(Dual Core). In this scenario is it like 500GB/64? assuming that each block is configured to hold 64 GB RAM
If I copyFromLocal a 1TB file into HDFS, which portion of the file will be placed in which block in which node? How can I know this?
How can I find which record/row of the input file is available in which file of the multiple files split by Hadoop?
What are the purpose of each xmls configured? (core-site.xml,hdfs-site.xml & mapred-site.xml). In a distributed environment, which of these files should be placed in all the slave Data Nodes?
How to know how many map and reduce jobs will run for any read/write activity? Will the write operation always have 0 reducer?
Apologize for asking some of the basic questions. Kindly suggest methods to find answers for all of the above queries.

Can I get around the no-update restriction in HDFS?

Thanks for the answers. I'm still not quite getting the answer I want. It's a particular question involving HDFS and the concat api.
Here it is. When concat talks about files, does it mean only "files created and managed by HDFS?" Or will it work on files that are not known to HDFS but just happen to live on the datanodes?
The idea is to
Create a file and save it through HDFS. It's broken up into blocks and saved to the datanodes.
Go directly to the datanodes and make local copies of the blocks using normal shell commands.
Alter those copies. I now have a set of blocks that Hadoop doesn't know about. The checksums are definitely bad.
Use concat to stitch the copies together and "register" them with HDFS.
At the end of all that, I have two files as far as HDFS is concerned. The original and an updated copy. Essentially, I put the data blocks on the datanodes without going through Hadoop. The concat code put all those new blocks into a new HDFS file without having to pass the data through Hadoop.
I don't think this will work, but I need to be sure it won't. It was suggested to me as a possible solution to the update problem. I need to convince them this will not work.
The base philosophy of HDFS is:
write-once, read-many
then, it is not possible to update files with the base implementation of HDFS. You only can append at the end of a current file if you are using a Hadoop branch that allow it. (The original version doesn't allow it)
An alternative could be use a non-standard HDFS like Map-R file system: https://www.mapr.com/blog/get-real-hadoop-read-write-file-system#.VfHYK2wViko
Go for HBase which is built on top of Hadoop to support CRUD operations in big data hadoop world.
If you are not supposed to use No SQL database then there is no chance for updating HDFS files. Only option is to rewrite.

How to achieve desired block size with Hadoop with data on local filesystem

I have a 2TB sequence file that I am trying to process with Hadoop which resides on a cluster set up to use a local (lustre) filesystem for storage instead of HDFS. My problem is that no matter what I try, I am always forced to have about 66000 map tasks when I run a map/reduce jobs with this data as input. This seems to correspond with a block size of 2TB/66000 =~ 32MB. The actual computation in each map task executes very quickly, but the overhead associated with so many map tasks slows things down substantially.
For the job that created the data and for all subsequent jobs, I have dfs.block.size=536870912 and fs.local.block.size=536870912 (512MB). I also found suggestions that said to try this:
hadoop fs -D fs.local.block.size=536870912 -put local_name remote_location
to make a new copy with larger blocks, which I did to no avail. I have also changed the stripe size of the file on lustre. It seems that any parameters having to do with block size are ignored for local file system.
I know that using lustre instead of HDFS is a non-traditional use of hadoop, but this is what I have to work with. I'm wondering if others either have experience with this, or have any ideas to try other than what I have mentioned.
I am using cdh3u5 if that is useful.

How to control file assignation in different slave in hadoop distributed system?

How to control file assignation in different slave in hadoop distributed system?
Is it possible to write 2 or more file in hadoop as map reduce task Simultaneously?
I am new to hadoop.It will be really helpful to me.
If you know please answer.
This is my answer for your #1:
You can't directly control where map tasks go in your cluster or where files get sent in your cluster. The JobTracker and the NameNode handle these, respectively. The JobTracker will try to send the map tasks to be data local to improve performance. (I had to guess what you meant for your question , if I didn't get it right, please elaborate)
This is my answer for your #2:
MultipleOutputs is what you are looking for when you want to write multiple files out from a single reducer.

Using mahout and hadoop

I am a newbie trying to understand how will mahout and hadoop be used for collaborative filtering. I m having single node cassandra setup. I want to fetch data from cassandra
Where can I find clear installation steps for hadoop first and then mahout to work with cassandra?
(I think this is the same question you just asked on user#mahout.apache.org? Copying my answer.)
You may not need Hadoop at all, and if you don't, I'd suggest you not use it for simplicity. It's "necessary evil" to scale past a certain point.
You can have data on Cassandra but you will want to be able to read it into memory. If you can dump as a file, you can use FileDataModel. Or, you can emulate the code in FileDataModel to create one based on Cassandra.
Then, your two needs are easily answered:
This is not even a recommendation
problem. Just pick an implementation
of UserSimilarity, and use it to
compare a user to all others, and
pick the ones with highest
similarity. (Wrapping with
CachingUserSimilarity will help a
lot.)
This is just a recommender
problem. Use a
GenericUserBasedRecommender with
your UserSimilarity and DataModel
and you're done.
It of course can get much more complex than this, but this is a fine start point.
If later you use Hadoop, yes you have to set up Hadoop according to its instructions. There is no Mahout "setup". For recommenders, you would look at one of the RecommenderJob classes which invokes the necessary jobs on your Hadoop cluster. You would run it with the "hadoop" command -- again, this is where you'd need to just understand Hadoop.
The book Mahout in Action writes up most of the Mahout Hadoop jobs in some detail.
The book Mahout in Action did indeed just save me from a frustrating lack of docs.
I was following https://issues.apache.org/jira/browse/MAHOUT-180 ... which suggests a 'hadoop -jar' syntax that only gave me errors. The book has 'jar' instead, and with that fix my test job is happily running.
Here's what I did:
used the utility at http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html?showComment=1298565709376#c3501116664672385942 to convert a CSV representation of my matrix to a mahout file format. Copied it into Hadoop filesystem.
Uploaded mahout-examples-0.5-SNAPSHOT-job.jar from a freshly built Mahout on my laptop, onto the hadoop cluster's control box. No other mahout stuff on there.
Ran this: (assumes hadoop is configured; which I confirm with dfs -ls /user/danbri )
hadoop jar ./mahout-examples-0.5-SNAPSHOT-job.jar \
org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver \
--input svdoutput.mht --output outpath --numRows 0 --numCols 4 --rank 50
...now whether I got this right is quite another matter, but it seems to be doing something!
you can follow following tutorial to learn. its ease to understand and stated clearly about basics of Hadoop:
http://developer.yahoo.com/hadoop/tutorial/

Resources