processing result of hdfs command output - hadoop

It is probably a question about stream processing. But I am not able to find a elegant solution using awk.
I am running a m/r job scheduled to run once a day. But there can be multiple HDFS directories on which it needs to run. For example, 3 input directories were uploaded to HDFS for the day, so 3 m/r jobs one for each directory needs to run.
So I need a solution, where i can extract filenames from the results of:
hdfs dfs -ls /user/xxx/17-03-15*
Then iterate over the filenames, launching one m/r job for each file.
Thanks

Browsing more on the issue, I found Hadoop provides a configuration settings for this issue. Here are details.
Also, I was just having some syntax issue and this simple awk command did, what i wanted:
files=`hdfs dfs -ls /user/hduser/17-03-15* | awk {'print $8'}`

Related

How can I append multiple files in HDFS to a single file in HDFS without the help of local file system?

I am learning hadoop. I came across a problem now. I ran the mapreduce job and output was stored in multiple files but not as single file. I want to append all of them into a single file in hdfs. I know about appendToFile and getmerge commands. But they work only for either local file system to hdfsor hdfs to local system but not from HDFS to HDFS. Is there any way to append the output files in HDFS to a single file in HDFS without touching local file system?
The only way to do this would be to force your mapreduce code to use one reducer, for example, by sorting all the results by a single key.
However, this defeats the purpose of having a distributed filesystem and multiple processors. All Hadoop jobs should be able to read a directory of files, not isolated to process a single file
If you need a single file to download from HDFS, then you should use getmerge
There is no easy way to do this directly in HDFS. But the below trick works. Although not a feasible solution, but should work if output is not huge.
hadoop fs -cat source_folder_path/* | hadoop fs -put target_filename

Splitting a file on Hadoop

I have an 8.8G file on the hadoop cluster that I'm trying to extract certain lines for testing purpose.
Seeing that Apache Hadoop 2.6.0 have no split command, how am I able to do it without having to download the file.
If the file was on a linux server I would've used:
$ csplit filename %2015-07-17%
The previous command works as desired, is something close to that possible on Hadoop?
You could use a combination of unix and hdfs commands.
hadoop fs -cat filename.dat | head -250 > /redirect/filename
Or if last KB of the file is suffice you could use this.
hadoop fs -tail filename.dat > /redirect/filename

Get last 5 lines of a file in Hadoop (HDFS)

I have several files in my Hadoop cluster (on HDFS). I want to see the last 5 lines of every file. Is there a simple command to do so?
If you want to see the last 5 lines specifically (and not any more or any less) of a file in HDFS, you can use the following command but its not very efficient:
hadoop fs -cat /your/file/with/path | tail -5
Here's a more efficient command within hadoop, but it returns the last kilobyte of the data, not a user-specified number of lines:
hadoop fs -tail /your/file/with/path
Here's a reference to the hadoop tail command : http://hadoop.apache.org/docs/r0.18.3/hdfs_shell.html#tail

Where are my files(dir) stored when i used the hadoop fs -mkdir?

I'm totally new to hadoop and just finished installing which took me 2 days...
I'm now trying with the hadoop dfs command, but i just couldn't understand it, although i've been browsing for days, i couldnt find the answer to what i want to know.
All the examples shows what the result is supposed to be, without explaining the real structure of it, so i will be happy if someone could assist me in understanding hadoop hdfs.
I've created a directory on the HDFS.
bin/hadoop fs -mkdir input
OK, i shall check on it with the ls command.
bin/hadoop fs -ls
Found 1 items
drwxr-xr-x - hadoop supergroup 0 2012-07-30 11:08 input
OK, no problem, everything seems perfect.. BUT where is actually the HDFS data stored?
I thought it would store in the my datanode directory (/home/hadoop/datastore), which was defined in core-site.xml under hadoop.tmp.dir, but it is not there..
Then i tried to view through the WEB-UI and i found that "input" was created under "/user/hadoop/" (/user/hadoop/input).
My questions are
(1) What are the datanode directory (hadoop.tmp.dir) used for, since it doesnt store everything i processed through dfs command?
(2) Everything created with dfs command goes to /user/XXX/ , how to change the value of it?
(3) I cant see anything when i try to access through normal linux command (ls /user/hadoop). Does /user/hadoop exists logically?
I'm sorry if my questions are stupid..
a newbie struggling to understand hadoop better..
Thank you in advance.
Hdfs is not a posix file system and you have to use hadoop api to read and view this file system. That's the reason you have to do hadoop fs -ls as you are using hadoop API to read files here. Data in hdfs are stored in blocks and is stored in all datanodes. Metadata about this file system is stored on Namenode. The data files you see in the directory "/home/hadoop/datastore " are blocks stored on individual datanode.
I think you should explore more about its file system in its tutorial. Yahoo, YDN tutorial on hdfs

Hadoop DistCp using wildcards?

Is it possible to use DistCp to copy only files that match a certain pattern?
For example. For /foo I only want *.log files.
I realize this is an old thread. But I was interested in the answer to this question myself - and dk89 also asked again in 2013. So here we go:
distcp does not support wildcards. The closest you can do is to:
Find the files you want to copy (sources), filter then using grep, format for hdfs using awk, and output the result to an "input-files" list:
hadoop dfs -lsr hdfs://localhost:9000/path/to/source/dir/
| grep -e webapp.log.3. | awk '{print "hdfs\://localhost\:9000/" $8'} > input-files.txt
Put the input-files list into hdfs
hadoop dfs -put input-files.txt .
Create the target dir
hadoop dfs -mkdir hdfs://localhost:9000/path/to/target/
Run distcp using the input-files list and specifying the target hdfs dir:
hadoop distcp -i -f input-files.txt hdfs://localhost:9000/path/to/target/
DistCp is in fact just a regular map-reduce job: you can use the same globbing syntax as you would use for input of a regular map-reduce job. Generally, you can just use foo/*.log and that should suffice. You can experiment with hadoop fs -ls statement here - if globbing works with fs -ls, then if will work with DistCp (well, almost, but differences are fairly subtle to mention).

Resources