Hadoop sample map reduce program using Awk

Hadoop sample map reduce program using Awk - hadoop

I am familiar with Hadoop using Java. Looking for sample Hadoop Map reduce program using AWK only.
For a text file containing...
A k1
B k1
C k2
D k3
Looking for an o/p
k1 2
k2 1
k3 1

I would advise using Hadoop streaming to do this. I'm not a Awk expert by any means, but using #sudo_O answer and transforming it into the Hadoop world here is what I would do:
Write an Awk script that will be used as your mapper. You need only a mapper for this, no need for reducers.
$ cat mapper.awk
#!/usr/bin/awk -f
{a[$2]++}END{for(k in a)print k,a[k]}
You can run your Hadoop streaming job doing the following:
${HADOOP_HOME}/bin/hadoop \
jar ${HADOOP_HOME}/contrib/streaming/*.jar \
-D mapreduce.job.reduces=0 \
-D mapred.reduce.tasks=0 \
-input /path/to/input.txt \
-output /path/to/output/dir \
-mapper mapper.awk \
-file /path/to/mapper.awk
You can view the results in HDFS by doing:
hadoop fs -cat /path/to/output/dir/*

This will do the trick:
$ awk '{a[$2]++}END{for(k in a)print k,a[k]}' file
k1 2
k2 1
k3 1

Related

Pig or Hive for a file manipulation

I have a file composed as follows:
&009:65
34KKll90JJKK87LLOO
%(..)?.I$£.....
&013:35
36KKll90TTYY87LLPP
%%(.9)?'
&025:66
55KKll88ZZYY87MMQQ
%&(.9)?%%??-_'
And I would like to get a file as:
&009:65 34KKll90JJKK87LLOO %(..)?.I$£.....
&013:35 36KKll90TTYY87LLPP %%(.9)?'.......
&025:66 55KKll88ZZYY87MMQQ %&(.9)?%%??-_'.......
I use hortonworks and I would like to know if it's better to use Hive or PIG and how I could achieve this using one or the other?

Hive, Pig, and the whole Hadoop ecosystem expect files with single-line records, so that you can split the file arbitrarily on any line break and process the splits separately with an arbitrary number of Mappers.
Your example has logical records spanned on multiple lines. Not splittable stuff. Cannot be processed easily in a distributed way. Game over.
Workaround: start a shell somewhere, download the ugly stuff locally, rebuild consistent records with good old sed or awk utilities, and upload the result. Then you can read it with Hive or Pig.
Sample sed command line (awk would be overkill IMHO)...
sed -n '/^&/ { N ; N ; N ; N ; s/\n\n/ /g ; p }' UglyStuff.dump > NiceStuff.txt
If you prefer one-liners:
hdfs dfs -cat /some/path/UglyStuff.dump | sed -n '/^&/ { N ; N ; N ; N ; s/\n\n/ /g ; p }' | hdfs dfs -put -f - /different/path/NiceStuff.txt

How to count lines in a file on hdfs command?

I have a file on HDFS that I want to know how many lines are. (testfile)
In linux, I can do:
wc -l <filename>
Can I do something similar with "hadoop fs" command? I can print file contents with:
hadoop fs -text /user/mklein/testfile
How do I know how many lines do I have? I want to avoid copying the file to local filesystem then running the wc command.
Note: My file is compressed using snappy compression, which is why I have to use -text instead of -cat

Total number of files:
hadoop fs -ls /path/to/hdfs/* | wc -l
Total number of lines:
hadoop fs -cat /path/to/hdfs/* | wc -l
Total number of lines for a given file:
hadoop fs -cat /path/to/hdfs/filename | wc -l

1. Number of lines of a mapper output file:
`~]$ hadoop fs -cat /user/cloudera/output/part-m-00000 | wc -l`
2. Number of lines of a text or any other file on hdfs:
`~]$ hadoop fs -cat /user/cloudera/output/abc.txt | wc -l`
3. Top (Header) 5 lines of a text or any other file on hdfs:
`~]$ hadoop fs -cat /user/cloudera/output/abc.txt | head -5`
4. Bottom 10 lines of a text or any other file on hdfs:
`~]$ hadoop fs -cat /user/cloudera/output/abc.txt | tail -10`

You cannot do it with a hadoop fs command. Either you have to write a mapreduce code with the logic explained in this post or this pig script would help.
A = LOAD 'file' using PigStorage() as(...);
B = group A all;
cnt = foreach B generate COUNT(A);
Makesure you have the correct extension for your snappy file so that pig could detect and read it.

View gzipped file content in hadoop

How can I decompress and view few lines of a compressed file in hdfs.
The below command displays the last few lines of the compressed data
hadoop fs -tail /myfolder/part-r-00024.gz
Is there a way I can use the -text command and pipe the output to tail command? I tried this but this doesn't work.
hadoop fs -text /myfolder/part-r-00024.gz > hadoop fs -tail /myfolder/

The following will show you the specified number of lines without decompressing the whole file:
hadoop fs -cat /hdfs_location/part-00000.gz | zcat | head -n 20
The following will page the file, also without first decompressing the whole of it:
hadoop fs -cat /hdfs_location/part-00000.gz | zmore

Try the following, should work as long as your file isn't too big (since the whole thing will be decompressed):
hadoop fs -text /myfolder/part-r-00024.gz | tail

I ended up writing a pig script.
A = LOAD '/myfolder/part-r-00024.gz' USING PigStorage('\t');
B = LIMIT A 10;
DUMP B;

Use gunzip to view the compressed file contents:
hdfs dfs -cat /path/filename.gz | gunzip

Get a few lines of HDFS data

I am having a 2 GB data in my HDFS.
Is it possible to get that data randomly.
Like we do in the Unix command line
cat iris2.csv |head -n 50

Native head
hadoop fs -cat /your/file | head
is efficient here, as cat will close the stream as soon as head will finish reading all the lines.
To get the tail there is a special effective command in hadoop:
hadoop fs -tail /your/file
Unfortunately it returns last kilobyte of the data, not a given number of lines.

You can use head command in Hadoop too! Syntax would be
hdfs dfs -cat <hdfs_filename> | head -n 3
This will print only three lines from the file.

The head and tail commands on Linux display the first 10 and last 10 lines respectively. But, the output of these two commands is not randomly sampled, they are in the same order as in the file itself.
The Linux shuffle - shuf command helps us generate random permutations of input lines & using this in conjunction with the Hadoop commands would be helpful, like so:
$ hadoop fs -cat <file_path_on_hdfs> | shuf -n <N>
Therefore, in this case if iris2.csv is a file on HDFS and you wanted 50 lines randomly sampled from the dataset:
$ hadoop fs -cat /file_path_on_hdfs/iris2.csv | shuf -n 50
Note: The Linux sort command could also be used, but the shuf command is faster and randomly samples data better.

hdfs dfs -cat yourFile | shuf -n <number_of_line>
Will do the trick for you.Though its not available on mac os. You can get installed GNU coreutils.

My suggestion would be to load that data into Hive table, then you can do something like this:
SELECT column1, column2 FROM (
SELECT iris2.column1, iris2.column2, rand() AS r
FROM iris2
ORDER BY r
) t
LIMIT 50;
EDIT:
This is simpler version of that query:
SELECT iris2.column1, iris2.column2
FROM iris2
ORDER BY rand()
LIMIT 50;

Write this command
sudo -u hdfs hdfs dfs -cat "path of csv file" |head -n 50
50 is number of lines(this can be customize by the user based on the requirements)

Working code:
hadoop fs -cat /tmp/a/b/20200630.xls | head -n 10
hadoop fs -cat /tmp/a/b/20200630.xls | tail -3

I was using tail and cat for an avro file on HDFS cluster, but the result was not getting printed in correct encoding. I tried this and worked well for me.
hdfs dfs -text hdfs://<path_of_directory>/part-m-00000.avro | head -n 1
Change 1 to higher integer to print more samples from avro file.

hadoop fs -cat /user/hive/warehouse/vamshi_customers/* |tail
I think the head part is working as per the answer posted by #Viacheslav Rodionov works fine but for the tail part the one that I posted is working good.

Hadoop MapReduce Streaming sorting on multiple columns

I have mapreduce input that looks like this:
key1 \t 4.1 \t more ...
key1 \t 10.3 \t more ...
key2 \t 6.9 \t more ...
key2 \t 3 \t more ...
I want to sort by the first column then by second column (reverse numerical). Is there a way to achieve this Streaming MapReduce?
My current attempt is this:
hadoop jar hadoop-streaming-1.2.1.jar -Dnum.key.fields.for.partition=1 -Dmapred.text.key.comparator.options='-k1,2rn' -Dmapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator -mapper cat -reducer cat -file mr_base.py -file common.py -file mr_sort_combiner.py -input mr_combiner/2013_12_09__05_47_21/part-* -output mr_sort_combiner/2013_12_09__07_15_59/
But this sorts by first part of key and second (but does not sort second as numeric but rather as a string).
Any ideas on how I can sort two fields (one numeric and one textual)?

you can achieve numerical sorting on multiple columns by specifying multiple k options in mapred.text.key.comparator.options (similarly to the linux sort command)
e.g. in bash
sort -k1,1 -k2rn
so for your example it would be
hadoop jar hadoop-streaming-1.2.1.jar \
-Dmapred.text.key.comparator.options='-k1,1 - k2rn' \
-Dmapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \
-mapper cat \
-reducer cat \
-file mr_base.py \
-file common.py \
-file mr_sort_combiner.py \
-input mr_combiner/2013_12_09__05_47_21/part-* \
-output mr_sort_combiner/2013_12_09__07_15_59/

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Hadoop sample map reduce program using Awk - hadoop

I am familiar with Hadoop using Java. Looking for sample Hadoop Map reduce program using AWK only. For a text file containing... A k1 B k1 C k2 D k3 Looking for an o/p k1 2 k2 1 k3 1

This will do the trick: $ awk '{a[$2]++}END{for(k in a)print k,a[k]}' file k1 2 k2 1 k3 1

Related

Pig or Hive for a file manipulation

How to count lines in a file on hdfs command?

View gzipped file content in hadoop

Get a few lines of HDFS data

Hadoop MapReduce Streaming sorting on multiple columns

Categories

Resources