Pig or Hive for a file manipulation

Pig or Hive for a file manipulation - hadoop

I have a file composed as follows:
&009:65
34KKll90JJKK87LLOO
%(..)?.I$£.....
&013:35
36KKll90TTYY87LLPP
%%(.9)?'
&025:66
55KKll88ZZYY87MMQQ
%&(.9)?%%??-_'
And I would like to get a file as:
&009:65 34KKll90JJKK87LLOO %(..)?.I$£.....
&013:35 36KKll90TTYY87LLPP %%(.9)?'.......
&025:66 55KKll88ZZYY87MMQQ %&(.9)?%%??-_'.......
I use hortonworks and I would like to know if it's better to use Hive or PIG and how I could achieve this using one or the other?

Hive, Pig, and the whole Hadoop ecosystem expect files with single-line records, so that you can split the file arbitrarily on any line break and process the splits separately with an arbitrary number of Mappers.
Your example has logical records spanned on multiple lines. Not splittable stuff. Cannot be processed easily in a distributed way. Game over.
Workaround: start a shell somewhere, download the ugly stuff locally, rebuild consistent records with good old sed or awk utilities, and upload the result. Then you can read it with Hive or Pig.
Sample sed command line (awk would be overkill IMHO)...
sed -n '/^&/ { N ; N ; N ; N ; s/\n\n/ /g ; p }' UglyStuff.dump > NiceStuff.txt
If you prefer one-liners:
hdfs dfs -cat /some/path/UglyStuff.dump | sed -n '/^&/ { N ; N ; N ; N ; s/\n\n/ /g ; p }' | hdfs dfs -put -f - /different/path/NiceStuff.txt

Related

Save all hbase table names to the bash array

I would like to store the names of all my hbase tables in an array inside my bash script.
All sed hotfixes are acceptable.
All better solutions (like readarray it from some zookeeper file I am not aware of) are acceptable
I have two hbase tables called MY_TABLE_NAME_1 and MY_TABLE_NAME_2, so what I want would be:
tables = (
MY_TABLE_NAME_1
MY_TABLE_NAME_2
)
What I tried:
Basing on HBase Shell in OS Scripts by Cloudera:
echo "list" | /path/to/hbase/bin/hbase shell -n > /home/me/hbase-tables
readarray -t tables < /home/me/hbase-tables
but inside my /home/me/hbase-tables is:
MY_TABLE_NAME_1
MY_TABLE_NAME_2
2 row(s) in 0.3310 seconds
MY_TABLE_NAME_1
MY_TABLE_NAME_2

You could use readarray/mapfile just fine. But to remove duplicates/skip empty lines and remove unnecessary strings, you need a filter using awk.
Also you don't need to create a temporary file and then parse that file, but directly use a technique called process substitution which allows the output of a command be available as if it is available in a temporary file
mapfile -t output < <(echo "list" | /path/to/hbase/bin/hbase shell -n | awk '!unique[$0]++ && !/seconds/ && NF')
Now the array would contain only the unique table names from the hbase output. That said, you should really look-up for the solution to remove the noise as part of the query output than post-process it this way.

Sorting the output text file in hadoop, is there a way to view the output without sorting it? or using different sorting method?

So basically I used mapreduce for wordcount for a text file I've saved in hadoop, now I want to the view output.
Currently this is the only command I've seen online:
bin/hadoop fs -cat output/part-r-00000 | sort -k 2 -n -r | less
So far I'm just confused by this command, is it just sort the output? can I view the output without sorting it?
Is this command sorting the wordcount display everything in alphabetical order otherwise? Is there any other way you would recommend to sort the saved the text fie, a novel?
Also can I just view the outputfile of wordcount without sorting it?

Can I view the output without sorting it?
Just -cat it
bin/hadoop fs -cat output/part-r-00000 | less
Or copy the output file to the Local FS from HDFS and use it
bin/hadoop fs -get output/part-r-00000 /tmp/output
Is this command sorting the wordcount display everything in
alphabetical order otherwise?
sort -k 2 -n -r: Sort the 2nd column (-k 2) numerically (-n) in reverse (-r) order.
Assuming the second column contains the count, this would sort the words from most number of occurrences to the least. As for the different way of sorting, I feel this is the better one. If you want to sort the content alphabetically, just use sort. Refer sort manual.

Get a few lines of HDFS data

I am having a 2 GB data in my HDFS.
Is it possible to get that data randomly.
Like we do in the Unix command line
cat iris2.csv |head -n 50

Native head
hadoop fs -cat /your/file | head
is efficient here, as cat will close the stream as soon as head will finish reading all the lines.
To get the tail there is a special effective command in hadoop:
hadoop fs -tail /your/file
Unfortunately it returns last kilobyte of the data, not a given number of lines.

You can use head command in Hadoop too! Syntax would be
hdfs dfs -cat <hdfs_filename> | head -n 3
This will print only three lines from the file.

The head and tail commands on Linux display the first 10 and last 10 lines respectively. But, the output of these two commands is not randomly sampled, they are in the same order as in the file itself.
The Linux shuffle - shuf command helps us generate random permutations of input lines & using this in conjunction with the Hadoop commands would be helpful, like so:
$ hadoop fs -cat <file_path_on_hdfs> | shuf -n <N>
Therefore, in this case if iris2.csv is a file on HDFS and you wanted 50 lines randomly sampled from the dataset:
$ hadoop fs -cat /file_path_on_hdfs/iris2.csv | shuf -n 50
Note: The Linux sort command could also be used, but the shuf command is faster and randomly samples data better.

hdfs dfs -cat yourFile | shuf -n <number_of_line>
Will do the trick for you.Though its not available on mac os. You can get installed GNU coreutils.

My suggestion would be to load that data into Hive table, then you can do something like this:
SELECT column1, column2 FROM (
SELECT iris2.column1, iris2.column2, rand() AS r
FROM iris2
ORDER BY r
) t
LIMIT 50;
EDIT:
This is simpler version of that query:
SELECT iris2.column1, iris2.column2
FROM iris2
ORDER BY rand()
LIMIT 50;

Write this command
sudo -u hdfs hdfs dfs -cat "path of csv file" |head -n 50
50 is number of lines(this can be customize by the user based on the requirements)

Working code:
hadoop fs -cat /tmp/a/b/20200630.xls | head -n 10
hadoop fs -cat /tmp/a/b/20200630.xls | tail -3

I was using tail and cat for an avro file on HDFS cluster, but the result was not getting printed in correct encoding. I tried this and worked well for me.
hdfs dfs -text hdfs://<path_of_directory>/part-m-00000.avro | head -n 1
Change 1 to higher integer to print more samples from avro file.

hadoop fs -cat /user/hive/warehouse/vamshi_customers/* |tail
I think the head part is working as per the answer posted by #Viacheslav Rodionov works fine but for the tail part the one that I posted is working good.

Why is there no 'hadoop fs -head' shell command?

A fast method for inspecting files on HDFS is to use tail:
~$ hadoop fs -tail /path/to/file
This displays the last kilobyte of data in the file, which is extremely helpful. However, the opposite command head does not appear to be part of the shell command collections. I find this very surprising.
My hypothesis is that since HDFS is built for very fast streaming reads on very large files, there is some access-oriented issue that affects head. This makes me hesitant to do things to access the head. Does anyone have an answer?

I would say it's more to do with efficiency - a head can easily be replicated by piping the output of a hadoop fs -cat through the linux head command.
hadoop fs -cat /path/to/file | head
This is efficient as head will close out the underlying stream after the desired number of lines have been output
Using tail in this manner would be considerably less efficient - as you'd have to stream over the entire file (all HDFS blocks) to find the final x number of lines.
hadoop fs -cat /path/to/file | tail
The hadoop fs -tail command as you note works on the last kilobyte - hadoop can efficiently find the last block and skip to the position of the final kilobyte, then stream the output. Piping via tail can't easily do this.

Starting with version 3.1.0 we now have it:
Usage: hadoop fs -head URI
Displays first kilobyte of the file to stdout.
See here.

hdfs -dfs /path | head
is a good way to solve the problem.

you can try the folowing command
hadoop fs -cat /path | head -n
where -n can be replace with number of records to view

In Hadoop v2:
hdfs dfs -cat /file/path|head
In Hadoop v1 and v3:
hadoop fs -cat /file/path|head

Batch rename in hadoop

How can I rename all files in a hdfs directory to have a .lzo extension? .lzo.index files should not be renamed.
For example, this directory listing:
file0.lzo file0.lzo.index file0.lzo_copy_1
could be renamed to:
file0.lzo file0.lzo.index file0.lzo_copy_1.lzo
These files are lzo compressed, and I need them to have the .lzo extension to be recognized by hadoop.

If you don't want to write Java Code for this - I think using the command line HDFS API is your best bet:
mv in Hadoop
hadoop fs -mv URI [URI …] <dest>
You can get the paths using a small one liner:
% hadoop fs -ls /user/foo/bar | awk '!/^d/ {print $8}'
/user/foo/bar/blacklist
/user/foo/bar/books-eng
...
the awk will remove directories from output..now you can put these files into a variable:
% files=$(hadoop fs -ls /user/foo/bar | awk '!/^d/ {print $8}')
and rename each file..
% for f in $files; do hadoop fs -mv $f $f.lzo; done
you can also use awk to filter the files for other criteria. This should remove files that match the regex nolzo. However it's untested. But this way you can write flexible filters.
% files=$(hadoop fs -ls /user/foo/bar | awk '!/^d|nolzo/ {print $8}' )
test if it works with replacing the hadoop command with echo:
$ for f in $files; do echo $f $f.lzo; done
Edit: Updated examples to use awk instead of sed for more reliable output.
The "right" way to do it is probably using the HDFS Java API .. However using the shell is probably faster and more flexible for most jobs.

When I had to rename many files I was searching for an efficient solution and stumbled over this question and thi-duong-nguyen's remark that renaming many files is very slow. I implemented a Java solution for batch rename operations which I can highly recommend, since it is orders of magnitude faster. The basic idea is to use org.apache.hadoop.fs.FileSystem's rename() method:
Configuration conf = new Configuration();
conf.set("fs.defaultFS", "hdfs://master:8020");
FileSystem dfs = FileSystem.get(conf);
dfs.rename(from, to);
where from and to are org.apache.hadoop.fs.Path objects. The easiest way is to create a list of files to be renamed (including their new name) and feed this list to the Java program.
I have published the complete implementation which reads such a mapping from STDIN. It renamed 100 files in less than four seconds (the same time was required to rename 7000 files!) while the hdfs dfs -mv based approach described before requires 4 minutes to rename 100 files.

We created an utility to do bulk renaming of files in HDFS: https://github.com/tenaris/hdfs-rename. The tool is limited, but if you want you can contribute to improve it with recursive, awk regex syntax and so on.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Pig or Hive for a file manipulation - hadoop

Related

Save all hbase table names to the bash array

Sorting the output text file in hadoop, is there a way to view the output without sorting it? or using different sorting method?

Get a few lines of HDFS data

Why is there no 'hadoop fs -head' shell command?

Batch rename in hadoop

Categories

Resources