Fastest access of a file using Hadoop - hadoop

I need fastest access to a single file, several copies of which are stored in many systems using Hadoop. I also need to finding the ping time for each file in a sorted manner.
How should I approach learning hadoop to accomplish this task?
Please help fast.I have very less time.

If you need faster access to a file just increase the replication factor to that file using setrep command. This might not increase the file throughput proportionally, because of your current hardware limitations.
The ls command is not giving the access time for the directories and the files, it's showing the modification time only. Use the Offline Image Viewer to dump the contents of hdfs fsimage files to human-readable formats. Below is the command using the Indented option.
bin/hdfs oiv -i fsimagedemo -p Indented -o fsimage.txt
A sample o/p from the fsimage.txt, look for the ACCESS_TIME column.
INODE
INODE_PATH = /user/praveensripati/input/sample.txt
REPLICATION = 1
MODIFICATION_TIME = 2011-10-03 12:53
ACCESS_TIME = 2011-10-03 16:26
BLOCK_SIZE = 67108864
BLOCKS [NUM_BLOCKS = 1]
BLOCK
BLOCK_ID = -5226219854944388285
NUM_BYTES = 529
GENERATION_STAMP = 1005
NS_QUOTA = -1
DS_QUOTA = -1
PERMISSIONS
USER_NAME = praveensripati
GROUP_NAME = supergroup
PERMISSION_STRING = rw-r--r--
To get the ping time in a sorted manner, you need to write a shell script or some other program to extract the INODE_PATH and ACCESS_TIME for each of the INODE section and then sort them based on the ACCESS_TIME. You can also use Pig as shown here.
How should I approach learning hadoop to accomplish this task? Please help fast.I have very less time.
If you want to learn Hadoop in a day or two it's not possible. Here are some videos and articles to start with.

Related

Concatenating multiple text files into one very large file in HDFS

I have the multiple text files.
The total size of them exceeds the largest disk size available to me (~1.5TB)
A spark program reads a single input text file from HDFS. So I need to combine those files into one. (I cannot re-write the program code. I am given only the *.jar file for execution)
Does HDFS have such a capability? How can I achieve this?
What I understood from your question is you want to Concatenate multiple files into one. Here is a solution which might not be the most efficient way of doing it but it works. suppose you have two files: file1 and file2 and you want to get a combined file as ConcatenatedFile
.Here is the script for that.
hadoop fs -cat /hadoop/path/to/file/file1.txt /hadoop/path/to/file/file2.txt | hadoop fs -put - /hadoop/path/to/file/Concatenate_file_Folder/ConcatenateFile.txt
Hope this helps.
HDFS by itself does not provide such capabilities. All out-of-the-box features (like hdfs dfs -text * with pipes or FileUtil's copy methods) use your client server to transfer all data.
In my experience we always used our own written MapReduce jobs to merge many small files in HDFS in distributed way.
So you have two solutions:
Write your own simple MapReduce/Spark job to combine text files with
your format.
Find already implemented solution for such kind of
purposes.
About solution #2: there is the simple project FileCrush for combining text or sequence files in HDFS. It might be suitable for you, check it.
Example of usage:
hadoop jar filecrush-2.0-SNAPSHOT.jar crush.Crush -Ddfs.block.size=134217728 \
--input-format=text \
--output-format=text \
--compress=none \
/input/dir /output/dir 20161228161647
I had a problem to run it without these options (especially -Ddfs.block.size and output file date prefix 20161228161647) so make sure you run it properly.
You can do a pig job:
A = LOAD '/path/to/inputFiles' as (SCHEMA);
STORE A into '/path/to/outputFile';
Doing a hdfs cat and then putting it back to hdfs means, all this data is processed in the client node and will degradate your network

Pig script to compress and decompress the hdfs data in bzip2

How to compress hdfs data to bzip2 using pig such that on decompression it should give the same dir structure which it had initially.I am new to pig.
I tried to compress with bzip2 but it generated many files due to many mappers being spawned and hence reverting back to plain text file(initial form) in the same dir structure becomes difficult.
Just like how in unix if we compress bzip2 using tarball and then after decompression of bzip2.tar gives me exactly same data and folder structure which it had initially.
eg Compression:- tar -cjf compress_folder.tar.bz2 compress_folder/
Decompression:- tar -jtvf compress_folder.tar.bz2
will give exactly same dir st.
Approach 1:
you can try running one reducer to store only 1 file on hdfs. but compromise will be performance here.
set default_parallel 1;
to compress data, set these parameters in pig script , if not tried this way:-
set output.compression.enabled true;
SET mapred.output.compression.codec 'org.apache.hadoop.io.compress.BZip2Codec';
just use JsonStorage while storing file
STORE file INTO '/user/hduser/data/usercount' USING JsonStorage();
Eventually you also want to read data, use TextLoader
data = LOAD '/user/hduser/data/usercount/' USING TextLoader;
Approach 2:
filecrush: file merge utility available at #Mr. github

How to Perform shell script like operation in Hadoop

I am facing a problem with respect to performing operation like cut, tail, sort, etc. as I was able to do on files in Unix Shell Environment.
I am having a situation like I want the highest time stamp in my file which is not sorted by time stamp and store it in say 'X' and then pass 'X' as argument to my MapReducer Driver Class while executing the MR job.
In Local Mode it is easy to do this :
cut -d, -f <<fieldIndexNo>> <<FileName>> | sort -n | tail -1
This gives me the greatest time stamp.
Now In distributed mode, How to go about performing such operations., Or In other Words, what tricks can we use to help solve such problems,
I donot wish to trigger a Mapreduce Job to find the Greatest Time Stamp and then pass it to another Map Reduce Job.
Kindly suggest.
Let me know in case more information is needed.
Thanks
I'm going to assume the files are stored in HDFS and not on the local file system on each node. In that case, you only have 2 options:
Read all files in your local shell and do the filtering as you did before. Mind you, this is very slow, very inefficient, and completely opposed to the idea of hadoop. But you could do something like:
hadoop fs -cat <foldername>/* | cut -d, -f <<fieldIndexNo>> <<FileName>> | sort -n | tail -1
Write a Pig job (or spark job or ...) that does it efficiently. It should be a simple max 3 lines script that sorts a file by timestamp and takes the top 1. Then you store this number on HDFS. This will be executed in parallel on each node and will be much quicker than the first solution.

Hive - Possible to get total size of file parts in a directory?

Background:
I have some gzip files in a HDFS directory. These files are named in the format yyyy-mm-dd-000001.gz, yyyy-mm-dd-000002.gz and so on.
Aim:
I want to build a hive script which produces a table with the columns: Column 1 - date (yyyy-mm-dd), Column 2 - total file size.
To be specific, I would like to sum up the sizes of all of the gzip files for a particular date. The sum will be the value in Column 2 and the date in Column 1.
Is this possible? Are there any in-built functions or UDFs that could help me with my use case?
Thanks in advance!
A MapReduce job for this doesn't seem efficient since you don't actually have to load any data. Plus, doing this seems kind of awkward in Hive.
Can you write a bash script or python script or something like that to parse the output of hadoop fs -ls? I'd imagine something like this:
$ hadoop fs -ls mydir/*gz | python datecount.py | hadoop fs -put - counts.txt

Loading new files using Pig LOAD statement

I wanted to load data from HDFS to HBSE table sing PIG script.
I have hadfs folder structure as below:
-rw-r--r-- 1 user supergroup 63 2014-05-15 20:28 dataparse/good/goodrec_051520142028
-rw-r--r-- 1 user supergroup 72 2014-05-15 20:30 dataparse/good/goodrec_051520142030
-rw-r--r-- 1 user supergroup 110 2014-05-15 20:32 dataparse/good/goodrec_051520142032
In the above all filenames are attached with the timestamp.
Below is my PIG script to load from HDFS to HBASE:
G = LOAD '/user/user/dataparse/good/' USING PigStorage(',') as (c1:chararray, c2:chararray,c3:chararray,c4:chararray,c5:chararray);
STORE G INTO 'hbase://test' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('t1:name t1:state t1:phone_no t1:gender');
The script is working fine and the data from all the 3 files are written to the Hbase "test" table.
Suppose after some time if some more files comes to HDFS with the same structure and when i run the pig script it will LOAD all the files in the "good" directory along with the already read file. So how can i load only those files which are new files. Already loaded files should not be loaded again into my HBASE table.
How can i do this?
Thanks,
Sapthashree
I think you have a few options here.
Using globs
Using a shell script pick up the "new" files, Use the glob feature so
that multiple files can be fed into the script. A related use case is
here
If the files have a date and timestamp in the filename then you can
use globs directly, look here to inspiration
Using big guns
If using globs is failing you, then you need to bring out the big
guns, use a custom load function put in the logic to identify "new
files" in it and you should be good to go. Details here
you need to have some scheduling mechanism where pig job runs time to time. So, in this process you can only process the files which are not processed earlier by keep traking the timestamp and file names or any other field.
See here for more information Execute Pig from within Java Application

Resources