Reading the last 10 months HDFS files - hadoop

I need to select the HDFS files from the last 10 months, based in my HDFS path date:
/path/ds=<year-month-day>
Is there a way to do that using path wildcards, dynamically, while setting this information on a .conf file?
The expected result would be some like (considering today's month):
/path/ds=202{1,2}-{03-01}*/hour=*/*/*

Related

Is there a way to iterate through hdfs dfs file system in unix to search for huge number of small files

Need to iterate through hdfs dfs file system where the owner of the file system is my service account and find directories/folders where the number of small files are more than a lakh.
The problem I am facing is respect to the depths of these folders, for example:
Path 1 : hdfs dfs edx/home/krn/zxy/
Path 2 : hdfs dfs edx/home/nzy/
In Path 1 small files are present inside zxy which means the depth is 3 whereas for Path 2 small files are present in nzy which is of depth2.
I need help in write a bash/shell script where I can have the depth as dynamic and get all the locations/paths where there are > 100k small files

hdfs - get folder/file creation timestamp

I am trying to retrieve the creation timestamp for a specific folder stored in hdfs, but I did not find a command that can get this information.
Apparently, as the -help command states out, the -stat command can only retrieve the modification date using the %y option:
bash$ hdfs dfs -help stat
-stat [format] <path> ... :
Print statistics about the file/directory at <path> in the specified format.
Format accepts filesize in blocks (%b), group name of owner(%g), filename (%n),
block size (%o), replication (%r), user name of owner(%u), modification date
(%y, %Y)
Is there some way to get the creation date?
HDFS stores only the modified time and access time of the files as per the HDFS inode code in GitHub - HERE.
The modified time for files is the time when the file was last closed (such as when originally written and closed, or when reopened for append and closed).
During most of the time the modified time does not change for most files we place on HDFS unless they undergo any modifications as stated above. Hence, the modified can be referred as an acceptable creation time most of the time (NOT ALWAYS).

Data retention in Hadoop HDFS

We have a Hadoop cluster with over 100TB data in HDFS. I want to delete data older than 13 weeks in certain Hive tables.
Are there any tools or way I can achieve this?
Thank you
To delete data older then a certain time frame, you have a few options.
First, if the Hive table is partitioned by date, you could simply DROP the partitions within Hive and remove their underlying directories.
Second option would be to run an INSERT to a new table, filtering out the old data using a datestamp (if available). This is likely not a good option since you have 100TB of data.
A third option would be to recursively list the data directories for your Hive tables. hadoop fs -lsr /path/to/hive/table. This will output a list of the files and their creation dates. You can take this output, extract the date and compare against the time frame you want to keep. If the file is older then you want to keep, run a hadoop fs -rm <file> on it.
A fourth option would be to grab a copy of the FSImage: curl --silent "http://<active namenode>:50070/getimage?getimage=1&txid=latest" -o hdfs.image Next turn it into a text file. hadoop oiv -i hdfs.image -o hdfs.txt. The text file will contain a text representation of HDFS, the same as what hadoop fs -ls ... would return.

Loading new files using Pig LOAD statement

I wanted to load data from HDFS to HBSE table sing PIG script.
I have hadfs folder structure as below:
-rw-r--r-- 1 user supergroup 63 2014-05-15 20:28 dataparse/good/goodrec_051520142028
-rw-r--r-- 1 user supergroup 72 2014-05-15 20:30 dataparse/good/goodrec_051520142030
-rw-r--r-- 1 user supergroup 110 2014-05-15 20:32 dataparse/good/goodrec_051520142032
In the above all filenames are attached with the timestamp.
Below is my PIG script to load from HDFS to HBASE:
G = LOAD '/user/user/dataparse/good/' USING PigStorage(',') as (c1:chararray, c2:chararray,c3:chararray,c4:chararray,c5:chararray);
STORE G INTO 'hbase://test' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('t1:name t1:state t1:phone_no t1:gender');
The script is working fine and the data from all the 3 files are written to the Hbase "test" table.
Suppose after some time if some more files comes to HDFS with the same structure and when i run the pig script it will LOAD all the files in the "good" directory along with the already read file. So how can i load only those files which are new files. Already loaded files should not be loaded again into my HBASE table.
How can i do this?
Thanks,
Sapthashree
I think you have a few options here.
Using globs
Using a shell script pick up the "new" files, Use the glob feature so
that multiple files can be fed into the script. A related use case is
here
If the files have a date and timestamp in the filename then you can
use globs directly, look here to inspiration
Using big guns
If using globs is failing you, then you need to bring out the big
guns, use a custom load function put in the logic to identify "new
files" in it and you should be good to go. Details here
you need to have some scheduling mechanism where pig job runs time to time. So, in this process you can only process the files which are not processed earlier by keep traking the timestamp and file names or any other field.
See here for more information Execute Pig from within Java Application

how to prevent hadoop corrupted .gz file

I'm using following simple code to upload files to hdfs.
FileSystem hdfs = FileSystem.get(config);
hdfs.copyFromLocalFile(src, dst);
The files are generated by webserver java component and rotated and closed by logback in .gz format. I've noticed that sometimes the .gz file is corrupted.
> gunzip logfile.log_2013_02_20_07.close.gz
gzip: logfile.log_2013_02_20_07.close.gz: unexpected end of file
But the following command does show me the content of the file
> hadoop fs -text /input/2013/02/20/logfile.log_2013_02_20_07.close.gz
The impact of having such files is quite disaster - since the aggregation for the whole day fails, and also several slave nodes is marked as blacklisted in such case.
What can I do in such case?
Can hadoop copyFromLocalFile() utility corrupt the file?
Does anyone met similar problem ?
It shouldn't do - this error is normally associated with GZip files which haven't been closed out when originally written to local disk, or are being copied to HDFS before they have finished being written to.
You should be able to check by running an md5sum on the original file and that in HDFS - if they match then the original file is corrupt:
hadoop fs -cat /input/2013/02/20/logfile.log_2013_02_20_07.close.gz | md5sum
md5sum /path/to/local/logfile.log_2013_02_20_07.close.gz
If they don't match they check the timestamps on the two files - the one in HDFS should be modified after the local file system one.

Resources