hdfs - get folder/file creation timestamp - bash

I am trying to retrieve the creation timestamp for a specific folder stored in hdfs, but I did not find a command that can get this information.
Apparently, as the -help command states out, the -stat command can only retrieve the modification date using the %y option:
bash$ hdfs dfs -help stat
-stat [format] <path> ... :
Print statistics about the file/directory at <path> in the specified format.
Format accepts filesize in blocks (%b), group name of owner(%g), filename (%n),
block size (%o), replication (%r), user name of owner(%u), modification date
(%y, %Y)
Is there some way to get the creation date?

HDFS stores only the modified time and access time of the files as per the HDFS inode code in GitHub - HERE.
The modified time for files is the time when the file was last closed (such as when originally written and closed, or when reopened for append and closed).
During most of the time the modified time does not change for most files we place on HDFS unless they undergo any modifications as stated above. Hence, the modified can be referred as an acceptable creation time most of the time (NOT ALWAYS).

Related

Reading the last 10 months HDFS files

I need to select the HDFS files from the last 10 months, based in my HDFS path date:
/path/ds=<year-month-day>
Is there a way to do that using path wildcards, dynamically, while setting this information on a .conf file?
The expected result would be some like (considering today's month):
/path/ds=202{1,2}-{03-01}*/hour=*/*/*

Input / Output error when using HDFS NFS Gateway

Getting "Input / output error" when trying work with files in mounted HDFS NFS Gateway. This is despite having set dfs.namenode.accesstime.precision=3600000 in Ambari. For example, doing something like...
$ hdfs dfs -cat /hdfs/path/to/some/tsv/file | sed -e "s/$NULL_WITH_TAB/$TAB/g" | hadoop fs -put -f - /hdfs/path/to/some/tsv/file
$ echo -e "Lines containing null (expect zero): $(grep -c "\tnull\t" /nfs/hdfs/path/to/some/tsv/file)"
when trying to remove nulls from a tsv then inspect for nulls in that tsv based on the NFS location throws the error, but I am seeing it in many other places (again, already have dfs.namenode.accesstime.precision=3600000). Anyone have any ideas why this may be happening or debugging suggestions? Can anyone explain what exactly "access time" is in this context?
From discussion on the apache hadoop mailing list:
I think access time refers to the POSIX atime attribute for files, the “time of last access” as described here for instance (https://www.unixtutorial.org/atime-ctime-mtime-in-unix-filesystems). While HDFS keeps a correct modification time (mtime), which is important, easy and cheap, it only keeps a very low-resolution sense of last access time, which is less important, and expensive to monitor and record, as described here (https://issues.apache.org/jira/browse/HADOOP-1869) and here (https://superuser.com/questions/464290/why-is-cat-not-changing-the-access-time).
However, to have a conforming NFS api, you must present atime, and so the HDFS NFS implementation does. But first you have to configure it on. [...] many sites have been advised to turn it off entirely by setting it to zero, to improve HDFS overall performance. See for example here ( https://community.hortonworks.com/articles/43861/scaling-the-hdfs-namenode-part-4-avoiding-performa.html, section "Don’t let Reads become Writes”). So if your site has turned off atime in HDFS, you will need to turn it back on to fully enable NFS. Alternatively, you can maintain optimum efficiency by mounting NFS with the “noatime” option, as described in the document you reference.
[...] check under /var/log, eg with find /var/log -name ‘*nfs3*’ -print

Hadoop does the returned file size include the replication factor?

I have file stored on HDFS and I need to get its size. I used the following line at the command prompt to get the file size
hadoop fs -du -s train.csv | awk '{{s+=$1}} END {{printf s}}
I know that Hadoop stores duplicates of files decided by the replication factor. So when I run the line above, is the returned size the file size time the replication factor or just the file size?
From Hadoop documentation:
The du returns three columns with the following format:
size disk_space_consumed_with_all_replicas full_path_name
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html
As you can see the first column is size of file, while second column is space consumed including replicas.

Data retention in Hadoop HDFS

We have a Hadoop cluster with over 100TB data in HDFS. I want to delete data older than 13 weeks in certain Hive tables.
Are there any tools or way I can achieve this?
Thank you
To delete data older then a certain time frame, you have a few options.
First, if the Hive table is partitioned by date, you could simply DROP the partitions within Hive and remove their underlying directories.
Second option would be to run an INSERT to a new table, filtering out the old data using a datestamp (if available). This is likely not a good option since you have 100TB of data.
A third option would be to recursively list the data directories for your Hive tables. hadoop fs -lsr /path/to/hive/table. This will output a list of the files and their creation dates. You can take this output, extract the date and compare against the time frame you want to keep. If the file is older then you want to keep, run a hadoop fs -rm <file> on it.
A fourth option would be to grab a copy of the FSImage: curl --silent "http://<active namenode>:50070/getimage?getimage=1&txid=latest" -o hdfs.image Next turn it into a text file. hadoop oiv -i hdfs.image -o hdfs.txt. The text file will contain a text representation of HDFS, the same as what hadoop fs -ls ... would return.

how to prevent hadoop corrupted .gz file

I'm using following simple code to upload files to hdfs.
FileSystem hdfs = FileSystem.get(config);
hdfs.copyFromLocalFile(src, dst);
The files are generated by webserver java component and rotated and closed by logback in .gz format. I've noticed that sometimes the .gz file is corrupted.
> gunzip logfile.log_2013_02_20_07.close.gz
gzip: logfile.log_2013_02_20_07.close.gz: unexpected end of file
But the following command does show me the content of the file
> hadoop fs -text /input/2013/02/20/logfile.log_2013_02_20_07.close.gz
The impact of having such files is quite disaster - since the aggregation for the whole day fails, and also several slave nodes is marked as blacklisted in such case.
What can I do in such case?
Can hadoop copyFromLocalFile() utility corrupt the file?
Does anyone met similar problem ?
It shouldn't do - this error is normally associated with GZip files which haven't been closed out when originally written to local disk, or are being copied to HDFS before they have finished being written to.
You should be able to check by running an md5sum on the original file and that in HDFS - if they match then the original file is corrupt:
hadoop fs -cat /input/2013/02/20/logfile.log_2013_02_20_07.close.gz | md5sum
md5sum /path/to/local/logfile.log_2013_02_20_07.close.gz
If they don't match they check the timestamps on the two files - the one in HDFS should be modified after the local file system one.

Resources