Merge multiple files recursively in HDFS - hadoop

My folder path structure in HDFS is something like this:
/data/topicname/year=2017/month=02/day=28/hour=00
/data/topicname/year=2017/month=02/day=28/hour=01
/data/topicname/year=2017/month=02/day=28/hour=02
/data/topicname/year=2017/month=02/day=28/hour=03
Inside these paths I have many small size json files. I am writing a shell script which can merge all files present inside all these individual directories into a single individual filename depending on path.
Example:
All JSONs inside /data/topicname/year=2017/month=02/day=28/hour=00 into one merged file full_2017_02_28_00.json
All JSONs inside /data/topicname/year=2017/month=02/day=28/hour=01 into one merged file full_2017_02_28_01.json
All JSONs inside /data/topicname/year=2017/month=02/day=28/hour=02 into one merged file full_2017_02_28_02.json and so on.
Keeping the file name in the above said pattern is secondary job which I will try to achieve. Currently I can hardcode the filenames.
But, recursive concatenation inside directory path structure is not happening.
So far, I have tried below:
hadoop fs -cat /data/topicname/year=2017/* | hadoop fs -put - /merged/test1.json
Error:-
cat: `/data/topicname/year=2017/month=02/day=28/hour=00': Is a directory
cat: `/data/topicname/year=2017/month=02/day=28/hour=01': Is a directory
cat: `/data/topicname/year=2017/month=02/day=28/hour=02': Is a directory
Recursive cat is not happening in above try
hadoop fs -ls /data/topicname/year=2017/month=02 | find /data/topicname/year=2017/month=02/day=28 -name '*.json' -exec cat {} \; > output.json
Error:-
find: ‘/data/topicname/year=2017/month=02/day=28’: No such file or directory
It is doing find in local FS instead of HDFS in above attempt
for i in `hadoop fs -ls -R /data/topicname/year=2017/ | cut -d' ' -f19` ;do `hadoop fs -cat $i/* |hadoop fs -put - /merged/output.json`; done
Error:-
cannot write output to stream message is repeated multiple times
file /merged/output.json is repeated a few times
How is this achievable? I do not want to use Spark.

Use -appendToFile:
for file in `hdfs dfs -ls -R /src_folder | awk '$2!="-" {print $8}'`; do hdfs dfs -cat $file | hdfs dfs -appendToFile - /target_folder/filename;done
Time taken will be dependent on the number and size of files as the process is sequential.

I was able to achieve my goal with below script:
#!/bin/bash
for k in 01 02 03 04 05 06 07 08 09 10 11 12
do
for j in 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
do
for i in 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23
do
hadoop fs -cat /data/topicname/year=2017/month=$k/day=$j/hour=$i/* | hadoop fs -put - /merged/TEST1/2017"_"$k"_"$j"_"$i.json
hadoop fs -du -s /merged/TEST1/2017"_"$k"_"$j"_"$i.json > /home/test/sizetest.txt
x=`awk '{ print $1 }' /home/test/sizetest.txt`
echo $x
if [ $x -eq 0 ]
then
hadoop fs -rm /merged/TEST1/2017"_"$k"_"$j"_"$i.json
else
echo "MERGE DONE!!! All files generated at hour $i of $j-$k-2017 merged into one"
echo "DELETED 0 SIZED FILES!!!!"
fi
done
done
done
rm -f /home/test/sizetest.txt
hadoop fs -rm -r /data/topicname

Related

is it safe to remove the /tmp/hive/hive folder?

is it safe to remove the /tmp/hive/hive folder? ( from hdfs )
as ( from user hdfs )
hdfs dfs -rm -r /tmp/hive/hive
the reason for that because under /tmp/hive/hive we have thousand of files and we cant delete them
hdfs dfs -ls /tmp/hive/
Found 7 items
drwx------ - admin hdfs 0 2019-03-05 12:00 /tmp/hive/admin
drwx------ - drt hdfs 0 2019-06-16 14:02 /tmp/hive/drt
drwx------ - ambari-qa hdfs 0 2019-06-16 15:11 /tmp/hive/ambari-qa
drwx------ - anonymous hdfs 0 2019-06-16 08:57 /tmp/hive/anonymous
drwx------ - hdfs hdfs 0 2019-06-13 08:42 /tmp/hive/hdfs
drwx------ - hive hdfs 0 2019-06-13 10:58 /tmp/hive/hive
drwx------ - root hdfs 0 2018-07-17 23:37 /tmp/hive/root
what we did until now - as the following is to remove the files that are older then 10 days ,
but because there are so many files then files not deleted at all
hdfs dfs -ls /tmp/hive/hive | tr -s " " | cut -d' ' -f6-8 | grep "^[0-9]" | awk 'BEGIN{ MIN=14400; LAST=60*MIN; "date +%s" | getline NOW } { cmd="date -d'\''"$1" "$2"'\'' +%s"; cmd | getline WHEN; DIFF=NOW-WHEN; if(DIFF > LAST){ print "Deleting: "$3; system("hdfs dfs -rm -r "$3) }}'

Match closest filename based on timestamp

We have backups stored in S3 and need to retrieve a backup based on nearest time stamp. We have files stored in S3 in YYYYMMDD_HHMM.tar.gz format
20181009_0910.tar.gz
20181004_0719.tar.gz
20180925_0414.tar.gz
20180915_2210.tar.gz
Given a timestamp 20180922_1020, we need to fetch file 20180925_0414.tar.gz using shell script.
Thanks.
Using Perl One liner ( Of-course somewhat lengthy !!).
I understand that "nearest" means shortest between the input in either direction. That is if you have 2 files, Oct-1st.txt and Oct-30.txt, and if the input is Oct-20, then Oct-30 file will be the output
$ ls -l *2018*gz
-rw-r--r-- 1 xxxx xxxx 0 Oct 17 00:04 20180915_2210.tar.gz
-rw-r--r-- 1 xxxx xxxx 0 Oct 17 00:04 20180925_0414.tar.gz
-rw-r--r-- 1 xxxx xxxx 0 Oct 17 00:04 20181004_0719.tar.gz
-rw-r--r-- 1 xxxx xxxx 0 Oct 17 00:04 20181009_0910.tar.gz
$ export input=20180922_1020
$ perl -ne 'BEGIN { #VAR=#ARGV; $in=$ENV{input}; $in=~s/_//g;foreach(#VAR) {$x=$_;s/.tar.gz//g;s/_//g;s/(\d+)/abs($1-$in)/e;$KV{$_}=$x};$res=(sort keys %KV)[0]; print "$KV{$res}"} ' 2018*gz
20180925_0414.tar.gz
$ export input=20180905_0101
$ perl -ne 'BEGIN { #VAR=#ARGV; $in=$ENV{input}; $in=~s/_//g;foreach(#VAR) {$x=$_;s/.tar.gz//g;s/_//g;s/(\d+)/abs($1-$in)/e;$KV{$_}=$x};$res=(sort keys %KV)[0]; print "$KV{$res}"} ' 2018*gz
20180915_2210.tar.gz
$
Hope, this helps!

Csh - Fetching fields via awk inside xargs

I'm struggling to understand this behavior:
Script behavior: read a file (containing dates); print a list of files in a multi-level directory tree and get their size, print the file size only, (future step: sum the overall file size).
Starting script:
cat dates | xargs -I {} sh -c "echo '{}: '; du -d 2 "/folder/" | grep {} | head"
2000-03:
1000 /folder/2000-03balbasldas
2000-04:
12300 /folder/2000-04asdwqdas
[and so on]
But when I try to filter via awk on the first field, I still get the whole line
cat dates | xargs -I {} sh -c "echo '{}: '; du -d 2 "/folder/" | grep {} | awk '{print $1}'"
2000-03:
1000 /folder/2000-03balbasldas
2000-04:
12300 /folder/2000-04asdwqdas
I've already approached it via divide-et-impera, and the following command works just fine:
du -d 2 "/folder/" | grep '2000-03' | awk '{print $1}'
1000
I'm afraid that I'm missing something very trivial, but I haven't found anything so far.
Any idea? Thanks!
Input: directory containing folders named YYYY-MM-random_data and a file containing strings:
ls -l
drwxr-xr-x 2 user staff 68 Apr 24 11:21 2000-03-blablabla
drwxr-xr-x 2 user staff 68 Apr 24 11:21 2000-04-blablabla
drwxr-xr-x 2 user staff 68 Apr 24 11:21 2000-05-blablabla
drwxr-xr-x 2 user staff 68 Apr 24 11:21 2000-06-blablabla
drwxr-xr-x 2 user staff 68 Apr 24 11:21 2000-06-blablablb
drwxr-xr-x 2 user staff 68 Apr 24 11:21 2000-06-blablablc
[...]
cat dates
2000-03
2000-04
2000-05
[...]
Expected output: sum of the disk space occupied by all the files contained in the folder whose name include the string in the file dates
2000-03: 1000
2000-04: 2123
2000-05: 1222112
[...]
======
But in particular, I'm interested in why awk is not able to fetch the column $1 I asked it to.
Ok it seems I found the answer myself after a lot of research :D
I'll post it here, hoping that it will help somebody else out.
https://unix.stackexchange.com/questions/282503/right-syntax-for-awk-usage-in-combination-with-other-command-inside-xargs-sh-c
The trick was to escape the $ sign.
cat dates | xargs -I {} sh -c "echo '{}: '; du -d 2 "/folder/" | grep {} | awk '{print \$1}'"
Using GNU Parallel it looks like this:
parallel --tag "eval du -s folder/{}* | perl -ne '"'$s+=$_ ; END {print "$s\n"}'"'" :::: dates
--tag prepends the line with the date.
{} is replaced with the date.
eval du -s folder/{}* finds all the dirs starting with the date and gives the total du from those dirs.
perl -ne '$s+=$_ ; END {print "$s\n"}' sums up the output from du
Finally there is bit of quoting trickery to get it quoted correctly.

How to extract date from filename with extenstion using shell script

I tried to extract date from filenames for first two rows only with extension .log
ex: filenames are as follows
my_logFile.txt contains
abc20140916_1.log
abhgg20140914_1.log
abf20140910_1.log
log.abc_abc20140909_1
The code I tried:
awk '{print substr($1,length($1)-3,4)}' my_logFile.txt
But getting op as:
.log
.log
.log
Need op as:
20140916
20140914
*****revised query*
I have a txt file containing n number of log files. Each line in txt file is like this.
-rw-rw-rw- 1 abchost abchost 241315175 Apr 16 10:45 abc20140405_1.log
-rw-rw-rw- 1 abchost abchost 241315175 Apr 16 10:45 aghtff20140404_1.log
-rw-rw-rw- 1 abchost abchost 241315175 Apr 16 10:45 log.pqrs20140403_1
I need to extract date out of file names from only first two rows. Here the filename has varying number of char before date.
The op should beL
20140405
20140404
Will this work to you?
$ head -2 file | grep -Po ' [a-z]+\K[0-9]+(?=.*\.log$)'
20140405
20140404
Explanation
head -2 file gets the first two lines of the file.
grep -Po ' [a-z]+\K[0-9]+(?=.*\.log$)' gets the set of digits in between a block of (space + a-z letters) and (.log + end of line).
try this,
cut -f9 -d " " <file> | grep -o -E "[0-9]{8}"
worked on my machine,
[root#giam20 ~]# cat sample.txt
-rw-rw-rw- 1 abchost abchost 241315175 Apr 16 10:45 abc20140405_1.log
-rw-rw-rw- 1 abchost abchost 241315175 Apr 16 10:45 aghtff20140404_1.log
-rw-rw-rw- 1 abchost abchost 241315175 Apr 16 10:45 log.pqrs20140403_1
[root#giam20 ~]# cut -f9 -d " " sample.txt | grep -o -E "[0-9]{8}"
20140405
20140404
20140403

pick up files based on dates in ksh script

I have this list of files . Now I will have to pick the latest file based on some condition
3679 Jul 21 23:59 belk_rpo_error_**po9324892**_07212014.log
0 Jul 22 23:59 belk_rpo_error_**po9324892**_07222014.log
3679 Jul 23 23:59 belk_rpo_error_**po9324892**_07232014.log
22 Jul 22 06:30 belk_rpo_error_**po9324267**_07012014.log
0 Jul 20 05:50 belk_rpo_error_**po9999992**_07202014.log
411 Jul 21 06:30 belk_rpo_error_**po9999992**_07212014.log
742 Jul 21 07:30 belk_rpo_error_**po9999991**_07212014.log
0 Jul 23 2014 belk_rpo_error_**po9999991**_07232014.log
For a PATRICULAR Order_No(Marked with ** **)
If the latest file is 0 kB then we will discard it (rest of the files with same Order_no as well)
if the latest file is non Zero then I will take it.(Only the latest one)
Then append the contents in a txt file .
My expected output would be ::
411 Jul 21 06:30 belk_rpo_error_**po9999992**_07212014.log
3679 Jul 23 23:59 belk_rpo_error_**po9324892**_07232014.log
22 Jul 22 06:30 belk_rpo_error_**po9324267**_07012014.log
I am at my wits end here. I cant seem to figure out how to compare dates in Unix. Any help is very appreciated.
You can try something like:
touch test.txt
for var in ` find . ! -empty -exec ls -r {} \;`
do
cat $var>>test.txt
done
untested
use stat to emit date (epoch time), size and filename.
use awk to filter out zero-length files and extract order number.
sort by order number and date
awk to pick up the last filename for each order number
stat -c $'%Y\t%s\t%n' *.log |
awk -F'\t' -v OFS='\t' '
$2 > 0 {
split($3, a, /_/)
print a[4], $1, $3
}' |
sort -t $'\t' -k1,1 -k2,2n |
awk -F'\t' '
NR > 1 && $1 != prev_order {print filename}
{filename = $3; prev_order = $1}
END {print filename}
'
The sort command might be wrong: In order to group by order number, you might need to sort first by file time then by order number.
If I understand your question, the resulting files need to be concatenated and appended to a file. If the above pipeline is working OK, then pipe into | xargs cat >> something.log

Resources