is it safe to remove the /tmp/hive/hive folder? - hadoop

is it safe to remove the /tmp/hive/hive folder? ( from hdfs )
as ( from user hdfs )
hdfs dfs -rm -r /tmp/hive/hive
the reason for that because under /tmp/hive/hive we have thousand of files and we cant delete them
hdfs dfs -ls /tmp/hive/
Found 7 items
drwx------ - admin hdfs 0 2019-03-05 12:00 /tmp/hive/admin
drwx------ - drt hdfs 0 2019-06-16 14:02 /tmp/hive/drt
drwx------ - ambari-qa hdfs 0 2019-06-16 15:11 /tmp/hive/ambari-qa
drwx------ - anonymous hdfs 0 2019-06-16 08:57 /tmp/hive/anonymous
drwx------ - hdfs hdfs 0 2019-06-13 08:42 /tmp/hive/hdfs
drwx------ - hive hdfs 0 2019-06-13 10:58 /tmp/hive/hive
drwx------ - root hdfs 0 2018-07-17 23:37 /tmp/hive/root
what we did until now - as the following is to remove the files that are older then 10 days ,
but because there are so many files then files not deleted at all
hdfs dfs -ls /tmp/hive/hive | tr -s " " | cut -d' ' -f6-8 | grep "^[0-9]" | awk 'BEGIN{ MIN=14400; LAST=60*MIN; "date +%s" | getline NOW } { cmd="date -d'\''"$1" "$2"'\'' +%s"; cmd | getline WHEN; DIFF=NOW-WHEN; if(DIFF > LAST){ print "Deleting: "$3; system("hdfs dfs -rm -r "$3) }}'

Related

Run a bash command and split the output by line not by space

I want to run the following command and get the output in the variable splited in an array by line not by space:
files=$( hdfs dfs -ls -R $hdfsDir)
So the output I get is the following: echo $files
drwxr-xr-x - pepeuser supergroup 0 2016-05-27 15:03 /user/some/kpi/2015/01/02 -rw-r--r-- 3 pepeuser supergroup 55107934 2016-05-27 15:02 /user/some/kpi/2015/01/02/part-00000902148 -rw-r--r-- 3 pepeuser supergroup 49225279 2016-05-27 15:02 /user/some/kpi/2015/01/02/part-00001902148
When I do a for in $files in stead of getting the full line on each., I get the column in stead of the line. It prints like the following:
drwxr-xr-x
-
pepeuser
supergroup
and what I need on the for to print like this:
drwxr-xr-x - pepeuser supergroup 0 2016-05-27 15:03 /user/some/kpi/2015/01/02
-rw-r--r-- 3 pepeuser supergroup 55107934 2016-05-27 15:02 /user/some/kpi/2015/01/02/part-00000902148
-rw-r--r-- 3 pepeuser supergroup 49225279 2016-05-27 15:02 /user/some/kpi/2015/01/02/part-00001902148
If you have bash 4, you can use readarray:
readarray -t files < <(hdfs dfs -ls -R "$hdfsDir")
Otherwise, use read -a to read into an array. IFS=$'\n' sets the field separator to newlines and -d '' tells it to keep reading until it hits a NUL character: effectively, that means it'll read to EOF.
IFS=$'\n' read -d '' -r -a files < <(hdfs dfs -ls -R "$hdfsDir")
You can verify that the array is populated correctly with something like:
printf '[%s]\n' "${files[#]}"
And can loop over the array with:
for file in "${files[#]}"; do
echo "$file"
done

Merge multiple files recursively in HDFS

My folder path structure in HDFS is something like this:
/data/topicname/year=2017/month=02/day=28/hour=00
/data/topicname/year=2017/month=02/day=28/hour=01
/data/topicname/year=2017/month=02/day=28/hour=02
/data/topicname/year=2017/month=02/day=28/hour=03
Inside these paths I have many small size json files. I am writing a shell script which can merge all files present inside all these individual directories into a single individual filename depending on path.
Example:
All JSONs inside /data/topicname/year=2017/month=02/day=28/hour=00 into one merged file full_2017_02_28_00.json
All JSONs inside /data/topicname/year=2017/month=02/day=28/hour=01 into one merged file full_2017_02_28_01.json
All JSONs inside /data/topicname/year=2017/month=02/day=28/hour=02 into one merged file full_2017_02_28_02.json and so on.
Keeping the file name in the above said pattern is secondary job which I will try to achieve. Currently I can hardcode the filenames.
But, recursive concatenation inside directory path structure is not happening.
So far, I have tried below:
hadoop fs -cat /data/topicname/year=2017/* | hadoop fs -put - /merged/test1.json
Error:-
cat: `/data/topicname/year=2017/month=02/day=28/hour=00': Is a directory
cat: `/data/topicname/year=2017/month=02/day=28/hour=01': Is a directory
cat: `/data/topicname/year=2017/month=02/day=28/hour=02': Is a directory
Recursive cat is not happening in above try
hadoop fs -ls /data/topicname/year=2017/month=02 | find /data/topicname/year=2017/month=02/day=28 -name '*.json' -exec cat {} \; > output.json
Error:-
find: ‘/data/topicname/year=2017/month=02/day=28’: No such file or directory
It is doing find in local FS instead of HDFS in above attempt
for i in `hadoop fs -ls -R /data/topicname/year=2017/ | cut -d' ' -f19` ;do `hadoop fs -cat $i/* |hadoop fs -put - /merged/output.json`; done
Error:-
cannot write output to stream message is repeated multiple times
file /merged/output.json is repeated a few times
How is this achievable? I do not want to use Spark.
Use -appendToFile:
for file in `hdfs dfs -ls -R /src_folder | awk '$2!="-" {print $8}'`; do hdfs dfs -cat $file | hdfs dfs -appendToFile - /target_folder/filename;done
Time taken will be dependent on the number and size of files as the process is sequential.
I was able to achieve my goal with below script:
#!/bin/bash
for k in 01 02 03 04 05 06 07 08 09 10 11 12
do
for j in 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
do
for i in 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23
do
hadoop fs -cat /data/topicname/year=2017/month=$k/day=$j/hour=$i/* | hadoop fs -put - /merged/TEST1/2017"_"$k"_"$j"_"$i.json
hadoop fs -du -s /merged/TEST1/2017"_"$k"_"$j"_"$i.json > /home/test/sizetest.txt
x=`awk '{ print $1 }' /home/test/sizetest.txt`
echo $x
if [ $x -eq 0 ]
then
hadoop fs -rm /merged/TEST1/2017"_"$k"_"$j"_"$i.json
else
echo "MERGE DONE!!! All files generated at hour $i of $j-$k-2017 merged into one"
echo "DELETED 0 SIZED FILES!!!!"
fi
done
done
done
rm -f /home/test/sizetest.txt
hadoop fs -rm -r /data/topicname

Hadoop fs -du-h sorting by size for M, G, T, P, E, Z, Y

I am running this command --
sudo -u hdfs hadoop fs -du -h /user | sort -nr
and the output is not sorted in terms of gigs, Terabytes,gb
I found this command -
hdfs dfs -du -s /foo/bar/*tobedeleted | sort -r -k 1 -g | awk '{ suffix="KMGT"; for(i=0; $1>1024 && i < length(suffix); i++) $1/=1024; print int($1) substr(suffix, i, 1), $3; }'
but did not seem to work.
is there a way or a command line flag i can use to make it sort and output should look like--
123T /xyz
124T /xyd
126T /vat
127G /ayf
123G /atd
Please help
regards
Mayur
hdfs dfs -du -h <PATH> | awk '{print $1$2,$3}' | sort -hr
Short explanation:
The hdfs command gets the input data.
The awk only prints the first three fields with a comma in between the 2nd and 3rd.
The -h of sort compares human readable numbers like 2K or 4G, while the -r reverses the sort order.
hdfs dfs -du -h <PATH> | sed 's/ //' | sort -hr
sed will strip out the space between the number and the unit, after which sort will be able to understand it.
This is a rather old question, but stumbled across it while trying to do the same thing. As you were providing the -h (human readable flag) it was converting the sizes to different units to make it easier for a human to read. By leaving that flag off we get the aggregate summary of file lengths (in bytes).
sudo -u hdfs hadoop fs -du -s '/*' | sort -nr
Not as easy to read but means you can sort it correctly.
See https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/FileSystemShell.html#du for more details.
I would use some small skript. It's primitive but reliable
#!/bin/bash
PATH_TO_FOLDER="$1"
hdfs dfs -du -h $PATH_TO_FOLDER > /tmp/output
cat /tmp/output | awk '$2 ~ /^[0-9]+$/ {print $1,$NF}' | sort -k1,1n
cat /tmp/output | awk ' {if ($2 == "K" ) print $1,$2,$NF}' | sort -k1,1n
cat /tmp/output | awk ' {if ($2 == "M" ) print $1,$2,$NF}' | sort -k1,1n
cat /tmp/output | awk ' {if ($2 == "G" ) print $1,$2,$NF}' | sort -k1,1n
cat /tmp/output | awk ' {if ($2 == "T" ) print $1,$2,$NF}' | sort -k1,1n
rm /tmp/output
Try this to sort hdfs dfs -ls -h /path sort -r -n -k 5
-rw-r--r-- 3 admin admin 108.5 M 2016-05-05 17:23 /user/admin/2008.csv.bz2
-rw-r--r-- 3 admin admin 3.1 M 2016-05-17 16:19 /user/admin/warand_peace.txt
Found 11 items
drwxr-xr-x - admin admin 0 2016-05-16 17:34 /user/admin/oozie-oozi
drwxr-xr-x - admin admin 0 2016-05-16 16:35 /user/admin/Jars
drwxr-xr-x - admin admin 0 2016-05-12 05:30 /user/admin/.Trash
drwxrwxrwx - admin admin 0 2016-05-16 11:21 /user/admin/2015_11_21
drwxrwxrwx - admin admin 0 2016-05-16 11:21 /user/admin/2015_11_20
drwxrwxrwx - admin admin 0 2016-05-16 11:21 /user/admin/2015_11_19
drwxrwxrwx - admin admin 0 2016-05-16 11:21 /user/admin/2015_11_18
drwx------ - admin admin 0 2016-05-16 17:38 /user/admin/.staging

how to properly import csv data set using kite-dataset partitioned schema?

I'm working with the publicly-available csv dataset from MovieLens
I have created a partitioned dataset for the ratings.csv:
kite-dataset create ratings --schema rating.avsc --partition-by year-month.json --format parquet
Here is my year-month.json:
[ {
"name" : "year",
"source" : "timestamp",
"type" : "year"
}, {
"name" : "month",
"source" : "timestamp",
"type" : "month"
} ]
Here is my csv import command:
mkite-dataset csv-import ratings.csv ratings
After the import finished, I ran this command to see whether year and month partitions where in fact created:
hadoop fs -ls /user/hive/warehouse/ratings/
What I've noticed, is that only a single year partition was created, and inside of it one a single month partition was created:
[cloudera#quickstart ml-20m]$ hadoop fs -ls /user/hive/warehouse/ratings/
Found 3 items
drwxr-xr-x - cloudera supergroup 0 2016-06-12 18:49 /user/hive/warehouse/ratings/.metadata
drwxr-xr-x - cloudera supergroup 0 2016-06-12 18:59 /user/hive/warehouse/ratings/.signals
drwxrwxrwx - cloudera supergroup 0 2016-06-12 18:59 /user/hive/warehouse/ratings/year=1970
[cloudera#quickstart ml-20m]$ hadoop fs -ls /user/hive/warehouse/ratings/year=1970/
Found 1 items
drwxrwxrwx - cloudera supergroup 0 2016-06-12 18:59 /user/hive/warehouse/ratings/year=1970/month=01
What is the proper way to conduct such partitioned import, which would resulted in all years and all month partitions being created?
Add three zeros in the end for the timestamp.
Use the below shell script to do it
#!/bin/bash
# add the CSV header to both files
head -n 1 ratings.csv > ratings_1.csv
head -n 1 ratings.csv > ratings_2.csv
# output the first 10,000,000 rows to ratings_1.csv
# this includes the header, and uses tail to remove it
head -n 10000001 ratings.csv | tail -n +2 | awk '{print "000" $1 }' >> ratings_1.csv
enter code here
# output the rest of the file to ratings_2.csv
# this starts at the line after the ratings_1 file stopped
tail -n +10000002 ratings.csv | awk '{print "000" $1 }' >> ratings_2.csv
Even I had this problem, and it was resolved after adding 3 zeros.

specify time range using unix grep

Hi I have few files in hdfs , now I have to extract the files in specific range . How can I do that using unix grep command?
My hdfs looks like this:
-rw-rw-r-- 3 pscore hdpdevs 94461 2014-12-10 02:08 /data/bus/pharma/shared/purch/availability_alert/proc/2014-12-10_02-07-12-0
-rw-rw-r-- 3 pscore hdpdevs 974422 2014-12-11 02:08 /data/bus/pharma/shared/purch/availability_alert/proc/2014-12-11_02-07-10-0
-rw-rw-r-- 3 pscore hdpdevs 32854 2014-12-11 02:08 /data/bus/pharma/shared/purch/availability_alert/proc/2014-12-11_02-07-16-0
-rw-rw-r-- 3 pscore hdpdevs 1936753 2014-12-12 02:07 /data/bus/pharma/shared/purch/availability_alert/proc/2014-12-12_02-06-04-0
-rw-rw-r-- 3 pscore hdpdevs 79365 2014-12-12 02:07 /data/bus/pharma/shared/purch/availability_alert/proc/2014-12-12_02-06-11-0
I want to extract the files from 2014-12-11 09:00 to 2014-12-12 09:00.
I tried using hadoop fs -ls /dabc | sed -n '/2014-12-11 09:00/ , /2014-12-12 09:00/p' but that does'nt work . Any help? I want to use grep command for this
awk '$6FS$7 >= "2014-12-11 09:00" && $6FS$7 <= "2014-12-12 09:00"'
Can I do string comparison in awk?

Resources