Hadoop fs -du-h sorting by size for M, G, T, P, E, Z, Y - bash

I am running this command --
sudo -u hdfs hadoop fs -du -h /user | sort -nr
and the output is not sorted in terms of gigs, Terabytes,gb
I found this command -
hdfs dfs -du -s /foo/bar/*tobedeleted | sort -r -k 1 -g | awk '{ suffix="KMGT"; for(i=0; $1>1024 && i < length(suffix); i++) $1/=1024; print int($1) substr(suffix, i, 1), $3; }'
but did not seem to work.
is there a way or a command line flag i can use to make it sort and output should look like--
123T /xyz
124T /xyd
126T /vat
127G /ayf
123G /atd
Please help
regards
Mayur

hdfs dfs -du -h <PATH> | awk '{print $1$2,$3}' | sort -hr
Short explanation:
The hdfs command gets the input data.
The awk only prints the first three fields with a comma in between the 2nd and 3rd.
The -h of sort compares human readable numbers like 2K or 4G, while the -r reverses the sort order.

hdfs dfs -du -h <PATH> | sed 's/ //' | sort -hr
sed will strip out the space between the number and the unit, after which sort will be able to understand it.

This is a rather old question, but stumbled across it while trying to do the same thing. As you were providing the -h (human readable flag) it was converting the sizes to different units to make it easier for a human to read. By leaving that flag off we get the aggregate summary of file lengths (in bytes).
sudo -u hdfs hadoop fs -du -s '/*' | sort -nr
Not as easy to read but means you can sort it correctly.
See https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/FileSystemShell.html#du for more details.

I would use some small skript. It's primitive but reliable
#!/bin/bash
PATH_TO_FOLDER="$1"
hdfs dfs -du -h $PATH_TO_FOLDER > /tmp/output
cat /tmp/output | awk '$2 ~ /^[0-9]+$/ {print $1,$NF}' | sort -k1,1n
cat /tmp/output | awk ' {if ($2 == "K" ) print $1,$2,$NF}' | sort -k1,1n
cat /tmp/output | awk ' {if ($2 == "M" ) print $1,$2,$NF}' | sort -k1,1n
cat /tmp/output | awk ' {if ($2 == "G" ) print $1,$2,$NF}' | sort -k1,1n
cat /tmp/output | awk ' {if ($2 == "T" ) print $1,$2,$NF}' | sort -k1,1n
rm /tmp/output

Try this to sort hdfs dfs -ls -h /path sort -r -n -k 5
-rw-r--r-- 3 admin admin 108.5 M 2016-05-05 17:23 /user/admin/2008.csv.bz2
-rw-r--r-- 3 admin admin 3.1 M 2016-05-17 16:19 /user/admin/warand_peace.txt
Found 11 items
drwxr-xr-x - admin admin 0 2016-05-16 17:34 /user/admin/oozie-oozi
drwxr-xr-x - admin admin 0 2016-05-16 16:35 /user/admin/Jars
drwxr-xr-x - admin admin 0 2016-05-12 05:30 /user/admin/.Trash
drwxrwxrwx - admin admin 0 2016-05-16 11:21 /user/admin/2015_11_21
drwxrwxrwx - admin admin 0 2016-05-16 11:21 /user/admin/2015_11_20
drwxrwxrwx - admin admin 0 2016-05-16 11:21 /user/admin/2015_11_19
drwxrwxrwx - admin admin 0 2016-05-16 11:21 /user/admin/2015_11_18
drwx------ - admin admin 0 2016-05-16 17:38 /user/admin/.staging

Related

is it safe to remove the /tmp/hive/hive folder?

is it safe to remove the /tmp/hive/hive folder? ( from hdfs )
as ( from user hdfs )
hdfs dfs -rm -r /tmp/hive/hive
the reason for that because under /tmp/hive/hive we have thousand of files and we cant delete them
hdfs dfs -ls /tmp/hive/
Found 7 items
drwx------ - admin hdfs 0 2019-03-05 12:00 /tmp/hive/admin
drwx------ - drt hdfs 0 2019-06-16 14:02 /tmp/hive/drt
drwx------ - ambari-qa hdfs 0 2019-06-16 15:11 /tmp/hive/ambari-qa
drwx------ - anonymous hdfs 0 2019-06-16 08:57 /tmp/hive/anonymous
drwx------ - hdfs hdfs 0 2019-06-13 08:42 /tmp/hive/hdfs
drwx------ - hive hdfs 0 2019-06-13 10:58 /tmp/hive/hive
drwx------ - root hdfs 0 2018-07-17 23:37 /tmp/hive/root
what we did until now - as the following is to remove the files that are older then 10 days ,
but because there are so many files then files not deleted at all
hdfs dfs -ls /tmp/hive/hive | tr -s " " | cut -d' ' -f6-8 | grep "^[0-9]" | awk 'BEGIN{ MIN=14400; LAST=60*MIN; "date +%s" | getline NOW } { cmd="date -d'\''"$1" "$2"'\'' +%s"; cmd | getline WHEN; DIFF=NOW-WHEN; if(DIFF > LAST){ print "Deleting: "$3; system("hdfs dfs -rm -r "$3) }}'

How do I remove duplicated by position SNPs using PLink?

I am working with PLINK to analyse SNP chip data.
Does anyone know how to remove duplicated SNPs (duplicated by position)?
If we already have files in plink format then we should have .bim for binary plink files or .map for text plink files. In either case the positions are on the 3rd column and SNP names are on 2nd column.
We need to create a list of SNPs that are duplicated:
sort -k3n myFile.map | uniq -f2 -D | cut -f2 > dupeSNP.txt
Then run plink with --exclude flag:
plink --file myFile --exclude dupeSNP.txt --out myFileSubset
you can also do it directly in PLINK1.9 using the --list-duplicate-vars flag
together with the <require-same-ref>, <ids-only>, or <suppress-first> modifiers depending on what you want to do.
check https://www.cog-genomics.org/plink/1.9/data#list_duplicate_vars for more details
If you want to delete all occurences of a variant with duplicates, you will have to use the --exclude flag on the output file of --list-duplicate-vars ,
which should have a .dupvar extention.
I should caution that the two answers given below yield different results. This is because the sort | uniq method only takes into account SNP and bp location; whereas, the PLINK method (--list-duplicate-vars) takes into account A1 and A2 as well.
Similar to sort | uniq on the .map file we could use AWK on a .gen file, that looks like this:
22 rs1 12 A G 1 0 0 1 0 0
22 rs1 12 G A 0 1 0 0 0 1
22 rs2 16 C A 1 0 0 0 1 0
22 rs2 16 C G 0 0 1 1 0 0
22 rs3 17 T CTA 0 0 1 0 1 0
22 rs3 17 CTA T 1 0 0 0 0 1
# Get list of duplicate rsXYZ ID's
awk -F' ' '{print $2}' chr22.gen |\
sort |\
uniq -d > chr22_rsid_duplicates.txt
# Get list of duplicated bp positions
awk -F' ' '{print $3}' chr22.gen |\
sort |\
uniq -d > chr22_location_duplicates.txt
# Now match this list of bp positions to gen file to get the rsid for these locations
awk 'NR==FNR{a[$1]=$2;next}$3 in a{print $2}' \
chr22_location_duplicates.txt \
chr22.gen |\
sort |\
uniq \
> chr22_rsidBylocation_duplicates.txt
cat chr22_rsid_duplicates.txt \
chr22_rsidBylocation_duplicates.txt \
> tmp
# Get list of duplicates (by location and/or rsid)
cat tmp | sort | uniq > chr22_duplicates.txt
plink --gen chr22.gen \
--sample chr22.sample \
--exclude chr22_duplicates.txt \
--recode oxford \
--out chr22_noDups
This will classify rs2 as a duplicate; however, for the PLINK list-duplicate-vars method rs2 will not be flagged as a duplicate.
If one want's to obtain the same results using PLINK (a non-trivial task for BGEN file formats since awk, sed etc. do not work on binary files!) you can use the --rm-dup command from PLINK2.0. The list of all duplicate SNPs removed can be logged (to a file ending in .rmdup.list) using the list parameter, like so:
plink2 --bgen chr22.bgen \
--sample chr22.sample \
--rm-dup exclude-all list \
--export bgen-1.1 \ # Export as bgen version 1.1
--out chr22_noDups
Note: I'm saving the output as version 1.1 since plink1.9 still has commands not available in plink version 2.0. Therefore the only way to use bgen files with plink1.9 (at this time) is with the older 1.1 version.

Csh - Fetching fields via awk inside xargs

I'm struggling to understand this behavior:
Script behavior: read a file (containing dates); print a list of files in a multi-level directory tree and get their size, print the file size only, (future step: sum the overall file size).
Starting script:
cat dates | xargs -I {} sh -c "echo '{}: '; du -d 2 "/folder/" | grep {} | head"
2000-03:
1000 /folder/2000-03balbasldas
2000-04:
12300 /folder/2000-04asdwqdas
[and so on]
But when I try to filter via awk on the first field, I still get the whole line
cat dates | xargs -I {} sh -c "echo '{}: '; du -d 2 "/folder/" | grep {} | awk '{print $1}'"
2000-03:
1000 /folder/2000-03balbasldas
2000-04:
12300 /folder/2000-04asdwqdas
I've already approached it via divide-et-impera, and the following command works just fine:
du -d 2 "/folder/" | grep '2000-03' | awk '{print $1}'
1000
I'm afraid that I'm missing something very trivial, but I haven't found anything so far.
Any idea? Thanks!
Input: directory containing folders named YYYY-MM-random_data and a file containing strings:
ls -l
drwxr-xr-x 2 user staff 68 Apr 24 11:21 2000-03-blablabla
drwxr-xr-x 2 user staff 68 Apr 24 11:21 2000-04-blablabla
drwxr-xr-x 2 user staff 68 Apr 24 11:21 2000-05-blablabla
drwxr-xr-x 2 user staff 68 Apr 24 11:21 2000-06-blablabla
drwxr-xr-x 2 user staff 68 Apr 24 11:21 2000-06-blablablb
drwxr-xr-x 2 user staff 68 Apr 24 11:21 2000-06-blablablc
[...]
cat dates
2000-03
2000-04
2000-05
[...]
Expected output: sum of the disk space occupied by all the files contained in the folder whose name include the string in the file dates
2000-03: 1000
2000-04: 2123
2000-05: 1222112
[...]
======
But in particular, I'm interested in why awk is not able to fetch the column $1 I asked it to.
Ok it seems I found the answer myself after a lot of research :D
I'll post it here, hoping that it will help somebody else out.
https://unix.stackexchange.com/questions/282503/right-syntax-for-awk-usage-in-combination-with-other-command-inside-xargs-sh-c
The trick was to escape the $ sign.
cat dates | xargs -I {} sh -c "echo '{}: '; du -d 2 "/folder/" | grep {} | awk '{print \$1}'"
Using GNU Parallel it looks like this:
parallel --tag "eval du -s folder/{}* | perl -ne '"'$s+=$_ ; END {print "$s\n"}'"'" :::: dates
--tag prepends the line with the date.
{} is replaced with the date.
eval du -s folder/{}* finds all the dirs starting with the date and gives the total du from those dirs.
perl -ne '$s+=$_ ; END {print "$s\n"}' sums up the output from du
Finally there is bit of quoting trickery to get it quoted correctly.

How to use awk and sed to count number of elements in a column

There are some emails in my email account's inbox:
12:00 <harry#hotmail.com>
12:20 <harry#hotmail.com>
12:22 <jim#gmail.com>
12:30 <clare#bbc.org>
12:40 <harry#hotmail.com>
12:50 <jim#gmail.com>
12:55 <harry#hotmail.com>
I would like to use command line (awk, sed, grep etc.) to count the number of emails I received from different people.(change all the minute to :00) How can I make it?
I prefer the result like:
Number of email time From
3 12:00 <jim#gmail.com>
4 12:00 <harry#hotmail.com>
1 12:00 <clare#bbc.org>
Appreciate for your help!
Here is how to do it with awk
awk '{a[$1]++} END {for (i in a) print a[i]"\t"i}' file
4 <harry#hotmail.com>
1 <clare#bbc.org>
2 <jim#gmail.com>
You may want to use uniq after sort:
$ sort file | uniq -c
1 <clare#bbc.org>
4 <harry#hotmail.com>
2 <jim#gmail.com>
You can also get the header using printf:
$ printf "Number of email\temail\n%s\n" "$(sort file | uniq -c)"
Number of email email
1 <clare#bbc.org>
4 <harry#hotmail.com>
2 <jim#gmail.com>
We initially have to sort the file in order to uniq to work properly. From man uniq:
Filter adjacent matching lines from INPUT

Ploblem of This code(show pr0cess) [ps Linux]

This code show user's process load (%cpu)
ps aux | awk 'NR!=1{print $1}' | sort | uniq | awk '{print "ps aux | grep "$1}' | awk '{printf $0; printf " | awk"; printf "{sum +="; print "$3}" }' | sed "s/{/ '{/g" | sed "s/3}/3} END {print \$1,sum}'/g" > 0.out
chmod 755 0.out
bash 0.out
This Code show same user in some OS(UBUNTU) example:
root 11.5
root 0
root 0
root 1.8
root 1.3
root 0
root 1.1
but show different user(uniq) on some OS example2:
root 11.5
daemon 0
syslog 0 ....
How can i write for example2 only.i want diff3rent user's %cpu.
You can replace all that with:
ps ahx -o "%U %C" | awk '
{cpu[$1] += $2}
END {for (user in cpu) {print user, cpu[user]}}
'

Resources