How to sort in bash? - bash

I have several lines in file, just like below:
/adbc/eee/ddd/baa/
/adbc/fff/ddd/ccc/avfff/
/adbc/ccc/ddd/b/
/adbc/fff/ddd/c/
/adbc/ccc/ddd/bf/
/adbc/ccc/ddd/bc/
The sort algorithm must first get the string before last /, that is:
baa
avfff
b
c
bf
bc
and then sort it by the first character, and then the length of the string, and then alphabetically.
The expected result is
/adbc/fff/ddd/ccc/avfff/
/adbc/ccc/ddd/b/
/adbc/ccc/ddd/bc/
/adbc/ccc/ddd/bf/
/adbc/eee/ddd/baa/
/adbc/fff/ddd/c/

You could use awk in a pre-processing step to add 3 columns all based on the field of the interest, feed to sort, then use cut to discard the extra fields
awk -F'/' -v OFS="/" '{x=substr($(NF-1), 1, 1);
print(x, length($(NF-1)), $(NF-1), $0)}' file.txt |
sort -k1,1 -k2,2n -k3,3 -t'/' |
cut -f4- -d'/'
/adbc/fff/ddd/ccc/avfff/
/adbc/ccc/ddd/b/
/adbc/ccc/ddd/bc/
/adbc/ccc/ddd/bf/
/adbc/eee/ddd/baa/
/adbc/fff/ddd/c/

cat sortthisfile | while read line
do
field=$( echo $line | sed -e 's:/$::' -e 's:.*/::' )
firstchar=${field:0:1}
fieldlen=${#field}
echo "${firstchar},${fieldlen},${field},${line}"
done | sort-k1,1 -k2,2n -k3,3 -t, | sed 's:.*,/::'
Obviously, sortthisfile is the name of your file.

Related

How do I remove the header in the df command?

I'm trying to write a bash command that will sort all volumes by the amount of data they have used and tried using
df | awk '{print $1 | "sort -r -k3 -n"}'
Output:
map
devfs
Filesystem
/dev/disk1s5
/dev/disk1s2
/dev/disk1s1
But this also shows the header called Filesystem.
How do I remove that?
For your specific case, i.e. using awk, #codeforester answer (using awk NR (Number of Records) variable) is the best.
In a more general case, in order to remove the first line of any output, you can use the tail -n +N option in order to output starting with line N:
df | tail -n +2 | other_command
This will remove the first line in df output.
Skip the first line, like this:
df | awk 'NR>1 {print $1 | "sort -r -k3 -n"}'
I normally use one of these options, if I have no reason to use awk:
df | sed 1d
The 1d option to sed says delete the first line, then print everything else.
df | tail -n+2
the -n+2 option to tail say start looking at line 2 and print everything until End-of-Input.
I suspect sed is faster than awk or tail, but I can't prove it.
EDIT
If you want to use awk, this will print every line except the first:
df | awk '{if (FNR>1) print}'
FNR is the File Record Number. It is the line number of the input. If it is greater than 1, print the input line.
Count the lines from the output of df with wc and then substract one line to output a headerless df with tail ...
LINES=$(df|wc -l)
LINES=$((${LINES}-1))
df | tail -n ${LINES}
OK - I see oneliner - Here is mine ...
DF_HEADERLESS=$(LINES=$(df|wc -l); LINES=$((${LINES}-1));df | tail -n ${LINES})
And for formated output lets printf loop over it...
printf "%s\t%s\t%s\t%s\t%s\t%s\n" ${DF_HEADERLESS} | awk '{print $1 | "sort -r -k3 -n"}'
This might help with GNU df and GNU sort:
df -P | awk 'NR>1{$1=$1; print}' | sort -r -k3 -n | awk '{print $1}'
With GNU df and GNU awk:
df -P | awk 'NR>1{array[$3]=$1} END{PROCINFO["sorted_in"]="#ind_num_desc"; for(i in array){print array[i]}}'
Documentation: 8.1.6 Using Predefined Array Scanning Orders with gawk
Removing something from a command output can be done very simply, using grep -v, so in your case:
df | grep -v "Filesystem" | ...
(You can do your awk at the ...)
When you're not sure about caps, small caps, you might add -i:
df | grep -i -v "FiLeSyStEm" | ...
(The switching caps/small caps are meant as a clarification joke :-) )

How can I sort and list the word frequency in the second column?

My data input looks like this:
1RDD4_00022_02842 o220
1RDD4_00024_03137 o132
1RDD4_00035_05208 o216
1RDD4_00045_05573 o132
1RDD4_00046_02134 o132
1RDD4_00051_04040 o154
In numerical order, I want to sort and list the frequency of words in the right column so the output looks like this:
o132 3
o154 1
o216 1
o220 1
I've tried the following pipeline, but it only works on the input's left column, and I don't know how to modify for the right column:
sed 's/\.//g;s/\(.*\)/\L\1/;s/\ /\n/g' inputfile | sort | uniq -c
use
cat inputfile | cut -f2
or
cat inputfile | awk '{print $2}'
(heavy)
to select second column only
cat inputfile | cut -f2 | sort | uniq -c

Bash - Count number of occurences in textfile and display in descending order

I want to count the amount of the same words in a text file and display them in descending order.
So far I have :
cat sample.txt | tr ' ' '\n' | sort | uniq -c | sort -nr
Which is mostly giving me satisfying output except the fact that it includes special characters like commas, full stops, ! and hyphen.
How can I modify existing command to not include special characters mentioned above?
You can use tr with a composite string of the letters you wish to delete.
Example:
$ echo "abc, def. ghi! boss-man" | tr -d ',.!'
abc def ghi boss-man
Or, use a POSIX character class knowing that boss-man for example would become bossman:
$ echo "abc, def. ghi! boss-man" | tr -d [:punct:]
abc def ghi bossman
Side note: You can have a lot more control and speed by using awk for this:
$ echo "one two one! one. oneone
two two three two-one three" |
awk 'BEGIN{RS="[^[:alpha:]]"}
/[[:alpha:]]/ {seen[$1]++}
END{for (e in seen) print seen[e], e}' |
sort -k1,1nr -k2,2
4 one
4 two
2 three
1 oneone
How about first extracting words with grep:
grep -o "\w\+" sample.txt | sort | uniq -c | sort -nr

How do I pipe commands inside for loop in bash

I am writing a bash script to iterate through file lines with given value.
The command I am using to list the possible values is:
cat file.csv | cut -d';' -f2 | sort | uniq | head
When I use it in for loop like this it stops working:
for i in $( cat file.csv | cut -d';' -f2 | sort | uniq | head )
do
//do something else with these lines
done
How can I use piped commands in for loop?
You can use this awk command to get sum of 3rd column for each unique value of 2nd columns:
awk -F ';' '{sums[$2]+=$3} END{for (i in sums) print i ":", sums[i]}' file.csv
Input data:
asd;foo;0
asd;foo;2
asd;bar;1
asd;foo;4
Output:
foo: 6
bar: 1

Unix uniq command to CSV file

I have a text file (list.txt) containing single and multi-word English phrases. My goal is to do a word count for each word and write the results to a CSV file.
I have figured out the command to write the amount of unique instances of each word, sorted from largest to smallest. That command is:
$ tr 'A-Z' 'a-z' < list.txt | tr -sc 'A-Za-z' '\n' | sort | uniq -c | sort -n -r | less > output.txt
The problem is the way the new file (output.txt) is formatted. There are 3 leading spaces, followed by the number of occurrences, followed by a space, followed by the word. Then on to a next line. Example:
9784 the
6368 and
4211 for
2929 to
What would I need to do in order to get the results in a more desired format, such as CSV? For example, I'd like it to be:
9784,the
6368,and
4211,for
2929,to
Even better would be:
the,9784
and,6368
for,4211
to,2929
Is there a way to do this with a Unix command, or do I need to do some post-processing within a text editor or Excel?
Use awk as follows:
> cat input
9784 the
6368 and
4211 for
2929 to
> cat input | awk '{ print $2 "," $1}'
the,9784
and,6368
for,4211
to,2929
You full pipeline will be:
$ tr 'A-Z' 'a-z' < list.txt | tr -sc 'A-Za-z' '\n' | sort | uniq -c | sort -n -r | awk '{ print $2 "," $1}' > output.txt
use sed to replace the spaces with comma
cat extra_set.txt | sort -i | uniq -c | sort -nr | sed 's/^ *//g' | sed 's/ /\, /'

Resources