awk length is counting +1 - bash

I'm trying, as an exercise, to output how many words exist in the dictionary for each possible length.
Here is my code:
$ awk '{print length}' dico.txt | sort -nr | uniq -c
Here is the output:
...
1799 5
427 4
81 3
1 2
My problem is that awk length count one more letter for each word in my file. The right output should have been:
1799 4
427 3
81 2
1 1
I checked my file and it does not contain any space after the word:
ABAISSA
ABAISSABLE
ABAISSABLES
ABAISSAI
...
So I guess awk is counting the newline as a character, despite the fact it is not supposed to.
Is there any solution? Or something I'm doing wrong?

I'm gonna venture a guess. Isn't your awk expecting "U*X" style newlines (LF), but your dico.txt has Windows style (CR+LF). That easily give you the +1 on all lengths.
I took your four words:
$ cat dico.txt
ABAISSA
ABAISSABLE
ABAISSABLES
ABAISSAI
And ran your line:
$ awk '{print length}' dico.txt | sort -nr | uniq -c
1 11
1 10
1 8
1 7
So far so good. Now the same, but dico.txt with windows newlines:
$ cat dico.txt | todos > dico_win.txt
$ awk '{print length}' dico_win.txt | sort -nr | uniq -c
1 12
1 11
1 9
1 8

Related

Inconsistency in output field separator

We have to find the difference(d) Between last 2 nos and display rows with the highest value of d in ascending order
INPUT
1 | Latha | Third | Vikas | 90 | 91
2 | Neethu | Second | Meridian | 92 | 94
3 | Sethu | First | DAV | 86 | 98
4 | Theekshana | Second | DAV | 97 | 100
5 | Teju | First | Sangamithra | 89 | 100
6 | Theekshitha | Second | Sangamithra | 99 |100
Required OUTPUT
4$Theekshana$Second$DAV$97$100$3
5$Teju$First$Sangamithra$89$100$11
3$Sethu$First$DAV$86$98$12
awk 'BEGIN{FS="|";OFS="$";}{
avg=sqrt(($5-$6)^2)
print $1,$2,$3,$4,$5,$6,avg
}'|sort -nk7 -t "$"| tail -3
Output:
4 $ Theekshana $ Second $ DAV $ 97 $ 100$3
5 $ Teju $ First $ Sangamithra $ 89 $ 100$11
3 $ Sethu $ First $ DAV $ 86 $ 98$12
As you can see there is space before and after $ sign but for the last column (avg) there is no space, please explain why its happening
2)
awk 'BEGIN{FS=" | ";OFS="$";}{
avg=sqrt(($5-$6)^2)
print $1,$2,$3,$4,$5,$6,avg
}'|sort -nk7 -t "$"| tail -3
OUTPUT
4$|$Theekshana$|$Second$|$0
5$|$Teju$|$First$|$0
6$|$Theekshitha$|$Second$|$0
I have not mentiond | as the output field separator but still it appears, why is this happening and the difference is zero too
I am just 6 days old in unix,please answer even if its easy
your field separator is only the pipe symbol, so surrounding whitespace is part of the field definitions and that's what you see in the output. In combined uses pipe has the regex special meaning and need to be escaped. In your second case it means space or space is the field separator.
$ awk 'BEGIN {FS=" *\\| *"; OFS="$"}
{d=sqrt(($NF-$(NF-1))^2); $1=$1;
print d "\t" $0,d}' file | sort -n | tail -3 | cut -f2-
4$Theekshana$Second$DAV$97$100$3
5$Teju$First$Sangamithra$89$100$11
3$Sethu$First$DAV$86$98$12
a slight rewrite will eliminate the number of fields dependency and fixes the format.

Sorting tab delimited numbers by column with pure bash script.

Im stuck on some homework. The requirements of the assignment are to accept an input file and perform some statistics on the values. The user may specify whether to calculate the statistics by row or by value. The shell script must be pure bash script so I can't use awk, sed, perl, python etc.
sample input:
1 1 1 1 1 1 1
39 43 4 3225 5 2 2
6 57 8 9 7 3 4
3 36 8 9 14 4 3
3 4 2 1 4 5 5
6 4 4814 7 7 6 6
I can't figure out how to sort and process the data by column. My code for processing the rows works fine.
# CODE FOR ROWS
while read -r line
echo $(printf "%d\n" $line | sort -n) | tr ' ' \\t > sorted.txt
....
#I perform the stats calculations
# for row line by working with the temp file sorted.txt
done
How could I process this data by column? I've never worked with shell script so I've been staring at this for hours.
If you wanted to analyze by columns you'll need the cols value first (number of columns). head -n 1 gives you the first row, and NF counts the number of fields, giving us the number of columns.
cols=$(head -n 1 test.txt | awk '{print NF}');
Then you can use cut with the '\t' delimiter to grab every column from input.txt, and run it through sort -n, as you did in your original post.
$ for i in `seq 2 $((cols+1))`; do cut -f$i -d$'\t' input.txt; done | sort -n > output.txt
For rows, you can use the shell built-in printf with the format modifier %dfor integers. The sort command works on lines of input, so we replace spaces ' ' with newlines \n using the tr command:
$ cat input.txt | while read line; do echo $(printf "%d\n" $line); done | tr ' ' '\n' | sort -n > output.txt
Now take the output file to gather our statistics:
Min: cat output.txt | head -n 1
Max: cat output.txt | tail -n 1
Sum: (courtesy of Dimitre Radoulov): cat output.txt | paste -sd+ - | bc
Mean: (courtesy of porges): cat output.txt | awk '{ $total += $2 } END { print $total/NR }'
Median: (courtesy of maxschlepzig): cat output.txt | awk ' { a[i++]=$1; } END { print a[int(i/2)]; }'
Histogram: cat output.txt | uniq -c
8 1
3 2
4 3
6 4
3 5
4 6
3 7
2 8
2 9
1 14
1 36
1 39
1 43
1 57
1 3225
1 4814

Print out the value with the highest number of occurrences in a file

In a bash shell script, I want to go through a list of numbers and then print out the number that occurs most often. If there are several different numbers appearing an equal amount of times, I want to print the highest number. For example, in a file like this:
10
10
10
15
15
20
20
20
20
I want to print the value 20.
How can I achieve this?
If the numbers are in a file, one per line:
sort < myfile | uniq -c | sort -r | head -1
without the count:
A=$(sort < myfile | uniq -c | sort -r | head -1)
set $A
echo $2
You can use this command -
echo 10 10 10 15 15 20 20 20 20 | sed 's/ /\n/g' | sort | uniq -c | sort -V | tail -n 1 | awk '{print $2}'
It will print the number you want.

bash uniq, how to show count number at back

Normally when I do cat number.txt | sort -n | uniq -c , I get numbers like this:
3 43
4 66
2 96
1 97
But what I need is the number shows of occurrences at the back, like this:
43 3
66 4
96 2
97 1
Please give advice on how to change this. Thanks.
Use awk to change the order of columns:
cat number.txt | sort -n | uniq -c | awk '{ print $2, $1 }'
Perl version:
perl -lne '$occ{0+$_}++; END {print "$_ $occ{$_}" for sort {$a <=> $b} keys %occ}' < numbers.txt
Through GNU sed,
cat number.txt | sort -n | uniq -c | sed -r 's/^([0-9]+) ([0-9]+)$/\2 \1/g'

Find most frequent line in file in bash

Suppose I have a file similar to as follows:
Abigail 85
Kaylee 25
Kaylee 25
kaylee
Brooklyn
Kaylee 25
kaylee 25
I would like to find the most repeated line, the output must be just the line.
I've tried
sort list | uniq -c
but I need clean output, just the most repeated line (in this example Kaylee 25).
Kaizen ~
$ sort zlist | uniq -c | sort -r | head -1| xargs | cut -d" " -f2-
Kaylee 25
does this help ?
IMHO, none of these answers will sort the results correctly. The reason is that sort, without the -n, option will sort like this "1 10 11 2 3 4", etc., instead of "1 2 3 4 10 11 12". So, add -n like so:
sort zlist | uniq -c | sort -n -r | head -1
You can then, of course, pipe that to either xargs or sed as described earlier.
awk -
awk '{a[$0]++; if(m<a[$0]){ m=a[$0];s[m]=$0}} END{print s[m]}' t.lis
$ uniq -c list | sort -r | head -1 | awk '{$1=""}1'
Kaylee 25
Is this what you're looking for?

Resources