Count line lengths in file using command line tools - bash

Problem
If I have a long file with lots of lines of varying lengths, how can I count the occurrences of each line length?
Example:
file.txt
this
is
a
sample
file
with
several
lines
of
varying
length
Running count_line_lengths file.txt would give:
Length Occurences
1 1
2 2
4 3
5 1
6 2
7 2
Ideas?

This
counts the line lengths using awk, then
sorts the (numeric) line lengths using sort -n and finally
counts the unique line length values uniq -c.
$ awk '{print length}' input.txt | sort -n | uniq -c
1 1
2 2
3 4
1 5
2 6
2 7
In the output, the first column is the number of lines with the given length, and the second column is the line length.

Pure awk
awk '{++a[length()]} END{for (i in a) print i, a[i]}' file.txt
4 3
5 1
6 2
7 2
1 1
2 2

Using bash arrays:
#!/bin/bash
while read line; do
((histogram[${#line}]++))
done < file.txt
echo "Length Occurrence"
for length in "${!histogram[#]}"; do
printf "%-6s %s\n" "${length}" "${histogram[$length]}"
done
Example run:
$ ./t.sh
Length Occurrence
1 1
2 2
4 3
5 1
6 2
7 2

$ perl -lne '$c{length($_)}++ }{ print qq($_ $c{$_}) for (keys %c);' file.txt
Output
6 2
1 1
4 3
7 2
2 2
5 1

Try this:
awk '{print length}' FILENAME
Or next if you want the longest length:
awk '{ln=length} ln>max{max=ln} END {print FILENAME " " max}'
You can combine above command with find using -exec option.

You can accomplish this by using basic unix utilities only:
$ printf "%s %s\n" $(for line in $(cat file.txt); do printf $line | wc -c; done | sort -n | uniq -c | sed -E "s/([0-9]+)[^0-9]+([0-9]+)/\2 \1/")
1 1
2 2
4 3
5 1
6 2
7 2
How it works?
Here's the source file:
$ cat file.txt
this
is
a
sample
file
with
several
lines
of
varying
length
Replace each line of the source file with its length:
$ for line in $(cat file.txt); do printf $line | wc -c; done
4
2
1
6
4
4
7
5
2
7
6
Sort and count the number of length occurrences:
$ for line in $(cat file.txt); do printf $line | wc -c; done | sort -n | uniq -c
1 1
2 2
3 4
1 5
2 6
2 7
Swap and format the numbers:
$ printf "%s %s\n" $(for line in $(cat file.txt); do printf $line | wc -c; done | sort -n | uniq -c | sed -E "s/([0-9]+)[^0-9]+([0-9]+)/\2 \1/")
1 1
2 2
4 3
5 1
6 2
7 2

If you allow for the columns to be swapped and don't need the headers, something as easy as
while read line; do echo -n "$line" | wc -m; done < file | sort | uniq -c
(without any advanced tricks with sed or awk) will work. The output is:
1 1
2 2
3 4
1 5
2 6
2 7
One important thing to keep in mind: wc -c counts the bytes, not the characters, and will not give the correct length for strings containing multibyte characters. Therefore the use of wc -m.
References:
man uniq(1)
man sort(1)
man wc(1)

Related

One line bash command to find N oldest and newest files in a directory

I have the following command:
files=$(ls -lhrt dirname) && echo $files | head -5 && echo $files | tail -5
The idea is to return the oldest and newest 5 files in a directory dirname. This returns the requested data - however, the lines are jumbled together.
Is there a way to better format the output? (or perhaps a better way to write this functionality)?
Always quote variable expansions to prevent word splitting and globbing. When you leave $files unquoted bash's word splitting pass causes the newlines to be lost.
files=$(ls -lhrt dirname) && echo "$files" | head -5 && echo "$files" | tail -5
There's no real benefit from using the && operators. I'd just write:
files=$(ls -lhrt dirname)
echo "$files" | head -5
echo "$files" | tail -5
Or, better, swap the echos for <<< to avoid unnecessary subprocesses.
files=$(ls -lhrt dirname)
head -5 <<< "$files"
tail -5 <<< "$files"
head an tail together
( Without having to store whole output of previous command into one variable )
Nota: along this, I will use top 4 lines and last 4 lines for sample using seq 1 100.., but top 5 lines and last 5 lines for samples using ls -lhrt dirname.
First way, by using head and tail consecutively
if you try:
seq 1 100000 | (head -n 4;tail -n 4;)
1
2
3
4
99997
99998
99999
100000
Seem do the job, but
seq 1 1000 | (head -n 4;tail -n 4;)
1
2
3
4
Give wrong answer.
This is due to buffering, but bash let you use unbuffered input:
seq 1 12 | { for i in {1..4};do read foo;echo "$foo";done;tail -n 4 ;}
1
2
3
4
9
10
11
12
Finally
For your request, try this:
{ for i in {1..5};do read foo;echo "$foo";done;tail -n 5;} < <(ls -lhrt dirname)
must match your need.
Or by using both together, with help of tee
Just look:
seq 1 12 | tee > >(tail -n4) >(head -n4)
1
2
3
4
9
10
11
12
But this could render strange things on terminal, to prevent this, you could just pipe whole to cat:
seq 1 12 | tee > >(tail -n4) >(head -n4) | cat
1
2
3
4
9
10
11
12
So
ls -lhrt dirname | tee > >(tail -n5) >(head -n5) | cat
must do the job.
Or even, if you wanna play with bash and a big variable:
files=$(seq 1 12) out='' in=''
for i in {1..4};do
in+=${files%%$'\n'*}$'\n'
files=${files#*$'\n'}
out=${files##*$'\n'}$'\n'${out}
files=${files%$'\n'*}
done
echo "$in${out%$'\n'}"
1
2
3
4
9
10
11
12
Then again:
files=$(ls -lhrt dirname) out='' in=''
for i in {1..5};do
in+=${files%%$'\n'*}$'\n'
files=${files#*$'\n'}
out=${files##*$'\n'}$'\n'${out}
files=${files%$'\n'*}
done
echo "$in${out%$'\n'}"
But you could use GNU sed
seq 1 100000 | sed -e ':a;N;4p;5,${s/^[^\n]*\n//;};$!ba;'
1
2
3
4
99997
99998
99999
100000
Then
ls -lhrt dirname | sed -e ':a;N;5p;6,${s/^[^\n]*\n//;};$!ba;'
What about adding linebreaks like so:
files=$(ls -lhrt dirname) && echo -e "${files}\n" | head -5 && echo -e "${files}\n" | tail -5
Explanation:
The -e flag enables echo to interpret escapes such as \n in this example.
\n itself is the escape sequence for "new line". So all it does is adding a new line after the echoed variable.
${ } is called Brace Expansion. Since I put the string in quotes, ${} will expand the variable to the string.
Even though it was requested for BASH, I just put here the ZSH line
echo dirname/*(.om[1,5]) dirname/*(.om[-5,-1])
This returns a list of files with the 5 oldest and 5 newest files (based on modification time). Other solutions based on ls -lrth might return directories or links or pipes or anything else.
You can replace echo with anything, but you requested a way to find the files, hence the correct answer in ZSH is the above glob (no echo)
It works like this :
dirname/* : take all mathching strings
( : open glob specifier
. : return only plain files
om : sort them according to modification time
[1,5] return first five or [-5,-1] return last five
) : close the glob specifier
More information on zsh globbing can be found here :
http://www.bash2zsh.com/zsh_refcard/refcard.pdf

Bash: extract column using empty lines as separators

I have a file like:
1
2
3
4
5
a
b
c
d
e
And want to put it like:
1 a
2 b
3 c
4 d
5 e
Is there a quick way to do it in bash?
pr is the tool to use for columnizing data:
pr -s" " -T -2 filename
With paste and process substitution:
$ paste -d " " <(sed -n '1,/^$/{/^$/d;p}' file) <(sed -n '/^$/,${//!p}' file)
1 a
2 b
3 c
4 d
5 e
Simple bash script the does the job:
nums=()
is_line=0
cat ${1} | while read line
do
if [[ ${line} == '' ]]
then
is_line=1
else
if [[ ${is_line} == 0 ]]
then
nums=("${nums[#]}" "${line}")
else
echo ${nums[0]} ${line}
nums=(${nums[*]:1})
fi
fi
done
Run it like this: ./script filename
Example:
$ ./script filein
1 a
2 b
3 c
4 d
5 e
$ rs 2 5 <file | rs -T
1 a
2 b
3 c
4 d
5 e
If you want that extra separator space off, use -g1 in the latter rs. Explained:
print file in 5 cols and 2 rows
-T transpose it

Sum/Average numbers in a single line - UNIX

I'm working on a small script to take 3 numbers in a single line, sum and average them, and print the result at the end of the line. I know how to use the paste command, but everything I'm finding is telling me how to average a column. I need to average a line, not a column. Any advice? Thanks!
awk to the rescue!
$ echo 1 2 3 | awk -v RS=' ' '{sum+=$1; count++} END{print sum, sum/count}'
6 2
works for any number of input fields
$ echo 1 2 3 4 | awk -v RS=' ' '{sum+=$1; count++} END{print sum, sum/count}'
10 2.5
You can manipulate your line before giving it to bc. With bc you have additional possibilities such as setting the scale.
A simple mean from 1 2 3 would be
echo "1 2 3" | sed -e 's/\([0-9.]\) /\1+/g' -e 's/.*/(&)\/3/' | bc
You can wrap it in a function and see more possibilities:
function testit {
echo "Input $#"
echo "Integer mean"
echo "$#" | sed -e 's/\([0-9.]\) /\1+/g' -e 's/.*/(&)\/'$#'/' | bc
echo "floating decimal mean"
echo "$#" | sed -e 's/\([0-9.]\) /\1+/g' -e 's/.*/(&)\/'$#'/' | bc -l
echo "2 decimal output mean"
echo "$#" | sed -e 's/\([0-9.]\) /\1+/g' -e 's/.*/scale=2; (&)\/'$#'/' | bc
echo
}
testit 4 5 6
testit 4 5 8
testit 4.2 5.3 6.4
testit 1 2 3 4 5 6 7 8 9

Removing a specified number of lines from both head and tail of a stream

k=$1
m=$2
fileName=$3
head -n -$k "$fileName" | tail -n +$m
I have the bash code.
when I execute it, it only removes less than what it should remove. like ./strip.sh 4 5 hi.txt > bye.txt should remove first 4 lines and last 5 lines, but it only removes first 4 lines and last "4" lines. Also, when I execute ./strip.sh 1 1 hi.txt > bye.txt, it only removes last line, not first line....
#!/bin/sh
tail -n +"$(( $1 + 1 ))" <"$3" | head -n -"$2"
Tested as follows:
set -- 4 5 /dev/stdin # assign $1, $2 and $3
printf '%s\n' {1..20} | tail -n +"$(( $1 + 1 ))" <"$3" | head -n -"$2"
...which correctly prints numbers between 5 and 15, trimming the first 4 from the front and 5 from the back. Similarly, with set -- 3 6 /dev/stdin, numbers between 4 and 14 inclusive are printed, which is likewise correct.

Sorting decimals

here is another question for sorting a list with decimals:
$ list="1 2 5 2.1"
$ for j in "${list[#]}"; do echo "$j"; done | sort -n
1 2 5 2.1
I expected
1 2 2.1 5
If you intended that the variable list be an array, then you needed to say:
list=(1 2 5 2.1)
which would result in
1
2
2.1
5
for j in $list; do echo $j; done | sort -n
or
printf '%s\n' $list|sort -n
You do not need to "${list[#]}" but just $list because it is just a string. Otherwise it gets all numbers in the same field.
$ for j in $list; do echo $j; done | sort -n
1
2
2.1
5
With your previous code it was not sorting at all:
$ list="77 1 2 5 2.1 99"
$ for j in "${list[#]}"; do echo "$j"; done | sort -n
77 1 2 5 2.1 99

Resources