Bash: extract column using empty lines as separators - bash

I have a file like:
1
2
3
4
5
a
b
c
d
e
And want to put it like:
1 a
2 b
3 c
4 d
5 e
Is there a quick way to do it in bash?

pr is the tool to use for columnizing data:
pr -s" " -T -2 filename

With paste and process substitution:
$ paste -d " " <(sed -n '1,/^$/{/^$/d;p}' file) <(sed -n '/^$/,${//!p}' file)
1 a
2 b
3 c
4 d
5 e

Simple bash script the does the job:
nums=()
is_line=0
cat ${1} | while read line
do
if [[ ${line} == '' ]]
then
is_line=1
else
if [[ ${is_line} == 0 ]]
then
nums=("${nums[#]}" "${line}")
else
echo ${nums[0]} ${line}
nums=(${nums[*]:1})
fi
fi
done
Run it like this: ./script filename
Example:
$ ./script filein
1 a
2 b
3 c
4 d
5 e

$ rs 2 5 <file | rs -T
1 a
2 b
3 c
4 d
5 e
If you want that extra separator space off, use -g1 in the latter rs. Explained:
print file in 5 cols and 2 rows
-T transpose it

Related

select rows where values of two columns agree

if I have the following:
1 5 a
2 5 a
3 5 a
4 5 a
5 5 a
6 5 a
1 3 b
2 3 b
3 3 b
4 3 b
5 3 b
6 3 b
How do I only select rows where the two columns have the same value i.e.
5 5 a
3 3 b
in bash / awk / sed.
I know how to select rows with certain values using awk, but only when I specifiy the value.
Just say:
$ awk '$1==$2' file
5 5 a
3 3 b
As you see, when the condition $1 == $2 is accomplished, awk automatically prints the line.
perl -ane 'print if $F[0] == $F[1]' file
For completeness:
bash
while read first second rest; do
[[ $first -eq $second ]] && echo "$first $second $rest"
done < fi
or if content is not just integers:
while read first second rest; do
[[ $first == $second ]] && echo "$first $second $rest"
done < file
sed
sed -En '/^([^ ]+) \1 /p' file
This might work for you (GNU grep):
grep '^\(\S\+\)\s\+\1\s\+' file

why read command reads all words into first name

My script:
#!/bin/bash
IFS=','
read a b c d e f g <<< $(echo "1,2,3,4,5,6,7") # <- this could be any other commands, I am just making up a dummy command call
echo $a
echo $b
echo $c
I expected it to output
1
2
3
But instead it outputs:
1 2 3 4 5 6 7
blank line
blank line
What did I do wrong?
You should use it like this:
IFS=, read a b c d e f g <<< "1,2,3,4,5,6,7"
Use IFS in same line as read to avoid cluttering the current shell environment.
And avoid using command substitution just to capture the output of a single echo command.
If you want to use a command's output in read then better use process substitution in bash:
IFS=, read a b c d e f g < <(echo "1,2,3,4,5,6,7")
This works:
#!/bin/bash
IFS=','
read a b c d e f g <<< "$(echo "1,2,3,4,5,6,7")"
echo $a; echo $b; echo $c
Note the quoting: "$( ...)". Without it, the string is split and becomes
$(echo "1,2,3,4,5,6,7") ===> 1 2 3 4 5 6 7
Giving 1 2 3 4 5 6 7 to read produces no splitting, as the IFS is ,.
Of course, this also works (IFS only apply to the executed command: read):
#!/bin/bash
IFS=',' read a b c d e f g <<< "$(echo "1,2,3,4,5,6,7")"
echo $a; echo $b; echo $c
And is even better like this:
#!/bin/bash
IFS=',' read a b c d e f g <<< "1,2,3,4,5,6,7"
echo $a; echo $b; echo $c
You do not need to "execute an echo" to get a variable, you already have it.
Technically, your code is correct. There is a bug in here-string handling in bash 4.3 and earlier that incorrectly applies word-splitting to the unquoted expansion of the command substitution. The following would work around the bug:
# Quote the expansion to prevent bash from splitting the expansion
# to 1 2 3 4 5 6 7
$ read a b c d e f g <<< "$(echo "1,2,3,4,5,6,7")"
as would
# A regular string is not split
$ read a b c d e f g <<< 1,2,3,4,5,6,7
In bash 4.4, this seems to be fixed:
$ echo $BASH_VERSION
4.4.0(1)-beta
$ IFS=,
$ read a b c d e f g <<< $(echo "1,2,3,4,5,6,7")
$ echo $a
1

Count line lengths in file using command line tools

Problem
If I have a long file with lots of lines of varying lengths, how can I count the occurrences of each line length?
Example:
file.txt
this
is
a
sample
file
with
several
lines
of
varying
length
Running count_line_lengths file.txt would give:
Length Occurences
1 1
2 2
4 3
5 1
6 2
7 2
Ideas?
This
counts the line lengths using awk, then
sorts the (numeric) line lengths using sort -n and finally
counts the unique line length values uniq -c.
$ awk '{print length}' input.txt | sort -n | uniq -c
1 1
2 2
3 4
1 5
2 6
2 7
In the output, the first column is the number of lines with the given length, and the second column is the line length.
Pure awk
awk '{++a[length()]} END{for (i in a) print i, a[i]}' file.txt
4 3
5 1
6 2
7 2
1 1
2 2
Using bash arrays:
#!/bin/bash
while read line; do
((histogram[${#line}]++))
done < file.txt
echo "Length Occurrence"
for length in "${!histogram[#]}"; do
printf "%-6s %s\n" "${length}" "${histogram[$length]}"
done
Example run:
$ ./t.sh
Length Occurrence
1 1
2 2
4 3
5 1
6 2
7 2
$ perl -lne '$c{length($_)}++ }{ print qq($_ $c{$_}) for (keys %c);' file.txt
Output
6 2
1 1
4 3
7 2
2 2
5 1
Try this:
awk '{print length}' FILENAME
Or next if you want the longest length:
awk '{ln=length} ln>max{max=ln} END {print FILENAME " " max}'
You can combine above command with find using -exec option.
You can accomplish this by using basic unix utilities only:
$ printf "%s %s\n" $(for line in $(cat file.txt); do printf $line | wc -c; done | sort -n | uniq -c | sed -E "s/([0-9]+)[^0-9]+([0-9]+)/\2 \1/")
1 1
2 2
4 3
5 1
6 2
7 2
How it works?
Here's the source file:
$ cat file.txt
this
is
a
sample
file
with
several
lines
of
varying
length
Replace each line of the source file with its length:
$ for line in $(cat file.txt); do printf $line | wc -c; done
4
2
1
6
4
4
7
5
2
7
6
Sort and count the number of length occurrences:
$ for line in $(cat file.txt); do printf $line | wc -c; done | sort -n | uniq -c
1 1
2 2
3 4
1 5
2 6
2 7
Swap and format the numbers:
$ printf "%s %s\n" $(for line in $(cat file.txt); do printf $line | wc -c; done | sort -n | uniq -c | sed -E "s/([0-9]+)[^0-9]+([0-9]+)/\2 \1/")
1 1
2 2
4 3
5 1
6 2
7 2
If you allow for the columns to be swapped and don't need the headers, something as easy as
while read line; do echo -n "$line" | wc -m; done < file | sort | uniq -c
(without any advanced tricks with sed or awk) will work. The output is:
1 1
2 2
3 4
1 5
2 6
2 7
One important thing to keep in mind: wc -c counts the bytes, not the characters, and will not give the correct length for strings containing multibyte characters. Therefore the use of wc -m.
References:
man uniq(1)
man sort(1)
man wc(1)

Sorting decimals

here is another question for sorting a list with decimals:
$ list="1 2 5 2.1"
$ for j in "${list[#]}"; do echo "$j"; done | sort -n
1 2 5 2.1
I expected
1 2 2.1 5
If you intended that the variable list be an array, then you needed to say:
list=(1 2 5 2.1)
which would result in
1
2
2.1
5
for j in $list; do echo $j; done | sort -n
or
printf '%s\n' $list|sort -n
You do not need to "${list[#]}" but just $list because it is just a string. Otherwise it gets all numbers in the same field.
$ for j in $list; do echo $j; done | sort -n
1
2
2.1
5
With your previous code it was not sorting at all:
$ list="77 1 2 5 2.1 99"
$ for j in "${list[#]}"; do echo "$j"; done | sort -n
77 1 2 5 2.1 99

How to combine columns that have the same headers within 1 file using Awk or Bash

I would like to know how to combine columns with duplicate headers in a file using bash/sed/awk.
x y x y
s1 3 4 6 10
s2 3 9 10 7
s3 7 1 3 2
to :
x y
s1 9 14
s2 13 16
s3 10 3
$ cat file
x y x y
s1 3 4 6 10
s2 3 9 10 7
s3 7 1 3 2
$ cat tst.awk
NR==1 {
for (i=1;i<=NF;i++) {
flds[$i] = flds[$i] " " i+1
}
printf "%-3s",""
for (hdr in flds) {
printf "%3s",hdr
}
print ""
next
}
{
printf "%-3s",$1
for (hdr in flds) {
n = split(flds[hdr],fldNrs)
sum = 0
for (i=1; i<=n; i++) {
sum += $(fldNrs[i])
}
printf "%3d",sum
}
print ""
}
$ awk -f tst.awk file
x y
s1 9 14
s2 13 16
s3 10 3
$ time awk -f ./tst.awk file
x y
s1 9 14
s2 13 16
s3 10 3
real 0m0.265s
user 0m0.030s
sys 0m0.108s
Adjust the printf lines in the obvious ways for different output formatting if you like.
Here's the bash equivalent in response to the comments elsethread. Do NOT use this, the awk solution is the right one, this is just to show how you should write it in bash IF you wanted to do that for some inexplicable reason:
$ cat tst.sh
declare -A flds
while IFS= read -r rec
do
lineNr=$(( lineNr + 1 ))
set -- $rec
if (( lineNr == 1 ))
then
fldNr=1
for fld
do
fldNr=$(( fldNr + 1 ))
flds[$fld]+=" $fldNr"
done
printf "%-3s" ""
for hdr in "${!flds[#]}"
do
printf "%3s" "$hdr"
done
printf "\n"
else
printf "%-3s" "$1"
for hdr in "${!flds[#]}"
do
fldNrs=( ${flds[$hdr]} )
sum=0
for fldNr in "${fldNrs[#]}"
do
eval val="\$$fldNr"
sum=$(( sum + val ))
done
printf "%3d" "$sum"
done
printf "\n"
fi
done < "$1"
$
$ time ./tst.sh file
x y
s1 9 14
s2 13 16
s3 10 3
real 0m0.062s
user 0m0.031s
sys 0m0.046s
Note that it runs in roughly the same order of magnitude duration as the awk script (see comments elsethread). Caveat - I never write bash scripts for processing text files so I'm not claiming the above bash script is perfect, just an example of how to approach it in bash for comparison with the other script in this thread that I claimed should be rewritten!
This not a one line. You can do it using Bash v4, Bash's dictonaries, and some shell tools.
Execute the script below with the name of the file to process a parameter
bash script_below.sh your_file
Here is the script:
declare -A coltofield
headerdone=0
# Take the first line of the input file and extract all fields
# and their position. Start with position value 2 because of the
# format of the following lines
while read line; do
colnum=$(echo $line | cut -d "=" -f 1)
field=$(echo $line | cut -d "=" -f 2)
coltofield[$colnum]=$field
done < <(head -n 1 $1 | sed -e 's/^[[:space:]]*//;' -e 's/[[:space:]]*$//;' -e 's/[[:space:]]\+/\n/g;' | nl -v 2 -n ln | sed -e 's/[[:space:]]\+/=/g;')
# Read the rest of the file starting with the second line
while read line; do
declare -A computation
declare varname
# Turn the line in key value pair. The key is the position of
# the value in the line
while read value; do
vcolnum=$(echo $value | cut -d "=" -f 1)
vvalue=$(echo $value | cut -d "=" -f 2)
# The first value is the line variable name
# (s1, s2)
if [[ $vcolnum == "1" ]]; then
varname=$vvalue
continue
fi
# Get the name of the field by the column
# position
field=${coltofield[$vcolnum]}
# Add the value to the current sum for this field
computation[$field]=$((computation[$field]+${vvalue}))
done < <(echo $line | sed -e 's/^[[:space:]]*//;' -e 's/[[:space:]]*$//;' -e 's/[[:space:]]\+/\n/g;' | nl -n ln | sed -e 's/[[:space:]]\+/=/g;')
if [[ $headerdone == "0" ]]; then
echo -e -n "\t"
for key in ${!computation[#]}; do echo -n -e "$key\t" ; done; echo
headerdone=1
fi
echo -n -e "$varname\t"
for value in ${computation[#]}; do echo -n -e "$value\t"; done; echo
computation=()
done < <(tail -n +2 $1)
Yet another AWK alternative:
$ cat f
x y x y
s1 3 4 6 10
s2 3 9 10 7
s3 7 1 3 2
$ cat f.awk
BEGIN {
OFS="\t";
}
NR==1 {
#need header for 1st column
for(f=NF; f>=1; --f)
$(f+1) = $f;
$1="";
for(f=1; f<=NF; ++f)
fld2hdr[f]=$f;
}
{
for(f=1; f<=NF; ++f)
if($f ~ /^[0-9]/)
colValues[fld2hdr[f]]+=$f;
else
colValues[fld2hdr[f]]=$f;
for (i in colValues)
row = row colValues[i] OFS;
print row;
split("", colValues);
row=""
}
$ awk -f f.awk f
x y
s1 9 14
s2 13 16
s3 10 3
$ awk 'BEGIN{print " x y"} a=$2+$4, b=$3+$5 {print $1, a, b}' file
x y
s1 9 14
s2 13 16
s3 10 3
No doubt there is a better way to display the heading but my awk is a little sketchy.
Here's a Perl solution, just for fun:
cat table.txt | perl -e'#h=grep{$_}split/\s+/,<>;while(#l=grep{$_}split/\s+/,<>){for$i(1..$#l){$t{$l[0]}{$h[$i-1]}+=$l[$i]}};printf " %s\n",(join" ",sort keys%{$t{(keys%t)[0]}});for$h(sort keys%t){printf"$h %s\n",(join " ",map{sprintf"%2d",$_}#{$t{$h}}{sort keys%{$t{$h}}})};'

Resources