How to speed up reading txt file in bash script - bash

I am buliding a script that reads 24 daily hourly temperature data to extract a latitude-longitude region for a smaller domain. There are three columns in each data file temperature-longitude-latitude and 188426 rows.
> ==> 20120810234500.txt <==
> 0.0362,-12.5000,33.5000
> -0.0188,-12.5000,33.5400
> -0.0732,-12.5000,33.5800
> -0.1263,-12.5000,33.6200
> -0.1778,-12.5000,33.6600
> -0.2278,-12.5000,33.7000
> -0.2761,-12.5000,33.7400
> -0.3226,-12.5000,33.7800
> -0.3677,-12.5000,33.8200
> -0.4115,-12.5000,33.8600
I have used for and while loops and awk command to read data but it takes a too long time (at least for me) to read, extract and grab the new smaller file. Here you can see the relevant part of the script
# Start 24 hours loop
lom1=-3
lom2=3
lam1=35
lam2=42
nhoras=24
n=1
while [ $n -le $nhoras ]
do
# File name (nom_file) and length (nstation=188426)
nom_file=`awk -v i=$n 'BEGIN { FS = ","} NR==i { print $1 }' lista_datos.txt`
nstation=`awk 'END{print NR}' $nom_file`
# Original data came from windows system and has carriage returns
dos2unix -q $nom_file
# Date, time values from file name
year=`echo $nom_file | cut -c 1-4`
month=`echo $nom_file | cut -c 5-6`
day=`echo $nom_file | cut -c 7-8`
hour=`echo $nom_file | cut -c 9-14`
# Part of the string to write in the new smaller file
var1=`echo $nom_file | awk '{print substr($0,1,4) " " substr($0,5,2) " " substr($0,7,2) " " substr($0,9,6)}'`
# Read rows 65000 to 125000 to gain processing time
m=65000
#while [ $m -le $nstation ] # Bucle extración datos
while [ $m -le 125000 ] # Bucle extración datos
do
station_id=$m
elevation=1.5
lat=`awk -v i=$m 'BEGIN { FS = ","} NR==i { print $3 }' $nom_file`
lon=`awk -v i=$m 'BEGIN { FS = ","} NR==i { print $2 }' $nom_file`
# As lon/lat are floating point I use this workaround to get a smaller region
lom1=`echo $lon'>'$lon1 | bc -l`
lom2=`echo $lon'<'$lon2 | bc -l`
lam1=`echo $lat'>'$lat1 | bc -l`
lam2=`echo $lat'<'$lat2 | bc -l`
if [ $lom1 -eq 1 ] && [ $lom2 -eq 1 ];
then
if [ $lam1 -eq 1 ] && [ $lam2 -eq 1 ];
then
# Second part of the string to write in the new smaller file
var2=`awk -v i=$m -v e=$elevation 'BEGIN { FS = ","} NR==i { print "'${station_id}' " $3 " " $2 " '${elevation}' 000 " $1 " 000" }' $nom_file`
# Paste
paste <(echo "$var1") <(echo "$var2") -d ' ' >> out.txt
fi # final condición lat
fi # final condición lon
m=$(( $m + 1 ))
done # End of extracting loop
# Save results
cat cabecera-dp-s.txt out.txt > dp-s$year-$month-$day-$hour
rm out.txt
n=$(( $n + 1 ))
done # End 24 hours loop
By now it takes two hours to process a single imput file. Is there any option to speed up the process?
Thanks in advance

Thanks to all the comments and specially thanks to #fedorqui
With the right use for awk processing speed has dramatically increased. My first attempt processed a single file in 2 hours, now 24 files have been processed in 93 minutes. There should be room for improvement but right now is fine for me. Thanks again.
I attach the script, maybe it could be useful for someone
#!/bin/bash
# RUTAS
base=/home/meteo/PROJECTES/TERMED
dades=$base/DADES
files=$base/FILES
msg_data=$dades/MSG/Agosto
treball=$base/TREBALL
# INICIO DEL SCRIPT
cd $treball
rm *
# Header for final output
cp $files/cabecera-dp-s.txt ./
# Inicio bucle dia
for dia
in 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
do
cp $msg_data/$dia/* ./
ls 2*.txt > lista_datos.txt
awk '{print substr($0,9,6)}' lista_datos.txt > lista_horas.txt
nhoras=`awk 'END{print NR}' lista_horas.txt`
# Inicio bucle hora
n=1
while [ $n -le $nhoras ]
do
# File name and size
nom_file=`awk -v i=$n 'BEGIN { FS = ","} NR==i { print $1 }' lista_datos.txt`
nstation=`awk 'END{print NR}' $nom_file`
# avoid carriage returns
dos2unix -q $nom_file
# Date values
year=`echo $nom_file | cut -c 1-4`
month=`echo $nom_file | cut -c 5-6`
day=`echo $nom_file | cut -c 7-8`
hour=`echo $nom_file | cut -c 9-14`
# Extract region, thanks fedorqui
awk -F, '$2>=-3 && $2<=3 && $3>=35 && $3<=42' $nom_file > output-$year$month$day$hour.txt
# Parte 1 de la línea de datos RAMS
var1=`echo $nom_file | awk '{print substr($0,1,4) " " substr($0,5,2) " " substr($0,7,2) " " substr($0,9,6)}'`
# station_id, latitud, longitud, elevación y temperatura para cada punto
m=1
nstation=`awk 'END{print NR}' output-$year$month$day$hour.txt`
while [ $m -le $nstation ] # Bucle extración datos
do
station_id=$m
elevation=1.5
# Parte 2 de la línea de datos RAMS
var2=`awk -v i=$m -v e=$elevation 'BEGIN { FS = ","} NR==i { print "'${station_id}' " $3 " " $2 " '${elevation}' 000 " $1 " 000" }' output-$year$month$day$hour.txt`
# Pegamos las dos partes para construir la línea de datos
paste <(echo "$var1") <(echo "$var2") -d ' ' >> out.txt
m=$(( $m + 1 ))
done # Final bucle extracción datos
# Guardamos la salida con el formato y nombre RAMS
cat cabecera-dp-s.txt out.txt > dp-s$year-$month-$day-$hour
n=$(( $n + 1 ))
rm out.txt
done # Final bucle horas
# Borra datos para evitar conflicto con lista_horas, lista_datos
rm *txt
done # Final bucle dia

Related

"Integer expression expected" bash if statements

I'm trying to extract the xy coordinates of some earthquake occurrences along with their magnitudes from a file "seismic_c_am.txt", and plot them as circles of various sizes and colours based on the magnitude. Here is what I have so far:
25 i=`awk '{ FS = "|" ; print $11}' seismic_c_am.txt`
26
27 if [ "$i" -gt 7 ] ; then
28 awk 'NR%25==0 { FS = "|" ; print $4, $3}' seismic_c_am.txt | psxy $rgn $proj -Sc0.25c -Gred -O -K >> $psfile ;
29 fi
30
31 if [ "$i" -gt 5 ] && [ "$i" -le 7 ] ; then
32 awk 'NR%25==0 { FS = "|" ; print $4, $3}' seismic_c_am.txt | psxy $rgn $proj -Sc0.2c -Gorange -O -K >> $psfile ;
33 fi
34
35 if [ "$i" -le 5 ] ; then
36 awk 'NR%25==0 { FS = "|" ; print $4, $3}' seismic_c_am.txt | psxy $rgn $proj -Sc0.1c -Gyellow -O -K >> $psfile ;
37 fi
This script seems to just print all the magnitudes ($11) into the terminal, and the last line reads:
.
.
3.6
4.0
1.7
3.6 : integer expression expected
But I don't know which line this is referring to! Possibly line 27, 31 or 35? (see above)
Bash doesn't do floating point arithmetic, only integer arithmetic.
Since you're comparing with integers, you can make awk print the integer part.
i=`awk '{ FS = "|" ; printf "%d", $11}' seismic_c_am.txt`
If you want to know which line is causing these errors, add the command set -x to your script to turn on tracing mode: bash will print each script line before executing it. If you only want to trace part of the script, you can turn off tracing with set +x.
Since you're repeating the same snippet many times, you may want to restructure your script a bit.
i=`awk '{ FS = "|" ; printf "%d", $11}' seismic_c_am.txt`
if [ $i -ge 7 ]; then
sc_value=0.25 color=red
elif [ $i -ge 5 ]; then
sc_value=0.2 color=orange
else
sc_value=0.1 color=yellow
fi
awk 'NR%25==0 { FS = "|" ; print $4, $3}' seismic_c_am.txt |
psxy $rgn $proj -Sc${sc_value}c -G$color -O -K >> $psfile

awk on debian squeeze versus debian wheezy

The first part of my script works on a debian wheezy box:
OUTPUT_DIR=/share/es-ops/Build_Farm_Reports/WorkSpace_Reports
BASE=/export/ws
TODAY=`date +"%m-%d-%y"`
HOSTNAME=`hostname`
WORKSPACES=( "bob_avail" "bob_used" "mel_avail" "mel_used" "sideshow-ws2_avail" "sideshow-ws2_used" )
if ! [ -f $OUTPUT_DIR/$HOSTNAME.csv ] && [ $HOSTNAME == "sideshow" ]; then
echo "$HOSTNAME" > $OUTPUT_DIR/$HOSTNAME.csv # with a linebreak
separator="," # defined empty for the first value
for v in "${WORKSPACES[#]}"
do
echo -n "$separator$v" >> $OUTPUT_DIR/$HOSTNAME.csv # append, concatenated, the separator and the value to the file
#separator="," # comma for the next values
done
echo >> $OUTPUT_DIR/$HOSTNAME.csv # add a linebreak (if you want it)
WORKSPACES2=( "bob" "mel" "sideshow-ws2" )
df -m "${WORKSPACES2[#]/#//export/ws/}" | awk '
BEGIN { "date +'%m-%d-%y'" | getline date;
printf "%s",date }
NR > 1 { printf ",%s,%s", $3, $2; }
END { printf "\n"}' >> "$OUTPUT_DIR/$HOSTNAME.csv"
elif [ $OUTPUT_DIR/$HOSTNAME.csv ] && [ $HOSTNAME == "sideshow" ]; then
WORKSPACES2=( "bob" "mel" "sideshow-ws2" )
df -m "${WORKSPACES2[#]/#//export/ws/}" | awk '
BEGIN { "date +'%m-%d-%y'" | getline date;
printf "%s",date }
NR > 1 { printf ",%s,%s", $3, $2; }
END { printf "\n"}' >> "$OUTPUT_DIR/$HOSTNAME.csv"
else
:
fi
and produces daily output like this each time the cron goes off at 3:00AM -8GMT:
sideshow
,bob_avail,bob_used,mel_avail,mel_used,sideshow-ws2_avail,sideshow-ws2_used
09-20-14,470400,1032124,661826,1032124,43443,1032108
09-20-15,470400,1032124,661826,1032124,43443,1032108
09-20-16,470400,1032124,661826,1032124,43443,1032108
But for some reason when I try to run it on these other 3 debian squeeze boxes I get triple commas between values:
case "$HOSTNAME" in
simpsons) WORKSPACES=(bart_avail bart_used homer_avail home_used lisa_avail lisa_used \
marge_avail marge_used releases_avail releases_used rt-private_avail rt-private_used \
simpsons-ws0_avail simpsons-ws0_used simpsons-ws1_avail simpsons-ws1_used simpsons-ws2_avail \
simpsons-ws2_used vsimpsons-ws_avail vsimpsons-ws_used) ;;
moes) WORKSPACES=(barney_avail barney_used carl_avail carl_used lenny_avail lenny_used moes-ws2_avail moes-ws2_used) ;;
flanders) WORKSPACES=(flanders-ws0_avail flanders-ws0_used flanders-ws1_avail flanders-ws1_used flanders-ws2_avail \
flanders-ws2_used maude_avail maude_used ned_avail ned_used rod_avail rod_used todd_avail \
todd_used to-delete_avail to-delete_used) ;;
esac
if ! [ -f $OUTPUT_DIR/$HOSTNAME.csv ]; then
echo "$HOSTNAME" > $OUTPUT_DIR/$HOSTNAME.csv # with a linebreak
separator="," # defined empty for the first value
for v in "${WORKSPACES[#]}"
do
echo -n "$separator$v" >> $OUTPUT_DIR/$HOSTNAME.csv # append, concatenated, the separator and the value to the file
#separator="," # comma for the next values
done
echo >> $OUTPUT_DIR/$HOSTNAME.csv # add a linebreak (if you want it)
case "$HOSTNAME" in
simpsons) WORKSPACES2=(bart homer lisa marge releases rt-private simpsons-ws0 simpsons-ws1 simpsons-ws2 vsimpsons-ws) ;;
moes) WORKSPACES2=(barney carl lenny moes-ws2) ;;
flanders) WORKSPACES2=(flanders-ws0 flanders-ws1 flanders-ws2 maude ned rod todd to-delete) ;;
esac
df -m "${WORKSPACES2[#]/#//export/ws/}" | awk '
BEGIN { "date +'%m-%d-%y'" | getline date;
printf "%s",date }
NR > 1 { printf ",%s,%s", $3, $2; }
END { printf "\n"}' >> "$OUTPUT_DIR/$HOSTNAME.csv"
elif [ $OUTPUT_DIR/$HOSTNAME.csv ]; then
df -m "${WORKSPACES2[#]/#//export/ws/}" | awk '
BEGIN { "date +'%m-%d-%y'" | getline date;
printf "%s",date }
NR > 1 { printf ",%s,%s", $3, $2; }
END { printf "\n"}' >> "$OUTPUT_DIR/$HOSTNAME.csv"
else
:
fi
which looks like this:
simpsons
,bart_avail,bart_used,homer_avail,home_used,lisa_avail,lisa_used,marge_avail,marge_used,releases_avail,releases_used,rt-private_avail,rt-private_used,simpsons-ws0_avail,simpsons-ws0_used,simpsons-ws1_avail,simpsons-ws1_used,simpsons-ws2_avail,simpsons-ws2_used,vsimpsons-ws_avail,vsimpsons-ws_used
09-21-14,,,43417,1154259,,,2669,1195007,,,3427,1194249,,,2948,162602,,,128174,281377,,,967520,991870,,,85,168836,,,11995,1011937,,,780184,199511,,,14251,22408
Can you guys help me reduce the 3 commas to just 1 between values?
On these 3 boxes (simpsons, moes, and flanders), the only way to get the right avail and used values is to run awk like this:
df -m /export/ws/maude | awk '{if (NR!=1) {print $3, $2}}'
which looks like this:
492 163306
Otherwise if you run it like this:
df -m /export/ws/maude | awk '{print $3, $2}'
you get this:
Used 1M-blocks
492 163306
I fixed the triple comma issue with a work around:
OUTPUT_DIR=/share/es-ops/Build_Farm_Reports/WorkSpace_Reports
BASE=/export/ws
TODAY=`date +"%m-%d-%y"`
HOSTNAME=`hostname`
WORKSPACES=( "bob_avail" "bob_used" "mel_avail" "mel_used" "sideshow-ws2_avail" "sideshow-ws2_used" )
if ! [ -f $OUTPUT_DIR/$HOSTNAME.csv ] && [ $HOSTNAME == "sideshow" ]; then
echo "$HOSTNAME" > $OUTPUT_DIR/$HOSTNAME.csv # with a linebreak
separator="," # defined empty for the first value
for v in "${WORKSPACES[#]}"
do
echo -n "$separator$v" >> $OUTPUT_DIR/$HOSTNAME.csv # append, concatenated, the separator and the value to the file
#separator="," # comma for the next values
done
echo >> $OUTPUT_DIR/$HOSTNAME.csv # add a linebreak (if you want it)
WORKSPACES2=( "bob" "mel" "sideshow-ws2" )
df -m "${WORKSPACES2[#]/#//export/ws/}" | awk '
BEGIN { "date +'%m-%d-%y'" | getline date;
printf "%s",date }
NR > 1 { printf ",%s,%s", $3, $2; }
END { printf "\n"}' >> "$OUTPUT_DIR/$HOSTNAME.csv"
elif [ $OUTPUT_DIR/$HOSTNAME.csv ] && [ $HOSTNAME == "sideshow" ]; then
WORKSPACES2=( "bob" "mel" "sideshow-ws2" )
df -m "${WORKSPACES2[#]/#//export/ws/}" | awk '
BEGIN { "date +'%m-%d-%y'" | getline date;
printf "%s",date }
NR > 1 { printf ",%s,%s", $3, $2; }
END { printf "\n"}' >> "$OUTPUT_DIR/$HOSTNAME.csv"
else
:
fi
case "$HOSTNAME" in
simpsons) WORKSPACES=(bart_avail bart_used homer_avail home_used lisa_avail lisa_used marge_avail marge_used releases_avail releases_used rt-private_avail rt-private_used simpsons-ws0_ava
il simpsons-ws0_used simpsons-ws1_avail simpsons-ws1_used simpsons-ws2_avail simpsons-ws2_used vsimpsons-ws_avail vsimpsons-ws_used) ;;
moes) WORKSPACES=(barney_avail barney_used carl_avail carl_used lenny_avail lenny_used moes-ws2_avail moes-ws2_used) ;;
flanders) WORKSPACES=(flanders-ws0_avail flanders-ws0_used flanders-ws1_avail flanders-ws1_used flanders-ws2_avail flanders-ws2_used maude_avail maude_used ned_avail ned_used rod_avail ro
d_used todd_avail todd_used to-delete_avail to-delete_used) ;;
esac
if ! [ -f $OUTPUT_DIR/$HOSTNAME.csv ] && [ $HOSTNAME == `hostname` ]; then
echo "$HOSTNAME" > $OUTPUT_DIR/$HOSTNAME.csv # with a linebreak
separator="," # defined empty for the first value
for v in "${WORKSPACES[#]}"
do
echo -n "$separator$v" >> $OUTPUT_DIR/$HOSTNAME.csv # append, concatenated, the separator and the value to the file
#separator="," # comma for the next values
done
echo >> $OUTPUT_DIR/$HOSTNAME.csv # add a linebreak (if you want it)
case "$HOSTNAME" in
simpsons) WORKSPACES2=(bart homer lisa marge releases rt-private simpsons-ws0 simpsons-ws1 simpsons-ws2 vsimpsons-ws) ;;
moes) WORKSPACES2=(barney carl lenny moes-ws2) ;;
flanders) WORKSPACES2=(flanders-ws0 flanders-ws1 flanders-ws2 maude ned rod todd to-delete) ;;
esac
df -m "${WORKSPACES2[#]/#//export/ws/}" | awk '
BEGIN { "date +'%m-%d-%y'" | getline date;
printf "%s",date }
NR > 1 { printf ",%s,%s", $3, $2; }
END { printf "\n"}' >> "$OUTPUT_DIR/$HOSTNAME.csv"
sed -i s/,,,/,/g "$OUTPUT_DIR/$HOSTNAME.csv"
elif [ $OUTPUT_DIR/$HOSTNAME.csv ] && [ $HOSTNAME == `hostname` ]; then
case "$HOSTNAME" in
simpsons) WORKSPACES2=(bart homer lisa marge releases rt-private simpsons-ws0 simpsons-ws1 simpsons-ws2 vsimpsons-ws) ;;
moes) WORKSPACES2=(barney carl lenny moes-ws2) ;;
flanders) WORKSPACES2=(flanders-ws0 flanders-ws1 flanders-ws2 maude ned rod todd to-delete) ;;
esac
df -m "${WORKSPACES2[#]/#//export/ws/}" | awk '
BEGIN { "date +'%m-%d-%y'" | getline date;
printf "%s",date }
NR > 1 { printf ",%s,%s", $3, $2; }
END { printf "\n"}' >> "$OUTPUT_DIR/$HOSTNAME.csv"
sed -i s/,,,/,/g "$OUTPUT_DIR/$HOSTNAME.csv"
else
:
fi
I just put:
sed -i s/,,,/,/g "$OUTPUT_DIR/$HOSTNAME.csv"
I really wish I would have thought of how to do the awk part that eliminated the triple commas in the first place but at least my script works now.

How to combine columns that have the same headers within 1 file using Awk or Bash

I would like to know how to combine columns with duplicate headers in a file using bash/sed/awk.
x y x y
s1 3 4 6 10
s2 3 9 10 7
s3 7 1 3 2
to :
x y
s1 9 14
s2 13 16
s3 10 3
$ cat file
x y x y
s1 3 4 6 10
s2 3 9 10 7
s3 7 1 3 2
$ cat tst.awk
NR==1 {
for (i=1;i<=NF;i++) {
flds[$i] = flds[$i] " " i+1
}
printf "%-3s",""
for (hdr in flds) {
printf "%3s",hdr
}
print ""
next
}
{
printf "%-3s",$1
for (hdr in flds) {
n = split(flds[hdr],fldNrs)
sum = 0
for (i=1; i<=n; i++) {
sum += $(fldNrs[i])
}
printf "%3d",sum
}
print ""
}
$ awk -f tst.awk file
x y
s1 9 14
s2 13 16
s3 10 3
$ time awk -f ./tst.awk file
x y
s1 9 14
s2 13 16
s3 10 3
real 0m0.265s
user 0m0.030s
sys 0m0.108s
Adjust the printf lines in the obvious ways for different output formatting if you like.
Here's the bash equivalent in response to the comments elsethread. Do NOT use this, the awk solution is the right one, this is just to show how you should write it in bash IF you wanted to do that for some inexplicable reason:
$ cat tst.sh
declare -A flds
while IFS= read -r rec
do
lineNr=$(( lineNr + 1 ))
set -- $rec
if (( lineNr == 1 ))
then
fldNr=1
for fld
do
fldNr=$(( fldNr + 1 ))
flds[$fld]+=" $fldNr"
done
printf "%-3s" ""
for hdr in "${!flds[#]}"
do
printf "%3s" "$hdr"
done
printf "\n"
else
printf "%-3s" "$1"
for hdr in "${!flds[#]}"
do
fldNrs=( ${flds[$hdr]} )
sum=0
for fldNr in "${fldNrs[#]}"
do
eval val="\$$fldNr"
sum=$(( sum + val ))
done
printf "%3d" "$sum"
done
printf "\n"
fi
done < "$1"
$
$ time ./tst.sh file
x y
s1 9 14
s2 13 16
s3 10 3
real 0m0.062s
user 0m0.031s
sys 0m0.046s
Note that it runs in roughly the same order of magnitude duration as the awk script (see comments elsethread). Caveat - I never write bash scripts for processing text files so I'm not claiming the above bash script is perfect, just an example of how to approach it in bash for comparison with the other script in this thread that I claimed should be rewritten!
This not a one line. You can do it using Bash v4, Bash's dictonaries, and some shell tools.
Execute the script below with the name of the file to process a parameter
bash script_below.sh your_file
Here is the script:
declare -A coltofield
headerdone=0
# Take the first line of the input file and extract all fields
# and their position. Start with position value 2 because of the
# format of the following lines
while read line; do
colnum=$(echo $line | cut -d "=" -f 1)
field=$(echo $line | cut -d "=" -f 2)
coltofield[$colnum]=$field
done < <(head -n 1 $1 | sed -e 's/^[[:space:]]*//;' -e 's/[[:space:]]*$//;' -e 's/[[:space:]]\+/\n/g;' | nl -v 2 -n ln | sed -e 's/[[:space:]]\+/=/g;')
# Read the rest of the file starting with the second line
while read line; do
declare -A computation
declare varname
# Turn the line in key value pair. The key is the position of
# the value in the line
while read value; do
vcolnum=$(echo $value | cut -d "=" -f 1)
vvalue=$(echo $value | cut -d "=" -f 2)
# The first value is the line variable name
# (s1, s2)
if [[ $vcolnum == "1" ]]; then
varname=$vvalue
continue
fi
# Get the name of the field by the column
# position
field=${coltofield[$vcolnum]}
# Add the value to the current sum for this field
computation[$field]=$((computation[$field]+${vvalue}))
done < <(echo $line | sed -e 's/^[[:space:]]*//;' -e 's/[[:space:]]*$//;' -e 's/[[:space:]]\+/\n/g;' | nl -n ln | sed -e 's/[[:space:]]\+/=/g;')
if [[ $headerdone == "0" ]]; then
echo -e -n "\t"
for key in ${!computation[#]}; do echo -n -e "$key\t" ; done; echo
headerdone=1
fi
echo -n -e "$varname\t"
for value in ${computation[#]}; do echo -n -e "$value\t"; done; echo
computation=()
done < <(tail -n +2 $1)
Yet another AWK alternative:
$ cat f
x y x y
s1 3 4 6 10
s2 3 9 10 7
s3 7 1 3 2
$ cat f.awk
BEGIN {
OFS="\t";
}
NR==1 {
#need header for 1st column
for(f=NF; f>=1; --f)
$(f+1) = $f;
$1="";
for(f=1; f<=NF; ++f)
fld2hdr[f]=$f;
}
{
for(f=1; f<=NF; ++f)
if($f ~ /^[0-9]/)
colValues[fld2hdr[f]]+=$f;
else
colValues[fld2hdr[f]]=$f;
for (i in colValues)
row = row colValues[i] OFS;
print row;
split("", colValues);
row=""
}
$ awk -f f.awk f
x y
s1 9 14
s2 13 16
s3 10 3
$ awk 'BEGIN{print " x y"} a=$2+$4, b=$3+$5 {print $1, a, b}' file
x y
s1 9 14
s2 13 16
s3 10 3
No doubt there is a better way to display the heading but my awk is a little sketchy.
Here's a Perl solution, just for fun:
cat table.txt | perl -e'#h=grep{$_}split/\s+/,<>;while(#l=grep{$_}split/\s+/,<>){for$i(1..$#l){$t{$l[0]}{$h[$i-1]}+=$l[$i]}};printf " %s\n",(join" ",sort keys%{$t{(keys%t)[0]}});for$h(sort keys%t){printf"$h %s\n",(join " ",map{sprintf"%2d",$_}#{$t{$h}}{sort keys%{$t{$h}}})};'

How can I align the columns of tables in Bash?

I want to format text as a table. I tried echoing with a '\t' separator, but it was misaligned.
Desired output:
a very long string.......... 112232432 anotherfield
a smaller string 123124343 anotherfield
Use the column command:
column -t -s' ' filename
printf is great, but people forget about it.
$ for num in 1 10 100 1000 10000 100000 1000000; do printf "%10s %s\n" $num "foobar"; done
1 foobar
10 foobar
100 foobar
1000 foobar
10000 foobar
100000 foobar
1000000 foobar
$ for((i=0;i<array_size;i++));
do
printf "%10s %10d %10s" stringarray[$i] numberarray[$i] anotherfieldarray[%i]
done
Notice I used %10s for strings. %s is the important part. It tells it to use a string. The 10 in the middle says how many columns it is to be. %d is for numerics (digits).
See man 1 printf for more info.
function printTable()
{
local -r delimiter="${1}"
local -r data="$(removeEmptyLines "${2}")"
if [[ "${delimiter}" != '' && "$(isEmptyString "${data}")" = 'false' ]]
then
local -r numberOfLines="$(wc -l <<< "${data}")"
if [[ "${numberOfLines}" -gt '0' ]]
then
local table=''
local i=1
for ((i = 1; i <= "${numberOfLines}"; i = i + 1))
do
local line=''
line="$(sed "${i}q;d" <<< "${data}")"
local numberOfColumns='0'
numberOfColumns="$(awk -F "${delimiter}" '{print NF}' <<< "${line}")"
# Add Line Delimiter
if [[ "${i}" -eq '1' ]]
then
table="${table}$(printf '%s#+' "$(repeatString '#+' "${numberOfColumns}")")"
fi
# Add Header Or Body
table="${table}\n"
local j=1
for ((j = 1; j <= "${numberOfColumns}"; j = j + 1))
do
table="${table}$(printf '#| %s' "$(cut -d "${delimiter}" -f "${j}" <<< "${line}")")"
done
table="${table}#|\n"
# Add Line Delimiter
if [[ "${i}" -eq '1' ]] || [[ "${numberOfLines}" -gt '1' && "${i}" -eq "${numberOfLines}" ]]
then
table="${table}$(printf '%s#+' "$(repeatString '#+' "${numberOfColumns}")")"
fi
done
if [[ "$(isEmptyString "${table}")" = 'false' ]]
then
echo -e "${table}" | column -s '#' -t | awk '/^\+/{gsub(" ", "-", $0)}1'
fi
fi
fi
}
function removeEmptyLines()
{
local -r content="${1}"
echo -e "${content}" | sed '/^\s*$/d'
}
function repeatString()
{
local -r string="${1}"
local -r numberToRepeat="${2}"
if [[ "${string}" != '' && "${numberToRepeat}" =~ ^[1-9][0-9]*$ ]]
then
local -r result="$(printf "%${numberToRepeat}s")"
echo -e "${result// /${string}}"
fi
}
function isEmptyString()
{
local -r string="${1}"
if [[ "$(trimString "${string}")" = '' ]]
then
echo 'true' && return 0
fi
echo 'false' && return 1
}
function trimString()
{
local -r string="${1}"
sed 's,^[[:blank:]]*,,' <<< "${string}" | sed 's,[[:blank:]]*$,,'
}
SAMPLE RUNS
$ cat data-1.txt
HEADER 1,HEADER 2,HEADER 3
$ printTable ',' "$(cat data-1.txt)"
+-----------+-----------+-----------+
| HEADER 1 | HEADER 2 | HEADER 3 |
+-----------+-----------+-----------+
$ cat data-2.txt
HEADER 1,HEADER 2,HEADER 3
data 1,data 2,data 3
$ printTable ',' "$(cat data-2.txt)"
+-----------+-----------+-----------+
| HEADER 1 | HEADER 2 | HEADER 3 |
+-----------+-----------+-----------+
| data 1 | data 2 | data 3 |
+-----------+-----------+-----------+
$ cat data-3.txt
HEADER 1,HEADER 2,HEADER 3
data 1,data 2,data 3
data 4,data 5,data 6
$ printTable ',' "$(cat data-3.txt)"
+-----------+-----------+-----------+
| HEADER 1 | HEADER 2 | HEADER 3 |
+-----------+-----------+-----------+
| data 1 | data 2 | data 3 |
| data 4 | data 5 | data 6 |
+-----------+-----------+-----------+
$ cat data-4.txt
HEADER
data
$ printTable ',' "$(cat data-4.txt)"
+---------+
| HEADER |
+---------+
| data |
+---------+
$ cat data-5.txt
HEADER
data 1
data 2
$ printTable ',' "$(cat data-5.txt)"
+---------+
| HEADER |
+---------+
| data 1 |
| data 2 |
+---------+
REF LIB at: https://github.com/gdbtek/linux-cookbooks/blob/master/libraries/util.bash
To have the exact same output as you need, you need to format the file like this:
a very long string..........\t 112232432\t anotherfield\n
a smaller string\t 123124343\t anotherfield\n
And then using:
$ column -t -s $'\t' FILE
a very long string.......... 112232432 anotherfield
a smaller string 123124343 anotherfield
It's easier than you wonder.
If you are working with a separated-by-semicolon file and header too:
$ (head -n1 file.csv && sort file.csv | grep -v <header>) | column -s";" -t
If you are working with an array (using tab as separator):
for((i=0;i<array_size;i++));
do
echo stringarray[$i] $'\t' numberarray[$i] $'\t' anotherfieldarray[$i] >> tmp_file.csv
done;
cat file.csv | column -t
awk solution that deals with stdin
Since column is not POSIX, maybe this is:
mycolumn() (
file="${1:--}"
if [ "$file" = - ]; then
file="$(mktemp)"
cat > "${file}"
fi
awk '
FNR == 1 { if (NR == FNR) next }
NR == FNR {
for (i = 1; i <= NF; i++) {
l = length($i)
if (w[i] < l)
w[i] = l
}
next
}
{
for (i = 1; i <= NF; i++)
printf "%*s", w[i] + (i > 1 ? 1 : 0), $i
print ""
}
' "$file" "$file"
if [ "$1" = - ]; then
rm "$file"
fi
)
Test:
printf '12 1234 1
12345678 1 123
1234 123456 123456
' > file
Test commands:
mycolumn file
mycolumn <file
mycolumn - <file
Output for all:
12 1234 1
12345678 1 123
1234 123456 123456
See also:
Using awk to align columns in text file?
AWK: go through the file twice, doing different tasks
I am not sure where you were running this, but the code you posted would not produce the output you gave, at least not in the Bash version that I'm familiar with.
Try this instead:
stringarray=('test' 'some thing' 'very long long long string' 'blah')
numberarray=(1 22 7777 8888888888)
anotherfieldarray=('other' 'mixed' 456 'data')
array_size=4
for((i=0;i<array_size;i++))
do
echo ${stringarray[$i]} $'\x1d' ${numberarray[$i]} $'\x1d' ${anotherfieldarray[$i]}
done | column -t -s$'\x1d'
Note that I'm using the group separator character (0x1D) instead of tab, because if you are getting these arrays from a file, they might contain tabs.
Just in case someone wants to do that in PHP, I posted a gist on GitHub:
https://gist.github.com/redestructa/2a7691e7f3ae69ec5161220c99e2d1b3
Simply call:
$output = $tablePrinter->printLinesIntoArray($items, ['title', 'chilProp2']);
You may need to adapt the code if you are using a PHP version older than 7.2.
After that, call echo or writeLine depending on your environment.
The below code has been tested and does exactly what is requested in the original question.
Parameters:
%30s Column of 30 char and text right align.
%10d integer notation, %10s will also work. \
stringarray[0]="a very long string.........."
# 28Char (max length for this column)
numberarray[0]=1122324333
# 10digits (max length for this column)
anotherfield[0]="anotherfield"
# 12Char (max length for this column)
stringarray[1]="a smaller string....."
numberarray[1]=123124343
anotherfield[1]="anotherfield"
printf "%30s %10d %13s" "${stringarray[0]}" ${numberarray[0]} "${anotherfield[0]}"
printf "\n"
printf "%30s %10d %13s" "${stringarray[1]}" ${numberarray[1]} "${anotherfield[1]}"
# a var string with spaces has to be quoted
printf "\n Next line will fail \n"
printf "%30s %10d %13s" ${stringarray[0]} ${numberarray[0]} "${anotherfield[0]}"
a very long string.......... 1122324333 anotherfield
a smaller string..... 123124343 anotherfield
column -t skips empty fields when a line starts with a delimiter character or when there are two or more consecutive delimiter characters:
$ printf %s\\n a,b,c a,,c ,b,c|column -s, -t
a b c
a c
b c
Therefore I use this awk function instead (it requires gawk because it uses arrays of arrays):
$ tab(){ awk '{if(NF>m)m=NF;for(i=1;i<=NF;i++){a[NR][i]=$i;l=length($i);if(l>b[i])b[i]=l}}END{for(h in a){for(i=1;i<=m;i++)printf("%-"(b[i]+n)"s",a[h][i]);print""}}' n="${2-1}" "${1+FS=$1}"|sed 's/ *$//';}
$ printf %s\\n a,b,c a,,c ,b,c|tab ,
a b c
a c
b c
if you data doesn't contain the equal sign ("=") anywhere in it, you can use that as a shell-friendly delimiter for column without having to escape anything -
by modifying FS to be either a tab ("\t") plus any amount of spaces (" ") or tabs ("\t") on either side of it, or a contiguous chunk of 2 or more spaces, it also allows the input data to have any amount of single space within each field
echo "${inputdata2}" |
mawk NF=NF OFS== FS=' + |[ \t]*\t[ \t]*' |
column -s= -t
a very long string.......... 112232432 anotherfield
a smaller string 123124343 anotherfield
if the data does contain the equal sign, use a combo sep that's close to impossible to exist in typical data :
gawk -e NF=NF OFS='\301\372\5' FS=' + |[ \t]*\t[ \t]*' |
LC_ALL=C column -s$'\301\372\5' -t
a very long string.......... 112232432 anotherfield
a smaller string 123124343 anotherfield
and if ur data only has 2 columns, and you have ballpark sense of how wide the first field is, you can use this \r trick for nice on-screen formatting (but those don't become runs of spaces if u need to send it down the pipe) :
# each \t is 8-spaces at console terminal
mawk NF=2 FS=' + |[ \t]*\t[ \t]*' OFS='\r\t\t\t\t'
a very long string.......... 112232432
a smaller string 123124343

Unix bash scripting sort

i need help to calculate and display the largest and average of a group of input numbers.
The program should accept a group of numbers, each can be up to 3 digits.
For example, input of 246, 321, 16, 10, 12345, 4, 274 and 0 should result in 321 as the largest and the average of 145, with an error message indicating that 12345 is invalid.
Any ideas how to sort in bash ? Sorry I am not developer in this low level, any help is great :)
I see that you ask for a Bash solution but since you tagged it also Unix I suggest a pure awk solution (awk is just ideal for such problems):
awk '
{
if(length($1) <= 3 && $1 ~ /^[0-9]+$/) {
if($1 > MAX) {MAX = $1}
SUM+=$1
N++
print $1, N, SUM
} else {
print "Illegal Input " $1
}
}
END {
print "Average: " SUM / N
print "Max: " MAX
}
' < <(echo -e "246\n321\n16\n10\n12345\n4\n274\n0")
prints
246 1 246
321 2 567
16 3 583
10 4 593
Illegal Input 12345
4 5 597
274 6 871
0 7 871
Average: 124.429
Max: 321
However, I cannot comprehend why the above input yields 145 as average?
tmpfile=`mktemp`
while read line ; do
if [[ $line =~ ^[0-9]{1,3}$ ]] ; then
# valid input
if [ $line == "0" ] ; then
break
fi
echo $line >> $tmpfile
else
echo "ERROR: invalid input '$line'"
fi
done
awk ' { tot += $1; if (max < $1) max = $1; } \
END { print tot / NR; print max; } ' $tmpfile
rm $tmpfile
A piped coreutils option with bc:
echo 246 321 16 10 12345 4 274 0 \
| grep -o '\b[0-9]{1,3}\b' \
| tee >(sort -n | tail -n1 > /tmp/max) \
| tr '\n' ' ' \
| tee >(sed 's/ $//; s/ \+/+/g' > /tmp/add) \
>(wc -w > /tmp/len) > /dev/null
printf "Max: %d, Min: %.2f\n" \
$(< /tmp/max) \
$((echo -n '('; cat /tmp/add; echo -n ')/'; cat /tmp/len) | bc -l)
Output:
Max: 321, Min: 124.43
grep ensures that the number format constraint.
sort finds max, as suggested by chepner
sed and wc generate the sum and divisor.
Note that this generates 3 temporary files: /tmp/{max,add,len}, so you might want to use mktemp and/or deletion:
rm /tmp/{max,add,len}
Edit
Stick this into the front of the pipe if you want to know about invalid input:
tee >(tr ' ' '\n' \
| grep -v '\b.{1,3}\b' \
| sed 's/^/Invalid input: /' > /tmp/err)
And do cat /tmp/err after the printf.

Resources