Shell Script output formatting

Shell Script output formatting - bash

I have output from a shell script like below
output1 ..... 1
output2 ..... 2
output3 ............3
I tried to format it with equal spacing inside script but output still not have uniform spacing.
I want to print the output like below.
output1 ..... 1
output2 ..... 2
output3 ......3
Are there any commnads available to get this done. I use bash.
here is the code.
lnode=abc
printf "server name ......... "$lnode""
printf "\nserver uptime and load details : ......... `uptime`"
printf "\n"
lcpu=`cat /proc/cpuinfo | grep -i process |wc -l`
printf "Total number of CPUs on this server : ......... $lcpu\n"
-Thanks.

The idea of printf is that you specify a format string that specifies column widths, etc:
$ cat script.sh
lnode=abc
printf "%-40s %s\n" "server name :" "......... $lnode"
printf "%-40s %s\n" "server uptime and load details :" "......... `uptime`"
lcpu=$(cat /proc/cpuinfo | grep -i process |wc -l)
printf "%-40s %s\n" "Total number of CPUs on this server :" "......... $lcpu"
The first directive in the format string, %-40s, is applied to the first argument that follows the format string. It tells printf to display that argument in a 40-character-wide column. If we had used %40s, it would be a right-aligned column. I specified %-40s so that it would be left-aligned.
This produces output like:
$ bash script.sh
server name : ......... abc
server uptime and load details : ......... 18:05:50 up 17 days, 20 users, load average: 0.05, 0.20, 0.33
Total number of CPUs on this server : ......... 4
Documentation
Bash's printf command is similar to printf in other languages, particularly the C version. Details specific to bash are found in man bash. Detailed information about the available format options is found in man 3 printf. To begin, however, you are probably better served by a tutorial such as this one or this one or this one.

Related

divide floating point numbers from two different outputs

I am writing a bash script that has 1) number of lines in a file matching a pattern and 2) total lines in a file.
a) To get the number of lines in a file within a directory that had a specific pattern I used grep -c "pattern" f*
b) For overall line count in each file within the directory I used
wc -l f*
I am trying to divide the output from 2 by 1. I have tried a for loop
for i in $a
do
printf "%f\n" $(($b/$a)
echo i
done
but that returns an error syntax error in expression (error token is "first file in directory")
I also have tried
bc "$b/$a"
which does not work either
I am not sure if this is possible to do -- any advice appreciated. thanks!
Sample: grep -c *f generates a list like this
myfile1 500
myfile2 0
myfile3 14
myfile4 18
and wc -l *f generates a list like this:
myfile1 500
myfile2 500
myfile3 500
myfile4 238
I want my output to be the outcome of output for grep/wc divided so for example
myfile1 1
myfile2 0
myfile3 0.28
myfile4 0.07

bash only supports integer math so the following will print the (silently) truncated integer value:
$ a=3 b=5
$ printf "%f\n" $(($b/$a))
1.000000
bc is one solution and with a tweak of OP's current code:
$ bc <<< "scale=2;$b/$a"
1.66
# or
$ echo "scale=4;$b/$a" | bc
1.6666
If you happen to start with real/float numbers the printf approach will error (more specifically, the $(($b/$a)) will generate an error):
$ a=3.55 b=8.456
$ printf "%f\n" $(($b/$a))
-bash: 8.456/3.55: syntax error: invalid arithmetic operator (error token is ".456/3.55")
bc to the rescue:
$ bc <<< "scale=2;$b/$a"
2.38
# or
$ echo "scale=4;$b/$a" | bc
2.3819
NOTE: in OP's parent code there should be a test for $a=0 and if true then decide how to proceed (eg, set answer to 0; skip the calculation; print a warning message) otherwise the this code will generate a divide by zero error

bash doesn't have builtin floating-point arithmetic, but it can be simulated to some extent. For instance, in order to truncate the value of the fraction a/b to two decimal places (without rounding):
q=$((100*a/b)) # hoping multiplication won't overflow
echo ${q:0:-2}.${q: -2}
The number of decimal places can be made parametric:
n=4
q=$((10**n*a/b))
echo ${q:0:-n}.${q: -n}

This awk will do it all:
awk '/pattern/{a+=1}END{print a/NR}' f*

jot 93765431 |
mawk -v __='[13579]6$' 'BEGIN {
_^=__=_*=FS=__ }{ __+=_<NF } END { if (___=NR) {
printf(" %\47*.f / %\47-*.f ( %.*f %% )\n",
_+=++_*_*_++,__,_,___,_--,_*__/___*_) } }'
4,688,271 / 93,765,431 ( 4.99999941343 % )
filtering pattern = [13579]6$

while loops in parallel with input from splited file

I am stuck on that. So I have this while-read loop within my code that is taking so long and I would like to run it in many processors. But, I'd like to split the input file and run 14 loops (because I have 14 threads), one for each splited file, in parallel. Thing is that I don't know how to tell the while loop which file to get and work with.
For example, in a regular while-read loop I would code:
while read line
do
<some code>
done < input file or variable...
But in this case I would like to split the above input file in 14 files and run 14 while loops in parallel, one for each splited file.
I tried :
split -n 14 input_file
find . -name "xa*" | \
parallel -j 14 | \
while read line
do
<lot of stuff>
done
also tried
split -n 14 input_file
function loop {
while read line
do
<lot of stuff>
done
}
export -f loop
parallel -j 14 ::: loop
But neither I was able to tell which file would be the input to the loop so parallel would understand "take each of those xa* files and place into individual loops in parallel"
An example of the input file (a list of strings)
AEYS01000010.10484.12283
CVJT01000011.50.2173
KF625180.1.1799
KT949922.1.1791
LOBZ01000025.54942.57580
EDIT
This is the code.
The output is a table (741100 lines) with some statistics regarding DNA sequences alignments already made.
The loop takes an input_file (no broken lines, varies from 500 to ~45000 lines, 800Kb) with DNA sequence acessions, reads it line-by-line and look for each correspondent full taxonomy for those acessions in a databank (~45000 lines). Then, it does a few sums/divisions. Output is a .tsv and looks like this (an example for sequence "KF625180.1.1799"):
Rate of taxonomies for this sequence in %: KF625180.1.1799 D_6__Bacillus_atrophaeus
Taxonomy %aligned number_ocurrences_in_the_alignment num_ocurrences_in_databank %alingment/databank
D_6__Bacillus_atrophaeus 50% 1 20 5%
D_6__Bacillus_amyloliquefaciens 50% 1 154 0.649351%
$ head input file
AEYS01000010.10484.12283
CVJT01000011.50.217
KF625180.1.1799
KT949922.1.1791
LOBZ01000025.54942.57580
Two additional files are also used inside the loop. They are not the loop input.
1) a file called alnout_file that only serves for finding how many hits (or alignments) a given sequence had against the databank. It was also previously made outside this loop. It can vary in the number of lines from hundreads to thousands. Only columns 1 and 2 matters here. Column1 is the name of the sequence and col2 is the name of all sequences it matched in the databnk. It looks like that:
$ head alnout_file
KF625180.1.1799 KF625180.1.1799 100.0 431 0 0 1 431 1 431 -1 0
KF625180.1.1799 KP143082.1.1457 99.3 431 1 2 1 431 1 429 -1 0
KP143082.1.1457 KF625180.1.1799 99.3 431 1 2 1 429 1 431 -1 0
2) a databank .tsv file containing ~45000 taxonomies correspondent to the DNA sequences. Each taxonomy is in one line:
$ head taxonomy.file.tsv
KP143082.1.1457 D_0__Bacteria;D_1__Firmicutes;D_2__Bacilli;D_3__Bacillales;D_4__Bacillaceae;D_5__Bacillus;D_6__Bacillus_amyloliquefaciens
KF625180.1.1799 D_0__Bacteria;D_1__Firmicutes;D_2__Bacilli;D_3__Bacillales;D_4__Bacillaceae;D_5__Bacillus;D_6__Bacillus_atrophaeus
So, given sequence KF625180.1.1799. I previously aligned it against a databank containing ~45000 other DNA sequences and got an output whis has all the accessions to sequences that it matched. What the loop does is that it finds the taxonomies for all those sequences and calculates the "statistics" I mentionded previously. Code does it for all the DNA-sequences-accesions I have.
TAXONOMY=path/taxonomy.file.tsv
while read line
do
#find hits
hits=$(grep $line alnout_file | cut -f 2)
completename=$(grep $line $TAXONOMY | sed 's/D_0.*D_4/D_4/g')
printf "\nRate of taxonomies for this sequence in %%:\t$completename\n"
printf "Taxonomy\t%aligned\tnumber_ocurrences_in_the_alignment\tnum_ocurrences_in_databank\t%alingment/databank\n"
#find hits and calculate the frequence (%) of the taxonomy in the alignment output
# ex.: Bacillus_subtilis 33
freqHits=$(grep "${hits[#]}" $TAXONOMY | \
cut -f 2 | \
awk '{a[$0]++} END {for (i in a) {print i, "\t", a[i]/NR*100, "\t", a[i]}}' | \
sed -e 's/D_0.*D_5/D_5/g' -e 's#\s\t\s#\t#g' | \
sort -k2 -hr)
# print frequence of each taxonomy in the databank
freqBank=$(while read line; do grep -c "$line" $TAXONOMY; done < <(echo "$freqHits" | cut -f 1))
#print cols with taxonomy and calculations
paste <(printf %s "$freqHits") <(printf %s "$freqBank") | awk '{print $1,"\t",$2"%","\t",$3,"\t",$4,"\t",$3/$4*100"%"}'
done < input_file
It is a lot of greps and parsing so it takes about ~12h running in one processor for doing it to all the 45000 DNA sequence accessions. The, I would like to split input_file and do it in all the processors I have (14) because it would the time spend in that.
Thank you all for being so patient with me =)

You are looking for --pipe. In this case you can even use the optimized --pipepart (version >20160621):
export TAXONOMY=path/taxonomy.file.tsv
doit() {
while read line
do
#find hits
hits=$(grep $line alnout_file | cut -f 2)
completename=$(grep $line $TAXONOMY | sed 's/D_0.*D_4/D_4/g')
printf "\nRate of taxonomies for this sequence in %%:\t$completename\n"
printf "Taxonomy\t%aligned\tnumber_ocurrences_in_the_alignment\tnum_ocurrences_in_databank\t%alingment/databank\n"
#find hits and calculate the frequence (%) of the taxonomy in the alignment output
# ex.: Bacillus_subtilis 33
freqHits=$(grep "${hits[#]}" $TAXONOMY | \
cut -f 2 | \
awk '{a[$0]++} END {for (i in a) {print i, "\t", a[i]/NR*100, "\t", a[i]}}' | \
sed -e 's/D_0.*D_5/D_5/g' -e 's#\s\t\s#\t#g' | \
sort -k2 -hr)
# print frequence of each taxonomy in the databank
freqBank=$(while read line; do grep -c "$line" $TAXONOMY; done < <(echo "$freqHits" | cut -f 1))
#print cols with taxonomy and calculations
paste <(printf %s "$freqHits") <(printf %s "$freqBank") | awk '{print $1,"\t",$2"%","\t",$3,"\t",$4,"\t",$3/$4*100"%"}'
done
}
export -f doit
parallel -a input_file --pipepart doit
This will chop input_file into 10*ncpu blocks (where ncpu is the number of CPU threads), pass each block to doit, run ncpu jobs in parallel.
That said I think your real problem is spawning too many programs: If you rewrite doit in Perl or Python I will expect you will see a major speedup.

As an alternative I threw together a quick test.
#! /bin/env bash
mkfifo PIPELINE # create a single queue
cat "$1" > PIPELINE & # supply it with records
{ declare -i cnt=0 max=14
while (( ++cnt <= max )) # spawn loop creates worker jobs
do printf -v fn "%02d" $cnt
while read -r line # each work loop reads common stdin...
do echo "$fn:[$line]"
sleep 1
done >$fn.log 2>&1 & # these run in background in parallel
done # this one exits
} < PIPELINE # *all* read from the same queue
wait
cat [0-9][0-9].log
Doesn't need split, but does need a mkfifo.
Obviously, change the code inside the internal loop.

This answers what you asked, namely how to process in parallel the 14 files you get from running split. However, I don't think it is the best way of doing whatever it is that you are trying to do - but we would need some answers from you for that.
So, let's make a million line file and split it into 14 parts:
seq 1000000 > 1M
split -n 14 1M part-
That gives me 14 files called part-aa through part-an. Now your question is how to process those 14 parts in parallel - (read the last line first):
#!/bin/bash
# This function will be called for each of the 14 files
DoOne(){
# Pick up parameters
job=$1
file=$2
# Count lines in specified file
lines=$(wc -l < "$file")
echo "Job No: $job, file: $file, lines: $lines"
}
# Make the function above known to processes spawned by GNU Parallel
export -f DoOne
# Run 14 parallel instances of "DoOne" passing job number and filename to each
parallel -k -j 14 DoOne {#} {} ::: part-??
Sample Output
Job No: 1, file: part-aa, lines: 83861
Job No: 2, file: part-ab, lines: 72600
Job No: 3, file: part-ac, lines: 70295
Job No: 4, file: part-ad, lines: 70295
Job No: 5, file: part-ae, lines: 70294
Job No: 6, file: part-af, lines: 70295
Job No: 7, file: part-ag, lines: 70295
Job No: 8, file: part-ah, lines: 70294
Job No: 9, file: part-ai, lines: 70295
Job No: 10, file: part-aj, lines: 70295
Job No: 11, file: part-ak, lines: 70295
Job No: 12, file: part-al, lines: 70294
Job No: 13, file: part-am, lines: 70295
Job No: 14, file: part-an, lines: 70297
You would omit the -k argument to GNU Parallel normally - I only added it so the output comes in order.

I think that using a bunch of grep and awk commands is the wrong approach here - you would be miles better off using Perl, or awk. As you have not provided any sample files I generated some using this code:
#!/bin/bash
for a in {A..Z} {0..9} ; do
for b in {A..Z} {0..9} ; do
for c in {A..Z} {0..9} ; do
echo "${a}${b}${c}"
done
done
done > a
# Now make file "b" which has the same stuff but shuffled into a different order
gshuf < a > b
Note that there are 26 letters in the alphabet, so if I add the digits 0..9 to the letters of the alphabet, I get 36 alphanumeric digits and if I nest 3 loops of that I get 36^3 or 46,656 lines which matches your file sizes roughly. File a now looks like this:
AAA
AAB
AAC
AAD
AAE
AAF
File b looks like this:
UKM
L50
AOC
79U
K6S
6PO
12I
XEV
WJN
Now I want to loop through a finding the corresponding line in b. First, I use your approach:
time while read thing ; do grep $thing b > /dev/null ; done < a
That takes 9 mins 35 seconds.
If I now exit grep on the first match, on average I will find it in the middle, which means the time will be halved since I won't continue to needlessly read b after I find what I want.
time while read thing ; do grep -m1 $thing b > /dev/null ; done < a
That improves the time down to 4 mins 30 seconds.
If I now use awk to read the contents of b into an associative array (a.k.a. hash) and then read the elements of a and find them in b like this:
time awk 'FNR==NR{a[$1]=$1; next} {print a[$1]}' b a > /dev/null
That now runs in 0.07 seconds. Hopefully you get the idea of what I am driving at. I expect Perl would do this in the same time and also provide more expressive facilities for the maths in the middle of your loop too.

I hope this small script helps you out:
function process {
while read line; do
echo "$line"
done < $1
}
function loop {
file=$1
chunks=$2
dir=`mktemp -d`
cd $dir
split -n l/$chunks $file
for i in *; do
process "$i" &
done
rm -rf $dir
}
loop /tmp/foo 14
It runs the process loop on the specified file with the specified number of chunks (without splitting lines) in parallel (using & to put each invocation in the background). I hope it gets you started.

This can do the job for You, I am not familiar with parallel instead using native bash spawning processes &:
function loop () {
while IFS= read -r -d $'\n'
do
# YOUR BIG STUFF
done < "${1}"
}
arr_files=(./xa*)
for i in "${arr_files[#]}"
do loop "${i}" &
done
wait

Find words in multiple files and sort in another

Need help with "printf" and "for" loop.
I have individual files each named after a user (e.g. john.txt, david.txt) and contains various commands that each user ran. Example of commands are (SUCCESS, TERMINATED, FAIL, etc.). Files have multiple lines with various text but each line contains one of the commands (1 command per line).
Sample:
command: sendevent "-F" "SUCCESS" "-J" "xxx-ddddddddddddd"
command: sendevent "-F" "TERMINATED" "-J" "xxxxxxxxxxx-dddddddddddddd"
I need to go through each file, count the number of each command and put it in another output file in this format:
==== John ====
SUCCESS - 3
TERMINATED - 2
FAIL - 4
TOTAL 9
==== David ====
SUCCESS - 1
TERMINATED - 1
FAIL - 2
TOTAL 4

P.S. This code can be made more compact, e.g there is no need to use so many echo's etc, but the following structure is being used to make it clear what's happening:
ls | grep .txt | sed 's/.txt//' > names
for s in $(cat names)
do
suc=$(grep "SUCCESS" "$s.txt" | wc -l)
termi=$(grep "TERMINATED" "$s.txt"|wc -l)
fail=$(grep "FAIL" "$s.txt"|wc -l)
echo "=== $s ===" >>docs
echo "SUCCESS - $suc" >> docs
echo "TERMINATED - $termi" >> docs
echo "FAIL - $fail" >> docs
echo "TOTAL $(($termi+$fail+$suc))">>docs
done
Output from my test files was like :
===new===
SUCCESS - 0
TERMINATED - 0
FAIL - 0
TOTAL 0
===vv===
SUCCESS - 0
TERMINATED - 0
FAIL - 0
TOTAL 0
based on karafka's suggestions instead of using the above lines for the for-loopyou can directly use the following:
for f in *.txt
do
something
#in order to print the required name in the file without the .txt you can do a
printf "%s\n" ${f::(-4)}

awk to the rescue!
$ awk -vOFS=" - " 'function pr() {s=0;
for(k in a) {s+=a[k]; print k,a[k]};
print "\nTOTAL "s"\n\n\n"}
NR!=1 && FNR==1 {pr(); delete a}
FNR==1 {print "==== " FILENAME " ===="}
{a[$4]++}
END {pr()}' file1 file2 ...
if your input file is not structured (key is not always on fourth field), you can do the same with pattern match.

Shell script to cut /proc/softirqs

The following is output of "cat /proc/softirqs " :
CPU0 CPU1 CPU2 CPU3
HI: 24 13 7 54
TIMER: 344095632 253285150 121234786 108207697
NET_TX: 2366955 319 695 316044
NET_RX: 16337920 16030558 250497436 117201444
BLOCK: 19631 2747 2353 5067051
BLOCK_IOPOLL: 0 0 0 0
TASKLET: 298 93 157 20965
SCHED: 74354472 28133393 30646119 26217748
HRTIMER: 4123645358 2409060621 2466360502 401470590
RCU: 26083738 17708780 15330534 16857905
My another machine has 24 cpu cores and the output is hard to read ,
I like the output to be only cpu0 , cpu2 , cpu4 , cpu6, ....
I know cut or awk might be ued to do that ,
but no idea how to use it to get even output columns .
Edit :
awk -F" " '{printf("%10s\t%s\n", $2,$4) }'
will get
CPU1 CPU3
24 7
344095632 121234786
2366955 695
16337920 250497436
19631 2353
0 0
298 157
74354472 30646119
4123645358 2466360502
26083738 15330534
unfortunately , CPU1 should be CPU0 , CPU3 should be CPU2 ,
the first line has only 4 columns , may I skip the first line
in this shell ?!
Edit2 :
watch -d "cat /proc/softirqs | awk -F" " '{printf("%10s\t%s\n",$2,$4)}' "
encounter errors like the following :
Every 2.0s: cat /proc/softirqs | awk -F '{print }' Tue Jun 21 10:23:22 2016
Usage: awk [POSIX or GNU style options] -f progfile [--] file ...
Usage: awk [POSIX or GNU style options] [--] 'program' file ...
POSIX options: GNU long options: (standard)
-f progfile --file=progfile
-F fs --field-separator=fs
-v var=val --assign=var=val
Short options: GNU long options: (extensions)
-b --characters-as-bytes
-c --traditional
-C --copyright
-d[file] --dump-variables[=file]
-e 'program-text' --source='program-text'
-E file --exec=file
-g --gen-pot
-h --help
-L [fatal] --lint[=fatal]
-n --non-decimal-data
-N --use-lc-numeric
-O --optimize
-p[file] --profile[=file]
-P --posix
-r --re-interval
-S --sandbox
-t --lint-old
-V --version
To report bugs, see node `Bugs' in `gawk.info', which is
section `Reporting Problems and Bugs' in the printed version.
gawk is a pattern scanning and processing language.
By default it reads standard input and writes standard output.
Examples:
gawk '{ sum += $1 }; END { print sum }' file
gawk -F: '{ print $1 }' /etc/passwd
what else should I try ?!
Edit3 :
The final workable shell would like :
# define function encapsulating code; this prevents any need for extra layers of quoting
# or escaping.
run() {
awk 'NR>1{printf("%20s\t%10s\t%s\n",$1,$2,$4)}' </proc/softirqs|egrep 'TIMER|RX'
}
# export function
export -f run
# run function in subshell of watch, ensuring that that shell is bash
# (other shells may not honor exported functions)
watch -d "bash -c run"

One easy way to communicate code to a subprocess of watch that avoids escaping errors is to use an exported function:
# define function encapsulating code; this prevents any need for extra layers of quoting
# or escaping.
run() {
awk -F" " '{printf("%10s\t%s\n",$2,$4)}' </proc/softirqs
}
# export function
export -f run
# run function in subshell of watch, ensuring that that shell is bash
# (other shells may not honor exported functions)
watch "bash -c run"
To avoid the dependency on exported functions, one can also use declare -f to retrieve the function's source in an evalable form, and printf %q to escape it to survive processing by the outer shell invoked by watch:
run() {
awk -F" " '{printf("%10s\t%s\n",$2,$4)}' </proc/softirqs
}
printf -v run_str '%q' "$(declare -f run); run"
watch "bash -c $run_str"

To skip the first line, do:
awk -F" " 'NR>1{printf("%10s\t%s\n", $2,$4) }'
Why do you need -F" ", is a mystery to me. You can as well write:
awk 'NR>1{printf("%10s\t%s\n", $2,$4) }'
(As for the watch part, see other answer/s.)

Bash error: Integer expression expected

In the sections below, you'll see the shell script I am trying to run on a UNIX machine, along with a transcript.
When I run this program, it gives the expected output but it also gives an error shown in the transcript. What could be the problem and how can I fix it?
First, the script:
#!/usr/bin/bash
while read A B C D E F
do
E=`echo $E | cut -f 1 -d "%"`
if test $# -eq 2
then
I=`echo $2`
else
I=90
fi
if test $E -ge $I
then
echo $F
fi
done
And the transcript of running it:
$ df -k | ./filter.sh -c 50
./filter.sh: line 12: test: capacity: integer expression expected
/etc/svc/volatile
/var/run
/home/ug
/home/pg
/home/staff/t
/packages/turnin
$ _

Before the line that says:
if test $E -ge $I
temporarily place the line:
echo "[$E]"
and you'll find something very much non-numeric, and that's because the output of df -k looks like this:
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sdb1 954316620 212723892 693109608 24% /
udev 10240 0 10240 0% /dev
: :
The offending line there is the first, which will have its fifth field Use% turned into Use, which is definitely not an integer.
A quick fix may be to change your usage to something like:
df -k | sed -n '2,$p' | ./filter -c 50
or:
df -k | tail -n+2 | ./filter -c 50
Either of those extra filters (sed or tail) will print only from line 2 onwards.
If you're open to not needing a special script at all, you could probably just get away with something like:
df -k | awk -vlimit=40 '$5+0>=limit&&NR>1{print $5" "$6}'
The way it works is to only operate on lines where both:
the fifth field, converted to a number, is at least equal to the limit passed in with -v; and
the record number (line) is two or greater.
Then it simply outputs the relevant information for those matching lines.
This particular example outputs the file system and usage (as a percentage like 42%) but, if you just want the file system as per your script, just change the print to output $6 on its own: {print $6}.
Alternatively, if you do the percentage but without the %, you can use the same method I used in the conditional: {print $5+0" "$6}.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Shell Script output formatting - bash

Related

divide floating point numbers from two different outputs

while loops in parallel with input from splited file

Find words in multiple files and sort in another

Shell script to cut /proc/softirqs

Bash error: Integer expression expected

Categories

Resources