Continuously-updated (running-count) output from a program reading from a pipeline - bash

How can I get continuously-updated output from a program that's reading from a pipeline? For example, let's say that this program were a version of wc:
$ ls | running_wc
So I'd like this to output instantly, e.g.
0 0 0
and then every time a new output line is received, it'd update again, e.g.
1 2 12
2 4 24
etc.
Of course my command isn't really ls, it's a process that slowly outputs data... I'd actually love to dynamically have it count matches and non matches, and sum this info up on a single line, e.g,
$ my_process | count_matches error
This would constantly update a single line of output with the matching and non matching counts, e.g.
$ my_process | count_matches error
0 5
then later on it might look like so, since it's found 2 matches and 10 non matching lines.
$ my_process | count_matches error
2 10

dd will print out statistics if it receives a SIGUSR1 signal, but neither wc nor grep does that. You'll need to re-implement them, more or less.
count_matches() {
local pattern=$1
local matches=0 nonmatches=0
local line
while IFS= read -r line; do
if [[ $line == *$pattern* ]]; then ((++matches)); else ((++nonmatches)); fi
printf '\r%s %s' "$matches" "$nonmatches"
done
printf '\n'
}
Printing a carriage return \r each time causes the printouts to overwrite each other.
Most programs will switch from line buffering to full buffering when used in a pipeline. Your slow-running program should flush its output after each line to ensure the results are available immediately. Or if you can't modify it, you can often use stdbuf -oL to force programs that use C stdio to line buffer stdout.
stdbuf -oL my_process | count_matches error

Using awk. First we create the "my_process":
$ for i in {1..10} ; do echo $i ; sleep 1 ; done # slowly prints lines
The match counter:
$ awk 'BEGIN {
print "match","miss" # print header
m=0 # reset match count
}
{
if($1~/(3|6)/) # match is a 3 or 6 (for this output)
m++ # increment match count
print m,NR-m # for each record output match / miss counts
}'
Running it:
$ for i in {1..10} ; do echo $i ; sleep 1 ; done | awk 'BEGIN{print "match","miss";m=0}{if($1~/(3|6)/)m++;print m,NR-m}'
match miss
0 1
0 2
1 2
1 3
1 4
2 4
2 5
2 6
2 7
2 8

Related

Is there a command for substituting a set of characters by a set of strings?

I'm would like to substitute a set of edit: single byte characters with a set of literal strings in a stream, without any constraint on the line size.
#!/bin/bash
for (( i = 1; i <= 0x7FFFFFFFFFFFFFFF; i++ ))
do
printf '\a,\b,\t,\v'
done |
chars_to_strings $'\a\b\t\v' '<bell>' '<backspace>' '<horizontal-tab>' '<vertical-tab>'
The expected output would be:
<bell>,<backspace>,<horizontal-tab>,<vertical-tab><bell>,<backspace>,<horizontal-tab>,<vertical-tab><bell>...
I can think of a bash function that would do that, something like:
chars_to_strings() {
local delim buffer
while true
do
delim=''
IFS='' read -r -d '.' -n 4096 buffer && (( ${#buffer} != 4096 )) && delim='.'
if [[ -n "${delim:+_}" ]] || [[ -n "${buffer:+_}" ]]
then
# Do the replacements in "$buffer"
# ...
printf "%s%s" "$buffer" "$delim"
else
break
fi
done
}
But I'm looking for a more efficient way, any thoughts?
Since you seem to be okay with using ANSI C quoting via $'...' strings, then maybe use sed?
sed $'s/\a/<bell>/g; s/\b/<backspace>/g; s/\t/<horizontal-tab>/g; s/\v/<vertical-tab>/g'
Or, via separate commands:
sed -e $'s/\a/<bell>/g' \
-e $'s/\b/<backspace>/g' \
-e $'s/\t/<horizontal-tab>/g' \
-e $'s/\v/<vertical-tab>/g'
Or, using awk, which replaces newline characters too (by customizing the Output Record Separator, i.e., the ORS variable):
$ printf '\a,\b,\t,\v\n' | awk -vORS='<newline>' '
{
gsub(/\a/, "<bell>")
gsub(/\b/, "<backspace>")
gsub(/\t/, "<horizontal-tab>")
gsub(/\v/, "<vertical-tab>")
print $0
}
'
<bell>,<backspace>,<horizontal-tab>,<vertical-tab><newline>
For a simple one-liner with reasonable portability, try Perl.
for (( i = 1; i <= 0x7FFFFFFFFFFFFFFF; i++ ))
do
printf '\a,\b,\t,\v'
done |
perl -pe 's/\a/<bell>/g;
s/\b/<backspace>/g;s/\t/<horizontal-tab>/g;s/\v/<vertical-tab>/g'
Perl internally does some intelligent optimizations so it's not encumbered by lines which are longer than its input buffer or whatever.
Perl by itself is not POSIX, of course; but it can be expected to be installed on any even remotely modern platform (short of perhaps embedded systems etc).
Assuming the overall objective is to provide the ability to process a stream of data in real time without having to wait for a EOL/End-of-buffer occurrence to trigger processing ...
A few items:
continue to use the while/read -n loop to read a chunk of data from the incoming stream and store in buffer variable
push the conversion code into something that's better suited to string manipulation (ie, something other than bash); for sake of discussion we'll choose awk
within the while/read -n loop printf "%s\n" "${buffer}" and pipe the output from the while loop into awk; NOTE: the key item is to introduce an explicit \n into the stream so as to trigger awk processing for each new 'line' of input; OP can decide if this additional \n must be distinguished from a \n occurring in the original stream of data
awk then parses each line of input as per the replacement logic, making sure to append anything leftover to the front of the next line of input (ie, for when the while/read -n breaks an item in the 'middle')
General idea:
chars_to_strings() {
while read -r -n 15 buffer # using '15' for demo purposes otherwise replace with '4096' or whatever OP wants
do
printf "%s\n" "${buffer}"
done | awk '{print NR,FNR,length($0)}' # replace 'print ...' with OP's replacement logic
}
Take for a test drive:
for (( i = 1; i <= 20; i++ ))
do
printf '\a,\b,\t,\v'
sleep 0.1 # add some delay to data being streamed to chars_to_strings()
done | chars_to_strings
1 1 15 # output starts printing right away
2 2 15 # instead of waiting for the 'for'
3 3 15 # loop to complete
4 4 15
5 5 13
6 6 15
7 7 15
8 8 15
9 9 15
A variation on this idea using a named pipe:
mkfifo /tmp/pipeX
sleep infinity > /tmp/pipeX # keep pipe open so awk does not exit
awk '{print NR,FNR,length($0)}' < /tmp/pipeX &
chars_to_strings() {
while read -r -n 15 buffer
do
printf "%s\n" "${buffer}"
done > /tmp/pipeX
}
Take for a test drive:
for (( i = 1; i <= 20; i++ ))
do
printf '\a,\b,\t,\v'
sleep 0.1
done | chars_to_strings
1 1 15 # output starts printing right away
2 2 15 # instead of waiting for the 'for'
3 3 15 # loop to complete
4 4 15
5 5 13
6 6 15
7 7 15
8 8 15
9 9 15
# kill background 'awk' and/or 'sleep infinity' when no longer needed
don't waste FS/OFS - use the built-in variables to take 2 out of the 5 needed :
echo $' \t abc xyz \t \a \n\n ' |
mawk 'gsub(/\7/, "<bell>", $!(NF = NF)) + gsub(/\10/,"<bs>") +\
gsub(/\11/,"<h-tab>")^_' OFS='<v-tab>' FS='\13' ORS='<newline>'
<h-tab> abc xyz <h-tab> <bell> <newline><newline> <newline>
To have NO constraint on the line length you could do something like this with GNU awk:
awk -v RS='.{1,100}' -v ORS= '{
$0 = RT
gsub(foo,bar)
print
}'
That will read and process the input 100 chars at a time no matter which chars are present, whether it has newlines or not, and even if the input was one multi-terabyte line.
Replace gsub(foo,bar) with whatever substitution(s) you have in mind, e.g.:
$ printf '\a,\b,\t,\v' |
awk -v RS='.{1,100}' -v ORS= '{
$0 = RT
gsub(/\a/,"<bell>")
gsub(/\b/,"<backspace>")
gsub(/\t/,"<horizontal-tab>")
gsub(/\v/,"<vertical-tab>")
print
}'
<bell>,<backspace>,<horizontal-tab>,<vertical-tab>
and of course it'd be trivial to pass a list of old and new strings to awk rather than hardcoding them, you'd just have to sanitize any regexp or backreference metachars before calling gsub().

Grep all content of File 1 from File 2

This is regarding grepping all the Thread IDs which are mentioned in one file from the thread dump file in unix.
I also require at least 5 lines below each thread id from thread dump while grepping.
Like below:-
MAX_CPU_PID_TD_Ids.out:
1001
1003
MAX_CPU_PID_TD.txt:
............TDID=1001..................
Line 1
Line 2
Line 3
Line 4
Line 5
Line 6
Line 7
............TDID=1002...................
Line 1
Line 2
Line 3
Line 4
Line 5
Line 6
Line 7
...........TDID=1003......................
Line 1
Line 2
Line 3
Line 4
Line 5
Line 6
Line 7
Output should contain :-
............TDID=1001..................
Line 1
Line 2
Line 3
Line 4
Line 5
...........TDID=1003......................
Line 1
Line 2
Line 3
Line 4
Line 5
If possible I would like to have the above output in the mail body.
I have tried the below code but it sends me the thread IDs in the body with thread dump file as an attachment
How ever I would like to have the description of each thread id in the body of the mail only
JAVA_HOME=/u01/oracle/products/jdk
MAX_CPU_PID=`ps -eo pid,ppid,cmd,%mem,%cpu --sort=-%mem | head -2 | sed -n '1!p' | awk '{print $1}'`
ps -eLo pid,ppid,tid,pcpu,comm | grep $MAX_CPU_PID > MAX_CPU_PID_SubProcess.out
cat MAX_CPU_PID_SubProcess.out | awk '{ print "pccpu: "$4" pid: "$1" ppid: "$2" ttid: "$3" comm: "$5}' |sort -n > MAX_CPU_PID_SubProcess_Sorted_temp1.out
rm MAX_CPU_PID_SubProcess.out
sort -k 2n MAX_CPU_PID_SubProcess_Sorted_temp1.out > MAX_CPU_PID_SubProcess_Sorted_temp2.out
rm MAX_CPU_PID_SubProcess_Sorted_temp1.out
awk '{a[i++]=$0}END{for(j=i-1;j>=0;j--)print a[j];}' MAX_CPU_PID_SubProcess_Sorted_temp2.out > MAX_CPU_PID_SubProcess_Sorted_temp3.out
rm MAX_CPU_PID_SubProcess_Sorted_temp2.out
awk '($2 > 15 ) ' MAX_CPU_PID_SubProcess_Sorted_temp3.out > MAX_CPU_PID_SubProcess_Sorted_Highest_Consuming.out
rm MAX_CPU_PID_SubProcess_Sorted_temp3.out
awk '{ print $8 }' MAX_CPU_PID_SubProcess_Sorted_Highest_Consuming.out > MAX_CPU_PID_SubProcess_Sorted_temp4.out
( echo "obase=16" ; cat MAX_CPU_PID_SubProcess_Sorted_temp4.out ) | bc > MAX_CPU_PID_TD_Ids_temp.out
rm MAX_CPU_PID_SubProcess_Sorted_temp4.out
$JAVA_HOME/bin/jstack -l $MAX_CPU_PID > MAX_CPU_PID_TD.txt
#grep -i -A 10 'error' data
awk 'BEGIN{print "The below thread IDs from the attached thread dump of OUD1 server are causing the highest CPU utilization. Please Analyze it further\n"}1' MAX_CPU_PID_TD_Ids_temp.out > MAX_CPU_PID_TD_Ids.out
rm MAX_CPU_PID_TD_Ids_temp.out
tr -cd "[:print:]\n" < MAX_CPU_PID_TD_Ids.out | mailx -s "OUD1 MAX CPU Utilization Analysis" -a MAX_CPU_PID_TD.txt <My Mail ID>
Answer for the first part: How to extract the lines.
The solution with grep -F -f MAX_CPU_PID_TD_Ids.out -A 5 MAX_CPU_PID_TD.txt as proposed in a comment is much simpler, but it may fail if the lines Line 1 etc can contain the values from MAX_CPU_PID_TD_Ids.out. It may also print a non-matching TDID= line if there are not enough lines after the previous matching line.
For the grep solution it may be better to create a file with patterns like ...TDID=1001....
The following script will print the matching lines ...TDID=XYZ... and at most the following 5 lines. It will stop after fewer lines if a new ...TDID=XYZ... is found.
For simplicity an empty line is printed before every ...TDID=XYZ... line, i.e. also before the first one.
awk 'NR==FNR {ids[$1]=1;next} # from the first file save all IDs as array keys
/\.\.\.TDID=/ {
sel = 0; # stop any previous output
id=gensub(/\.*TDID=([^.]*)\.*/,"\\1",1); # extract ID
if(id in ids) { # select if ID is present in array
print "" # empty line as separator
sel = 1;
}
count = 0; # counter to limit number of lines
}
sel { # selected for output?
print;
count++;
if(count > 5) { # stop after ...TDID= + 5 more lines (change the number if necessary)
sel = 0
}
}' MAX_CPU_PID_TD_Ids.out MAX_CPU_PID_TD.txt > MAX_CPU_PID_TD.extract
Apart from the first empty line, this script produces the expected output from the example input as shown in the question. If it does not work with the real input or if there are additional requirements, update the question to show the problematic input and the expected output or the additional requirements.
Answer for the second part: Mail formatting
To get the resulting data into the mail body you simply have to pipe it into mailx instead of specifying the file as an attachment.
( tr -cd "[:print:]\n" < MAX_CPU_PID_TD_Ids.out ; cat MAX_CPU_PID_TD.extract ) | mailx -s "OUD1 MAX CPU Utilization Analysis" <My Mail ID>

while loops in parallel with input from splited file

I am stuck on that. So I have this while-read loop within my code that is taking so long and I would like to run it in many processors. But, I'd like to split the input file and run 14 loops (because I have 14 threads), one for each splited file, in parallel. Thing is that I don't know how to tell the while loop which file to get and work with.
For example, in a regular while-read loop I would code:
while read line
do
<some code>
done < input file or variable...
But in this case I would like to split the above input file in 14 files and run 14 while loops in parallel, one for each splited file.
I tried :
split -n 14 input_file
find . -name "xa*" | \
parallel -j 14 | \
while read line
do
<lot of stuff>
done
also tried
split -n 14 input_file
function loop {
while read line
do
<lot of stuff>
done
}
export -f loop
parallel -j 14 ::: loop
But neither I was able to tell which file would be the input to the loop so parallel would understand "take each of those xa* files and place into individual loops in parallel"
An example of the input file (a list of strings)
AEYS01000010.10484.12283
CVJT01000011.50.2173
KF625180.1.1799
KT949922.1.1791
LOBZ01000025.54942.57580
EDIT
This is the code.
The output is a table (741100 lines) with some statistics regarding DNA sequences alignments already made.
The loop takes an input_file (no broken lines, varies from 500 to ~45000 lines, 800Kb) with DNA sequence acessions, reads it line-by-line and look for each correspondent full taxonomy for those acessions in a databank (~45000 lines). Then, it does a few sums/divisions. Output is a .tsv and looks like this (an example for sequence "KF625180.1.1799"):
Rate of taxonomies for this sequence in %: KF625180.1.1799 D_6__Bacillus_atrophaeus
Taxonomy %aligned number_ocurrences_in_the_alignment num_ocurrences_in_databank %alingment/databank
D_6__Bacillus_atrophaeus 50% 1 20 5%
D_6__Bacillus_amyloliquefaciens 50% 1 154 0.649351%
$ head input file
AEYS01000010.10484.12283
CVJT01000011.50.217
KF625180.1.1799
KT949922.1.1791
LOBZ01000025.54942.57580
Two additional files are also used inside the loop. They are not the loop input.
1) a file called alnout_file that only serves for finding how many hits (or alignments) a given sequence had against the databank. It was also previously made outside this loop. It can vary in the number of lines from hundreads to thousands. Only columns 1 and 2 matters here. Column1 is the name of the sequence and col2 is the name of all sequences it matched in the databnk. It looks like that:
$ head alnout_file
KF625180.1.1799 KF625180.1.1799 100.0 431 0 0 1 431 1 431 -1 0
KF625180.1.1799 KP143082.1.1457 99.3 431 1 2 1 431 1 429 -1 0
KP143082.1.1457 KF625180.1.1799 99.3 431 1 2 1 429 1 431 -1 0
2) a databank .tsv file containing ~45000 taxonomies correspondent to the DNA sequences. Each taxonomy is in one line:
$ head taxonomy.file.tsv
KP143082.1.1457 D_0__Bacteria;D_1__Firmicutes;D_2__Bacilli;D_3__Bacillales;D_4__Bacillaceae;D_5__Bacillus;D_6__Bacillus_amyloliquefaciens
KF625180.1.1799 D_0__Bacteria;D_1__Firmicutes;D_2__Bacilli;D_3__Bacillales;D_4__Bacillaceae;D_5__Bacillus;D_6__Bacillus_atrophaeus
So, given sequence KF625180.1.1799. I previously aligned it against a databank containing ~45000 other DNA sequences and got an output whis has all the accessions to sequences that it matched. What the loop does is that it finds the taxonomies for all those sequences and calculates the "statistics" I mentionded previously. Code does it for all the DNA-sequences-accesions I have.
TAXONOMY=path/taxonomy.file.tsv
while read line
do
#find hits
hits=$(grep $line alnout_file | cut -f 2)
completename=$(grep $line $TAXONOMY | sed 's/D_0.*D_4/D_4/g')
printf "\nRate of taxonomies for this sequence in %%:\t$completename\n"
printf "Taxonomy\t%aligned\tnumber_ocurrences_in_the_alignment\tnum_ocurrences_in_databank\t%alingment/databank\n"
#find hits and calculate the frequence (%) of the taxonomy in the alignment output
# ex.: Bacillus_subtilis 33
freqHits=$(grep "${hits[#]}" $TAXONOMY | \
cut -f 2 | \
awk '{a[$0]++} END {for (i in a) {print i, "\t", a[i]/NR*100, "\t", a[i]}}' | \
sed -e 's/D_0.*D_5/D_5/g' -e 's#\s\t\s#\t#g' | \
sort -k2 -hr)
# print frequence of each taxonomy in the databank
freqBank=$(while read line; do grep -c "$line" $TAXONOMY; done < <(echo "$freqHits" | cut -f 1))
#print cols with taxonomy and calculations
paste <(printf %s "$freqHits") <(printf %s "$freqBank") | awk '{print $1,"\t",$2"%","\t",$3,"\t",$4,"\t",$3/$4*100"%"}'
done < input_file
It is a lot of greps and parsing so it takes about ~12h running in one processor for doing it to all the 45000 DNA sequence accessions. The, I would like to split input_file and do it in all the processors I have (14) because it would the time spend in that.
Thank you all for being so patient with me =)
You are looking for --pipe. In this case you can even use the optimized --pipepart (version >20160621):
export TAXONOMY=path/taxonomy.file.tsv
doit() {
while read line
do
#find hits
hits=$(grep $line alnout_file | cut -f 2)
completename=$(grep $line $TAXONOMY | sed 's/D_0.*D_4/D_4/g')
printf "\nRate of taxonomies for this sequence in %%:\t$completename\n"
printf "Taxonomy\t%aligned\tnumber_ocurrences_in_the_alignment\tnum_ocurrences_in_databank\t%alingment/databank\n"
#find hits and calculate the frequence (%) of the taxonomy in the alignment output
# ex.: Bacillus_subtilis 33
freqHits=$(grep "${hits[#]}" $TAXONOMY | \
cut -f 2 | \
awk '{a[$0]++} END {for (i in a) {print i, "\t", a[i]/NR*100, "\t", a[i]}}' | \
sed -e 's/D_0.*D_5/D_5/g' -e 's#\s\t\s#\t#g' | \
sort -k2 -hr)
# print frequence of each taxonomy in the databank
freqBank=$(while read line; do grep -c "$line" $TAXONOMY; done < <(echo "$freqHits" | cut -f 1))
#print cols with taxonomy and calculations
paste <(printf %s "$freqHits") <(printf %s "$freqBank") | awk '{print $1,"\t",$2"%","\t",$3,"\t",$4,"\t",$3/$4*100"%"}'
done
}
export -f doit
parallel -a input_file --pipepart doit
This will chop input_file into 10*ncpu blocks (where ncpu is the number of CPU threads), pass each block to doit, run ncpu jobs in parallel.
That said I think your real problem is spawning too many programs: If you rewrite doit in Perl or Python I will expect you will see a major speedup.
As an alternative I threw together a quick test.
#! /bin/env bash
mkfifo PIPELINE # create a single queue
cat "$1" > PIPELINE & # supply it with records
{ declare -i cnt=0 max=14
while (( ++cnt <= max )) # spawn loop creates worker jobs
do printf -v fn "%02d" $cnt
while read -r line # each work loop reads common stdin...
do echo "$fn:[$line]"
sleep 1
done >$fn.log 2>&1 & # these run in background in parallel
done # this one exits
} < PIPELINE # *all* read from the same queue
wait
cat [0-9][0-9].log
Doesn't need split, but does need a mkfifo.
Obviously, change the code inside the internal loop.
This answers what you asked, namely how to process in parallel the 14 files you get from running split. However, I don't think it is the best way of doing whatever it is that you are trying to do - but we would need some answers from you for that.
So, let's make a million line file and split it into 14 parts:
seq 1000000 > 1M
split -n 14 1M part-
That gives me 14 files called part-aa through part-an. Now your question is how to process those 14 parts in parallel - (read the last line first):
#!/bin/bash
# This function will be called for each of the 14 files
DoOne(){
# Pick up parameters
job=$1
file=$2
# Count lines in specified file
lines=$(wc -l < "$file")
echo "Job No: $job, file: $file, lines: $lines"
}
# Make the function above known to processes spawned by GNU Parallel
export -f DoOne
# Run 14 parallel instances of "DoOne" passing job number and filename to each
parallel -k -j 14 DoOne {#} {} ::: part-??
Sample Output
Job No: 1, file: part-aa, lines: 83861
Job No: 2, file: part-ab, lines: 72600
Job No: 3, file: part-ac, lines: 70295
Job No: 4, file: part-ad, lines: 70295
Job No: 5, file: part-ae, lines: 70294
Job No: 6, file: part-af, lines: 70295
Job No: 7, file: part-ag, lines: 70295
Job No: 8, file: part-ah, lines: 70294
Job No: 9, file: part-ai, lines: 70295
Job No: 10, file: part-aj, lines: 70295
Job No: 11, file: part-ak, lines: 70295
Job No: 12, file: part-al, lines: 70294
Job No: 13, file: part-am, lines: 70295
Job No: 14, file: part-an, lines: 70297
You would omit the -k argument to GNU Parallel normally - I only added it so the output comes in order.
I think that using a bunch of grep and awk commands is the wrong approach here - you would be miles better off using Perl, or awk. As you have not provided any sample files I generated some using this code:
#!/bin/bash
for a in {A..Z} {0..9} ; do
for b in {A..Z} {0..9} ; do
for c in {A..Z} {0..9} ; do
echo "${a}${b}${c}"
done
done
done > a
# Now make file "b" which has the same stuff but shuffled into a different order
gshuf < a > b
Note that there are 26 letters in the alphabet, so if I add the digits 0..9 to the letters of the alphabet, I get 36 alphanumeric digits and if I nest 3 loops of that I get 36^3 or 46,656 lines which matches your file sizes roughly. File a now looks like this:
AAA
AAB
AAC
AAD
AAE
AAF
File b looks like this:
UKM
L50
AOC
79U
K6S
6PO
12I
XEV
WJN
Now I want to loop through a finding the corresponding line in b. First, I use your approach:
time while read thing ; do grep $thing b > /dev/null ; done < a
That takes 9 mins 35 seconds.
If I now exit grep on the first match, on average I will find it in the middle, which means the time will be halved since I won't continue to needlessly read b after I find what I want.
time while read thing ; do grep -m1 $thing b > /dev/null ; done < a
That improves the time down to 4 mins 30 seconds.
If I now use awk to read the contents of b into an associative array (a.k.a. hash) and then read the elements of a and find them in b like this:
time awk 'FNR==NR{a[$1]=$1; next} {print a[$1]}' b a > /dev/null
That now runs in 0.07 seconds. Hopefully you get the idea of what I am driving at. I expect Perl would do this in the same time and also provide more expressive facilities for the maths in the middle of your loop too.
I hope this small script helps you out:
function process {
while read line; do
echo "$line"
done < $1
}
function loop {
file=$1
chunks=$2
dir=`mktemp -d`
cd $dir
split -n l/$chunks $file
for i in *; do
process "$i" &
done
rm -rf $dir
}
loop /tmp/foo 14
It runs the process loop on the specified file with the specified number of chunks (without splitting lines) in parallel (using & to put each invocation in the background). I hope it gets you started.
This can do the job for You, I am not familiar with parallel instead using native bash spawning processes &:
function loop () {
while IFS= read -r -d $'\n'
do
# YOUR BIG STUFF
done < "${1}"
}
arr_files=(./xa*)
for i in "${arr_files[#]}"
do loop "${i}" &
done
wait

Find words in multiple files and sort in another

Need help with "printf" and "for" loop.
I have individual files each named after a user (e.g. john.txt, david.txt) and contains various commands that each user ran. Example of commands are (SUCCESS, TERMINATED, FAIL, etc.). Files have multiple lines with various text but each line contains one of the commands (1 command per line).
Sample:
command: sendevent "-F" "SUCCESS" "-J" "xxx-ddddddddddddd"
command: sendevent "-F" "TERMINATED" "-J" "xxxxxxxxxxx-dddddddddddddd"
I need to go through each file, count the number of each command and put it in another output file in this format:
==== John ====
SUCCESS - 3
TERMINATED - 2
FAIL - 4
TOTAL 9
==== David ====
SUCCESS - 1
TERMINATED - 1
FAIL - 2
TOTAL 4
P.S. This code can be made more compact, e.g there is no need to use so many echo's etc, but the following structure is being used to make it clear what's happening:
ls | grep .txt | sed 's/.txt//' > names
for s in $(cat names)
do
suc=$(grep "SUCCESS" "$s.txt" | wc -l)
termi=$(grep "TERMINATED" "$s.txt"|wc -l)
fail=$(grep "FAIL" "$s.txt"|wc -l)
echo "=== $s ===" >>docs
echo "SUCCESS - $suc" >> docs
echo "TERMINATED - $termi" >> docs
echo "FAIL - $fail" >> docs
echo "TOTAL $(($termi+$fail+$suc))">>docs
done
Output from my test files was like :
===new===
SUCCESS - 0
TERMINATED - 0
FAIL - 0
TOTAL 0
===vv===
SUCCESS - 0
TERMINATED - 0
FAIL - 0
TOTAL 0
based on karafka's suggestions instead of using the above lines for the for-loopyou can directly use the following:
for f in *.txt
do
something
#in order to print the required name in the file without the .txt you can do a
printf "%s\n" ${f::(-4)}
awk to the rescue!
$ awk -vOFS=" - " 'function pr() {s=0;
for(k in a) {s+=a[k]; print k,a[k]};
print "\nTOTAL "s"\n\n\n"}
NR!=1 && FNR==1 {pr(); delete a}
FNR==1 {print "==== " FILENAME " ===="}
{a[$4]++}
END {pr()}' file1 file2 ...
if your input file is not structured (key is not always on fourth field), you can do the same with pattern match.

Sorting on multiple columns w/ an output file per key

I'm uncertain as to how I can use the until loop inside a while loop.
I have an input file of 500,000 lines that look like this:
9 1 1 0.6132E+02
9 2 1 0.6314E+02
10 3 1 0.5874E+02
10 4 1 0.5266E+02
10 5 1 0.5571E+02
1 6 1 0.5004E+02
1 7 1 0.5450E+02
2 8 1 0.5696E+02
11 9 1 0.6369E+02
.....
And what I'm hoping to achieve is to sort the numbers in the first column in numerical order such that I can pull all the similar lines (eg. lines that start with the same number) into new text files "cluster${i}.txt". From there I want to sort the fourth column of ("cluster${i}.txt") files in numerical order. After sorting I would like to write the first row of each sorted "cluster${i}.txt" file into a single output file. A sample output of "cluster1.txt" would like this:
1 6 1 0.5004E+02
1 7 1 0.5450E+02
1 11 1 0.6777E+02
....
as well as an output.txt file that would look like this:
1 6 1 0.5004E+02
2 487 1 0.3495E+02
3 34 1 0.0344E+02
....
Here is what I've written:
#!/bin/bash
input='input.txt'
i=1
sort -nk 1 $input > 'temp.txt'
while read line; do
awk -v var="$i" '$1 == var' temp.txt > "cluster${i}.txt"
until [[$i -lt 20]]; do
i=$((i+1))
done
done
for f in *.txt; do
sort -nk 4 > temp2.txt
head -1 temp2.txt
rm temp2.txt
done > output.txt
This only takes one line, if your sort -n knows how to handle exponential notation:
sort -nk 1,4 <in.txt | awk '{ of="cluster" $1 ".txt"; print $0 >>of }'
...or, to also write the first line for each index to output.txt:
sort -nk 1,4 <in.txt | awk '
{
if($1 != last) {
print $0 >"output.txt"
last=$1
}
of="cluster" $1 ".txt";
print $0 >of
}'
Consider using an awk implementation -- such as GNU awk -- which will cache file descriptors, rather than reopening each output file for every append; this will greatly improve performance.
By the way, let's look at what was wrong with the original script:
It was slow. Really, really slow.
Starting a new instance of awk 20 times for every line of input (because the whole point of while read is to iterate over individual lines, so putting an awk inside a while read is going to run awk at least once per line) is going to have a very appreciable impact on performance. Not that it was actually doing this, because...
The while read line outer loop was reading from stdin, not temp.txt or input.txt.
Thus, the script was hanging if stdin didn't have anything written on it, or wasn't executing the contents of the loop at all if stdin pointed to a source with no content like /dev/null.
The inner loop wasn't actually processing the line read by the outer loop. line was being read, but all of temp.txt was being operated on.
The awk wasn't actually inside the inner loop, but rather was inside the outer loop, just before the inner loop. Consequently, it wasn't being run 20 times with different values for i, but run only once per line read, with whichever value for i was left over from previously executed code.
Whitespace is important to how commands are parsed. [[foo]] is wrong; it needs to be [[ foo ]].
To "fix" the inner loop, to do what I imagine you meant to write, might look like this:
# this is slow and awful, but at least it'll work.
while IFS= read -r line; do
i=0
until [[ $i -ge 20 ]]; do
awk -v var="$i" '$1 == var' <<<"$line" >>"cluster${i}.txt"
i=$((i+1))
done
done <temp.txt
...or, somewhat better (but still not as good as the solution suggested at the top):
# this is a somewhat less awful.
for (( i=0; i<=20; i++ )); do
awk -v var="$i" '$1 == var' <temp.txt >"cluster${i}.txt"
head -n 1 "cluster${i}.txt"
done >output.txt
Note how the redirection to output.txt is done just once, for the whole loop -- this means we're only opening the file once.

Resources