Split data using shell - shell

I'm new shell scripting. I need to get data between run and Automatic match counts using shell scripting. So that it can be processed as semi structured data. please advice

Using sed -n '/run/,/Automatic/p' filename.txt|sed '1d;$d'|sed '$d;s/ //g' - should clean up data (1st line, 2 last lines, and spaces in beginning)
shell script - split.sh:
#!/bin/bash
sed -n '/run/,/Automatic/p' $1|sed '1d;$d'|sed '$d;s/ //g'
run for any file as below to get output on console and in file:
shell> ./split.sh test.txt |tee splitted.dat
United Kingdom: 21/09/2012
Started: 08/02/2013 16:04:44
Finished: 08/02/2013 16:21:23
Time to process: 0 days 0 hours 16 mins 39 secs
Records processed: 37497
Throughput: 135124 records/hour
Time per record: 0.0266 secs
output will be stored in splitted.dat file:
shell> cat splitted.dat
United Kingdom: 21/09/2012
Started: 08/02/2013 16:04:44
Finished: 08/02/2013 16:21:23
Time to process: 0 days 0 hours 16 mins 39 secs
Records processed: 37497
Throughput: 135124 records/hour
Time per record: 0.0266 secs
shell>
Update:
#!/bin/bash
# p - print lines with specified conditions
# !p - print lines except specified in conditions (opposite of p)
# |(pipe) - passes output of first command to the next
# $d - delete last line
# 1d - delete first line ( nd - delete nth line)
# '/run/,/Automatic/!p' - print lines except lines between 'run' to 'Automatic'
# sed '1d;s/ //g'- use output from first sed command and delete the 1st line and replace spaces with nothing
sed -n '/run/,/Automatic/!p' $1 |sed '1d;s/ //g'
Output:
Verified Correct: 32426 (86.5%)
Good Match: 2102 ( 5.6%)
Good Premise Partial: 862 ( 2.3%)
Tentative Match: 1039 ( 2.8%)
Poor Match: 4 ( 0.0%)
Multiple Matches: 7 ( 0.0%)
Partial Match: 872 ( 2.3%)
Foreign Address: 2 ( 0.0%)
Unmatched: 183 ( 0.5%)

sed -n '/run/,/Automatic/ {//!p }' test.txt
This will print all lines (,) between run and Automatic.The //! removes the line run and Automatic match counts from the output.

Related

Processing of the data from a big number of input files

My AWK script processes each log file from the folder "${results}, from which it looks for a pattern (a number occurred on the first line of ranking table) and then print it in one line together with the filename of the log:
awk '$1=="1"{sub(/.*\//,"",FILENAME); sub(/\.log/,"",FILENAME); printf("%s: %s\n", FILENAME, $2)}' "${results}"/*_rep"${i}".log
Here is the format of each log file, from which the number
-9.14
should be taken
AutoDock Vina v1.2.3
#################################################################
# If you used AutoDock Vina in your work, please cite: #
# #
# J. Eberhardt, D. Santos-Martins, A. F. Tillack, and S. Forli #
# AutoDock Vina 1.2.0: New Docking Methods, Expanded Force #
# Field, and Python Bindings, J. Chem. Inf. Model. (2021) #
# DOI 10.1021/acs.jcim.1c00203 #
# #
# O. Trott, A. J. Olson, #
# AutoDock Vina: improving the speed and accuracy of docking #
# with a new scoring function, efficient optimization and #
# multithreading, J. Comp. Chem. (2010) #
# DOI 10.1002/jcc.21334 #
# #
# Please see https://github.com/ccsb-scripps/AutoDock-Vina for #
# more information. #
#################################################################
Scoring function : vina
Rigid receptor: /home/gleb/Desktop/dolce_vita/temp/nsp5holoHIE.pdbqt
Ligand: /home/gleb/Desktop/dolce_vita/temp/active2322.pdbqt
Grid center: X 11.106 Y 0.659 Z 18.363
Grid size : X 18 Y 18 Z 18
Grid space : 0.375
Exhaustiveness: 48
CPU: 48
Verbosity: 1
Computing Vina grid ... done.
Performing docking (random seed: -1717804037) ...
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
mode | affinity | dist from best mode
| (kcal/mol) | rmsd l.b.| rmsd u.b.
-----+------------+----------+----------
1 -9.14 0 0
2 -9.109 2.002 2.79
3 -9.006 1.772 2.315
4 -8.925 2 2.744
5 -8.882 3.592 8.189
6 -8.803 1.564 2.092
7 -8.507 4.014 7.308
8 -8.36 2.489 8.193
9 -8.356 2.529 8.104
10 -8.33 1.408 3.841
It works OK for a moderate number of input log files (tested for up to 50k logs), but does not work for the case of big number of the input logs (e.g. with 130k logs), producing the following error:
./dolche_finito.sh: line 124: /usr/bin/awk: Argument list too long
How could I adapt the AWK script to be able processing any number of input logs?
If you get a /usr/bin/awk: Argument list too long then you'll have to control the number of "files" that you supply to awk; the standard way to do that efficiently is:
results=. # ???
i=00001 # ???
output= # ???
find "$results" -type f -name "*_rep$i.log" -exec awk '
FNR == 1 {
filename = FILENAME
sub(/.*\//,"",filename)
sub(/\.[^.]*$/,"",filename)
}
$1 == 1 { printf "%s: %s\n", filename, $2 }
' {} + |
LC_ALL=C sort -t':' -k2,2g > "$results"/ranking_"$output"_rep"$i".csv
edit: appended the rest of the chain as asked in comment
note: you might need to specify other predicates to the find command if you don't want it to search the sub-folders of $results recursively
Note that your error message:
./dolche_finito.sh: line 124: /usr/bin/awk: Argument list too long
is from your shell interpreting line 124 in your shell script, not from awk - you just happen to be calling awk at that line but it could be any other tool and you'd get the same error. Google ARG_MAX for more information on it.
Assuming printf is a builtin on your system:
printf '%s\0' "${results}"/*_rep"${i}".log |
xargs -0 awk '...'
or if you need awk to process all input files in one call for some reason and your file names don't contain newlines:
printf '%s' "${results}"/*_rep"${i}".log |
awk '
NR==FNR {
ARGV[ARGC++] = $0
next
}
...
'
If you're using GNU awk or some other awk that can process NUL characters as the RS and your input file names might contain newlines then you could do:
printf '%s\0' "${results}"/*_rep"${i}".log |
awk '
NR==FNR {
ARGV[ARGC++] = $0
next
}
...
' RS='\0' - RS='\n'
When using GNU AWK you might alter ARGC and ARGV to command GNU AWK to read additional files, consider following simple example, let filelist.txt content be
file1.txt
file2.txt
file3.txt
and content of these files to be respectively uno, dos, tres then
awk 'FNR==NR{ARGV[NR+1]=$0;ARGC+=1;next}{print FILENAME,$0}' filelist.txt
gives output
file1.txt uno
file2.txt dos
file3.txt tres
Explanation: when reading first file i.e. where number of row in file (FNR) is equal number of row globally (NR) I add to ARGV line as value under key being number of row plus one, as ARGV[1] is already filelist.txt and I increase ARGC by 1, I instruct GNU AWK to then go to next line so no other action is undertaken. For other files I print filename followed by whole line.
(tested in GNU Awk 5.0.1)

Grep all content of File 1 from File 2

This is regarding grepping all the Thread IDs which are mentioned in one file from the thread dump file in unix.
I also require at least 5 lines below each thread id from thread dump while grepping.
Like below:-
MAX_CPU_PID_TD_Ids.out:
1001
1003
MAX_CPU_PID_TD.txt:
............TDID=1001..................
Line 1
Line 2
Line 3
Line 4
Line 5
Line 6
Line 7
............TDID=1002...................
Line 1
Line 2
Line 3
Line 4
Line 5
Line 6
Line 7
...........TDID=1003......................
Line 1
Line 2
Line 3
Line 4
Line 5
Line 6
Line 7
Output should contain :-
............TDID=1001..................
Line 1
Line 2
Line 3
Line 4
Line 5
...........TDID=1003......................
Line 1
Line 2
Line 3
Line 4
Line 5
If possible I would like to have the above output in the mail body.
I have tried the below code but it sends me the thread IDs in the body with thread dump file as an attachment
How ever I would like to have the description of each thread id in the body of the mail only
JAVA_HOME=/u01/oracle/products/jdk
MAX_CPU_PID=`ps -eo pid,ppid,cmd,%mem,%cpu --sort=-%mem | head -2 | sed -n '1!p' | awk '{print $1}'`
ps -eLo pid,ppid,tid,pcpu,comm | grep $MAX_CPU_PID > MAX_CPU_PID_SubProcess.out
cat MAX_CPU_PID_SubProcess.out | awk '{ print "pccpu: "$4" pid: "$1" ppid: "$2" ttid: "$3" comm: "$5}' |sort -n > MAX_CPU_PID_SubProcess_Sorted_temp1.out
rm MAX_CPU_PID_SubProcess.out
sort -k 2n MAX_CPU_PID_SubProcess_Sorted_temp1.out > MAX_CPU_PID_SubProcess_Sorted_temp2.out
rm MAX_CPU_PID_SubProcess_Sorted_temp1.out
awk '{a[i++]=$0}END{for(j=i-1;j>=0;j--)print a[j];}' MAX_CPU_PID_SubProcess_Sorted_temp2.out > MAX_CPU_PID_SubProcess_Sorted_temp3.out
rm MAX_CPU_PID_SubProcess_Sorted_temp2.out
awk '($2 > 15 ) ' MAX_CPU_PID_SubProcess_Sorted_temp3.out > MAX_CPU_PID_SubProcess_Sorted_Highest_Consuming.out
rm MAX_CPU_PID_SubProcess_Sorted_temp3.out
awk '{ print $8 }' MAX_CPU_PID_SubProcess_Sorted_Highest_Consuming.out > MAX_CPU_PID_SubProcess_Sorted_temp4.out
( echo "obase=16" ; cat MAX_CPU_PID_SubProcess_Sorted_temp4.out ) | bc > MAX_CPU_PID_TD_Ids_temp.out
rm MAX_CPU_PID_SubProcess_Sorted_temp4.out
$JAVA_HOME/bin/jstack -l $MAX_CPU_PID > MAX_CPU_PID_TD.txt
#grep -i -A 10 'error' data
awk 'BEGIN{print "The below thread IDs from the attached thread dump of OUD1 server are causing the highest CPU utilization. Please Analyze it further\n"}1' MAX_CPU_PID_TD_Ids_temp.out > MAX_CPU_PID_TD_Ids.out
rm MAX_CPU_PID_TD_Ids_temp.out
tr -cd "[:print:]\n" < MAX_CPU_PID_TD_Ids.out | mailx -s "OUD1 MAX CPU Utilization Analysis" -a MAX_CPU_PID_TD.txt <My Mail ID>
Answer for the first part: How to extract the lines.
The solution with grep -F -f MAX_CPU_PID_TD_Ids.out -A 5 MAX_CPU_PID_TD.txt as proposed in a comment is much simpler, but it may fail if the lines Line 1 etc can contain the values from MAX_CPU_PID_TD_Ids.out. It may also print a non-matching TDID= line if there are not enough lines after the previous matching line.
For the grep solution it may be better to create a file with patterns like ...TDID=1001....
The following script will print the matching lines ...TDID=XYZ... and at most the following 5 lines. It will stop after fewer lines if a new ...TDID=XYZ... is found.
For simplicity an empty line is printed before every ...TDID=XYZ... line, i.e. also before the first one.
awk 'NR==FNR {ids[$1]=1;next} # from the first file save all IDs as array keys
/\.\.\.TDID=/ {
sel = 0; # stop any previous output
id=gensub(/\.*TDID=([^.]*)\.*/,"\\1",1); # extract ID
if(id in ids) { # select if ID is present in array
print "" # empty line as separator
sel = 1;
}
count = 0; # counter to limit number of lines
}
sel { # selected for output?
print;
count++;
if(count > 5) { # stop after ...TDID= + 5 more lines (change the number if necessary)
sel = 0
}
}' MAX_CPU_PID_TD_Ids.out MAX_CPU_PID_TD.txt > MAX_CPU_PID_TD.extract
Apart from the first empty line, this script produces the expected output from the example input as shown in the question. If it does not work with the real input or if there are additional requirements, update the question to show the problematic input and the expected output or the additional requirements.
Answer for the second part: Mail formatting
To get the resulting data into the mail body you simply have to pipe it into mailx instead of specifying the file as an attachment.
( tr -cd "[:print:]\n" < MAX_CPU_PID_TD_Ids.out ; cat MAX_CPU_PID_TD.extract ) | mailx -s "OUD1 MAX CPU Utilization Analysis" <My Mail ID>

while loops in parallel with input from splited file

I am stuck on that. So I have this while-read loop within my code that is taking so long and I would like to run it in many processors. But, I'd like to split the input file and run 14 loops (because I have 14 threads), one for each splited file, in parallel. Thing is that I don't know how to tell the while loop which file to get and work with.
For example, in a regular while-read loop I would code:
while read line
do
<some code>
done < input file or variable...
But in this case I would like to split the above input file in 14 files and run 14 while loops in parallel, one for each splited file.
I tried :
split -n 14 input_file
find . -name "xa*" | \
parallel -j 14 | \
while read line
do
<lot of stuff>
done
also tried
split -n 14 input_file
function loop {
while read line
do
<lot of stuff>
done
}
export -f loop
parallel -j 14 ::: loop
But neither I was able to tell which file would be the input to the loop so parallel would understand "take each of those xa* files and place into individual loops in parallel"
An example of the input file (a list of strings)
AEYS01000010.10484.12283
CVJT01000011.50.2173
KF625180.1.1799
KT949922.1.1791
LOBZ01000025.54942.57580
EDIT
This is the code.
The output is a table (741100 lines) with some statistics regarding DNA sequences alignments already made.
The loop takes an input_file (no broken lines, varies from 500 to ~45000 lines, 800Kb) with DNA sequence acessions, reads it line-by-line and look for each correspondent full taxonomy for those acessions in a databank (~45000 lines). Then, it does a few sums/divisions. Output is a .tsv and looks like this (an example for sequence "KF625180.1.1799"):
Rate of taxonomies for this sequence in %: KF625180.1.1799 D_6__Bacillus_atrophaeus
Taxonomy %aligned number_ocurrences_in_the_alignment num_ocurrences_in_databank %alingment/databank
D_6__Bacillus_atrophaeus 50% 1 20 5%
D_6__Bacillus_amyloliquefaciens 50% 1 154 0.649351%
$ head input file
AEYS01000010.10484.12283
CVJT01000011.50.217
KF625180.1.1799
KT949922.1.1791
LOBZ01000025.54942.57580
Two additional files are also used inside the loop. They are not the loop input.
1) a file called alnout_file that only serves for finding how many hits (or alignments) a given sequence had against the databank. It was also previously made outside this loop. It can vary in the number of lines from hundreads to thousands. Only columns 1 and 2 matters here. Column1 is the name of the sequence and col2 is the name of all sequences it matched in the databnk. It looks like that:
$ head alnout_file
KF625180.1.1799 KF625180.1.1799 100.0 431 0 0 1 431 1 431 -1 0
KF625180.1.1799 KP143082.1.1457 99.3 431 1 2 1 431 1 429 -1 0
KP143082.1.1457 KF625180.1.1799 99.3 431 1 2 1 429 1 431 -1 0
2) a databank .tsv file containing ~45000 taxonomies correspondent to the DNA sequences. Each taxonomy is in one line:
$ head taxonomy.file.tsv
KP143082.1.1457 D_0__Bacteria;D_1__Firmicutes;D_2__Bacilli;D_3__Bacillales;D_4__Bacillaceae;D_5__Bacillus;D_6__Bacillus_amyloliquefaciens
KF625180.1.1799 D_0__Bacteria;D_1__Firmicutes;D_2__Bacilli;D_3__Bacillales;D_4__Bacillaceae;D_5__Bacillus;D_6__Bacillus_atrophaeus
So, given sequence KF625180.1.1799. I previously aligned it against a databank containing ~45000 other DNA sequences and got an output whis has all the accessions to sequences that it matched. What the loop does is that it finds the taxonomies for all those sequences and calculates the "statistics" I mentionded previously. Code does it for all the DNA-sequences-accesions I have.
TAXONOMY=path/taxonomy.file.tsv
while read line
do
#find hits
hits=$(grep $line alnout_file | cut -f 2)
completename=$(grep $line $TAXONOMY | sed 's/D_0.*D_4/D_4/g')
printf "\nRate of taxonomies for this sequence in %%:\t$completename\n"
printf "Taxonomy\t%aligned\tnumber_ocurrences_in_the_alignment\tnum_ocurrences_in_databank\t%alingment/databank\n"
#find hits and calculate the frequence (%) of the taxonomy in the alignment output
# ex.: Bacillus_subtilis 33
freqHits=$(grep "${hits[#]}" $TAXONOMY | \
cut -f 2 | \
awk '{a[$0]++} END {for (i in a) {print i, "\t", a[i]/NR*100, "\t", a[i]}}' | \
sed -e 's/D_0.*D_5/D_5/g' -e 's#\s\t\s#\t#g' | \
sort -k2 -hr)
# print frequence of each taxonomy in the databank
freqBank=$(while read line; do grep -c "$line" $TAXONOMY; done < <(echo "$freqHits" | cut -f 1))
#print cols with taxonomy and calculations
paste <(printf %s "$freqHits") <(printf %s "$freqBank") | awk '{print $1,"\t",$2"%","\t",$3,"\t",$4,"\t",$3/$4*100"%"}'
done < input_file
It is a lot of greps and parsing so it takes about ~12h running in one processor for doing it to all the 45000 DNA sequence accessions. The, I would like to split input_file and do it in all the processors I have (14) because it would the time spend in that.
Thank you all for being so patient with me =)
You are looking for --pipe. In this case you can even use the optimized --pipepart (version >20160621):
export TAXONOMY=path/taxonomy.file.tsv
doit() {
while read line
do
#find hits
hits=$(grep $line alnout_file | cut -f 2)
completename=$(grep $line $TAXONOMY | sed 's/D_0.*D_4/D_4/g')
printf "\nRate of taxonomies for this sequence in %%:\t$completename\n"
printf "Taxonomy\t%aligned\tnumber_ocurrences_in_the_alignment\tnum_ocurrences_in_databank\t%alingment/databank\n"
#find hits and calculate the frequence (%) of the taxonomy in the alignment output
# ex.: Bacillus_subtilis 33
freqHits=$(grep "${hits[#]}" $TAXONOMY | \
cut -f 2 | \
awk '{a[$0]++} END {for (i in a) {print i, "\t", a[i]/NR*100, "\t", a[i]}}' | \
sed -e 's/D_0.*D_5/D_5/g' -e 's#\s\t\s#\t#g' | \
sort -k2 -hr)
# print frequence of each taxonomy in the databank
freqBank=$(while read line; do grep -c "$line" $TAXONOMY; done < <(echo "$freqHits" | cut -f 1))
#print cols with taxonomy and calculations
paste <(printf %s "$freqHits") <(printf %s "$freqBank") | awk '{print $1,"\t",$2"%","\t",$3,"\t",$4,"\t",$3/$4*100"%"}'
done
}
export -f doit
parallel -a input_file --pipepart doit
This will chop input_file into 10*ncpu blocks (where ncpu is the number of CPU threads), pass each block to doit, run ncpu jobs in parallel.
That said I think your real problem is spawning too many programs: If you rewrite doit in Perl or Python I will expect you will see a major speedup.
As an alternative I threw together a quick test.
#! /bin/env bash
mkfifo PIPELINE # create a single queue
cat "$1" > PIPELINE & # supply it with records
{ declare -i cnt=0 max=14
while (( ++cnt <= max )) # spawn loop creates worker jobs
do printf -v fn "%02d" $cnt
while read -r line # each work loop reads common stdin...
do echo "$fn:[$line]"
sleep 1
done >$fn.log 2>&1 & # these run in background in parallel
done # this one exits
} < PIPELINE # *all* read from the same queue
wait
cat [0-9][0-9].log
Doesn't need split, but does need a mkfifo.
Obviously, change the code inside the internal loop.
This answers what you asked, namely how to process in parallel the 14 files you get from running split. However, I don't think it is the best way of doing whatever it is that you are trying to do - but we would need some answers from you for that.
So, let's make a million line file and split it into 14 parts:
seq 1000000 > 1M
split -n 14 1M part-
That gives me 14 files called part-aa through part-an. Now your question is how to process those 14 parts in parallel - (read the last line first):
#!/bin/bash
# This function will be called for each of the 14 files
DoOne(){
# Pick up parameters
job=$1
file=$2
# Count lines in specified file
lines=$(wc -l < "$file")
echo "Job No: $job, file: $file, lines: $lines"
}
# Make the function above known to processes spawned by GNU Parallel
export -f DoOne
# Run 14 parallel instances of "DoOne" passing job number and filename to each
parallel -k -j 14 DoOne {#} {} ::: part-??
Sample Output
Job No: 1, file: part-aa, lines: 83861
Job No: 2, file: part-ab, lines: 72600
Job No: 3, file: part-ac, lines: 70295
Job No: 4, file: part-ad, lines: 70295
Job No: 5, file: part-ae, lines: 70294
Job No: 6, file: part-af, lines: 70295
Job No: 7, file: part-ag, lines: 70295
Job No: 8, file: part-ah, lines: 70294
Job No: 9, file: part-ai, lines: 70295
Job No: 10, file: part-aj, lines: 70295
Job No: 11, file: part-ak, lines: 70295
Job No: 12, file: part-al, lines: 70294
Job No: 13, file: part-am, lines: 70295
Job No: 14, file: part-an, lines: 70297
You would omit the -k argument to GNU Parallel normally - I only added it so the output comes in order.
I think that using a bunch of grep and awk commands is the wrong approach here - you would be miles better off using Perl, or awk. As you have not provided any sample files I generated some using this code:
#!/bin/bash
for a in {A..Z} {0..9} ; do
for b in {A..Z} {0..9} ; do
for c in {A..Z} {0..9} ; do
echo "${a}${b}${c}"
done
done
done > a
# Now make file "b" which has the same stuff but shuffled into a different order
gshuf < a > b
Note that there are 26 letters in the alphabet, so if I add the digits 0..9 to the letters of the alphabet, I get 36 alphanumeric digits and if I nest 3 loops of that I get 36^3 or 46,656 lines which matches your file sizes roughly. File a now looks like this:
AAA
AAB
AAC
AAD
AAE
AAF
File b looks like this:
UKM
L50
AOC
79U
K6S
6PO
12I
XEV
WJN
Now I want to loop through a finding the corresponding line in b. First, I use your approach:
time while read thing ; do grep $thing b > /dev/null ; done < a
That takes 9 mins 35 seconds.
If I now exit grep on the first match, on average I will find it in the middle, which means the time will be halved since I won't continue to needlessly read b after I find what I want.
time while read thing ; do grep -m1 $thing b > /dev/null ; done < a
That improves the time down to 4 mins 30 seconds.
If I now use awk to read the contents of b into an associative array (a.k.a. hash) and then read the elements of a and find them in b like this:
time awk 'FNR==NR{a[$1]=$1; next} {print a[$1]}' b a > /dev/null
That now runs in 0.07 seconds. Hopefully you get the idea of what I am driving at. I expect Perl would do this in the same time and also provide more expressive facilities for the maths in the middle of your loop too.
I hope this small script helps you out:
function process {
while read line; do
echo "$line"
done < $1
}
function loop {
file=$1
chunks=$2
dir=`mktemp -d`
cd $dir
split -n l/$chunks $file
for i in *; do
process "$i" &
done
rm -rf $dir
}
loop /tmp/foo 14
It runs the process loop on the specified file with the specified number of chunks (without splitting lines) in parallel (using & to put each invocation in the background). I hope it gets you started.
This can do the job for You, I am not familiar with parallel instead using native bash spawning processes &:
function loop () {
while IFS= read -r -d $'\n'
do
# YOUR BIG STUFF
done < "${1}"
}
arr_files=(./xa*)
for i in "${arr_files[#]}"
do loop "${i}" &
done
wait

Find words in multiple files and sort in another

Need help with "printf" and "for" loop.
I have individual files each named after a user (e.g. john.txt, david.txt) and contains various commands that each user ran. Example of commands are (SUCCESS, TERMINATED, FAIL, etc.). Files have multiple lines with various text but each line contains one of the commands (1 command per line).
Sample:
command: sendevent "-F" "SUCCESS" "-J" "xxx-ddddddddddddd"
command: sendevent "-F" "TERMINATED" "-J" "xxxxxxxxxxx-dddddddddddddd"
I need to go through each file, count the number of each command and put it in another output file in this format:
==== John ====
SUCCESS - 3
TERMINATED - 2
FAIL - 4
TOTAL 9
==== David ====
SUCCESS - 1
TERMINATED - 1
FAIL - 2
TOTAL 4
P.S. This code can be made more compact, e.g there is no need to use so many echo's etc, but the following structure is being used to make it clear what's happening:
ls | grep .txt | sed 's/.txt//' > names
for s in $(cat names)
do
suc=$(grep "SUCCESS" "$s.txt" | wc -l)
termi=$(grep "TERMINATED" "$s.txt"|wc -l)
fail=$(grep "FAIL" "$s.txt"|wc -l)
echo "=== $s ===" >>docs
echo "SUCCESS - $suc" >> docs
echo "TERMINATED - $termi" >> docs
echo "FAIL - $fail" >> docs
echo "TOTAL $(($termi+$fail+$suc))">>docs
done
Output from my test files was like :
===new===
SUCCESS - 0
TERMINATED - 0
FAIL - 0
TOTAL 0
===vv===
SUCCESS - 0
TERMINATED - 0
FAIL - 0
TOTAL 0
based on karafka's suggestions instead of using the above lines for the for-loopyou can directly use the following:
for f in *.txt
do
something
#in order to print the required name in the file without the .txt you can do a
printf "%s\n" ${f::(-4)}
awk to the rescue!
$ awk -vOFS=" - " 'function pr() {s=0;
for(k in a) {s+=a[k]; print k,a[k]};
print "\nTOTAL "s"\n\n\n"}
NR!=1 && FNR==1 {pr(); delete a}
FNR==1 {print "==== " FILENAME " ===="}
{a[$4]++}
END {pr()}' file1 file2 ...
if your input file is not structured (key is not always on fourth field), you can do the same with pattern match.

Make cat command to operate recursively looping through a directory

I have a large directory of data files which I am in the process of manipulating to get them in a desired format. They each begin and end 15 lines too soon, meaning I need to strip the first 15 lines off one file and paste them to the end of the previous file in the sequence.
To begin, I have written the following code to separate the relevant data into easy chunks:
#!/bin/bash
destination='media/user/directory/'
for file1 in `ls $destination*.ascii`
do
echo $file1
file2="${file1}.end"
file3="${file1}.snip"
sed -e '16,$d' $file1 > $file2
sed -e '1,15d' $file1 > $file3
done
This worked perfectly, so the next step is the worlds simplest cat command:
cat $file3 $file2 > outfile
However, what I need to do is to stitch file2 to the previous file3. Look at this screenshot of the directory for better understanding.
See how these files are all sequential over time:
*_20090412T235945_20090413T235944_* ### April 13
*_20090413T235945_20090414T235944_* ### April 14
So I need to take the 15 lines snipped off the April 14 example above and paste it to the end of the April 13 example.
This doesn't have to be part of the original code, in fact it would be probably best if it weren't. I was just hoping someone would be able to help me get this going.
Thanks in advance! If there is anything I have been unclear about and needs further explanation please let me know.
"I need to strip the first 15 lines off one file and paste them to the end of the previous file in the sequence."
If I understand what you want correctly, it can be done with one line of code:
awk 'NR==1 || FNR==16{close(f); f=FILENAME ".new"} {print>f}' file1 file2 file3
When this has run, the files file1.new, file2.new, and file3.new will be in the new form with the lines transferred. Of course, you are not limited to three files: you may specify as many as you like on the command line.
Example
To keep our example short, let's just strip the first 2 lines instead of 15. Consider these test files:
$ cat file1
1
2
3
$ cat file2
4
5
6
7
8
$ cat file3
9
10
11
12
13
14
15
Here is the result of running our command:
$ awk 'NR==1 || FNR==3{close(f); f=FILENAME ".new"} {print>f}' file1 file2 file3
$ cat file1.new
1
2
3
4
5
$ cat file2.new
6
7
8
9
10
$ cat file3.new
11
12
13
14
15
As you can see, the first two lines of each file have been transferred to the preceding file.
How it works
awk implicitly reads each file line-by-line. The job of our code is to choose which new file a line should be written to based on its line number. The variable f will contain the name of the file that we are writing to.
NR==1 || FNR==16{f=FILENAME ".new"}
When we are reading the first line of the first file, NR==1, or when we are reading the 16th line of whatever file we are on, FNR==16, we update f to be the name of the current file with .new added to the end.
For the short example, which transferred 2 lines instead of 15, we used the same code but with FNR==16 replaced with FNR==3.
print>f
This prints the current line to file f.
(If this was a shell script, we would use >>. This is not a shell script. This is awk.)
Using a glob to specify the file names
destination='media/user/directory/'
awk 'NR==1 || FNR==16{close(f); f=FILENAME ".new"} {print>f}' "$destination"*.ascii
Your task is not that difficult at all. You want to gather a list of all _end files in the directory (using a for loop and globbing, NOT looping on the results of ls). Once you have all the end files, you simply parse the dates using parameter expansion w/substing removal say into d1 and d2 for date1 and date2 in:
stuff_20090413T235945_20090414T235944_end
| d1 | | d2 |
then you simply subtract 1 from d1 into say date0 or d0 and then construct a previous filename out of d0 and d1 using _snip instead of _end. Then just test for the existence of the previous _snip filename, and if it exists, paste your info from the current _end file to the previous _snip file. e.g.
#!/bin/bash
for i in *end; do ## find all _end files
d1="${i#*stuff_}" ## isolate first date in filename
d1="${d1%%T*}"
d2="${i%T*}" ## isolate second date
d2="${d2##*_}"
d0=$((d1 - 1)) ## subtract 1 from first, get snip d1
prev="${i/$d1/$d0}" ## create previous 'snip' filename
prev="${prev/$d2/$d1}"
prev="${prev%end}snip"
if [ -f "$prev" ] ## test that prev snip file exists
then
printf "paste to : %s\n" "$prev"
printf " from : %s\n\n" "$i"
fi
done
Test Input Files
$ ls -1
stuff_20090413T235945_20090414T235944_end
stuff_20090413T235945_20090414T235944_snip
stuff_20090414T235945_20090415T235944_end
stuff_20090414T235945_20090415T235944_snip
stuff_20090415T235945_20090416T235944_end
stuff_20090415T235945_20090416T235944_snip
stuff_20090416T235945_20090417T235944_end
stuff_20090416T235945_20090417T235944_snip
stuff_20090417T235945_20090418T235944_end
stuff_20090417T235945_20090418T235944_snip
stuff_20090418T235945_20090419T235944_end
stuff_20090418T235945_20090419T235944_snip
Example Use/Output
$ bash endsnip.sh
paste to : stuff_20090413T235945_20090414T235944_snip
from : stuff_20090414T235945_20090415T235944_end
paste to : stuff_20090414T235945_20090415T235944_snip
from : stuff_20090415T235945_20090416T235944_end
paste to : stuff_20090415T235945_20090416T235944_snip
from : stuff_20090416T235945_20090417T235944_end
paste to : stuff_20090416T235945_20090417T235944_snip
from : stuff_20090417T235945_20090418T235944_end
paste to : stuff_20090417T235945_20090418T235944_snip
from : stuff_20090418T235945_20090419T235944_end
(of course replace stuff_ with your actual prefix)
Let me know if you have questions.
You could store the previous $file3 value in a variable (and do a check if it is not the first run with -z check):
#!/bin/bash
destination='media/user/directory/'
prev=""
for file1 in $destination*.ascii
do
echo $file1
file2="${file1}.end"
file3="${file1}.snip"
sed -e '16,$d' $file1 > $file2
sed -e '1,15d' $file1 > $file3
if [ -z "$prev" ]; then
cat $prev $file2 > outfile
fi
prev=$file3
done

Resources