Better way of extracting data from file for comparison - bash

Problem: Comparison of files from Pre-check status and Post-check status of a node for specific parameters.
With some help from community, I have written the following solution which extracts the information from files from directories pre and post and based on the "Node-ID" (which happens to be unique and is to be extracted from the files as well). After extracting the data from Pre/post folder, I have created folders based on the node-id and dumped files into the folders.
My Code to extract data (The data is extracted from Pre and Post folders)
FILES=$(find postcheck_logs -type f -name *.log)
for f in $FILES
do
NODE=`cat $f | grep -m 1 ">" | awk '{print $1}' | sed 's/[>]//g'` ##Generate the node-id
echo "Extracting Post check information for " $NODE
mkdir temp/$NODE-post ## create a temp directory
cat $f | awk 'BEGIN { RS=$NODE"> "; } /^param1/ { foo=RS $0; } END { print foo ; }' > temp/$NODE-post/param1.txt ## extract data
cat $f | awk 'BEGIN { RS=$NODE"> "; } /^param2/ { foo=RS $0; } END { print foo ; }' > temp/$NODE-post/param2.txt
cat $f | awk 'BEGIN { RS=$NODE"> "; } /^param3/ { foo=RS $0; } END { print foo ; }' > temp/$NODE-post/param3.txt
done
After this I have a structure as:
/Node1-pre/param1.txt
/Node1-post/param1.txt
and so on.
Now I am stuck to compare $NODE-pre and $NODE-post files,
I have tried to do it using recursive grep, but I am not finding a suitable way to do so. What is the best possible way to compare these files using diff?
Moreover, I find the above data extraction program very slow. I believe it's not the best possible way (using least resources) to do so. Any suggestions?

Look askance at any instance of cat one-file — you could use I/O redirection on the next command in the pipeline instead.
You can do the whole thing more simply with:
for f in $(find postcheck_logs -type f -name *.log)
do
NODE=$(sed '/>/{ s/ .*//; s/>//g; p; q; }' $f) ##Generate the node-id
echo "Extracting Post check information for $NODE"
mkdir temp/$NODE-post
awk -v NODE="$NODE" -v DIR="temp/$NODE-post" \
'BEGIN { RS=NODE"> " }
/^param1/ { param1 = $0 }
/^param2/ { param2 = $0 }
/^param3/ { param3 = $0 }
END {
print RS param1 > DIR "/param1.txt"
print RS param2 > DIR "/param2.txt"
print RS param3 > DIR "/param3.txt"
}' $f
done
The NODE finding process is much better done by a single sed command than cat | grep | awk | sed, and you should plan to use $(...) rather than back-quotes everywhere.
The main processing of the log file should be done once; a single awk command is sufficient. The script is passed to variables — NODE and the directory name. The BEGIN is cleaned up; the $ before NODE was probably not what you intended. The main actions are very similar; each looks for the relevant parameter name and saves it in an appropriate variable. At the end, it write the saved values to the relevant files, decorated with the value of RS. Semicolons are only needed when there's more than one statement on a line; there's just one statement per line in this expanded script. It looks bigger than the original, but that's only because I'm using vertical space.
As to comparing the before and after files, you can do it in many ways, depending on what you want to know. If you've got a POSIX-compliant diff (you probably do), you can use:
diff -r temp/$NODE-pre temp/$NODE-post
to report on the differences, if any, between the contents of the two directories. Alternatively, you can do it manually:
for file in param1.txt param2.txt param3.txt
do
if cmp -s temp/$NODE-pre/$file temp/$NODE-post/$file
then : No difference
else diff temp/$NODE-pre/$file temp/$NODE-post/$file
fi
done
Clearly, you can wrap that in a 'for each node' loop. And, if you are going to need to do that, then you probably do want to capture the output of the find command in a variable (as in the original code) so that you do not have to repeat that operation.

Related

Parallelize a awk script with multiple input files and changing the name of the output file

I have a series of text files in a folder sub.yr_by_yr which I pass to a for loop to subset a Beagle file from the header. I want to parallelize this script to subset the Beagle file from the header values (which is done using my subbeagle.awk script). I use the title of the text files to export the subset to a new file name using the base pattern matching in bash (file11=${file1%.subbeagle.txt}) to get the desired output (MM.beagle.${file11}.gz)
for file1 in $(ls sub.yr_by_yr)
do
echo -e "Doing sub-samples \n $file1"
file11=${file1%.subbeagle.txt}
awk -f subbeagle.awk \
./sub.yr_by_yr/$file1 <(zcat ../MajorMinor.beagle.gz) | gzip > sub.yr_by_yr_beagle.files/MM.beagle.${file11}.gz
done
The for loop works, but takes for ever... hence the need for parallelization. the folder sub.yr_by_yr contains >10 files named
something like similar to this: sp.yrseries.site1.1.subbeagle.txt, sp.yrseries.site1.2.subbeagle.txt, sp.yrseries.site1.3.subbeagle.txt...
I've tried
parallel "file11=${{}%.subbeagle.txt}; awk -f $SUBBEAGLEAWKSCRIPT ./sub.yr_by_yr/{} <(zcat ../MajorMinor.beagle.gz) | gzip > sub.yr_by_yr_beagle.files/MM.beagle.${file11}.gz" ::: sub.yr_by_yr/*.subbeagle.txt
But it gives me 'bad substitution'
How could I use the awk script in parallel and rename the files accordingly?
Content of subbeagle.awk:
# Source: https://stackoverflow.com/questions/74451358/select-columns-based-on-their-names-from-a-file-using-awk
BEGIN { FS=OFS="\t" } # uncomment if input/output fields are tab delimited
FNR==NR { headers[$1]; next }
{ sep=""
for (i=1; i<=NF; i++) {
if (FNR==1 && ($i in headers)) {
fldids[i]
}
if (i in fldids) {
printf "%s%s",sep,$i
sep=OFS # if not set elsewhere (eg, in a BEGIN{}block) then default OFS == <space>
}
}
print ""
}
Content of MajorMinor.beagle.gz
marker allele1 allele2 FINCH_WB_ID1_splitMerged FINCH_WB_ID1_splitMerged FINCH_WB_ID1_splitMerged FINCH_WB_ID2_splitMerged FINCH_WB_ID2_splitMerged
chr1_34273 G C 0.79924 0.20076 3.18183e-09 0.940649 0.0593509
chr1_34285 G A 0.79924 0.20076 3.18183e-09 0.969347 0.0306534
chr1_34291 G C 0.666111 0.333847 4.20288e-05 0.969347 0.0306534
chr1_34299 C G 0.000251063 0.999498 0.000251063 0.996035 0.00396529
UPDATE:
I was able to get this from this source:
parallel "awk -f subbeagle.awk {} <(zcat ../MajorMinor.beagle.gz) | gzip > 'sub.yr_by_yr_beagle.files/MM.beagle.{/.}_test.gz'" ::: sub.yr_by_yr/*.subbeagle.txt
The only fancy thing that needs to be removed is the .subbeagle par of the input file name...
So the parallel tutorial helped me here:
parallel --rpl '{mymy} s:.*/::; s:\.[^.]+$::;s:\.[^.]+$::;' "awk -f subbeagle.awk {} <(zcat ../MajorMinor.beagle.gz) | gzip > 'sub.yr_by_yr_beagle.files/MM.beagle.{mymy}.gz'" ::: sub.yr_by_yr/*.subbeagle.txt
Let's break this:
--rpl '{mymy} s:.*/::; s:\.[^.]+$::;s:\.[^.]+$::;'
--rpl will "define a shorthand replacement string" (see parallel tutorial and another example here)
{mymy} is my 'new' replacement string, which will execute what is after it.
s:.*/::; is the definition to {/} (see parallel tutorial, search for "Perl expression replacement string", the last part of that section shows the definition of 7 'default' replacement strings)
s:\.[^.]+$::;s:\.[^.]+$::; removes 2 extensions (so .subbeagle.txt where .txt is the first extension and .subbeagle is the second)
"awk -f subbeagle.awk {} <(zcat ../MajorMinor.beagle.gz) | gzip > 'sub.yr_by_yr_beagle.files/MM.beagle.{mymy}.gz'"
is the subsetting and compressing par of the script. Note that the {mymy} is where the replacement will take place. As you can see {} will be in input string. The rest is unchanged!
::: sub.yr_by_yr/*.subbeagle.txt will pass all the files to parallel as input.
It took ~ 2 hours to do at least ~5 files, but using 22 cores, I could do all files this in a fraction of the time (~20 minutes)!

Replace a string with a random number for every line, in every file, in a directory in Bash

!/bin/bash
for file in ~/tdg/*.TXT
do
while read p; do
randvalue=`shuf -i 1-99999 -n 1`
sed -i -e "s/55555/${randvalue}/" $file
done < $file
done
This is my script. I'm attempting to replace 55555 with a different random number every time I find it. This currently works, but it replaces every instance of 55555 with the same random number. I have attempted to replace $file at the end of the sed command with $p but that just blows up.
Really though, even if I get to the point were each instance on the same line all of that same random number, but a new random number is used for each line, then I'll be happy.
EDIT
I should have specified this. I would like to actually save the results of the replace in the file, rather than just printing the results to the console.
EDIT
The final working version of my script after JNevill's fantastic help:
!/bin/bash
for file in ~/tdg/*.TXT
do
while read p;
do
gawk '{$0=gensub(/55555/, int(rand()*99999), "g", $0)}1' $file > ${file}.new
done < $file
mv -f ${file}.new $file
done
Since doing this is in sed gets pretty awful and quickly you may want to switch over to awk to perform this:
awk '{$0=gensub(/55555/, int(rand()*99999), "g", $0)}1' $file
Using this, you can remove the inner loop as this will run across the entire file line-by-line as awk does.
You could just swap out the entire script and feed the wildcard filename to awk directly too:
awk '{$0=gensub(/55555/, int(rand()*99999), "g", $0)}1' ~/tdg/*.TXT
This is how to REALLY do what you're trying to do with GNU awk:
awk -i inplace '{ while(sub(/55555/,int(rand()*99999)+1)); print }' ~/tdg/*.TXT
No shell loops or temp files required and it WILL replace every 55555 with a different random number within and across all files.
With other awks it'd be:
seed="$RANDOM"
for file in ~/tdg/*.TXT; do
seed=$(awk -v seed="$seed" '
BEGIN { srand(seed) }
{ while(sub(/55555/,int(rand()*99999)+1)); print > "tmp" }
END { print int(rand()*99999)+1 }
' "$file") &&
mv tmp "$file"
done
A variation on JNevill's solution that generates a different set of random numbers every time you run the script ...
A sample data file:
$ cat grand.dat
abc def 55555
xyz-55555-55555-__+
123-55555-55555-456
987-55555-55555-.2.
.+.-55555-55555-==*
And the script:
$ cat grand.awk
{ $0=gensub(/55555/,int(rand()*seed),"g",$0); print }
gensub(...) : works same as Nevill's answer, while we'll mix up the rand() multiplier by using our seed value [you can throw any numbers in here you wish to help determine size of the resulting value]
** keep in mind that this will replace all occurrences of 55555 on a single line with the same random value
Script in action:
$ awk -f grand.awk seed=${RANDOM} grand.dat
abc def 6939
xyz-8494-8494-__+
123-24685-24685-456
987-4442-4442-.2.
.+.-17088-17088-==*
$ awk -f grand.awk seed=${RANDOM} grand.dat
abc def 4134
xyz-5060-5060-__+
123-14706-14706-456
987-2646-2646-.2.
.+.-10180-10180-==*
$ awk -f grand.awk seed=${RANDOM} grand.dat
abc def 4287
xyz-5248-5248-__+
123-15251-15251-456
987-2744-2744-.2.
.+.-10558-10558-==*
seed=$RANDOM : have the OS generate a random int for us and pass into the awk script as the seed variable

Fast grep on huge csv files

I have a file (queryids.txt) with a list of 847 keywords to search. I have to grep the keywords from about 12 huge csv files (the biggest has 2,184,820,000 lines). Eventually we will load it into a database of some sort but for now, we just want certain keywords to be grep'ed.
My command is:
LC_ALL=C fgrep -f queryids.txt subject.csv
I am thinking of writing a bash script like this:
#!/bin/bash
for f in *.csv
do
( echo "Processing $f"
filename=$(basename "$f")
filename="${filename%.*}"
LC_ALL=C fgrep -f queryids.txt $f > $filename"_goi.csv" ) &
done
and I will run it using: nohup bash myscript.sh &
The queryids.txt looks like this:
ENST00000401850
ENST00000249005
ENST00000381278
ENST00000483026
ENST00000465765
ENST00000269080
ENST00000586539
ENST00000588458
ENST00000586292
ENST00000591459
The subject file looks like this:
target_id,length,eff_length,est_counts,tpm,id
ENST00000619216.1,68,2.65769E1,0.5,0.300188,00065a62-5e18-4223-a884-12fca053a109
ENST00000473358.1,712,5.39477E2,8.26564,0.244474,00065a62-5e18-4223-a884-12fca053a109
ENST00000469289.1,535,3.62675E2,4.82917,0.212463,00065a62-5e18-4223-a884-12fca053a109
ENST00000607096.1,138,1.92013E1,0,0,00065a62-5e18-4223-a884-12fca053a109
ENST00000417324.1,1187,1.01447E3,0,0,00065a62-5e18-4223-a884-12fca053a109
I am concerned this will take a long time. Is there a faster way to do this?
Thanks!
Few things I can suggest to improve the performance:
No need to spawn a sub-shell using ( .. ) &, you can use braces { ... } & if needed.
Use grep -F (non-regex or fixed string search) to make grep run faster
Avoid basename command and use bash string manipulation
Try this script:
#!/bin/bash
for f in *.csv; do
echo "Processing $f"
filename="${f##*/}"
LC_ALL=C grep -Ff queryids.txt "$f" > "${filename%.*}_goi.csv"
done
I suggest you run this on a smaller dataset to compare the performance gain.
You could try this instead:
awk '
BEGIN {
while ( (getline line < "queryids.txt") > 0 ) {
re = ( re=="" ? "" : re "|") line
}
}
FNR==1 { close(out); out=FILENAME; sub(/\.[^.]+$/,"_goi&",out) }
$0 ~ re { print > out }
' *.csv
It's using a regexp rather than string comparison - whether or not that matters and, if so, what we can do about it depends on the values in queryids.txt. In fact there may be a vastly faster and more robust way to do this depending on what your files contain so if you edit your question to include some examples of your file contents we could be of more help.
I see you have now posted some sample input and indeed we can do this much faster and more robustly by using a hash lookup:
awk '
BEGIN {
FS="."
while ( (getline line < "queryids.txt") > 0 ) {
ids[line]
}
}
FNR==1 { close(out); out=FILENAME; sub(/\.[^.]+$/,"_goi&",out) }
$1 in ids { print > out }
' *.csv

Looping input file and find out if line is used

I am using bash to loop through a large input file (contents.txt) that looks like:
searchterm1
searchterm2
searchterm3
...in an effort to remove search terms from the file if they are not used in a code base. I am trying to use grep and awk, but no success. I also want to exclude the images and constants directories
#/bin/bash
while read a; do
output=`grep -R $a ../website | grep -v ../website/images | grep -v ../website/constants | grep -v ../website/.git`
if [ -z "$output" ]
then echo "$a" >> notneeded.txt
else echo "$a used $($output | wc -l) times" >> needed.txt
fi
done < constants.txt
The desired effect of this would be two files. One for showing all of the search terms that are found in the code base(needed.txt), and another for search terms that are not found in the code base(notneeded.txt).
needed.txt
searchterm1 used 4 times
searchterm3 used 10 times
notneeded.txt
searchterm2
I've tried awk as well in a similar fashion but I cannot get it to loop and output as desired
Not sure but it sounds like you're looking for something like this (assuming no spaces in your file names):
awk '
NR==FNR{ terms[$0]; next }
{
for (term in terms) {
if ($0 ~ term) {
hits[term]++
}
}
}
END {
for (term in terms) {
if (term in hits) {
print term " used " hits[term] " times" > "needed.txt"
}
else {
print term > "notneeded.txt"
}
}
}
' constants.txt $( find ../website -type f -print | egrep -v '\.\.\/website\/(images|constants|\.git)' )
There's probably some find option to make the egrep unnecessary.

How to convert HHMMSS to HH:MM:SS Unix?

I tried to convert the HHMMSS to HH:MM:SS and I am able to convert it successfully but my script takes 2 hours to complete because of the file size. Is there any better way (fastest way) to complete this task
Data File
data.txt
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,,,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,,071600,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,072200,072200,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TAB,072600,072600,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,073200,073200,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,073500,073500,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,MRO,073700,073700,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,CPT,073900,073900,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,074400,,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,,,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,,090200,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,090900,090900,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,091500,091500,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TAB,091900,091900,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,092500,092500,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,092900,092900,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,MRO,093200,093200,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,CPT,093500,093500,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,094500,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,CPT,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,MRO,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TAB,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,,170100,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,CPT,170400,170400,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,MRO,170700,170700,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,171000,171000,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,171500,171500,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TAB,171900,171900,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,172500,172500,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,172900,172900,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,173500,173500,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,174100,,
My code : script.sh
#!/bin/bash
awk -F"," '{print $5}' Data.txt > tmp.txt # print first line first string before , to tmp.txt i.e. all Numbers will be placed into tmp.txt
sort tmp.txt | uniq -d > Uniqe_number.txt # unique values be stored to Uniqe_number.txt
rm tmp.txt # removes tmp file
while read line; do
echo $line
cat Data.txt | grep ",$line," > Numbers/All/$line.txt # grep Number and creats files induvidtually
awk -F"," '{print $5","$4","$7","$8","$9","$10","$11}' Numbers/All/$line.txt > Numbers/All/tmp_$line.txt
mv Numbers/All/tmp_$line.txt Numbers/Final/Final_$line.txt
done < Uniqe_number.txt
ls Numbers/Final > files.txt
dos2unix files.txt
bash time_replace.sh
when you execute above script it will call time_replace.sh script
My Code for time_replace.sh
#!/bin/bash
for i in `cat files.txt`
do
while read aline
do
TimeDep=`echo $aline | awk -F"," '{print $6}'`
#echo $TimeDep
finalTimeDep=`echo $TimeDep | awk '{for(i=1;i<=length($0);i+=2){printf("%s:",substr($0,i,2))}}'|awk '{sub(/:$/,"")};1'`
#echo $finalTimeDep
##########
TimeAri=`echo $aline | awk -F"," '{print $7}'`
#echo $TimeAri
finalTimeAri=`echo $TimeAri | awk '{for(i=1;i<=length($0);i+=2){printf("%s:",substr($0,i,2))}}'|awk '{sub(/:$/,"")};1'`
#echo $finalTimeAri
sed -i 's/',$TimeDep'/',$finalTimeDep'/g' Numbers/Final/$i
sed -i 's/',$TimeAri'/',$finalTimeAri'/g' Numbers/Final/$i
############################
done < Numbers/Final/$i
done
Any better solution?
Appreciate any help.
Thanks
Sri
If there's a large quantity of files, then the pipelines are probably what are going to impact performance more than anything else - although processes can be cheap, if you're doing a huge amount of processing then cutting down the amount of time you do pass data through a pipeline can reap dividends.
So you're probably going to be better off writing the entire script in awk (or perl). For example, awk can send output to an arbitary file, so the while lop in your first file could be replaced with an awk script that does this. You also don't need to use a temporary file.
I assume the sorting is just for tracking progress easily as you know how many numbers there are. But if you don't care for the sorting, you can simply do this:
#!/bin/sh
awk -F ',' '
{
print $5","$4","$7","$8","$9","$10","$11 > Numbers/Final/Final_$line.txt
}' datafile.txt
ls Numbers/Final > files.txt
Alternatively, if you need to sort you can do sort -t, -k5,4,10 (or whichever field your sort keys actually need to be).
As for formatting the datetime, awk also does functions, so you could actually have an awk script that looks like this. This would replace both of your scripts above whilst retaining the same functionality (at least, as far as I can make out with a quick analysis) ... (Note! Untested, so may contain vauge syntax errors):
#!/usr/bin/awk
BEGIN {
FS=","
}
function formattime (t)
{
return substr(t,1,2)":"substr(t,3,2)":"substr(t,5,2)
}
{
print $5","$4","$7","$8","$9","formattime($10)","formattime($11) > Numbers/Final/Final_$line.txt
}
which you can save, chmod 700, and call directly as:
dostuff.awk filename
Other awk options include changing fields in-situ, so if you want to maintain the entire original file but with formatted datetimes, you can do a modification of the above. Change the print block to:
{
$10=formattime($10)
$11=formattime($11)
print $0
}
If this doesn't do everything you need it to, hopefully it gives some ideas that will help the code.
It's not clear what all your sorting and uniq-ing is for. I'm assuming your data file has only one entry per line, and you need to change the 10th and 11th comma-separated fields from HHMMSS to HH:MM:SS.
while IFS=, read -a line ; do
echo -n ${line[0]},${line[1]},${line[2]},${line[3]},
echo -n ${line[4]},${line[5]},${line[6]},${line[7]},
echo -n ${line[8]},${line[9]},
if [ -n "${line[10]}" ]; then
echo -n ${line[10]:0:2}:${line[10]:2:2}:${line[10]:4:2}
fi
echo -n ,
if [ -n "${line[11]}" ]; then
echo -n ${line[11]:0:2}:${line[11]:2:2}:${line[11]:4:2}
fi
echo ""
done < data.txt
The operative part is the ${variable:offset:length} construct that lets you extract substrings out of a variable.
In Perl, that's close to child's play:
#!/usr/bin/env perl
use strict;
use warnings;
use English( -no_match_vars );
local($OFS) = ",";
while (<>)
{
my(#F) = split /,/;
$F[9] =~ s/(\d\d)(\d\d)(\d\d)/$1:$2:$3/ if defined $F[9];
$F[10] =~ s/(\d\d)(\d\d)(\d\d)/$1:$2:$3/ if defined $F[10];
print #F;
}
If you don't want to use English, you can write local($,) = ","; instead; it controls the output field separator, choosing to use comma. The code reads each line in the file, splits it up on the commas, takes the last two fields, counting from zero, and (if they're not empty) inserts colons in between the pairs of digits. I'm sure a 'Code Golf' solution would be made a lot shorter, but this is semi-legible if you know any Perl.
This will be quicker by far than the script, not least because it doesn't have to sort anything, but also because all the processing is done in a single process in a single pass through the file. Running multiple processes per line of input, as in your code, is a performance disaster when the files are big.
The output on the sample data you gave is:
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,,,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,,07:16:00,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,07:22:00,07:22:00,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TAB,07:26:00,07:26:00,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,07:32:00,07:32:00,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,07:35:00,07:35:00,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,MRO,07:37:00,07:37:00,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,CPT,07:39:00,07:39:00,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,07:44:00,,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,,,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,,09:02:00,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,09:09:00,09:09:00,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,09:15:00,09:15:00,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TAB,09:19:00,09:19:00,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,09:25:00,09:25:00,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,09:29:00,09:29:00,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,MRO,09:32:00,09:32:00,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,CPT,09:35:00,09:35:00,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,09:45:00,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,CPT,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,MRO,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TAB,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,,17:01:00,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,CPT,17:04:00,17:04:00,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,MRO,17:07:00,17:07:00,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,17:10:00,17:10:00,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,17:15:00,17:15:00,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TAB,17:19:00,17:19:00,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,17:25:00,17:25:00,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,17:29:00,17:29:00,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,17:35:00,17:35:00,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,17:41:00,,

Resources