search a pattern in file and output each pattern result in its own file using awk, sed - shell
I have a file of numbers in each new line:
$cat test
700320947
700509217
701113187
701435748
701435889
701667717
701668467
702119126
702306577
702914910
that I want to search details of from another larger file with several comma separated fields and out put results in
700320947.csv
700509217.csv
701113187.csv
701435748.csv
701435889.csv
701667717.csv
701668467.csv
702119126.csv
702306577.csv
702914910.csv
Logic:
ls test | while read file; do zgrep $line *large*file*gz >> $line.csv ; done
Please assist.
Thanks
Since nothing said about the structure of the large file, I'll just assume that the numbers in test are to be found in the second column of the large file; generalize as needed.
This can be done in a single pass through each of the files by using output redirection in awk:
awk -F"," 'FILENAME == "test" { num[$1]=1; next }
num[$2] { print > $2".csv" }' test bigfile
Unzip the large file first; using zgrep means unzipping on-the-fly for every line of the number file... very inefficient. After unzipping the big file, this will do it:
for number in `cat test`; do grep $number bigfile > $number.csv; done
Edited:
To limit hits to whole words only (eg 702119126 won't match 1702119126), add word boundaries to the regex:
for number in `cat test`; do grep \\b$number\\b bigfile > $number.csv; done
Related
Trying to create a script that counts the length of a all the reads in a fastq file but getting no return
I am trying go count the length of each read in a fastq file from illumina sequencing and outputting this to a tsv or any sort of file so I can then later also look at this and count the number of reads per file. So I need to cycle down the file and eactract each line that has a read on it (every 4th line) then get its length and store this as an output num=2 for file in *.fastq do echo "counting $file" function file_length(){ wc -l $file | awk '{print$FNR}' } for line in $file_length do awk 'NR==$num' $file | chrlen > ${file}read_length.tsv num=$((num + 4)) done done Currently all I get the counting $file and no other output but also no errors
Your script contains a lot of errors in both syntax and algorithm. Please try shellcheck to see what is the problem. The most issue will be the $file_length part. You may want to call a function file_length() here but it is just an undefined variable which is evaluated as null in the for loop. If you just want to count the length of the 4th line of *.fastq files, please try something like: for file in *.fastq; do awk 'NR==4 {print length}' "$file" > "${file}_length.tsv" done Or if you want to put the results together in a single tsv file, try: tsvfile="read_lenth.tsv" for file in *.fastq; do echo -n -e "$file\t" >> "$tsvfile" awk 'NR==4 {print length}' "$file" >> "$tsvfile" done Hope this helps.
Grep list (file) from another file
Im new to bash and trying to extract a list of patterns from file: File1.txt ABC BDF GHJ base.csv (tried comma separated and tab delimited) line 1,,,,"hfhf,ferf,ju,ABC" line 2 ,,,,,"ewy,trggt,gtg,ABC,RFR" line 3 .."himk,n,hn.ujj., BDF" etc Suggested output is smth like ABC line 1.. line 2..(whole lines) BDF line 3.. and so on for each pattern from file 1 the code i tried was: #!/bin/bash for i in *.txt -# cycle through all files containing pattern lists do for q in "$i"; # # cycle through list do echo $q >>output.${i}; grep -f "${q}" base.csv >>output.${i}; echo "\n"; done done But output is only filename and then some list of strings without pattern names, e.g. File1.txt line 1... line 2... line 3.. so i don`t know to what pattern belongs each string and have to check and assign manually. Can you please point out my errors? Thanks!
grep can process multiple files in one go, and then has the attractive added bonus of indicating which file it found a match in. grep -f File1.txt base.csv >output.txt It's not clear what you hope for the inner loop to do; it will just loop over a single token at a time, so it's not really a loop at all. If you want the output to be grouped per pattern, here's a for loop which looks for one pattern at a time: while read -r pat; do echo "$pat" grep "$pat" *.txt done <File1.txt >output.txt But the most efficient way to tackle this is to write a simple Awk script which processes all the input files at once, and groups the matches before printing them. An additional concern is anchoring. grep "ABC" will find a match in 123DEABCXYZ; is this something you want to avoid? You can improve the regex, or, again, turn to Awk which gives you more control over where exactly to look for a match in a structured line. awk '# Read patterns into memory NR==FNR { a[++i] = $1; next } # Loop across patterns { for(j=1; j<=i; ++j) if($0 ~ a[j]) { print FILENAME ":" FNR ":" $0 >>output.a[j] next } }' File1.txt base.csv
You're not actually reading the files, you're just handling the filenames. Try this: #!/bin/bash for i in *.txt # cycle through all files containing pattern lists do while read -r q # read file line by line do echo "$q" >>"output.${i}" grep -f "${q}" base.csv >>"output.${i}" echo "\n" done < "${i}" done
Here is one that separates (with split, comma-separatd with quotes and spaces stripped off) words from file2 to an array (word[]) and stores the record names (line 1 etc.) to it comma-separated: awk ' NR==FNR { n=split($0,tmp,/[" ]*(,|$)[" ]*/) # split words for(i=2;i<=n;i++) # after first if(tmp[i]!="") # non-empties word[tmp[i]]=word[tmp[i]] (word[tmp[i]]==""?"":",") tmp[1] # hash rownames record[tmp[1]]=$0 # store records next } ($1 in word) { # word found n=split(word[$1],tmp,",") # get record names print $1 ":" # output word for(i=1;i<=n;i++) # and records print record[tmp[i]] }' file2 file1 Output: ABC: line 1,,,,"hfhf,ferf,ju,ABC" line 2 ,,,,,"ewy,trggt,gtg,ABC,RFR" BDF: line 3 .."himk,n,hn.ujj., BDF"
Thank you for your kind help, my friends. Tried both variants above but kept getting various errors ( "do" expected) or misbehavior ( gets names of pattern blocks, eg ABC, BDF, but no lines. Gave up for a while and then eventually tried another way While base goal were to cycle through pattern list files, search for patterns in huge file and write out specific columns from lines found - i simply wrote for *i in *txt # cycle throughfiles w/ patterns do grep -F -f "$i" bigfile.csv >> ${i}.out1 #greps all patterns from current file cut -f 2,3,4,7 ${i}.out1>> ${i}.out2 # cuts columns of interest and writes them out to another file done I'm aware that this code should be improved using some fancy pipeline features, but it works perfectly as is, hope it`ll help somebody in similar situation. You can easily add some echoes to write out pattern list names as i initially requested
Writing a script for large text file manipulation (iterative substitution of duplicated lines), weird bugs and very slow.
I am trying to write a script which takes a directory containing text files (384 of them) and modifies duplicate lines that have a specific format in order to make them not duplicates. In particular, I have files in which some lines begin with the '#' character and contain the substring 0:0. A subset of these lines are duplicated one or more times. For those that are duplicated, I'd like to replace 0:0 with i:0 where i starts at 1 and is incremented. So far I've written a bash script that finds duplicated lines beginning with '#', writes them to a file, then reads them back and uses sed in a while loop to search and replace the first occurrence of the line to be replaced. This is it below: #!/bin/bash fdir=$1"*" #for each fastq file for f in $fdir do ( #find duplicated read names and write to file $f.txt sort $f | uniq -d | grep ^# > "$f".txt #loop over each duplicated readname while read in; do rname=$in i=1 #while this readname still exists in the file increment and replace while grep -q "$rname" $f; do replace=${rname/0:0/$i:0} sed -i.bu "0,/$rname/s/$rname/$replace/" "$f" let "i+=1" done done < "$f".txt rm "$f".txt rm "$f".bu done echo "done" >> progress.txt )& background=( $(jobs -p) ) if (( ${#background[#]} ==40)); then wait -n fi done The problem with it is that its impractically slow. I ran it on a 48 core computer for over 3 days and it hardly got through 30 files. It also seemed to have removed about 10 files and I'm not sure why. My question is where are the bugs coming from and how can I do this more efficiently? I'm open to using other programming languages or changing my approach. EDIT Strangely the loop works fine on one file. Basically I ran sort $f | uniq -d | grep ^# > "$f".txt while read in; do rname=$in i=1 while grep -q "$rname" $f; do replace=${rname/0:0/$i:0} sed -i.bu "0,/$rname/s/$rname/$replace/" "$f" let "i+=1" done done < "$f".txt To give you an idea of what the files look like below are a few lines from one of them. The thing is that even though it works for the one file, it's slow. Like multiple hours for one file of 7.5 M. I'm wondering if there's a more practical approach. With regard to the file deletions and other bugs I have no idea what was happening Maybe it was running into memory collisions or something when they were run in parallel? Sample input: #D00269:138:HJG2TADXX:2:1101:0:0 1:N:0:CCTAGAAT+ATTCCTCT GATAAGGACGGCTGGTCCCTGTGGTACTCAGAGTATCGCTTCCCTGAAGA + CCCFFFFFHHFHHIIJJJJIIIJJIJIJIJJIIBFHIHIIJJJJJJIJIG #D00269:138:HJG2TADXX:2:1101:0:0 1:N:0:CCTAGAAT+ATTCCTCT CAAGTCGAACGGTAACAGGAAGAAGCTTGCTTCTTTGCTGACGAGTGGCG Sample output: #D00269:138:HJG2TADXX:2:1101:1:0 1:N:0:CCTAGAAT+ATTCCTCT GATAAGGACGGCTGGTCCCTGTGGTACTCAGAGTATCGCTTCCCTGAAGA + CCCFFFFFHHFHHIIJJJJIIIJJIJIJIJJIIBFHIHIIJJJJJJIJIG #D00269:138:HJG2TADXX:2:1101:2:0 1:N:0:CCTAGAAT+ATTCCTCT CAAGTCGAACGGTAACAGGAAGAAGCTTGCTTCTTTGCTGACGAGTGGCG
Here's some code that produces the required output from your sample input. Again, it is assumed that your input file is sorted by the first value (up to the first space character). time awk '{ #dbg if (dbg) print "#dbg:prev=" prev if (/^#/ && prev!=$1) {fixNum=0 ;if (dbg) print "prev!=$1=" prev "!=" $1} if (/^#/ && (prev==$1 || NR==1) ) { prev=$1 n=split($1,tmpArr,":") ; n++ #dbg if (dbg) print "tmpArr[6]="tmpArr[6] "\tfixNum="fixNum fixNum++;tmpArr[6]=fixNum; # magic to rebuild $1 here for (i=1;i<n;i++) { tmpFix ? tmpFix=tmpFix":"tmpArr[i]"" : tmpFix=tmpArr[i] } $1=tmpFix ; $0=$0 print $0 } else { tmpFix=""; print $0 } }' file > fixedFile output #D00269:138:HJG2TADXX:2:1101:1:0 1:N:0:CCTAGAAT+ATTCCTCT GATAAGGACGGCTGGTCCCTGTGGTACTCAGAGTATCGCTTCCCTGAAGA + CCCFFFFFHHFHHIIJJJJIIIJJIJIJIJJIIBFHIHIIJJJJJJIJIG #D00269:138:HJG2TADXX:2:1101:2:0 1:N:0:CCTAGAAT+ATTCCTCT CAAGTCGAACGGTAACAGGAAGAAGCTTGCTTCTTTGCTGACGAGTGGCG I've left a few of the #dbg:... statements in place (but they are now commented out) to show how you can run a small set of data as you have provided, and watch the values of variables change. Assuming a non-csh, you should be able to copy/paste the code block into a terminal window cmd-line and replace file > fixFile at the end with your real file name and a new name for the fixed file. Recall that awk 'program' file > file (actually, any ...file>file) will truncate the existing file and then try to write, SO you can lose all the data of a file trying to use the same name. There are probably some syntax improvements that will reduce the size of this code, and there might be 1 or 2 things that could be done that will make the code faster, but this should run very quickly. If not, please post the result of time command that should appear at the end of the run, i.e. real 0m0.18s user 0m0.03s sys 0m0.06s IHTH
#!/bin/bash i=4 sort $1 | uniq -d | grep ^# > dups.txt while read in; do if [ $((i%4))=0 ] && grep -q "$in" dups.txt; then x="$in" x=${x/"0:0 "/$i":0 "} echo "$x" >> $1"fixed.txt" else echo "$in" >> $1"fixed.txt" fi let "i+=1" done < $1
Splitting large text file on every blank line
I'm having a bit trouble of splitting a large text file into multiple smaller ones. Syntax of my text file is the following: dasdas #42319 blaablaa 50 50 content content more content content conclusion asdasd #92012 blaablaa 30 70 content again more of it content conclusion asdasd #299 yadayada 60 40 content content contend done ...and so on A typical information table in my file has anywhere between 10-40 rows. I would like this file to be split in n smaller files, where n is the amount of content tables. That is dasdas #42319 blaablaa 50 50 content content more content content conclusion would be its own separate file, (whateverN.txt) and asdasd #92012 blaablaa 30 70 content again more of it content conclusion again a separate file whateverN+1.txt and so forth. It seems like awk or Perl are nifty tools for this, but having never used them before the syntax is kinda baffling. I found these two questions that are almost correspondent to my problem, but failed to modify the syntax to fit my needs: Split text file into multiple files & How can I split a text file into multiple text files? (on Unix & Linux) How should one modify the command line inputs, so that it solves my problem?
Setting RS to null tells awk to use one or more blank lines as the record separator. Then you can simply use NR to set the name of the file corresponding to each new record: awk -v RS= '{print > ("whatever-" NR ".txt")}' file.txt RS: This is awk's input record separator. Its default value is a string containing a single newline character, which means that an input record consists of a single line of text. It can also be the null string, in which case records are separated by runs of blank lines, or a regexp, in which case records are separated by matches of the regexp in the input text. $ cat file.txt dasdas #42319 blaablaa 50 50 content content more content content conclusion asdasd #92012 blaablaa 30 70 content again more of it content conclusion asdasd #299 yadayada 60 40 content content contend done $ awk -v RS= '{print > ("whatever-" NR ".txt")}' file.txt $ ls whatever-*.txt whatever-1.txt whatever-2.txt whatever-3.txt $ cat whatever-1.txt dasdas #42319 blaablaa 50 50 content content more content content conclusion $ cat whatever-2.txt asdasd #92012 blaablaa 30 70 content again more of it content conclusion $ cat whatever-3.txt asdasd #299 yadayada 60 40 content content contend done $
You could use the csplit command: csplit \ --quiet \ --prefix=whatever \ --suffix-format=%02d.txt \ --suppress-matched \ infile.txt /^$/ {*} POSIX csplit only uses short options and doesn't know --suffix and --suppress-matched, so this requires GNU csplit. This is what the options do: --quiet – suppress output of file sizes --prefix=whatever – use whatever instead fo the default xx filename prefix --suffix-format=%02d.txt – append .txt to the default two digit suffix --suppress-matched – don't include the lines matching the pattern on which the input is split /^$/ {*} – split on pattern "empty line" (/^$/) as often as possible ({*})
Perl has a useful feature called the input record separator. $/. This is the 'marker' for separating records when reading a file. So: #!/usr/bin/env perl use strict; use warnings; local $/ = "\n\n"; my $count = 0; while ( my $chunk = <> ) { open ( my $output, '>', "filename_".$count++ ) or die $!; print {$output} $chunk; close ( $output ); } Just like that. The <> is the 'magic' filehandle, in that it reads piped data or from files specified on command line (opens them and reads them). This is similar to how sed or grep work. This can be reduced to a one liner: perl -00 -pe 'open ( $out, '>', "filename_".++$n ); select $out;' yourfilename_here
You can use this awk, awk 'BEGIN{file="content"++i".txt"} !NF{file="content"++i".txt";next} {print > file}' yourfile (OR) awk 'BEGIN{i++} !NF{++i;next} {print > "filename"i".txt"}' yourfile More readable format: BEGIN { file="content"++i".txt" } !NF { file="content"++i".txt"; next } { print > file }
In case you get "too many open files" error as follows... awk: whatever-18.txt makes too many open files input record number 18, file file.txt source line number 1 You may need to close newly created file, before creating a new one, as follows. awk -v RS= '{close("whatever-" i ".txt"); i++}{print > ("whatever-" i ".txt")}' file.txt
Since it's Friday and I'm feeling a bit helpful... :) Try this. If the file is as small as you imply it's simplest to just read it all at once and work in memory. use strict; use warnings; # slurp file local $/ = undef; open my $fh, '<', 'test.txt' or die $!; my $text = <$fh>; close $fh; # split on double new line my #chunks = split(/\n\n/, $text); # make new files from chunks my $count = 1; for my $chunk (#chunks) { open my $ofh, '>', "whatever$count.txt" or die $!; print $ofh $chunk, "\n"; close $ofh; $count++; } The perl docs can explain any individual commands you don't understand but at this point you should probably look into a tutorial as well.
awk -v RS="\n\n" '{for (i=1;i<=NR;i++); print > i-1}' file.txt Sets record separator as blank line, prints each record as a separate file numbered 1, 2, 3, etc. Last file (only) ends in blank line.
Try this bash script also #!/bin/bash i=1 fileName="OutputFile_$i" while read line ; do if [ "$line" == "" ] ; then ((++i)) fileName="OutputFile_$i" else echo $line >> "$fileName" fi done < InputFile.txt
You can also try split -p "^$"
How to convert HHMMSS to HH:MM:SS Unix?
I tried to convert the HHMMSS to HH:MM:SS and I am able to convert it successfully but my script takes 2 hours to complete because of the file size. Is there any better way (fastest way) to complete this task Data File data.txt 10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,,, 10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,,071600, 10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,072200,072200, 10,SRI,AA,20091210,8503,ABCXYZ,D,N,TAB,072600,072600, 10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,073200,073200, 10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,073500,073500, 10,SRI,AA,20091210,8503,ABCXYZ,D,N,MRO,073700,073700, 10,SRI,AA,20091210,8503,ABCXYZ,D,N,CPT,073900,073900, 10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,074400,, 10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,,, 10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,,090200, 10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,090900,090900, 10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,091500,091500, 10,SRI,AA,20091210,8505,ABCXYZ,D,N,TAB,091900,091900, 10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,092500,092500, 10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,092900,092900, 10,SRI,AA,20091210,8505,ABCXYZ,D,N,MRO,093200,093200, 10,SRI,AA,20091210,8505,ABCXYZ,D,N,CPT,093500,093500, 10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,094500,, 10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,, 10,SRI,AA,20091210,8506,ABCXYZ,U,N,CPT,,, 10,SRI,AA,20091210,8506,ABCXYZ,U,N,MRO,,, 10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,, 10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,, 10,SRI,AA,20091210,8506,ABCXYZ,U,N,TAB,,, 10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,, 10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,, 10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,, 10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,, 10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,,170100, 10,SRI,AA,20091210,8510,ABCXYZ,U,N,CPT,170400,170400, 10,SRI,AA,20091210,8510,ABCXYZ,U,N,MRO,170700,170700, 10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,171000,171000, 10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,171500,171500, 10,SRI,AA,20091210,8510,ABCXYZ,U,N,TAB,171900,171900, 10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,172500,172500, 10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,172900,172900, 10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,173500,173500, 10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,174100,, My code : script.sh #!/bin/bash awk -F"," '{print $5}' Data.txt > tmp.txt # print first line first string before , to tmp.txt i.e. all Numbers will be placed into tmp.txt sort tmp.txt | uniq -d > Uniqe_number.txt # unique values be stored to Uniqe_number.txt rm tmp.txt # removes tmp file while read line; do echo $line cat Data.txt | grep ",$line," > Numbers/All/$line.txt # grep Number and creats files induvidtually awk -F"," '{print $5","$4","$7","$8","$9","$10","$11}' Numbers/All/$line.txt > Numbers/All/tmp_$line.txt mv Numbers/All/tmp_$line.txt Numbers/Final/Final_$line.txt done < Uniqe_number.txt ls Numbers/Final > files.txt dos2unix files.txt bash time_replace.sh when you execute above script it will call time_replace.sh script My Code for time_replace.sh #!/bin/bash for i in `cat files.txt` do while read aline do TimeDep=`echo $aline | awk -F"," '{print $6}'` #echo $TimeDep finalTimeDep=`echo $TimeDep | awk '{for(i=1;i<=length($0);i+=2){printf("%s:",substr($0,i,2))}}'|awk '{sub(/:$/,"")};1'` #echo $finalTimeDep ########## TimeAri=`echo $aline | awk -F"," '{print $7}'` #echo $TimeAri finalTimeAri=`echo $TimeAri | awk '{for(i=1;i<=length($0);i+=2){printf("%s:",substr($0,i,2))}}'|awk '{sub(/:$/,"")};1'` #echo $finalTimeAri sed -i 's/',$TimeDep'/',$finalTimeDep'/g' Numbers/Final/$i sed -i 's/',$TimeAri'/',$finalTimeAri'/g' Numbers/Final/$i ############################ done < Numbers/Final/$i done Any better solution? Appreciate any help. Thanks Sri
If there's a large quantity of files, then the pipelines are probably what are going to impact performance more than anything else - although processes can be cheap, if you're doing a huge amount of processing then cutting down the amount of time you do pass data through a pipeline can reap dividends. So you're probably going to be better off writing the entire script in awk (or perl). For example, awk can send output to an arbitary file, so the while lop in your first file could be replaced with an awk script that does this. You also don't need to use a temporary file. I assume the sorting is just for tracking progress easily as you know how many numbers there are. But if you don't care for the sorting, you can simply do this: #!/bin/sh awk -F ',' ' { print $5","$4","$7","$8","$9","$10","$11 > Numbers/Final/Final_$line.txt }' datafile.txt ls Numbers/Final > files.txt Alternatively, if you need to sort you can do sort -t, -k5,4,10 (or whichever field your sort keys actually need to be). As for formatting the datetime, awk also does functions, so you could actually have an awk script that looks like this. This would replace both of your scripts above whilst retaining the same functionality (at least, as far as I can make out with a quick analysis) ... (Note! Untested, so may contain vauge syntax errors): #!/usr/bin/awk BEGIN { FS="," } function formattime (t) { return substr(t,1,2)":"substr(t,3,2)":"substr(t,5,2) } { print $5","$4","$7","$8","$9","formattime($10)","formattime($11) > Numbers/Final/Final_$line.txt } which you can save, chmod 700, and call directly as: dostuff.awk filename Other awk options include changing fields in-situ, so if you want to maintain the entire original file but with formatted datetimes, you can do a modification of the above. Change the print block to: { $10=formattime($10) $11=formattime($11) print $0 } If this doesn't do everything you need it to, hopefully it gives some ideas that will help the code.
It's not clear what all your sorting and uniq-ing is for. I'm assuming your data file has only one entry per line, and you need to change the 10th and 11th comma-separated fields from HHMMSS to HH:MM:SS. while IFS=, read -a line ; do echo -n ${line[0]},${line[1]},${line[2]},${line[3]}, echo -n ${line[4]},${line[5]},${line[6]},${line[7]}, echo -n ${line[8]},${line[9]}, if [ -n "${line[10]}" ]; then echo -n ${line[10]:0:2}:${line[10]:2:2}:${line[10]:4:2} fi echo -n , if [ -n "${line[11]}" ]; then echo -n ${line[11]:0:2}:${line[11]:2:2}:${line[11]:4:2} fi echo "" done < data.txt The operative part is the ${variable:offset:length} construct that lets you extract substrings out of a variable.
In Perl, that's close to child's play: #!/usr/bin/env perl use strict; use warnings; use English( -no_match_vars ); local($OFS) = ","; while (<>) { my(#F) = split /,/; $F[9] =~ s/(\d\d)(\d\d)(\d\d)/$1:$2:$3/ if defined $F[9]; $F[10] =~ s/(\d\d)(\d\d)(\d\d)/$1:$2:$3/ if defined $F[10]; print #F; } If you don't want to use English, you can write local($,) = ","; instead; it controls the output field separator, choosing to use comma. The code reads each line in the file, splits it up on the commas, takes the last two fields, counting from zero, and (if they're not empty) inserts colons in between the pairs of digits. I'm sure a 'Code Golf' solution would be made a lot shorter, but this is semi-legible if you know any Perl. This will be quicker by far than the script, not least because it doesn't have to sort anything, but also because all the processing is done in a single process in a single pass through the file. Running multiple processes per line of input, as in your code, is a performance disaster when the files are big. The output on the sample data you gave is: 10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,,, 10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,,07:16:00, 10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,07:22:00,07:22:00, 10,SRI,AA,20091210,8503,ABCXYZ,D,N,TAB,07:26:00,07:26:00, 10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,07:32:00,07:32:00, 10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,07:35:00,07:35:00, 10,SRI,AA,20091210,8503,ABCXYZ,D,N,MRO,07:37:00,07:37:00, 10,SRI,AA,20091210,8503,ABCXYZ,D,N,CPT,07:39:00,07:39:00, 10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,07:44:00,, 10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,,, 10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,,09:02:00, 10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,09:09:00,09:09:00, 10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,09:15:00,09:15:00, 10,SRI,AA,20091210,8505,ABCXYZ,D,N,TAB,09:19:00,09:19:00, 10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,09:25:00,09:25:00, 10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,09:29:00,09:29:00, 10,SRI,AA,20091210,8505,ABCXYZ,D,N,MRO,09:32:00,09:32:00, 10,SRI,AA,20091210,8505,ABCXYZ,D,N,CPT,09:35:00,09:35:00, 10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,09:45:00,, 10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,, 10,SRI,AA,20091210,8506,ABCXYZ,U,N,CPT,,, 10,SRI,AA,20091210,8506,ABCXYZ,U,N,MRO,,, 10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,, 10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,, 10,SRI,AA,20091210,8506,ABCXYZ,U,N,TAB,,, 10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,, 10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,, 10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,, 10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,, 10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,,17:01:00, 10,SRI,AA,20091210,8510,ABCXYZ,U,N,CPT,17:04:00,17:04:00, 10,SRI,AA,20091210,8510,ABCXYZ,U,N,MRO,17:07:00,17:07:00, 10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,17:10:00,17:10:00, 10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,17:15:00,17:15:00, 10,SRI,AA,20091210,8510,ABCXYZ,U,N,TAB,17:19:00,17:19:00, 10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,17:25:00,17:25:00, 10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,17:29:00,17:29:00, 10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,17:35:00,17:35:00, 10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,17:41:00,,