Slow text parsing in bash script, any advice? - bash

I have wrote a script below to parse a text file that effectively removes line returns. It will take input that looks like this:
TCP 0.0.0.0:135 SVR LISTENING 776
RpcSs
And return this to a new text document
TCP 0.0.0.0:135 SVR LISTENING 776 RpcSs
Some entries span more than two lines so I was not able to write a script that removes the line return from every other line so I came up with this approach below. It worked fine for small collects but a 7MB collect resulted in my computer running out of memory and it took quite a bit of time do this before it failed. I'm curious why it ran out of memory as well as hoping someone could educate me on a better way to do this.
#!/bin/bash
#
# VARS
writeOuput=""
#
while read line
do
curLine=$line #grab current line from document
varWord=$(echo $curLine | awk '{print $1}') #grab first word from each line
if [ "$varWord" == "TCP" ] || [ "$varWord" == "UDP" ]; then
#echo "$curLine" >> results.txt
unset writeOutput
writeOutput=$curLine
elif [ "$varWord" == "Active" ]; then #new session
printf "\n" >> results1.txt
printf "New Session" >> results1.txt
printf "\n" >> results1.txt
else
writeOutput+=" $curLine"
#echo "$writeOutput\n"
printf "$writeOutput\n" >> results1.txt
#sed -e '"$index"s/$/"$curLine"'
fi
done < $1

Consider replacing the line with the awk call with this line:
varWord=${curLine%% *} #grab first word from each line
This saves the fork that happens in each iteration by using Bash-internal functionality only and should make your program run several times faster. See also that other guy's comment linking to this answer for an explanation.

As others have noted, the main bottleneck in your script is probably the forking where you pass each line through its own awk instance.
I have created an awk script which I hope does the same as your bash script, and I suspect it should run faster. Initially I just thought about replacing newlines with spaces, and manually adding newlines in front of every TCP or UDP, like this:
awk '
BEGIN {ORS=" "};
$1~/(TCP|UDP)/ {printf("\n")};
{print};
END {printf("\n")}
' <file>
But your script removes the 'Active' lines from the output, and adds three new lines before the line. You could, of course, pipe this through a second `awk command:
awk '/Active/ {gsub(/Active /, ""); print("\nNew Session\n")}; {print}'
But this awk script is a bit closer to what you did with bash, but it should still be considerably faster:
$ cat join.awk
$1~/Active/ {print("\nNew Session\n"); next}
$1~/(TCP|UDP)/ {if (output) print output; output = ""}
{if (output) output = output " " $0; else output = $0}
END {print output}
$ awk -f join.awk <file>
First, it checks whether the line begins with the word "Active", if it does, it prints the three lines, and goes on to the next input line.
Otherwise it checks for the presence of TCP or UDP as the first word. If it finds them, it prints what has accumulated in writeOutput (provided there is something in the variable), and clears it.
It then adds whatever it finds in the line to writeOutput
At the end, it prints what has accumulated since the last TCP or UDP.

Related

Convert multi-line csv to single line using Linux tools

I have a .csv file that contains double quoted multi-line fields. I need to convert the multi-line cell to a single line. It doesn't show in the sample data but I do not know which fields might be multi-line so any solution will need to check every field. I do know how many columns I'll have. The first line will also need to be skipped. I don't how much data so performance isn't a consideration.
I need something that I can run from a bash script on Linux. Preferably using tools such as awk or sed and not actual programming languages.
The data will be processed further with Logstash but it doesn't handle double quoted multi-line fields hence the need to do some pre-processing.
I tried something like this and it kind of works on one row but fails on multiple rows.
sed -e :0 -e '/,.*,.*,.*,.*,/b' -e N -e '1n;N;N;N;s/\n/ /g' -e b0 file.csv
CSV example
First name,Last name,Address,ZIP
John,Doe,"Country
City
Street",12345
The output I want is
First name,Last name,Address,ZIP
John,Doe,Country City Street,12345
Jane,Doe,Country City Street,67890
etc.
etc.
First my apologies for getting here 7 months late...
I came across a problem similar to yours today, with multiple fields with multi-line types. I was glad to find your question but at least for my case I have the complexity that, as more than one field is conflicting, quotes might open, close and open again on the same line... anyway, reading a lot and combining answers from different posts I came up with something like this:
First I count the quotes in a line, to do that, I take out everything but quotes and then use wc:
quotes=`echo $line | tr -cd '"' | wc -c` # Counts the quotes
If you think of a single multi-line field, knowing if the quotes are 1 or 2 is enough. In a more generic scenario like mine I have to know if the number of quotes is odd or even to know if the line completes the record or expects more information.
To check for even or odd you can use the mod operand (%), in general:
even % 2 = 0
odd % 2 = 1
For the first line:
Odd means that the line expects more information on the next line.
Even means the line is complete.
For the subsequent lines, I have to know the status of the previous one. for instance in your sample text:
First name,Last name,Address,ZIP
John,Doe,"Country
City
Street",12345
You can say line 1 (John,Doe,"Country) has 1 quote (odd) what means the status of the record is incomplete or open.
When you go to line 2, there is no quote (even). Nevertheless this does not mean the record is complete, you have to consider the previous status... so for the lines following the first one it will be:
Odd means that record status toggles (incomplete to complete).
Even means that record status remains as the previous line.
What I did was looping line by line while carrying the status of the last line to the next one:
incomplete=0
cat file.csv | while read line; do
quotes=`echo $line | tr -cd '"' | wc -c` # Counts the quotes
incomplete=$((($quotes+$incomplete)%2)) # Check if Odd or Even to decide status
if [ $incomplete -eq 1 ]; then
echo -n "$line " >> new.csv # If line is incomplete join with next
else
echo "$line" >> new.csv # If line completes the record finish
fi
done
Once this was executed, a file in your format generates a new.csv like this:
First name,Last name,Address,ZIP
John,Doe,"Country City Street",12345
I like one-liners as much as everyone, I wrote that script just for the sake of clarity, you can - arguably - write it in one line like:
i=0;cat file.csv|while read l;do i=$((($(echo $l|tr -cd '"'|wc -c)+$i)%2));[[ $i = 1 ]] && echo -n "$l " || echo "$l";done >new.csv
I would appreciate it if you could go back to your example and see if this works for your case (which you most likely already solved). Hopefully this can still help someone else down the road...
Recovering the multi-line fields
Every need is different, in my case I wanted the records in one line to further process the csv to add some bash-extracted data, but I would like to keep the csv as it was. To accomplish that, instead of joining the lines with a space I used a code - likely unique - that I could then search and replace:
i=0;cat file.csv|while read l;do i=$((($(echo $l|tr -cd '"'|wc -c)+$i)%2));[[ $i = 1 ]] && echo -n "$l ~newline~ " || echo "$l";done >new.csv
the code is ~newline~, this is totally arbitrary of course.
Then, after doing my processing, I took the csv text file and replaced the coded newlines with real newlines:
sed -i 's/ ~newline~ /\n/g' new.csv
References:
Ternary operator: https://stackoverflow.com/a/3953666/6316852
Count char occurrences: https://stackoverflow.com/a/41119233/6316852
Other peculiar cases: https://www.linuxquestions.org/questions/programming-9/complex-bash-string-substitution-of-csv-file-with-multiline-data-937179/
TL;DR
Run this:
i=0;cat file.csv|while read l;do i=$((($(echo $l|tr -cd '"'|wc -c)+$i)%2));[[ $i = 1 ]] && echo -n "$l " || echo "$l";done >new.csv
... and collect results in new.csv
I hope it helps!
If Perl is your option, please try the following:
perl -e '
while (<>) {
$str .= $_;
}
while ($str =~ /("(("")|[^"])*")|((^|(?<=,))[^,]*((?=,)|$))/g) {
if (($el = $&) =~ /^".*"$/s) {
$el =~ s/^"//s; $el =~ s/"$//s;
$el =~ s/""/"/g;
$el =~ s/\s+(?!$)/ /g;
}
push(#ary, $el);
}
foreach (#ary) {
print /\n$/ ? "$_" : "$_,";
}' sample.csv
sample.csv:
First name,Last name,Address,ZIP
John,Doe,"Country
City
Street",12345
John,Doe,"Country
City
Street",67890
Result:
First name,Last name,Address,ZIP
John,Doe,Country City Street,12345
John,Doe,Country City Street,67890
This might work for you (GNU sed):
sed ':a;s/[^,]\+/&/4;tb;N;ba;:b;s/\n\+/ /g;s/"//g' file
Test each line to see that it contains the correct number of fields (in the example that was 4). If there are not enough fields, append the next line and repeat the test. Otherwise, replace the newline(s) by spaces and finally remove the "'s.
N.B. This may be fraught with problems such as ,'s between "'s and quoted "'s.
Try cat -v file.csv. When the file was made with Excel, you might have some luck: When the newlines in a field are a simple \n and the newline at the end is a \r\n (which will look like ^M), parsing is simple.
# delete all newlines and replace the ^M with a new newline.
tr -d "\n" < file.csv| tr "\r" "\n"
# Above two steps with one command
tr "\n\r" " \n" < file.csv
When you want a space between the joined line, you need an additional step.
tr "\n\r" " \n" < file.csv | sed '2,$ s/^ //'
EDIT: #sjaak commented this didn't work is his case.
When your broken lines also have ^M you still can be a lucky (wo-)man.
When your broken field is always the first field in double quotes and you have GNU sed 4.2.2, you can join 2 lines when the first line has exactly one double quote.
sed -rz ':a;s/(\n|^)([^"]*)"([^"]*)\n/\1\2"\3 /;ta' file.csv
Explanation:
-z don't use \n as line endings
:a label for repeating the step after successful replacement
(\n|^) Search after a newline or the very first line
([^"]*) Substring without a "
ta Go back to label a and repeat
awk pattern matching is working.
answer in one line :
awk '/,"/{ORS=" "};/",/{ORS="\n"}{print $0}' YourFile
if you'd like to drop quotes, you could use:
awk '/,"/{ORS=" "};/",/{ORS="\n"}{print $0}' YourFile | sed 's/"//gw NewFile'
but I prefer to keep it.
to explain the code:
/Pattern/ : find pattern in current line.
ORS : indicates the output line record.
$0 : indicates the whole of the current line.
's/OldPattern/NewPattern/': substitude first OldPattern with NewPattern
/g : does the previous action for all OldPattern
/w : write the result to Newfile

Grep list (file) from another file

Im new to bash and trying to extract a list of patterns from file:
File1.txt
ABC
BDF
GHJ
base.csv (tried comma separated and tab delimited)
line 1,,,,"hfhf,ferf,ju,ABC"
line 2 ,,,,,"ewy,trggt,gtg,ABC,RFR"
line 3 .."himk,n,hn.ujj., BDF"
etc
Suggested output is smth like
ABC
line 1..
line 2..(whole lines)
BDF
line 3..
and so on for each pattern from file 1
the code i tried was:
#!/bin/bash
for i in *.txt -# cycle through all files containing pattern lists
do
for q in "$i"; # # cycle through list
do
echo $q >>output.${i};
grep -f "${q}" base.csv >>output.${i};
echo "\n";
done
done
But output is only filename and then some list of strings without pattern names, e.g.
File1.txt
line 1...
line 2...
line 3..
so i don`t know to what pattern belongs each string and have to check and assign manually. Can you please point out my errors? Thanks!
grep can process multiple files in one go, and then has the attractive added bonus of indicating which file it found a match in.
grep -f File1.txt base.csv >output.txt
It's not clear what you hope for the inner loop to do; it will just loop over a single token at a time, so it's not really a loop at all.
If you want the output to be grouped per pattern, here's a for loop which looks for one pattern at a time:
while read -r pat; do
echo "$pat"
grep "$pat" *.txt
done <File1.txt >output.txt
But the most efficient way to tackle this is to write a simple Awk script which processes all the input files at once, and groups the matches before printing them.
An additional concern is anchoring. grep "ABC" will find a match in 123DEABCXYZ; is this something you want to avoid? You can improve the regex, or, again, turn to Awk which gives you more control over where exactly to look for a match in a structured line.
awk '# Read patterns into memory
NR==FNR { a[++i] = $1; next }
# Loop across patterns
{ for(j=1; j<=i; ++j)
if($0 ~ a[j]) {
print FILENAME ":" FNR ":" $0 >>output.a[j]
next }
}' File1.txt base.csv
You're not actually reading the files, you're just handling the filenames. Try this:
#!/bin/bash
for i in *.txt # cycle through all files containing pattern lists
do
while read -r q # read file line by line
do
echo "$q" >>"output.${i}"
grep -f "${q}" base.csv >>"output.${i}"
echo "\n"
done < "${i}"
done
Here is one that separates (with split, comma-separatd with quotes and spaces stripped off) words from file2 to an array (word[]) and stores the record names (line 1 etc.) to it comma-separated:
awk '
NR==FNR {
n=split($0,tmp,/[" ]*(,|$)[" ]*/) # split words
for(i=2;i<=n;i++) # after first
if(tmp[i]!="") # non-empties
word[tmp[i]]=word[tmp[i]] (word[tmp[i]]==""?"":",") tmp[1] # hash rownames
record[tmp[1]]=$0 # store records
next
}
($1 in word) { # word found
n=split(word[$1],tmp,",") # get record names
print $1 ":" # output word
for(i=1;i<=n;i++) # and records
print record[tmp[i]]
}' file2 file1
Output:
ABC:
line 1,,,,"hfhf,ferf,ju,ABC"
line 2 ,,,,,"ewy,trggt,gtg,ABC,RFR"
BDF:
line 3 .."himk,n,hn.ujj., BDF"
Thank you for your kind help, my friends.
Tried both variants above but kept getting various errors ( "do" expected) or misbehavior ( gets names of pattern blocks, eg ABC, BDF, but no lines.
Gave up for a while and then eventually tried another way
While base goal were to cycle through pattern list files, search for patterns in huge file and write out specific columns from lines found - i simply wrote
for *i in *txt # cycle throughfiles w/ patterns
do
grep -F -f "$i" bigfile.csv >> ${i}.out1 #greps all patterns from current file
cut -f 2,3,4,7 ${i}.out1>> ${i}.out2 # cuts columns of interest and writes them out to another file
done
I'm aware that this code should be improved using some fancy pipeline features, but it works perfectly as is, hope it`ll help somebody in similar situation. You can easily add some echoes to write out pattern list names as i initially requested

Replace some lines in fasta file with appended text using while loop and if/else statement

I am working with a fasta file and need to add line-specific text to each of the headers. So for example if my file is:
>TER1
AGCATGCTAGCTAGTCGACTCGATCGCATGCTC
>TER2
AGCATGCTAGCTAGACGACTCGATCGCATGCTC
>URC1
AGCATGCTAGCTAGTCGACTCGATCGCATGCTC
>URC2
AGCATGCTACCTAGTCGACTCGATCGCATGCTC
>UCR3
AGCATGCTAGCTAGTCGACTCGATGGCATGCTC
I want a while loop that will read through each line; for those with a > at the start, I want to append |population: plus the first three characters after the >. So line one would be:
>TER1|population:TER
etc.
I can't figure out how to make this work. Here my best attempt so far.
filename="testfasta.fa"
while read -r line
do
if [[ "$line" == ">"* ]]; then
id=$(cut -c2-4<<<"$line")
printf $line"|population:"$id"\n" >>outfile
else
printf $line"\n">>outfile
fi
done <"$filename"
This produces a file with the original headers and following line each on a single line.
Can someone tell me where I'm going wrong? My if and else loop aren't working at all!
Thanks!
You could use a while loop if you really want,
but sed would be simpler:
sed -e 's/^>\(...\).*/&|population:\1/' "$filename"
That is, for lines starting with > (pattern: ^>),
capture the next 3 characters (with \(...\)),
and match the rest of the line (.*),
replace with the line as it was (&),
and the fixed string |population:,
and finally the captured 3 characters (\1).
This will produce for your input:
>TER1|population:TER
AGCATGCTAGCTAGTCGACTCGATCGCATGCTC
>TER2|population:TER
AGCATGCTAGCTAGACGACTCGATCGCATGCTC
>URC1|population:URC
AGCATGCTAGCTAGTCGACTCGATCGCATGCTC
>URC2|population:URC
AGCATGCTACCTAGTCGACTCGATCGCATGCTC
>UCR3|population:UCR
AGCATGCTAGCTAGTCGACTCGATGGCATGCTC
Or you can use this awk, also producing the same output:
awk '{sub(/^>.*/, $0 "|population:" substr($0, 2, 3))}1' "$filename"
You can do this quickly in awk:
awk '$1~/^>/{$1=$1"|population:"substr($1,2,3)}{}1' infile.txt > outfile.txt
$ awk '$1~/^>/{$1=$1"|population:"substr($1,2,3)}{}1' testfile
>TER1|population:TER
AGCATGCTAGCTAGTCGACTCGATCGCATGCTC
>TER2|population:TER
AGCATGCTAGCTAGACGACTCGATCGCATGCTC
>URC1|population:URC
AGCATGCTAGCTAGTCGACTCGATCGCATGCTC
>URC2|population:URC
AGCATGCTACCTAGTCGACTCGATCGCATGCTC
>UCR3|population:UCR
AGCATGCTAGCTAGTCGACTCGATGGCATGCTC
Here awk will:
Test if the record starts with a > The $1 looks at the first field, but $0 for the entire record would work just as well in this case. The ~ will perform a regex test, and ^> means "Starts with >". Making the test: ($1~/^>/)
If so it will set the first field to the output you are looking for (using substr() to get the bits of the string you want. {$1=$1"|population:"substr($1,2,3)}
Finally it will print out the entire record (with the changes if applicable): {}1 which is shorthand for {print $0} or.. print the entire record.

Writing a script for large text file manipulation (iterative substitution of duplicated lines), weird bugs and very slow.

I am trying to write a script which takes a directory containing text files (384 of them) and modifies duplicate lines that have a specific format in order to make them not duplicates.
In particular, I have files in which some lines begin with the '#' character and contain the substring 0:0. A subset of these lines are duplicated one or more times. For those that are duplicated, I'd like to replace 0:0 with i:0 where i starts at 1 and is incremented.
So far I've written a bash script that finds duplicated lines beginning with '#', writes them to a file, then reads them back and uses sed in a while loop to search and replace the first occurrence of the line to be replaced. This is it below:
#!/bin/bash
fdir=$1"*"
#for each fastq file
for f in $fdir
do
(
#find duplicated read names and write to file $f.txt
sort $f | uniq -d | grep ^# > "$f".txt
#loop over each duplicated readname
while read in; do
rname=$in
i=1
#while this readname still exists in the file increment and replace
while grep -q "$rname" $f; do
replace=${rname/0:0/$i:0}
sed -i.bu "0,/$rname/s/$rname/$replace/" "$f"
let "i+=1"
done
done < "$f".txt
rm "$f".txt
rm "$f".bu
done
echo "done" >> progress.txt
)&
background=( $(jobs -p) )
if (( ${#background[#]} ==40)); then
wait -n
fi
done
The problem with it is that its impractically slow. I ran it on a 48 core computer for over 3 days and it hardly got through 30 files. It also seemed to have removed about 10 files and I'm not sure why.
My question is where are the bugs coming from and how can I do this more efficiently? I'm open to using other programming languages or changing my approach.
EDIT
Strangely the loop works fine on one file. Basically I ran
sort $f | uniq -d | grep ^# > "$f".txt
while read in; do
rname=$in
i=1
while grep -q "$rname" $f; do
replace=${rname/0:0/$i:0}
sed -i.bu "0,/$rname/s/$rname/$replace/" "$f"
let "i+=1"
done
done < "$f".txt
To give you an idea of what the files look like below are a few lines from one of them. The thing is that even though it works for the one file, it's slow. Like multiple hours for one file of 7.5 M. I'm wondering if there's a more practical approach.
With regard to the file deletions and other bugs I have no idea what was happening Maybe it was running into memory collisions or something when they were run in parallel?
Sample input:
#D00269:138:HJG2TADXX:2:1101:0:0 1:N:0:CCTAGAAT+ATTCCTCT
GATAAGGACGGCTGGTCCCTGTGGTACTCAGAGTATCGCTTCCCTGAAGA
+
CCCFFFFFHHFHHIIJJJJIIIJJIJIJIJJIIBFHIHIIJJJJJJIJIG
#D00269:138:HJG2TADXX:2:1101:0:0 1:N:0:CCTAGAAT+ATTCCTCT
CAAGTCGAACGGTAACAGGAAGAAGCTTGCTTCTTTGCTGACGAGTGGCG
Sample output:
#D00269:138:HJG2TADXX:2:1101:1:0 1:N:0:CCTAGAAT+ATTCCTCT
GATAAGGACGGCTGGTCCCTGTGGTACTCAGAGTATCGCTTCCCTGAAGA
+
CCCFFFFFHHFHHIIJJJJIIIJJIJIJIJJIIBFHIHIIJJJJJJIJIG
#D00269:138:HJG2TADXX:2:1101:2:0 1:N:0:CCTAGAAT+ATTCCTCT
CAAGTCGAACGGTAACAGGAAGAAGCTTGCTTCTTTGCTGACGAGTGGCG
Here's some code that produces the required output from your sample input.
Again, it is assumed that your input file is sorted by the first value (up to the first space character).
time awk '{
#dbg if (dbg) print "#dbg:prev=" prev
if (/^#/ && prev!=$1) {fixNum=0 ;if (dbg) print "prev!=$1=" prev "!=" $1}
if (/^#/ && (prev==$1 || NR==1) ) {
prev=$1
n=split($1,tmpArr,":") ; n++
#dbg if (dbg) print "tmpArr[6]="tmpArr[6] "\tfixNum="fixNum
fixNum++;tmpArr[6]=fixNum;
# magic to rebuild $1 here
for (i=1;i<n;i++) {
tmpFix ? tmpFix=tmpFix":"tmpArr[i]"" : tmpFix=tmpArr[i]
}
$1=tmpFix ; $0=$0
print $0
}
else { tmpFix=""; print $0 }
}' file > fixedFile
output
#D00269:138:HJG2TADXX:2:1101:1:0 1:N:0:CCTAGAAT+ATTCCTCT
GATAAGGACGGCTGGTCCCTGTGGTACTCAGAGTATCGCTTCCCTGAAGA
+
CCCFFFFFHHFHHIIJJJJIIIJJIJIJIJJIIBFHIHIIJJJJJJIJIG
#D00269:138:HJG2TADXX:2:1101:2:0 1:N:0:CCTAGAAT+ATTCCTCT
CAAGTCGAACGGTAACAGGAAGAAGCTTGCTTCTTTGCTGACGAGTGGCG
I've left a few of the #dbg:... statements in place (but they are now commented out) to show how you can run a small set of data as you have provided, and watch the values of variables change.
Assuming a non-csh, you should be able to copy/paste the code block into a terminal window cmd-line and replace file > fixFile at the end with your real file name and a new name for the fixed file. Recall that awk 'program' file > file (actually, any ...file>file) will truncate the existing file and then try to write, SO you can lose all the data of a file trying to use the same name.
There are probably some syntax improvements that will reduce the size of this code, and there might be 1 or 2 things that could be done that will make the code faster, but this should run very quickly. If not, please post the result of time command that should appear at the end of the run, i.e.
real 0m0.18s
user 0m0.03s
sys 0m0.06s
IHTH
#!/bin/bash
i=4
sort $1 | uniq -d | grep ^# > dups.txt
while read in; do
if [ $((i%4))=0 ] && grep -q "$in" dups.txt; then
x="$in"
x=${x/"0:0 "/$i":0 "}
echo "$x" >> $1"fixed.txt"
else
echo "$in" >> $1"fixed.txt"
fi
let "i+=1"
done < $1

How to convert HHMMSS to HH:MM:SS Unix?

I tried to convert the HHMMSS to HH:MM:SS and I am able to convert it successfully but my script takes 2 hours to complete because of the file size. Is there any better way (fastest way) to complete this task
Data File
data.txt
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,,,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,,071600,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,072200,072200,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TAB,072600,072600,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,073200,073200,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,073500,073500,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,MRO,073700,073700,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,CPT,073900,073900,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,074400,,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,,,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,,090200,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,090900,090900,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,091500,091500,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TAB,091900,091900,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,092500,092500,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,092900,092900,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,MRO,093200,093200,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,CPT,093500,093500,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,094500,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,CPT,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,MRO,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TAB,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,,170100,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,CPT,170400,170400,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,MRO,170700,170700,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,171000,171000,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,171500,171500,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TAB,171900,171900,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,172500,172500,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,172900,172900,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,173500,173500,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,174100,,
My code : script.sh
#!/bin/bash
awk -F"," '{print $5}' Data.txt > tmp.txt # print first line first string before , to tmp.txt i.e. all Numbers will be placed into tmp.txt
sort tmp.txt | uniq -d > Uniqe_number.txt # unique values be stored to Uniqe_number.txt
rm tmp.txt # removes tmp file
while read line; do
echo $line
cat Data.txt | grep ",$line," > Numbers/All/$line.txt # grep Number and creats files induvidtually
awk -F"," '{print $5","$4","$7","$8","$9","$10","$11}' Numbers/All/$line.txt > Numbers/All/tmp_$line.txt
mv Numbers/All/tmp_$line.txt Numbers/Final/Final_$line.txt
done < Uniqe_number.txt
ls Numbers/Final > files.txt
dos2unix files.txt
bash time_replace.sh
when you execute above script it will call time_replace.sh script
My Code for time_replace.sh
#!/bin/bash
for i in `cat files.txt`
do
while read aline
do
TimeDep=`echo $aline | awk -F"," '{print $6}'`
#echo $TimeDep
finalTimeDep=`echo $TimeDep | awk '{for(i=1;i<=length($0);i+=2){printf("%s:",substr($0,i,2))}}'|awk '{sub(/:$/,"")};1'`
#echo $finalTimeDep
##########
TimeAri=`echo $aline | awk -F"," '{print $7}'`
#echo $TimeAri
finalTimeAri=`echo $TimeAri | awk '{for(i=1;i<=length($0);i+=2){printf("%s:",substr($0,i,2))}}'|awk '{sub(/:$/,"")};1'`
#echo $finalTimeAri
sed -i 's/',$TimeDep'/',$finalTimeDep'/g' Numbers/Final/$i
sed -i 's/',$TimeAri'/',$finalTimeAri'/g' Numbers/Final/$i
############################
done < Numbers/Final/$i
done
Any better solution?
Appreciate any help.
Thanks
Sri
If there's a large quantity of files, then the pipelines are probably what are going to impact performance more than anything else - although processes can be cheap, if you're doing a huge amount of processing then cutting down the amount of time you do pass data through a pipeline can reap dividends.
So you're probably going to be better off writing the entire script in awk (or perl). For example, awk can send output to an arbitary file, so the while lop in your first file could be replaced with an awk script that does this. You also don't need to use a temporary file.
I assume the sorting is just for tracking progress easily as you know how many numbers there are. But if you don't care for the sorting, you can simply do this:
#!/bin/sh
awk -F ',' '
{
print $5","$4","$7","$8","$9","$10","$11 > Numbers/Final/Final_$line.txt
}' datafile.txt
ls Numbers/Final > files.txt
Alternatively, if you need to sort you can do sort -t, -k5,4,10 (or whichever field your sort keys actually need to be).
As for formatting the datetime, awk also does functions, so you could actually have an awk script that looks like this. This would replace both of your scripts above whilst retaining the same functionality (at least, as far as I can make out with a quick analysis) ... (Note! Untested, so may contain vauge syntax errors):
#!/usr/bin/awk
BEGIN {
FS=","
}
function formattime (t)
{
return substr(t,1,2)":"substr(t,3,2)":"substr(t,5,2)
}
{
print $5","$4","$7","$8","$9","formattime($10)","formattime($11) > Numbers/Final/Final_$line.txt
}
which you can save, chmod 700, and call directly as:
dostuff.awk filename
Other awk options include changing fields in-situ, so if you want to maintain the entire original file but with formatted datetimes, you can do a modification of the above. Change the print block to:
{
$10=formattime($10)
$11=formattime($11)
print $0
}
If this doesn't do everything you need it to, hopefully it gives some ideas that will help the code.
It's not clear what all your sorting and uniq-ing is for. I'm assuming your data file has only one entry per line, and you need to change the 10th and 11th comma-separated fields from HHMMSS to HH:MM:SS.
while IFS=, read -a line ; do
echo -n ${line[0]},${line[1]},${line[2]},${line[3]},
echo -n ${line[4]},${line[5]},${line[6]},${line[7]},
echo -n ${line[8]},${line[9]},
if [ -n "${line[10]}" ]; then
echo -n ${line[10]:0:2}:${line[10]:2:2}:${line[10]:4:2}
fi
echo -n ,
if [ -n "${line[11]}" ]; then
echo -n ${line[11]:0:2}:${line[11]:2:2}:${line[11]:4:2}
fi
echo ""
done < data.txt
The operative part is the ${variable:offset:length} construct that lets you extract substrings out of a variable.
In Perl, that's close to child's play:
#!/usr/bin/env perl
use strict;
use warnings;
use English( -no_match_vars );
local($OFS) = ",";
while (<>)
{
my(#F) = split /,/;
$F[9] =~ s/(\d\d)(\d\d)(\d\d)/$1:$2:$3/ if defined $F[9];
$F[10] =~ s/(\d\d)(\d\d)(\d\d)/$1:$2:$3/ if defined $F[10];
print #F;
}
If you don't want to use English, you can write local($,) = ","; instead; it controls the output field separator, choosing to use comma. The code reads each line in the file, splits it up on the commas, takes the last two fields, counting from zero, and (if they're not empty) inserts colons in between the pairs of digits. I'm sure a 'Code Golf' solution would be made a lot shorter, but this is semi-legible if you know any Perl.
This will be quicker by far than the script, not least because it doesn't have to sort anything, but also because all the processing is done in a single process in a single pass through the file. Running multiple processes per line of input, as in your code, is a performance disaster when the files are big.
The output on the sample data you gave is:
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,,,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,,07:16:00,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,07:22:00,07:22:00,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TAB,07:26:00,07:26:00,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,07:32:00,07:32:00,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,07:35:00,07:35:00,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,MRO,07:37:00,07:37:00,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,CPT,07:39:00,07:39:00,
10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,07:44:00,,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,,,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,,09:02:00,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,09:09:00,09:09:00,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,09:15:00,09:15:00,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TAB,09:19:00,09:19:00,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,09:25:00,09:25:00,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,09:29:00,09:29:00,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,MRO,09:32:00,09:32:00,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,CPT,09:35:00,09:35:00,
10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,09:45:00,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,CPT,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,MRO,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TAB,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,,17:01:00,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,CPT,17:04:00,17:04:00,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,MRO,17:07:00,17:07:00,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,17:10:00,17:10:00,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,17:15:00,17:15:00,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TAB,17:19:00,17:19:00,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,17:25:00,17:25:00,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,17:29:00,17:29:00,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,17:35:00,17:35:00,
10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,17:41:00,,

Resources