Fast grep on huge csv files - bash
I have a file (queryids.txt) with a list of 847 keywords to search. I have to grep the keywords from about 12 huge csv files (the biggest has 2,184,820,000 lines). Eventually we will load it into a database of some sort but for now, we just want certain keywords to be grep'ed.
My command is:
LC_ALL=C fgrep -f queryids.txt subject.csv
I am thinking of writing a bash script like this:
#!/bin/bash
for f in *.csv
do
( echo "Processing $f"
filename=$(basename "$f")
filename="${filename%.*}"
LC_ALL=C fgrep -f queryids.txt $f > $filename"_goi.csv" ) &
done
and I will run it using: nohup bash myscript.sh &
The queryids.txt looks like this:
ENST00000401850
ENST00000249005
ENST00000381278
ENST00000483026
ENST00000465765
ENST00000269080
ENST00000586539
ENST00000588458
ENST00000586292
ENST00000591459
The subject file looks like this:
target_id,length,eff_length,est_counts,tpm,id
ENST00000619216.1,68,2.65769E1,0.5,0.300188,00065a62-5e18-4223-a884-12fca053a109
ENST00000473358.1,712,5.39477E2,8.26564,0.244474,00065a62-5e18-4223-a884-12fca053a109
ENST00000469289.1,535,3.62675E2,4.82917,0.212463,00065a62-5e18-4223-a884-12fca053a109
ENST00000607096.1,138,1.92013E1,0,0,00065a62-5e18-4223-a884-12fca053a109
ENST00000417324.1,1187,1.01447E3,0,0,00065a62-5e18-4223-a884-12fca053a109
I am concerned this will take a long time. Is there a faster way to do this?
Thanks!
Few things I can suggest to improve the performance:
No need to spawn a sub-shell using ( .. ) &, you can use braces { ... } & if needed.
Use grep -F (non-regex or fixed string search) to make grep run faster
Avoid basename command and use bash string manipulation
Try this script:
#!/bin/bash
for f in *.csv; do
echo "Processing $f"
filename="${f##*/}"
LC_ALL=C grep -Ff queryids.txt "$f" > "${filename%.*}_goi.csv"
done
I suggest you run this on a smaller dataset to compare the performance gain.
You could try this instead:
awk '
BEGIN {
while ( (getline line < "queryids.txt") > 0 ) {
re = ( re=="" ? "" : re "|") line
}
}
FNR==1 { close(out); out=FILENAME; sub(/\.[^.]+$/,"_goi&",out) }
$0 ~ re { print > out }
' *.csv
It's using a regexp rather than string comparison - whether or not that matters and, if so, what we can do about it depends on the values in queryids.txt. In fact there may be a vastly faster and more robust way to do this depending on what your files contain so if you edit your question to include some examples of your file contents we could be of more help.
I see you have now posted some sample input and indeed we can do this much faster and more robustly by using a hash lookup:
awk '
BEGIN {
FS="."
while ( (getline line < "queryids.txt") > 0 ) {
ids[line]
}
}
FNR==1 { close(out); out=FILENAME; sub(/\.[^.]+$/,"_goi&",out) }
$1 in ids { print > out }
' *.csv
Related
Unscramble words Challenge - improve my bash solution
There is a Capture the Flag challenge I have two files; one with scrambled text like this with about 550 entries dnaoyt cinuertdso bda haey tolpap ... The second file is a dictionary with about 9,000 entries radar ccd gcc fcc historical ... The goal is to find the right, unscrambled version of the word, which is contained in the dictionary file. My approach is to sort the characters from the first word from the first file and then look up if the first word from the second file has the same length. If so then sort that too and compare them. This is my fully functional bash script, but it is very slow. #!/bin/bash while IFS="" read -r p || [ -n "$p" ] do var=0 ro=$(echo $p | perl -F -lane 'print sort #F') len_ro=${#ro} while IFS="" read -r o || [ -n "$o" ] do ro2=$(echo $o | perl -F -lane 'print sort # F') len_ro2=${#ro2} let "var+=1" if [ $len_ro == $len_ro2 ]; then if [ $ro == $ro2 ]; then echo $o >> new.txt echo $var >> whichline.txt fi fi done < dictionary.txt done < scrambled-words.txt I have also tried converting all characters to ASCII integers and sum each word, but while comparing I realized that the sum of a different char pattern may have the same sum. [edit] For the records: - no anagrams contained in dictionary - to get the flag, you need to export the unscrambled words as one blob and ans make a SHA-Hash out of it (thats the flag) - link to ctf for guy who wanted the files https://challenges.reply.com/tamtamy/user/login.action
You're better off creating a lookup dictionary (keyed by the sorted word) from the dictionary file. Your loop body is executed 550 * 9,000 = 4,950,000 times (O(N*M)). The solution I propose executes two loops of at most 9,000 passes each (O(N+M)). Bonus: It finds all possible solutions at no cost. #!/usr/bin/perl use strict; use warnings qw( all ); use feature qw( say ); my $dict_qfn = "dictionary.txt"; my $scrambled_qfn = "scrambled-words.txt"; sub key { join "", sort split //, $_[0] } my %dict; { open(my $fh, "<", $dict_qfn) or die("Can't open \"$dict_qfn\": $!\n"); while (<$fh>) { chomp; push #{ $dict{key($_)} }, $_; } } { open(my $fh, "<", $scrambled_qfn) or die("Can't open \"$scrambled_qfn\": $!\n"); while (<$fh>) { chomp; my $matches = $dict{key($_)}; say "$_ matches #$matches" if $matches; } } I wouldn't be surprised if this only takes one millionths of the time of your solution for the sizes you provided (and it scales so much better than yours if you were to increase the sizes).
I would do something like this with gawk gawk ' NR == FNR { dict[csort()] = $0 next } { print dict[csort()] } function csort( chars, sorted) { split($0, chars, "") asort(chars) for (i in chars) sorted = sorted chars[i] return sorted }' dictionary.txt scrambled-words.txt
Here's perl-free solution I came up with using sort and join: sort_letters() { # Splits each letter onto a line, sorts the letters, then joins them # e.g. "hello" becomes "ehllo" echo "${1}" | fold-b1 | sort | tr -d '\n' } # For each input file... for input in "dict.txt" "words.txt"; do # Convert each line to [sorted] [original] # then sort and save the results with a .sorted extension while read -r original; do sorted=$(sort_letters "${original}") echo "${sorted} ${original}" done < "${input}" | sort > "${input}.sorted" done # Join the two files on the [sorted] word # outputting the scrambled and unscrambed words join -j 1 -o 1.2,2.2 "words.txt.sorted" "dict.txt.sorted"
I tried something very alike, but a bit different. #!/bin/bash exec 3<scrambled-words.txt while read -r line <&3; do printf "%s" ${line} | perl -F -lane 'print sort #F' done>scrambled-words_sorted.txt exec 3>&- exec 3<dictionary.txt while read -r line <&3; do printf "%s" ${line} | perl -F -lane 'print sort #F' done>dictionary_sorted.txt exec 3>&- printf "" > whichline.txt exec 3<scrambled-words_sorted.txt while read -r line <&3; do counter="$((++counter))" grep -n -e "^${line}$" dictionary_sorted.txt | cut -d ':' -f 1 | tr -d '\n' >>whichline.txt printf "\n" >>whichline.txt done exec 3>&- As you can see I don't create a new.txt file; instead I only create whichline.txt with a blank line where the word doesn't match. You can easily paste them up to create new.txt. The logic behind the script is nearly the logic behind yours, with the exception that I called perl less times and I save two support files. I think (but I am not sure) that creating them and cycle only one file will be better than ~5kk calls of perl. This way "only" ~10k times is called. Finally, I decided to use grep because it's (maybe) the fastest regex matcher, and searching for the entire line the lenght is intrinsic in the regex. Please, note that what #benjamin-w said is still valid and, in that case, grep will reply badly and I did not managed it! I hope this could help [:
Looping input file and find out if line is used
I am using bash to loop through a large input file (contents.txt) that looks like: searchterm1 searchterm2 searchterm3 ...in an effort to remove search terms from the file if they are not used in a code base. I am trying to use grep and awk, but no success. I also want to exclude the images and constants directories #/bin/bash while read a; do output=`grep -R $a ../website | grep -v ../website/images | grep -v ../website/constants | grep -v ../website/.git` if [ -z "$output" ] then echo "$a" >> notneeded.txt else echo "$a used $($output | wc -l) times" >> needed.txt fi done < constants.txt The desired effect of this would be two files. One for showing all of the search terms that are found in the code base(needed.txt), and another for search terms that are not found in the code base(notneeded.txt). needed.txt searchterm1 used 4 times searchterm3 used 10 times notneeded.txt searchterm2 I've tried awk as well in a similar fashion but I cannot get it to loop and output as desired
Not sure but it sounds like you're looking for something like this (assuming no spaces in your file names): awk ' NR==FNR{ terms[$0]; next } { for (term in terms) { if ($0 ~ term) { hits[term]++ } } } END { for (term in terms) { if (term in hits) { print term " used " hits[term] " times" > "needed.txt" } else { print term > "notneeded.txt" } } } ' constants.txt $( find ../website -type f -print | egrep -v '\.\.\/website\/(images|constants|\.git)' ) There's probably some find option to make the egrep unnecessary.
Better way of extracting data from file for comparison
Problem: Comparison of files from Pre-check status and Post-check status of a node for specific parameters. With some help from community, I have written the following solution which extracts the information from files from directories pre and post and based on the "Node-ID" (which happens to be unique and is to be extracted from the files as well). After extracting the data from Pre/post folder, I have created folders based on the node-id and dumped files into the folders. My Code to extract data (The data is extracted from Pre and Post folders) FILES=$(find postcheck_logs -type f -name *.log) for f in $FILES do NODE=`cat $f | grep -m 1 ">" | awk '{print $1}' | sed 's/[>]//g'` ##Generate the node-id echo "Extracting Post check information for " $NODE mkdir temp/$NODE-post ## create a temp directory cat $f | awk 'BEGIN { RS=$NODE"> "; } /^param1/ { foo=RS $0; } END { print foo ; }' > temp/$NODE-post/param1.txt ## extract data cat $f | awk 'BEGIN { RS=$NODE"> "; } /^param2/ { foo=RS $0; } END { print foo ; }' > temp/$NODE-post/param2.txt cat $f | awk 'BEGIN { RS=$NODE"> "; } /^param3/ { foo=RS $0; } END { print foo ; }' > temp/$NODE-post/param3.txt done After this I have a structure as: /Node1-pre/param1.txt /Node1-post/param1.txt and so on. Now I am stuck to compare $NODE-pre and $NODE-post files, I have tried to do it using recursive grep, but I am not finding a suitable way to do so. What is the best possible way to compare these files using diff? Moreover, I find the above data extraction program very slow. I believe it's not the best possible way (using least resources) to do so. Any suggestions?
Look askance at any instance of cat one-file — you could use I/O redirection on the next command in the pipeline instead. You can do the whole thing more simply with: for f in $(find postcheck_logs -type f -name *.log) do NODE=$(sed '/>/{ s/ .*//; s/>//g; p; q; }' $f) ##Generate the node-id echo "Extracting Post check information for $NODE" mkdir temp/$NODE-post awk -v NODE="$NODE" -v DIR="temp/$NODE-post" \ 'BEGIN { RS=NODE"> " } /^param1/ { param1 = $0 } /^param2/ { param2 = $0 } /^param3/ { param3 = $0 } END { print RS param1 > DIR "/param1.txt" print RS param2 > DIR "/param2.txt" print RS param3 > DIR "/param3.txt" }' $f done The NODE finding process is much better done by a single sed command than cat | grep | awk | sed, and you should plan to use $(...) rather than back-quotes everywhere. The main processing of the log file should be done once; a single awk command is sufficient. The script is passed to variables — NODE and the directory name. The BEGIN is cleaned up; the $ before NODE was probably not what you intended. The main actions are very similar; each looks for the relevant parameter name and saves it in an appropriate variable. At the end, it write the saved values to the relevant files, decorated with the value of RS. Semicolons are only needed when there's more than one statement on a line; there's just one statement per line in this expanded script. It looks bigger than the original, but that's only because I'm using vertical space. As to comparing the before and after files, you can do it in many ways, depending on what you want to know. If you've got a POSIX-compliant diff (you probably do), you can use: diff -r temp/$NODE-pre temp/$NODE-post to report on the differences, if any, between the contents of the two directories. Alternatively, you can do it manually: for file in param1.txt param2.txt param3.txt do if cmp -s temp/$NODE-pre/$file temp/$NODE-post/$file then : No difference else diff temp/$NODE-pre/$file temp/$NODE-post/$file fi done Clearly, you can wrap that in a 'for each node' loop. And, if you are going to need to do that, then you probably do want to capture the output of the find command in a variable (as in the original code) so that you do not have to repeat that operation.
How to convert HHMMSS to HH:MM:SS Unix?
I tried to convert the HHMMSS to HH:MM:SS and I am able to convert it successfully but my script takes 2 hours to complete because of the file size. Is there any better way (fastest way) to complete this task Data File data.txt 10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,,, 10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,,071600, 10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,072200,072200, 10,SRI,AA,20091210,8503,ABCXYZ,D,N,TAB,072600,072600, 10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,073200,073200, 10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,073500,073500, 10,SRI,AA,20091210,8503,ABCXYZ,D,N,MRO,073700,073700, 10,SRI,AA,20091210,8503,ABCXYZ,D,N,CPT,073900,073900, 10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,074400,, 10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,,, 10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,,090200, 10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,090900,090900, 10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,091500,091500, 10,SRI,AA,20091210,8505,ABCXYZ,D,N,TAB,091900,091900, 10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,092500,092500, 10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,092900,092900, 10,SRI,AA,20091210,8505,ABCXYZ,D,N,MRO,093200,093200, 10,SRI,AA,20091210,8505,ABCXYZ,D,N,CPT,093500,093500, 10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,094500,, 10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,, 10,SRI,AA,20091210,8506,ABCXYZ,U,N,CPT,,, 10,SRI,AA,20091210,8506,ABCXYZ,U,N,MRO,,, 10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,, 10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,, 10,SRI,AA,20091210,8506,ABCXYZ,U,N,TAB,,, 10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,, 10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,, 10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,, 10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,, 10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,,170100, 10,SRI,AA,20091210,8510,ABCXYZ,U,N,CPT,170400,170400, 10,SRI,AA,20091210,8510,ABCXYZ,U,N,MRO,170700,170700, 10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,171000,171000, 10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,171500,171500, 10,SRI,AA,20091210,8510,ABCXYZ,U,N,TAB,171900,171900, 10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,172500,172500, 10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,172900,172900, 10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,173500,173500, 10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,174100,, My code : script.sh #!/bin/bash awk -F"," '{print $5}' Data.txt > tmp.txt # print first line first string before , to tmp.txt i.e. all Numbers will be placed into tmp.txt sort tmp.txt | uniq -d > Uniqe_number.txt # unique values be stored to Uniqe_number.txt rm tmp.txt # removes tmp file while read line; do echo $line cat Data.txt | grep ",$line," > Numbers/All/$line.txt # grep Number and creats files induvidtually awk -F"," '{print $5","$4","$7","$8","$9","$10","$11}' Numbers/All/$line.txt > Numbers/All/tmp_$line.txt mv Numbers/All/tmp_$line.txt Numbers/Final/Final_$line.txt done < Uniqe_number.txt ls Numbers/Final > files.txt dos2unix files.txt bash time_replace.sh when you execute above script it will call time_replace.sh script My Code for time_replace.sh #!/bin/bash for i in `cat files.txt` do while read aline do TimeDep=`echo $aline | awk -F"," '{print $6}'` #echo $TimeDep finalTimeDep=`echo $TimeDep | awk '{for(i=1;i<=length($0);i+=2){printf("%s:",substr($0,i,2))}}'|awk '{sub(/:$/,"")};1'` #echo $finalTimeDep ########## TimeAri=`echo $aline | awk -F"," '{print $7}'` #echo $TimeAri finalTimeAri=`echo $TimeAri | awk '{for(i=1;i<=length($0);i+=2){printf("%s:",substr($0,i,2))}}'|awk '{sub(/:$/,"")};1'` #echo $finalTimeAri sed -i 's/',$TimeDep'/',$finalTimeDep'/g' Numbers/Final/$i sed -i 's/',$TimeAri'/',$finalTimeAri'/g' Numbers/Final/$i ############################ done < Numbers/Final/$i done Any better solution? Appreciate any help. Thanks Sri
If there's a large quantity of files, then the pipelines are probably what are going to impact performance more than anything else - although processes can be cheap, if you're doing a huge amount of processing then cutting down the amount of time you do pass data through a pipeline can reap dividends. So you're probably going to be better off writing the entire script in awk (or perl). For example, awk can send output to an arbitary file, so the while lop in your first file could be replaced with an awk script that does this. You also don't need to use a temporary file. I assume the sorting is just for tracking progress easily as you know how many numbers there are. But if you don't care for the sorting, you can simply do this: #!/bin/sh awk -F ',' ' { print $5","$4","$7","$8","$9","$10","$11 > Numbers/Final/Final_$line.txt }' datafile.txt ls Numbers/Final > files.txt Alternatively, if you need to sort you can do sort -t, -k5,4,10 (or whichever field your sort keys actually need to be). As for formatting the datetime, awk also does functions, so you could actually have an awk script that looks like this. This would replace both of your scripts above whilst retaining the same functionality (at least, as far as I can make out with a quick analysis) ... (Note! Untested, so may contain vauge syntax errors): #!/usr/bin/awk BEGIN { FS="," } function formattime (t) { return substr(t,1,2)":"substr(t,3,2)":"substr(t,5,2) } { print $5","$4","$7","$8","$9","formattime($10)","formattime($11) > Numbers/Final/Final_$line.txt } which you can save, chmod 700, and call directly as: dostuff.awk filename Other awk options include changing fields in-situ, so if you want to maintain the entire original file but with formatted datetimes, you can do a modification of the above. Change the print block to: { $10=formattime($10) $11=formattime($11) print $0 } If this doesn't do everything you need it to, hopefully it gives some ideas that will help the code.
It's not clear what all your sorting and uniq-ing is for. I'm assuming your data file has only one entry per line, and you need to change the 10th and 11th comma-separated fields from HHMMSS to HH:MM:SS. while IFS=, read -a line ; do echo -n ${line[0]},${line[1]},${line[2]},${line[3]}, echo -n ${line[4]},${line[5]},${line[6]},${line[7]}, echo -n ${line[8]},${line[9]}, if [ -n "${line[10]}" ]; then echo -n ${line[10]:0:2}:${line[10]:2:2}:${line[10]:4:2} fi echo -n , if [ -n "${line[11]}" ]; then echo -n ${line[11]:0:2}:${line[11]:2:2}:${line[11]:4:2} fi echo "" done < data.txt The operative part is the ${variable:offset:length} construct that lets you extract substrings out of a variable.
In Perl, that's close to child's play: #!/usr/bin/env perl use strict; use warnings; use English( -no_match_vars ); local($OFS) = ","; while (<>) { my(#F) = split /,/; $F[9] =~ s/(\d\d)(\d\d)(\d\d)/$1:$2:$3/ if defined $F[9]; $F[10] =~ s/(\d\d)(\d\d)(\d\d)/$1:$2:$3/ if defined $F[10]; print #F; } If you don't want to use English, you can write local($,) = ","; instead; it controls the output field separator, choosing to use comma. The code reads each line in the file, splits it up on the commas, takes the last two fields, counting from zero, and (if they're not empty) inserts colons in between the pairs of digits. I'm sure a 'Code Golf' solution would be made a lot shorter, but this is semi-legible if you know any Perl. This will be quicker by far than the script, not least because it doesn't have to sort anything, but also because all the processing is done in a single process in a single pass through the file. Running multiple processes per line of input, as in your code, is a performance disaster when the files are big. The output on the sample data you gave is: 10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,,, 10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,,07:16:00, 10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,07:22:00,07:22:00, 10,SRI,AA,20091210,8503,ABCXYZ,D,N,TAB,07:26:00,07:26:00, 10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,07:32:00,07:32:00, 10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,07:35:00,07:35:00, 10,SRI,AA,20091210,8503,ABCXYZ,D,N,MRO,07:37:00,07:37:00, 10,SRI,AA,20091210,8503,ABCXYZ,D,N,CPT,07:39:00,07:39:00, 10,SRI,AA,20091210,8503,ABCXYZ,D,N,TMP,07:44:00,, 10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,,, 10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,,09:02:00, 10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,09:09:00,09:09:00, 10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,09:15:00,09:15:00, 10,SRI,AA,20091210,8505,ABCXYZ,D,N,TAB,09:19:00,09:19:00, 10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,09:25:00,09:25:00, 10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,09:29:00,09:29:00, 10,SRI,AA,20091210,8505,ABCXYZ,D,N,MRO,09:32:00,09:32:00, 10,SRI,AA,20091210,8505,ABCXYZ,D,N,CPT,09:35:00,09:35:00, 10,SRI,AA,20091210,8505,ABCXYZ,D,N,TMP,09:45:00,, 10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,, 10,SRI,AA,20091210,8506,ABCXYZ,U,N,CPT,,, 10,SRI,AA,20091210,8506,ABCXYZ,U,N,MRO,,, 10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,, 10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,, 10,SRI,AA,20091210,8506,ABCXYZ,U,N,TAB,,, 10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,, 10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,, 10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,, 10,SRI,AA,20091210,8506,ABCXYZ,U,N,TMP,,, 10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,,17:01:00, 10,SRI,AA,20091210,8510,ABCXYZ,U,N,CPT,17:04:00,17:04:00, 10,SRI,AA,20091210,8510,ABCXYZ,U,N,MRO,17:07:00,17:07:00, 10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,17:10:00,17:10:00, 10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,17:15:00,17:15:00, 10,SRI,AA,20091210,8510,ABCXYZ,U,N,TAB,17:19:00,17:19:00, 10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,17:25:00,17:25:00, 10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,17:29:00,17:29:00, 10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,17:35:00,17:35:00, 10,SRI,AA,20091210,8510,ABCXYZ,U,N,TMP,17:41:00,,
Bash script optimisation
This is the script in question: for file in `ls products` do echo -n `cat products/$file \ | grep '<td>.*</td>' | grep -v 'img' | grep -v 'href' | grep -v 'input' \ | head -1 | sed -e 's/^ *<td>//g' -e 's/<.*//g'` done I'm going to run it on 50000+ files, which would take about 12 hours with this script. The algorithm is as follows: Find only lines containing table cells (<td>) that do not contain any of 'img', 'href', or 'input'. Select the first of them, then extract the data between the tags. The usual bash text filters (sed, grep, awk, etc.) are available, as well as perl.
Looks like that can all be replace by one gawk command: gawk ' /<td>.*<\/td>/ && !(/img/ || /href/ || /input/) { sub(/^ *<td>/,""); sub(/<.*/,"") print nextfile } ' products/* This uses the gawk extension nextfile. If the wildcard expansion is too big, then find products -type f -print | xargs gawk '...'
Here's some quick perl to do the whole thing that should be alot faster. #!/usr/bin/perl process_files($ARGV[0]); # process each file in the supplied directory sub process_files($) { my $dirpath = shift; my $dh; opendir($dh, $dirpath) or die "Cant readdir $dirpath. $!"; # get a list of files my #files; do { #files = readdir($dh); foreach my $ent ( #files ){ if ( -f "$dirpath/$ent" ){ get_first_text_cell("$dirpath/$ent"); } } } while ($#files > 0); closedir($dh); } # return the content of the first html table cell # that does not contain img,href or input tags sub get_first_text_cell($) { my $filename = shift; my $fh; open($fh,"<$filename") or die "Cant open $filename. $!"; my $found = 0; while ( ( my $line = <$fh> ) && ( $found == 0 ) ){ ## capture html and text inside a table cell if ( $line =~ /<td>([&;\d\w\s"'<>]+)<\/td>/i ){ my $cell = $1; ## omit anything with the following tags if ( $cell !~ /<(img|href|input)/ ){ $found++; print "$cell\n"; } } } close($fh); } Simply invoke it by passing the directory to be searched as the first argument: $ perl parse.pl /html/documents/
What about this (should be much faster and clearer): for file in products/*; do grep -P -o '(?<=<td>).*(?=<\/td>)' $file | grep -vP -m 1 '(img|input|href)' done the for will look to every file in products. See the difference with your syntax. the first grep will output just the text between <td> and </td> without those tags for every cell as long as each cell is in a single line. finally the second grep will output just the first line (which is what I believe you wanted to achieve with that head -1) of those lines which doesn't contain img, href or input (and will exit right then reducing the overall time allowing to process the next file faster) I would have loved to use just a single grep, but then the regex will be really awful. :-) Disclaimer: of course I haven't tested it