Unscramble words Challenge - improve my bash solution - bash

There is a Capture the Flag challenge
I have two files; one with scrambled text like this with about 550 entries
dnaoyt
cinuertdso
bda
haey
tolpap
...
The second file is a dictionary with about 9,000 entries
radar
ccd
gcc
fcc
historical
...
The goal is to find the right, unscrambled version of the word, which is contained in the dictionary file.
My approach is to sort the characters from the first word from the first file and then look up if the first word from the second file has the same length. If so then sort that too and compare them.
This is my fully functional bash script, but it is very slow.
#!/bin/bash
while IFS="" read -r p || [ -n "$p" ]
do
var=0
ro=$(echo $p | perl -F -lane 'print sort #F')
len_ro=${#ro}
while IFS="" read -r o || [ -n "$o" ]
do
ro2=$(echo $o | perl -F -lane 'print sort # F')
len_ro2=${#ro2}
let "var+=1"
if [ $len_ro == $len_ro2 ]; then
if [ $ro == $ro2 ]; then
echo $o >> new.txt
echo $var >> whichline.txt
fi
fi
done < dictionary.txt
done < scrambled-words.txt
I have also tried converting all characters to ASCII integers and sum each word, but while comparing I realized that the sum of a different char pattern may have the same sum.
[edit]
For the records:
- no anagrams contained in dictionary
- to get the flag, you need to export the unscrambled words as one blob and ans make a SHA-Hash out of it (thats the flag)
- link to ctf for guy who wanted the files https://challenges.reply.com/tamtamy/user/login.action

You're better off creating a lookup dictionary (keyed by the sorted word) from the dictionary file.
Your loop body is executed 550 * 9,000 = 4,950,000 times (O(N*M)).
The solution I propose executes two loops of at most 9,000 passes each (O(N+M)).
Bonus: It finds all possible solutions at no cost.
#!/usr/bin/perl
use strict;
use warnings qw( all );
use feature qw( say );
my $dict_qfn = "dictionary.txt";
my $scrambled_qfn = "scrambled-words.txt";
sub key { join "", sort split //, $_[0] }
my %dict;
{
open(my $fh, "<", $dict_qfn)
or die("Can't open \"$dict_qfn\": $!\n");
while (<$fh>) {
chomp;
push #{ $dict{key($_)} }, $_;
}
}
{
open(my $fh, "<", $scrambled_qfn)
or die("Can't open \"$scrambled_qfn\": $!\n");
while (<$fh>) {
chomp;
my $matches = $dict{key($_)};
say "$_ matches #$matches" if $matches;
}
}
I wouldn't be surprised if this only takes one millionths of the time of your solution for the sizes you provided (and it scales so much better than yours if you were to increase the sizes).

I would do something like this with gawk
gawk '
NR == FNR {
dict[csort()] = $0
next
}
{
print dict[csort()]
}
function csort( chars, sorted) {
split($0, chars, "")
asort(chars)
for (i in chars)
sorted = sorted chars[i]
return sorted
}' dictionary.txt scrambled-words.txt

Here's perl-free solution I came up with using sort and join:
sort_letters() {
# Splits each letter onto a line, sorts the letters, then joins them
# e.g. "hello" becomes "ehllo"
echo "${1}" | fold-b1 | sort | tr -d '\n'
}
# For each input file...
for input in "dict.txt" "words.txt"; do
# Convert each line to [sorted] [original]
# then sort and save the results with a .sorted extension
while read -r original; do
sorted=$(sort_letters "${original}")
echo "${sorted} ${original}"
done < "${input}" | sort > "${input}.sorted"
done
# Join the two files on the [sorted] word
# outputting the scrambled and unscrambed words
join -j 1 -o 1.2,2.2 "words.txt.sorted" "dict.txt.sorted"

I tried something very alike, but a bit different.
#!/bin/bash
exec 3<scrambled-words.txt
while read -r line <&3; do
printf "%s" ${line} | perl -F -lane 'print sort #F'
done>scrambled-words_sorted.txt
exec 3>&-
exec 3<dictionary.txt
while read -r line <&3; do
printf "%s" ${line} | perl -F -lane 'print sort #F'
done>dictionary_sorted.txt
exec 3>&-
printf "" > whichline.txt
exec 3<scrambled-words_sorted.txt
while read -r line <&3; do
counter="$((++counter))"
grep -n -e "^${line}$" dictionary_sorted.txt | cut -d ':' -f 1 | tr -d '\n' >>whichline.txt printf "\n" >>whichline.txt
done
exec 3>&-
As you can see I don't create a new.txt file; instead I only create whichline.txt with a blank line where the word doesn't match. You can easily paste them up to create new.txt.
The logic behind the script is nearly the logic behind yours, with the exception that I called perl less times and I save two support files.
I think (but I am not sure) that creating them and cycle only one file will be better than ~5kk calls of perl. This way "only" ~10k times is called.
Finally, I decided to use grep because it's (maybe) the fastest regex matcher, and searching for the entire line the lenght is intrinsic in the regex.
Please, note that what #benjamin-w said is still valid and, in that case, grep will reply badly and I did not managed it!
I hope this could help [:

Related

How to merge two or more lines if they start with the same word?

I have a file like this:
AAKRKA HIST1H1B AAGAGAAKRKATGPP
AAKRKA HIST1H1E RKSAGAAKRKASGPP
AAKRLN ACAT1 LMTADAAKRLNVTPL
AAKRLN SUCLG2 NEALEAAKRLNAKEI
AAKRLR GTF2F1 VSEMPAAKRLRLDTG
AAKRMA VCL NDIIAAAKRMALLMA
AAKRPL WIZ YLGSVAAKRPLQEDR
AAKRQK MTA2 SSSQPAAKRQKLNPA
I would like to kind of merge 2 lines if they are exactly the same in the 1st column. The desired output is:
AAKRKA HIST1H1B,HIST1H1E AAGAGAAKRKATGPP,RKSAGAAKRKASGPP
AAKRLN ACAT1,SUCLG2 LMTADAAKRLNVTPL,NEALEAAKRLNAKEI
AAKRLR GTF2F1 VSEMPAAKRLRLDTG
AAKRMA VCL NDIIAAAKRMALLMA
AAKRPL WIZ YLGSVAAKRPLQEDR
AAKRQK MTA2 SSSQPAAKRQKLNPA
Sometimes there could be more than two lines starting with the same word. How could I reach the desired output with bash/awk?
Thanks for help!
Since this resembles SQL like group operations, you can use sqlite which is available in bash
with the given inputs
$ cat aqua.txt
AAKRKA HIST1H1B AAGAGAAKRKATGPP
AAKRKA HIST1H1E RKSAGAAKRKASGPP
AAKRLN ACAT1 LMTADAAKRLNVTPL
AAKRLN SUCLG2 NEALEAAKRLNAKEI
AAKRLR GTF2F1 VSEMPAAKRLRLDTG
AAKRMA VCL NDIIAAAKRMALLMA
AAKRPL WIZ YLGSVAAKRPLQEDR
AAKRQK MTA2 SSSQPAAKRQKLNPA
$
Script:
$ cat ./sqlite_join.sh
#!/bin/sh
sqlite3 << EOF
create table data(a,b,c);
.separator ' '
.import $1 data
select a, group_concat(b) , group_concat(c) from data group by a;
EOF
$
Results
$ ./sqlite_join.sh aqua.txt
AAKRKA HIST1H1B,HIST1H1E AAGAGAAKRKATGPP,RKSAGAAKRKASGPP
AAKRLN ACAT1,SUCLG2 LMTADAAKRLNVTPL,NEALEAAKRLNAKEI
AAKRLR GTF2F1 VSEMPAAKRLRLDTG
AAKRMA VCL NDIIAAAKRMALLMA
AAKRPL WIZ YLGSVAAKRPLQEDR
AAKRQK MTA2 SSSQPAAKRQKLNPA
$
This is a two-liner in awk; the first line stores the second and third fields in associative arrays indexed by the first field, accumulating fields with identical indices with leading commas before each field, and the second line iterates over the two arrays, deleting the leading comma on output:
{ second[$1] = second[$1] "," $2; third[$1] = third[$1] "," $3 }
END { for (i in second) print i, substr(second[i],2), substr(third[i],2) }
I made no assumptions about the order of the input or the output. If you want sorted output, pipe the output through sort. You can run the program at https://ideone.com/sbgLNk.
try this:
DATAFILE=data.txt
cut -d " " -f1 < $DATAFILE | sort | uniq |
while read key; do
column1="$key"
column2=""
column3=""
grep "$key" $DATAFILE |
while read line; do
set -- $line
[ -n "$column2" ] && [ -n "$2" ] && column2="$column2,"
[ -n "$column3" ] && [ -n "$3" ] && column3="$column3,"
column2="$column2$2"
column3="$column3$3"
echo "$column1 $column2 $column3"
done | tail -n1
done

Using sed on text files with a csv

I've been trying to do bulk find and replace on two text files using a csv. I've seen the questions that SO suggests, and none seem to answer my question.
I've created two variables for the two text files I want to modify. The csv has two columns and hundreds of rows. The first column contains strings (none have whitespaces) already in the text file that need to be replaced with the corresponding strings in same row in the second column.
As a test, I tried the script
#!/bin/bash
test1='long_file_name.txt'
find='string1'
replace='string2'
sed -e "s/$find/$replace/g" $test1 > $test1.tmp && mv $test1.tmp $test1
This was successful, except that I need to do it once for every row in the csv, using the values given by the csv in each row. My hunch is that my while loop was used wrongly, but I can't find the error. When I execute the script below, I get the command line prompt, which makes me think that something has happened. When I check the text files, nothing's changed.
The two text files, this script, and the csv are all in the same folder (it's also been my working directory when I do this).
#!/bin/bash
textfile1='long_file_name1.txt'
textfile2='long_file_name2.txt'
while IFS=, read f1 f2
do
sed -e "s/$f1/$f2/g" $textfile1 > $textfile1.tmp && \
mv $textfile1.tmp $textfile1
sed -e "s/$f1/$f2/g" $textfile2 > $textfile2.tmp && \
mv $textfile2.tmp $textfile2
done <'findreplace.csv'
It seems to me that this code should do what I want it to do (but doesn't); perhaps I'm misunderstanding something fundamental (I'm new to bash scripting)?
The csv looks like this, but with hundreds of rows. All a_i's should be replaced with their counterpart b_i in the next column over.
a_1 b_1
a_2 b_2
a_3 b_3
Something to note: All the strings actually contain underscores, just in case this affects something. I've tried wrapping the variable name in braces a la ${var}, but it still doesn't work.
I appreciate the solutions, but I'm also curious to know why the above doesn't work. (Also, I would vote everyone up, but I lack the reputation to do so. However, know that I appreciate and am learning a lot from your answers!)
If you are going to process lot of data and your patterns can contain a special character I would consider using Perl. Especially if you are going to have a lot of pairs in findreplace.csv. You can use following script as filter or in-place modification with lot of files. As side effect, it will load replacements and create Aho-Corrasic automaton only once per invocation which will make this solution pretty efficient (O(M+N) instead of O(M*N) in your solution).
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
my $in_place = ( #ARGV and $ARGV[0] =~ /^-i(.*)/ )
? do {
shift;
my $backup_extension = $1;
my $backup_name = $backup_extension =~ /\*/
? sub { ( my $fn = $backup_extension ) =~ s/\*/$_[0]/; $fn }
: sub { shift . $backup_extension };
my $oldargv = '-';
sub {
if ( $ARGV ne $oldargv ) {
rename( $ARGV, $backup_name->($ARGV) );
open( ARGVOUT, '>', $ARGV );
select(ARGVOUT);
$oldargv = $ARGV;
}
};
}
: sub { };
die "$0: File with replacements required." unless #ARGV;
my ( $re, %replace );
do {
my $filename = shift;
open my $fh, '<', $filename;
%replace = map { chomp; split ',', $_, 2 } <$fh>;
close $fh;
$re = join '|', map quotemeta, keys %replace;
$re = qr/($re)/;
};
while (<>) {
$in_place->();
s/$re/$replace{$1}/g;
}
continue {print}
Usage:
./replace.pl replace.csv <file.in >file.out
as well as
./replace.pl replace.csv file.in >file.out
or in-place
./replace.pl -i replace.csv file1.csv file2.csv file3.csv
or with backup
./replace.pl -i.orig replace.csv file1.csv file2.csv file3.csv
or with backup whit placeholder
./replace.pl -ithere.is.\*.original replace.csv file1.csv file2.csv file3.csv
You should convert your CSV file to a sed.script with the following command:
cat replace.csv | awk -F, '{print "s/" $1 "/" $2 "/g";}' > sed.script
And then you will be able to do a one pass replacement:
sed -i -f sed.script longfilename.txt
This will be a faster implementation of what you wanna do.
BTW, sorry, but I do not understand what is wrong with your script which should work except if your CSV file has more than 2 columns.

Bash script processing too slow

I have the following script where I'm parsing 2 csv files to find a MATCH the files have 10000 lines each one. But the processing is taking a long time!!! Is this normal?
My script:
#!/bin/bash
IFS=$'\n'
CSV_FILE1=$1;
CSV_FILE2=$2;
sort -t';' $CSV_FILE1 >> Sorted_CSV1
sort -t';' $CSV_FILE2 >> Sorted_CSV2
echo "PATH1 ; NAME1 ; SIZE1 ; CKSUM1 ; PATH2 ; NAME2 ; SIZE2 ; CKSUM2" >> 'mapping.csv'
while read lineCSV1 #Parse 1st CSV file
do
PATH1=`echo $lineCSV1 | awk '{print $1}'`
NAME1=`echo $lineCSV1 | awk '{print $3}'`
SIZE1=`echo $lineCSV1 | awk '{print $7}'`
CKSUM1=`echo $lineCSV1 | awk '{print $9}'`
while read lineCSV2 #Parse 2nd CSV file
do
PATH2=`echo $lineCSV2 | awk '{print $1}'`
NAME2=`echo $lineCSV2 | awk '{print $3}'`
SIZE2=`echo $lineCSV2 | awk '{print $7}'`
CKSUM2=`echo $lineCSV2 | awk '{print $9}'`
# Test if NAM1 MATCHS NAME2
if [[ $NAME1 == $NAME2 ]]; then
#Test checksum OF THE MATCHING NAME
if [[ $CKSUM1 != $CKSUM2 ]]; then
#MAPPING OF THE MATCHING LINES
echo $PATH1 ';' $NAME1 ';' $SIZE1 ';' $CKSUM1 ';' $PATH2 ';' $NAME2 ';' $SIZE2 ';' $CKSUM2 >> 'mapping.csv'
fi
break #When its a match break the while loop and go the the next Row of the 1st CSV File
fi
done < Sorted_CSV2 #Done CSV2
done < Sorted_CSV1 #Done CSV1
This is a quadratic order. Also, see Tom Fenech comment: You are calling awk several times inside a loop inside another loop. Instead of using awk for the fields in every line try setting the IFS shell variable to ";" and read the fields directly in read commands:
IFS=";"
while read FIELD11 FIELD12 FIELD13; do
while read FIELD21 FIELD22 FIELD23; do
...
done <Sorted_CSV2
done <Sorted_CSV1
Though, this would be still O(N^2) and very inefficient. It seems you are matching 2 fields by a coincident field. This task is easier and faster to accomplish by using join command line utility, and would reduce order from O(N^2) to O(N).
Whenever you say "Does this file/data list/table have something that matches this file/data list/table?", you should think of associative arrays (sometimes called hashes).
An associative array is keyed by a particular value and each key is associated with a value. The nice thing is that finding a key is extremely fast.
In your loop of a loop, you have 10,000 lines in each file. You're outer loop executed 10,000 times. Your inner loop may execute 10,000 times for each and every line in your first file. That's 10,000 x 10,000 times you go through that inner loop. That's potentially looping 100 million times through that inner loop. Think you can see why your program might be a little slow?
In this day and age, having a 10,000 member associative array isn't that bad. (Imagine doing this back in 1980 on a MS-DOS system with 256K. It just wouldn't work). So, let's go through the first file, create a 10,000 member associative array, and then go through the second file looking for matching lines.
Bash 4.x has associative arrays, but I only have Bash 3.2 on my system, so I can't really give you an answer in Bash.
Besides, sometimes Bash isn't the answer to a particular issue. Bash can be a bit slow and the syntax can be error prone. Awk might be faster, but many versions don't have associative arrays. This is really a job for a higher level scripting language like Python or Perl.
Since I can't do a Bash answer, here's a Perl answer. Maybe this will help. Or, maybe this will inspire someone who has Bash 4.x can give an answer in Bash.
I Basically open the first file and create an associative array keyed by the checksum. If this is a sha1 checksum, it should be unique for all files (unless they're an exact match). If you don't have a sha1 checksum, you'll need to massage the structure a wee bit, but it's pretty much the same idea.
Once I have the associative array figured out, I then open file #2 and simply see if the checksum already exists in the file. If it does, I know I have a matching line, and print out the two matches.
I have to loop 10,000 times in the first file, and 10,000 times in the second. That's only 20,000 loops instead of 10 million that's 20,000 times less looping which means the program will run 20,000 times faster. So, if it takes 2 full days for your program to run with a double loop, an associative array solution will work in less than one second.
#! /usr/bin/env perl
#
use strict;
use warnings;
use autodie;
use feature qw(say);
use constant {
FILE1 => "file1.txt",
FILE2 => "file2.txt",
MATCHING => "csv_matches.txt",
};
#
# Open the first file and create the associative array
#
my %file_data;
open my $fh1, "<", FILE1;
while ( my $line = <$fh1> ) {
chomp $line;
my ( $path, $blah, $name, $bather, $yadda, $tl_dr, $size, $etc, $check_sum ) = split /\s+/, $line;
#
# The main key is "check_sum" which **should** be unique, especially if it's a sha1
#
$file_data{$check_sum}->{PATH} = $path;
$file_data{$check_sum}->{NAME} = $name;
$file_data{$check_sum}->{SIZE} = $size;
}
close $fh1;
#
# Now, we have the associative array keyed by the data we want to match, read file 2
#
open my $fh2, "<", FILE2;
open my $csv_fh, ">", MATCHING;
while ( my $line = <$fh2> ) {
chomp $line;
my ( $path, $blah, $name, $bather, $yadda, $tl_dr, $size, $etc, $check_sum ) = split /\s+/, $line;
#
# If there is a matching checksum in file1, we know we have a matching entry
#
if ( exists $file_data{$check_sum} ) {
printf {$csv_fh} "%s;%s:%s:%s:%s:%s\n",
$file_data{$check_sum}->{PATH}, $file_data{$check_sum}->{NAME}, $file_data{$check_sum}->{SIZE},
$path, $name, $size;
}
}
close $fh2;
close $csv_fh;
BUGS
(A good manpage always list issues!)
This assumes one match per file. If you have multiple duplicates in file1 or file2, you will only pick up the last one.
This assumes a sha256 or equivalent checksum. In such a checksum, it is extremely unlikely that two files will have the same checksum unless they match. A 16bit checksum from the historic sum command may have collisions.
Although a proper database engine would make a much better tool for this, it is still very well possible to do it with awk.
The trick is to sort your data, so that records with the same name are grouped together. Then a single pass from top to bottom is enough to find the matches. This can be done in linear time.
In detail:
Insert two columns in both CSV files
Make sure every line starts with the name. Also add a number (either 1 or 2) which denotes from which file the line originates. We will need this when we merge the two files together.
awk -F';' '{ print $2 ";1;" $0 }' csvfile1 > tmpfile1
awk -F';' '{ print $2 ";2;" $0 }' csvfile2 > tmpfile2
Concatenate the files, then sort the lines
sort tmpfile1 tmpfile2 > tmpfile3
Scan the result, report the mismatches
awk -F';' -f scan.awk tmpfile3
Where scan.awk contains:
BEGIN {
origin = 3;
}
$1 == name && $2 > origin && $6 != checksum {
print record;
}
{
name = $1;
origin = $2;
checksum = $6;
sub(/^[^;]*;.;/, "");
record = $0;
}
Putting it all together
Crammed together into a Bash oneliner, without explicit temporary files:
(awk -F';' '{print $2";1;"$0}' csvfile1 ; awk -F';' '{print $2";2;"$0}' csvfile2) | sort | awk -F';' 'BEGIN{origin=3}$1==name&&$2>origin&&$6!=checksum{print record}{name=$1;origin=$2;checksum=$6;sub(/^[^;]*;.;/,"");record=$0;}'
Notes:
If the same name appears more than once in csvfile1, then all but the last one are ignored.
If the same name appears more than once in csvfile2, then all but the first one are ignored.

Use awk to parse source code

I'm looking to create documentation from source code that I have. I've been looking around and something like awk seems like it will work, but I've had no luck so far. The information is split in two files, file1.c and file2.c.
Note: I've set up an automatic build environment for the program. This detects changes in the source and builds it. I would like to generate a text file containing a list of any variables which have been modified since the last successful build. The script I'm looking for would be a post-build step, and would run after compilation
In file1.c I have a list of function calls (all the same function) that have a string name to identify them such as:
newFunction("THIS_IS_THE_STRING_I_WANT", otherVariables, 0, &iAlsoNeedThis);
newFunction("I_WANT_THIS_STRING_TOO", otherVariable, 0, &iAnotherOneINeed);
etc...
The fourth parameter in the function call contains the value of the string name in file2. For example:
iAlsoNeedThis = 25;
iAnotherOneINeed = 42;
etc...
I'm looking to output the list to a txt file in the following format:
THIS_IS_THE_STRING_I_WANT = 25
I_WANT_THIS_STRING_TOO = 42
Is there any way of do this?
Thanks
Here is a start:
NR==FNR { # Only true when we are reading the first file
split($1,s,"\"") # Get the string in quotes from the first field
gsub(/[^a-zA-Z]/,"",$4) # Remove the none alpha chars from the forth field
m[$4]=s[2] # Create array
next
}
$1 in m { # Match feild four from file1 with field one file2
sub(/;/,"") # Get rid of the ;
print m[$1],$2,$3 # Print output
}
Saving this script.awk and running it with your example produces:
$ awk -f script.awk file1 file2
THIS_IS_THE_STRING_I_WANT = 25
I_WANT_THIS_STRING_TOO = 42
Edit:
The modifications you require affects the first line of the script:
NR==FNR && $3=="0," && /start here/,/end here/ {
You can do it in the shell like so.
#!/bin/sh
eval $(sed 's/[^a-zA-Z0-9=]//g' file2)
while read -r line; do
case $line in
(newFunction*)
set -- $line
string=${1#*\"}
string=${string%%\"*}
while test $# -gt 1; do shift; done
x=${1#&}
x=${x%);}
eval x=\$$x
printf '%s = %s\n' $string $x
esac
done < file1.c
Assumptions: newFunction is at the start of the line. Nothing follows the );. Whitespace exactly as in your samples. Output
THIS_IS_THE_STRING_I_WANT = 25
I_WANT_THIS_STRING_TOO = 42
You can execute file file2.c so variables will be defined in bash. Then, you will just have to print $iAlsoNeedThis to get value from iAlsoNeedThis = 25;
It can be done with . file2.c.
Then, what you can do is:
while read line;
do
name=$(echo $line | cut -d"\"" -f2);
value=$(echo $line | cut -d"&" -f2 | cut -d")" -f1);
echo $name = ${!value};
done < file1.c
to get the THIS_IS_THE_STRING_I_WANT, I_WANT_THIS_STRING_TOO text.

Bash script optimisation

This is the script in question:
for file in `ls products`
do
echo -n `cat products/$file \
| grep '<td>.*</td>' | grep -v 'img' | grep -v 'href' | grep -v 'input' \
| head -1 | sed -e 's/^ *<td>//g' -e 's/<.*//g'`
done
I'm going to run it on 50000+ files, which would take about 12 hours with this script.
The algorithm is as follows:
Find only lines containing table cells (<td>) that do not contain any of 'img', 'href', or 'input'.
Select the first of them, then extract the data between the tags.
The usual bash text filters (sed, grep, awk, etc.) are available, as well as perl.
Looks like that can all be replace by one gawk command:
gawk '
/<td>.*<\/td>/ && !(/img/ || /href/ || /input/) {
sub(/^ *<td>/,""); sub(/<.*/,"")
print
nextfile
}
' products/*
This uses the gawk extension nextfile.
If the wildcard expansion is too big, then
find products -type f -print | xargs gawk '...'
Here's some quick perl to do the whole thing that should be alot faster.
#!/usr/bin/perl
process_files($ARGV[0]);
# process each file in the supplied directory
sub process_files($)
{
my $dirpath = shift;
my $dh;
opendir($dh, $dirpath) or die "Cant readdir $dirpath. $!";
# get a list of files
my #files;
do {
#files = readdir($dh);
foreach my $ent ( #files ){
if ( -f "$dirpath/$ent" ){
get_first_text_cell("$dirpath/$ent");
}
}
} while ($#files > 0);
closedir($dh);
}
# return the content of the first html table cell
# that does not contain img,href or input tags
sub get_first_text_cell($)
{
my $filename = shift;
my $fh;
open($fh,"<$filename") or die "Cant open $filename. $!";
my $found = 0;
while ( ( my $line = <$fh> ) && ( $found == 0 ) ){
## capture html and text inside a table cell
if ( $line =~ /<td>([&;\d\w\s"'<>]+)<\/td>/i ){
my $cell = $1;
## omit anything with the following tags
if ( $cell !~ /<(img|href|input)/ ){
$found++;
print "$cell\n";
}
}
}
close($fh);
}
Simply invoke it by passing the directory to be searched as the first argument:
$ perl parse.pl /html/documents/
What about this (should be much faster and clearer):
for file in products/*; do
grep -P -o '(?<=<td>).*(?=<\/td>)' $file | grep -vP -m 1 '(img|input|href)'
done
the for will look to every file in products. See the difference with your syntax.
the first grep will output just the text between <td> and </td> without those tags for every cell as long as each cell is in a single line.
finally the second grep will output just the first line (which is what I believe you wanted to achieve with that head -1) of those lines which doesn't contain img, href or input (and will exit right then reducing the overall time allowing to process the next file faster)
I would have loved to use just a single grep, but then the regex will be really awful. :-)
Disclaimer: of course I haven't tested it

Resources