Efficient substring parsing of large fixed length text file in Bash - bash

I have a large text file (millions of records) of fixed length data and need to extract unique substrings and create a number of arrays with those values. I have a working version, however I'm wondering if performance can be improved since I need to run the script iteratively.
$_file5 looks like:
138000010065011417865201710152017102122
138000010067710416865201710152017102133
138000010131490417865201710152017102124
138000010142349413865201710152017102154
138400010142356417865201710152017102165
130000101694334417865201710152017102176
Here is what I have so far:
while IFS='' read -r line || [[ -n "$line" ]]; do
_in=0
_set=${line:15:6}
_startDate=${line:21:8}
_id="$_account-$_set-$_startDate"
for element in "${_subsets[#]}"; do
if [[ $element == "$_set" ]]; then
_in=1
break
fi
done
# If we find a new one and it's not 504721
if [ $_in -eq 0 ] && [ $_set != "504721" ] ; then
_subsets=("${_subsets[#]}" "$_set")
_ids=("${_ids[#]}" "$_id")
fi
done < $_file5
And this yields:
_subsets=("417865","416865","413865")
_ids=("9899-417865-20171015", "9899-416865-20171015", "9899-413865-20171015")
I'm not sure if sed or awk would be better here and can't find a way to implement either. Thanks.
EDIT: Benchmark Tests
So I benchmarked my original solution against the two provided. Ran this over 10 times and all results where similar to below.
# Bash read
real 0m8.423s
user 0m8.115s
sys 0m0.307s
# Using sort -u (#randomir)
real 0m0.719s
user 0m0.693s
sys 0m0.041s
# Using awk (#shellter)
real 0m0.159s
user 0m0.152s
sys 0m0.007s
Looks like awk wins this one. Regardless, the performance improvement from my original code is substantial. Thank you both for your contributions.

I don't think you can beat the performance of sort -u with bash loops (except in corner cases, as this one turned out to be, see footnote✻).
To reduce the list of strings you have in file to a list of unique strings (set), based on a substring:
sort -k1.16,1.21 -u file >set
Then, to filter-out the unwanted id, 504721, starting at position 16, you can use grep -v:
grep -vE '.{15}504721' set
Finally, reformat the remaining lines and store them in arrays with cut/sed/awk/bash.
So, to populate the _subsets array, for example:
$ _subsets=($(sort -k1.16,1.21 -u file | grep -vE '.{15}504721' | cut -c16-21))
$ printf "%s\n" "${_subsets[#]}"
413865
416865
417865
or, to populate the _ids array:
$ _ids=($(sort -k1.16,1.21 -u file | grep -vE '.{15}504721' | sed -E 's/^.{15}(.{6})(.{8}).*/9899-\1-\2/'))
$ printf "%s\n" "${_ids[#]}"
9899-413865-20171015
9899-416865-20171015
9899-417865-20171015
✻ If the input file is huge, but it contains only a small number (~40) of unique elements (for the relevant field), then it makes perfect sense for the awk solution to be faster. sort needs to sort a huge file (O(N*logN)), then filter the dupes (O(N)), all for a large N. On the other hand, awk needs to pass through the large input only once, checking for dupes along the way via set membership testing. Since the set of uniques is small, membership testing takes only O(1) (on average, but for such a small set, practically constant even in worst case), making the overall time O(N).
In case there were less dupes, awk would have O(N*log(N)) amortized, and O(N2) worst case. Not to mention the higher constant per-instruction overhead.
In short: you have to know how your data looks like before choosing the right tool for the job.

Here's an awk solution embedded in a bash script:
#!/bin/bash
fn_parser() {
awk '
BEGIN{ _account="9899" }
{ _set=substr($0,16,6)
_startDate=substr($0,22,8)
#dbg print "#dbg:_set=" _set "\t_startDate=" _startDate
if (_set != "504721") {
_id= _account "-" _set"-" _startDate
ids[_id] = _id
sets[_set]=_set
}
}
END {
printf "_subsets=("
for (s in sets) { printf("%s\"%s\"" , (commaCtr++ ? "," : ""), sets[s]) }
print ");"
printf "_ids=("
for (i in ids) { printf("%s\"%s\"" , (commaCtr2++ ? "," : ""), ids[i]) }
print ")"
}
' "${#}"
}
#dbg set -vx
eval $( echo $(fn_parser *.txt) )
echo "_subsets="$_subsets
echo "_ids="$_ids
output
_subsets=413865,417865,416865
_ids=9899-416865-20171015,9899-413865-20171015,9899-417865-20171015
Which I believe would be the same output your script would get if you did an echo on your variable names.
I didn't see that _account was being extracted from your file, and assume it is passed in from a previous step in your batch. But until I know if that is a critical piece, I'll have to come back to figuring out how to pass in var to a function that calls awk.
People won't like using eval, but hopefully no one will embed /bin/rm -rf / into your data set ;-)
I use the eval so that the data extracted is available via the shell variables. You can uncomment the #dbg before the eval line to see how the code is executing in the "layers" of function, eval, var=value assignments.
Hopefully, you see how the awk script is a transcription of your code into awk.
It does rely on the fact that arrays can contain only 1 copy of a key/value pair.
I'd really appreciate if you post timings for all solutions submitted. (You could reduce the file size by 1/2 and still have a good test). Be sure to run each version several times, and discard the first run.
IHTH

Related

How can I compare floats in bash

Working on a script and I am currently stuck. (Still pretty new at this)
First off I have my data file, the file I am searching inside.
First field is name, second is money payed, and third is money owed.
customerData.txt
name1,500.00,1000
name2,2000,100
name3,100,100.00
Here is my bash file. Basically if the owe amount is greater than the paid amount, then print the name. Works fine for anything thats not a float. I also understand that bash doesn't handle floats and the only way to handle them is with the bc utility, but I have had no luck.
#!/bin/bash
while IFS="," read name paid owe; do
#due=$(echo "$owe - $paid" |bc -1)
#echo $due
if [ $owe -gt $paid ]; then
echo $name
fi
done < customerData.txt
To print all lines for which the third column is larger than the second:
$ awk -F, '$3>$2' customerData.txt
name1,500.00,1000
How it works
-F, tells awk that the columns are comma-separated.
$3>$2 tells awk to print any line for which the third column is larger than the second.
In more detail, $3>$2 is a condition: it evaluates to true or false. If it evaluates to true, then the action is performed. Since we didn't specify any action, awk performs the default action which is to print the line.

awk store a pattern result to a shell array variable

I am trying to store the result of a pattern matched by awk to a shell array variable. Here's a simplified example of the same:
#!/bin/bash
declare -a array1=()
declare -a array2=()
READ_FILE="directory1/read_file.csv"
WRITE_FILE="directory2/results.csv"
#variable for counting array index
count1=0
count2=0
#
#
# need help with line below
# $2 below is the second set of characters which is a floating point number
awk -F 'string1_to_search' '{$array1[count1++] = $2}' $READ_FILE
awk -F 'string2_to_search' '{$array2[count2++] = $2}' $READ_FILE
#count++ indicates post increment of count variable
#do something with the array
.
.
#end
any suggestions would be helpful.
Something roughly like this, then?
awk '/string1_to_search/ {
count["id1"]++; sum["id1"] += $2 }
/string2_too/ {
count["id2"]++; sum["id2"] += $2 }
# ...
END { for (k in count) printf("%s: sum %f/count %i = avg %f\n", k, sum[k], count[k], sum[k]/count[k]) }' inputfile
I seem to recall there was a clever way to calculate a rolling variance without keeping the entire input set in memory; or else just collect the values space-separated value["id"] = value["id"] " " $2 and split into a list and loop over it near the end. Alternatively, simplify this to only examine one search string at a time and run it multiple times (let's hope then the input isn't very big). Or switch to Perl, which will easily let you collect lists of lists and other nested structures.
Obviously break out common functionality into separate functions so you don't have repeated code ... I suppose it's actually clearer like this, but if you find bugs, or need other changes, you only want to have to change one place in the code.
another method to do it is making awk print the number which can be passed to an array variable in bash like this :
mapfile -t array1 < <( awk -F 'string1_to_search' '{print $2}' "$READ_FILE" )
Later for taking out mean, variance and SD we can use bc tool from within the bash

Bash: nested loop one way comparison

I have one queston about nested loop with bash.
I have an input files with one file name per line (full path)
I read this file and then i make a nest loop:
for i in $filelines ; do
echo $i
for j in $filelines ; do
./program $i $j
done
done
The program I within the loop is pretty low.
Basically it compare the file A with the file B.
I want to skip A vs A comparison (i.e comparing one file with itslef) AND
I want to avoid permutation (i.e. for file A and B, only perform A against B and not B against A).
What is the simplest to perform this?
Version 2: this one takes care of permutations
#!/bin/bash
tmpunsorted="/tmp/compare_unsorted"
tmpsorted="/tmp/compare_sorted"
>$tmpunsorted
while read linei
do
while read linej
do
if [ $linei != $linej ]
then
echo $linei $linej | tr " " "\n" | sort | tr "\n" " " >>$tmpunsorted
echo >>$tmpunsorted
fi
done <filelines
done <filelines
sort $tmpunsorted | uniq > $tmpsorted
while read linecompare
do
echo "./program $linecompare"
done <$tmpsorted
# Cleanup
rm -f $tmpunsorted
rm -f $tmpsorted
What is done here:
I use the while loop to read each line, twice, i and j
if the value of the lines is the same, forget them, no use to consider them
if they are different, output them into a file ($tmpunsorted). And they are sorted in alphebetical order before going tothe $tmpunsorted file. This way the arguments are always in the same order. So a b and b a will be same in the unsorted file.
I then apply sort | uniq on $tmpunsorted, so the result is a list of individual argument pairs.
finally loop on the $tmpsorted file, and call the program on each individual pair.
Since I do not have your program, I did an echo, which you should remove to use the script.

Bash script processing too slow

I have the following script where I'm parsing 2 csv files to find a MATCH the files have 10000 lines each one. But the processing is taking a long time!!! Is this normal?
My script:
#!/bin/bash
IFS=$'\n'
CSV_FILE1=$1;
CSV_FILE2=$2;
sort -t';' $CSV_FILE1 >> Sorted_CSV1
sort -t';' $CSV_FILE2 >> Sorted_CSV2
echo "PATH1 ; NAME1 ; SIZE1 ; CKSUM1 ; PATH2 ; NAME2 ; SIZE2 ; CKSUM2" >> 'mapping.csv'
while read lineCSV1 #Parse 1st CSV file
do
PATH1=`echo $lineCSV1 | awk '{print $1}'`
NAME1=`echo $lineCSV1 | awk '{print $3}'`
SIZE1=`echo $lineCSV1 | awk '{print $7}'`
CKSUM1=`echo $lineCSV1 | awk '{print $9}'`
while read lineCSV2 #Parse 2nd CSV file
do
PATH2=`echo $lineCSV2 | awk '{print $1}'`
NAME2=`echo $lineCSV2 | awk '{print $3}'`
SIZE2=`echo $lineCSV2 | awk '{print $7}'`
CKSUM2=`echo $lineCSV2 | awk '{print $9}'`
# Test if NAM1 MATCHS NAME2
if [[ $NAME1 == $NAME2 ]]; then
#Test checksum OF THE MATCHING NAME
if [[ $CKSUM1 != $CKSUM2 ]]; then
#MAPPING OF THE MATCHING LINES
echo $PATH1 ';' $NAME1 ';' $SIZE1 ';' $CKSUM1 ';' $PATH2 ';' $NAME2 ';' $SIZE2 ';' $CKSUM2 >> 'mapping.csv'
fi
break #When its a match break the while loop and go the the next Row of the 1st CSV File
fi
done < Sorted_CSV2 #Done CSV2
done < Sorted_CSV1 #Done CSV1
This is a quadratic order. Also, see Tom Fenech comment: You are calling awk several times inside a loop inside another loop. Instead of using awk for the fields in every line try setting the IFS shell variable to ";" and read the fields directly in read commands:
IFS=";"
while read FIELD11 FIELD12 FIELD13; do
while read FIELD21 FIELD22 FIELD23; do
...
done <Sorted_CSV2
done <Sorted_CSV1
Though, this would be still O(N^2) and very inefficient. It seems you are matching 2 fields by a coincident field. This task is easier and faster to accomplish by using join command line utility, and would reduce order from O(N^2) to O(N).
Whenever you say "Does this file/data list/table have something that matches this file/data list/table?", you should think of associative arrays (sometimes called hashes).
An associative array is keyed by a particular value and each key is associated with a value. The nice thing is that finding a key is extremely fast.
In your loop of a loop, you have 10,000 lines in each file. You're outer loop executed 10,000 times. Your inner loop may execute 10,000 times for each and every line in your first file. That's 10,000 x 10,000 times you go through that inner loop. That's potentially looping 100 million times through that inner loop. Think you can see why your program might be a little slow?
In this day and age, having a 10,000 member associative array isn't that bad. (Imagine doing this back in 1980 on a MS-DOS system with 256K. It just wouldn't work). So, let's go through the first file, create a 10,000 member associative array, and then go through the second file looking for matching lines.
Bash 4.x has associative arrays, but I only have Bash 3.2 on my system, so I can't really give you an answer in Bash.
Besides, sometimes Bash isn't the answer to a particular issue. Bash can be a bit slow and the syntax can be error prone. Awk might be faster, but many versions don't have associative arrays. This is really a job for a higher level scripting language like Python or Perl.
Since I can't do a Bash answer, here's a Perl answer. Maybe this will help. Or, maybe this will inspire someone who has Bash 4.x can give an answer in Bash.
I Basically open the first file and create an associative array keyed by the checksum. If this is a sha1 checksum, it should be unique for all files (unless they're an exact match). If you don't have a sha1 checksum, you'll need to massage the structure a wee bit, but it's pretty much the same idea.
Once I have the associative array figured out, I then open file #2 and simply see if the checksum already exists in the file. If it does, I know I have a matching line, and print out the two matches.
I have to loop 10,000 times in the first file, and 10,000 times in the second. That's only 20,000 loops instead of 10 million that's 20,000 times less looping which means the program will run 20,000 times faster. So, if it takes 2 full days for your program to run with a double loop, an associative array solution will work in less than one second.
#! /usr/bin/env perl
#
use strict;
use warnings;
use autodie;
use feature qw(say);
use constant {
FILE1 => "file1.txt",
FILE2 => "file2.txt",
MATCHING => "csv_matches.txt",
};
#
# Open the first file and create the associative array
#
my %file_data;
open my $fh1, "<", FILE1;
while ( my $line = <$fh1> ) {
chomp $line;
my ( $path, $blah, $name, $bather, $yadda, $tl_dr, $size, $etc, $check_sum ) = split /\s+/, $line;
#
# The main key is "check_sum" which **should** be unique, especially if it's a sha1
#
$file_data{$check_sum}->{PATH} = $path;
$file_data{$check_sum}->{NAME} = $name;
$file_data{$check_sum}->{SIZE} = $size;
}
close $fh1;
#
# Now, we have the associative array keyed by the data we want to match, read file 2
#
open my $fh2, "<", FILE2;
open my $csv_fh, ">", MATCHING;
while ( my $line = <$fh2> ) {
chomp $line;
my ( $path, $blah, $name, $bather, $yadda, $tl_dr, $size, $etc, $check_sum ) = split /\s+/, $line;
#
# If there is a matching checksum in file1, we know we have a matching entry
#
if ( exists $file_data{$check_sum} ) {
printf {$csv_fh} "%s;%s:%s:%s:%s:%s\n",
$file_data{$check_sum}->{PATH}, $file_data{$check_sum}->{NAME}, $file_data{$check_sum}->{SIZE},
$path, $name, $size;
}
}
close $fh2;
close $csv_fh;
BUGS
(A good manpage always list issues!)
This assumes one match per file. If you have multiple duplicates in file1 or file2, you will only pick up the last one.
This assumes a sha256 or equivalent checksum. In such a checksum, it is extremely unlikely that two files will have the same checksum unless they match. A 16bit checksum from the historic sum command may have collisions.
Although a proper database engine would make a much better tool for this, it is still very well possible to do it with awk.
The trick is to sort your data, so that records with the same name are grouped together. Then a single pass from top to bottom is enough to find the matches. This can be done in linear time.
In detail:
Insert two columns in both CSV files
Make sure every line starts with the name. Also add a number (either 1 or 2) which denotes from which file the line originates. We will need this when we merge the two files together.
awk -F';' '{ print $2 ";1;" $0 }' csvfile1 > tmpfile1
awk -F';' '{ print $2 ";2;" $0 }' csvfile2 > tmpfile2
Concatenate the files, then sort the lines
sort tmpfile1 tmpfile2 > tmpfile3
Scan the result, report the mismatches
awk -F';' -f scan.awk tmpfile3
Where scan.awk contains:
BEGIN {
origin = 3;
}
$1 == name && $2 > origin && $6 != checksum {
print record;
}
{
name = $1;
origin = $2;
checksum = $6;
sub(/^[^;]*;.;/, "");
record = $0;
}
Putting it all together
Crammed together into a Bash oneliner, without explicit temporary files:
(awk -F';' '{print $2";1;"$0}' csvfile1 ; awk -F';' '{print $2";2;"$0}' csvfile2) | sort | awk -F';' 'BEGIN{origin=3}$1==name&&$2>origin&&$6!=checksum{print record}{name=$1;origin=$2;checksum=$6;sub(/^[^;]*;.;/,"");record=$0;}'
Notes:
If the same name appears more than once in csvfile1, then all but the last one are ignored.
If the same name appears more than once in csvfile2, then all but the first one are ignored.

combining multiple grep searches and making my script more efficient

I have a file called Type1.txt, that looks like this:
$ cat Type1.txt
ID.580.G3C0
TTTTTTTTTTT
ID.580.G3C8
ATTATATC-AAA
ID.580.GXC16
ATTATTTC-ACG-TTTTTCCTA
ID.694.G9C3
ATTATATC-ACG-AAATCCTA
ID.694.G9C3
etc...
I want to write a bash script to count the instances of each ID and export it into another file that provides a summary, something like this:
ID.580 = 3
ID.694 = 1
etc...
So far the script is messy and unusable.
For the above I have the following:
#!/bin/bash
for Count in `grep -c "ID.580" Type1.txt; do
echo $Count=ID.580
done > Result.txt #Allows to count only for that single ID.
I have over a thousand ID.XXX, making this code unusable since it's not plausible to add individual ID.XXX for each search. Thank you for the help!
Shell
The code below uses the standard UNIX utilities, and does not assume that the second part of the ID is exactly 3 characters, but will find ID.1.123123123 and ID.1234.123123 and properly only take the first dot-delimited part. As it is
grep '^ID\.[0-9]' Type1.txt | cut -d . -f 1-2 | sort \
| uniq -c | awk '{ print $2" = "$1 }'
grep filters only lines beginning with ID. followed by 1 digit (at least)
cut uses . as the field delimiter, and only outputting fields 1 and 2, thus removing
everything after and including the second . on the line.
sort sorts the lines for uniq to work
uniq prints each line from its input prefixed with a count
awk part reverses these fields and prints them separated with =.
If the first part of the ID can contain letters too, change the end of regular expression to [0-9] to [0-9A-Z]. for example
The pipeline outputs
ID.580 = 3
ID.694 = 2
Python
As the Python is popular among biologists, you might want to hone your python skills instead:
from collections import Counter
counter = Counter()
with open('Type1.txt') as f:
for line in f:
if line.startswith('ID.'):
top_id = '.'.join(line.split('.', 2)[:2])
counter[top_id] += 1
for top_id, count in sorted(counter.items()):
print("%s = %d" % (top_id, count))
The results are exactly identical.
grep '^ID.[0-9][0-9][0-9]' input_file | cut -c1-6 | sort | uniq -c
works?
TL;DR
Given your particular corpus and grouping strategy, there's more than one way to get the results you need. Here are two alternative solutions, one in awk, and one in Ruby.
GNU awk
One way is to use GNU awk to perform the following steps:
match just the ID lines
split matching input lines into fields
select and print the fields you need
sort the lines in the filtered result
count the adjacent duplicates
perform any specialized formatting on the result
For example:
$ awk '/^ID/ {split($0, a, "."); print a[1] "." a[2]}' /tmp/foo |
sort | uniq --count | awk '{print $2 " = " $1}'
ID.580 = 3
ID.694 = 2
With the corpus you provided in your question, this takes an average of 8 ms on my system. A larger corpus will take longer, of course, but unless you have a really huge data set this should be fast enough for most purposes.
Ruby
Ruby offers what I consider a more elegant solution, but is in fact slower. The idea here is to store the relevant portion of your IDs as hash keys, and increment a counter each time you encounter a given ID. For example, consider this Ruby one-liner:
$ ruby -ne 'BEGIN { id = Hash.new(0) }
id[$&] += 1 if /\AID\.\d+/
END { id.each_pair do |k,v| puts "#{k} = #{v}" end }' /tmp/foo
ID.580 = 3
ID.694 = 2
This solution takes around 45 ms to process the same corpus, so I wouldn't recommend it over the awk pipeline just for transforming output. The main advantage to doing it this way is that you have an actual data structure (e.g. a Hash object) that you could manipulate in a more full-featured program.
Here is awk one liner:
$ awk -F. '$1=="ID"{a[$2,$3]++}END{for (i in a) {split(i,ind,SUBSEP); r[ind[1]]++}for (i in r) print "ID."i" = "r[i]}' file
ID.694 = 1
ID.580 = 3
And here is a pure bash solution:
#!/bin/bash
while IFS=. read -r pre id code rest
do
[[ $pre == ID ]] || continue
[[ ${a[$id]} =~ \."$code"\. ]] || {
a[$id]="${a[$id]}.$code."
((count[$id]++));
}
done < file
for i in "${!count[#]}"
do
echo "ID.$i = ${count[$i]}"
done
$ ./script.sh
ID.580 = 3
ID.694 = 1
awk might work too...
awk '/ID.580/{x++}END{print x}' test.txt
You can put this in a for loop
for i in ID.580 ID.694
do
awk '/'$i'/{x++}END{print x}' test.txt
done

Resources