Efficient search pattern in large CSV file - performance

I recently asked how to use awk to filter and output based on a searched pattern. I received some very useful answers being the one by user #anubhava the one that I found more straightforward and elegant. For the sake of clarity I am going to repeat some information of the original question.
I have a large CSV file (around 5GB) I need to identify 30 categories (in the action_type column) and create a separate file with only the rows matching each category.
My input file dataset.csv is something like this:
action,action_type, Result
up,1,stringA
down,1,strinB
left,2,stringC
I am using the following to get the results I want (again, this is thanks to #anubhava).
awk -F, 'NR > 1{fn = $2 "_dataset.csv"; print >> fn; close(fn)}' file
This works as expected. But I have found it quite slow. It has been running for 14 hours now and, based on the size of the output files compared to the original file, it is not at even 20% of the whole process.
I am running this on a Windows 10 with an AMD Ryzen PRO 3500 200MHz, 4 Cores, 8 Logical Processors with 16GB Memory and an SDD drive. I am using GNU Awk 5.1.0, API: 3.0 (GNU MPFR 4.1.0, GNU MP 6.2.0). My CPU is currently at 30% and Memory at 51%. I am running awk inside a Cygwin64 Terminal.
I would love to hear some suggestions on how to improve the speed. As far as I can see it is not a capacity problem. Could it be the fact that this is running inside Cygwin? Is there an alternative solution? I was thinking about Silver Searcher but could not quite workout how to do the same thing awk is doing for me.
As always, I appreciate any advice.

with sorting:
awk -F, 'NR > 1{if(!seen[$2]++ && fn) close(fn); if(fn = $2 "_dataset.csv"; print >> fn}' < (sort -t, -nk2 dataset.csv)
or with gawk (unlimited number of opened fd-s)
gawk -F, 'NR > 1{fn = $2 "_dataset.csv"; print >> fn;}' dataset.csv

This is the right way to do it using any awk:
$ tail -n +2 file | sort -t, -k2,2n |
awk -F, '$2!=p{close(out); out=$2"_dataset.csv"; p=$2} {print > out}'
The reason I say this is the right approach is it doesn't rely on the 2nd field of the header line coming before the data values when sorted, doesn't require awk to test NR > 1 for every line of input, doesn't need an array to store $2s or any other values, and only keeps 1 output file open at a time (the more files open at once the slower any awk will run, especially gawk once you get past the limit of open files supported by other awks as gawk then has to start opening/closing the files in the background as needed). It also doesn't require you to empty existing output files before you run it, it will do that automatically, and it only does string concatenation to create the output file once per output file, not once per line.
Just like the currently accepted answer, the sort above could reorder the input lines that have the same $2 value - add -s if that's undesirable and you have GNU sort, with other sorts you need to replace the tail with a different awk command and add another sort arg.

Related

how rand() works in awk

I am trying to sample the 2nd column of a csv file (any number of samples is fine) using awk and rand(). But, I noticed that I always end up with the same number of samples
cat toy.txt | awk -F',' 'rand()<0.2 {print $2}' | wc -l
I explored and it seems rand() is not working as I expected. For example, a in the following seems to always be 1,
cat toy.txt | awk -F',' 'a=rand() a<0.2 {print a}'
Why?
From the documentation:
CAUTION: In most awk implementations, including gawk, rand() starts generating numbers from the same starting number, or seed, each time you run awk. Thus, a program generates the same results each time you run it. The numbers are random within one awk run but predictable from run to run. This is convenient for debugging, but if you want a program to do different things each time it is used, you must change the seed to a value that is different in each run. To do this, use srand().
So, to apply what's been pointed out in the man page, and duplicated all over this forum and elsewhere on the Internet, I like to use:
awk -v rseed=$RANDOM 'BEGIN{srand(rseed);}{print rand()" "$0}'
The rseed variable is optional, but included here, because sometimes it helps me to have a deterministic/repeatable random series for simulations when other variables can change, etc.

Performant way of displaying the number of unique column entries in a set of files?

I'm attempting to pipe a large amount of files in to a sequence of commands which displays the number of unique entries in a given column of said files. I'm inexperienced with the shell, but after a short while I was able to come up with this:
awk '{print $5 }' | sort | uniq | wc - l
This sequence of commands works fine for a small amount of files, but takes an unacceptable amount of time to execute on my target set. Is there a set of commands that can accomplish this more efficiently?
You can count unique occurrences of values in the fifth field in a single pass with awk:
awk '{if (!seen[$5]++) ++ctr} END {print ctr}'
This creates an array of the values in the fifth field and increments the ctr variable if the value has never seen before. The END rule prints the value of the counter.
With GNU awk, you can alternatively just check the length of the associative array in the end:
awk '{seen[$5]++} END {print length(seen)}'
Benjamin has supplied the good oil, but depending on just how much data is to be stored in the array, it may pay to pass the data to wc anyway:
awk '!_[$5]++' file | wc -l
the sortest and fastest (i could) using awk but not far from previous version of #BenjaminW. I think a bit faster (difference could only be interesting on very huge file) because of test made earlier in the process
awk '!E[$5]++{c++}END{print c}' YourFile
works with all awk version
GNU datamash has a count function for columns:
datamash -W count 5

Fastest way to extract a column and then find its uniq items in a large delimited file

Hoping for help. I have a 3 million line file, data.txt, delimited with "|", e.g,.
"4"|"GESELLSCHAFT FUER NUCLEONIC & ELECT MBH"|"DE"|"0"
"5"|"IMPEX ESSEN VERTRIEB VON WERKZEUGEN GMBH"|"DE"|"0"
I need to extract the 3rd column ("DE") and then limit it to its unique values. Here is what I've come up with (gawk and gsort as I'm running MacOS and only had the "--parallel" option via GNU sort):
gawk -F "|" '{print $3}' data.txt \
| gsort --parallel=4 -u > countries.uniq
This works, but it isn't very fast. I have similar tasks coming up with some even larger (11M record) files, so I'm wondering if anyone can point out a faster way.
I hope to stay in shell, rather than say, Python, because some of the related processing is much easier done in shell.
Many thanks!
awk is tailor-made for such tasks. Here is a minimal awk logic that could do the trick for you.
awk -F"|" '!($3 in arr){print} {arr[$3]++} END{ for (i in arr) print i}' logFile
The logic is as awk processes every line, it adds the entry of the value in $3 only if it has not seen it before. The above prints both unique lines followed by unique entries from $3
If you want the unique lines only, you can exclude the END() clause
awk -F"|" '!($3 in arr){print} {arr[$3]++}' logFile > uniqueLinesOnly
If you want unique values only from the file remove the inside print
awk -F"|" '!($3 in arr){arr[$3]++} END{ for (i in arr) print i}' logFile > uniqueEntriesOnly
You can see how fast it is for a 11M record entry file. You can write it a new file using the redirect operator

Bash: how to optimize/parallelize a search through two large files to replace strings?

I'm trying to figure out a way to speed up a pattern search and replace between two large text files (>10Mb). File1 has two columns with unique names in each row. File2 has one column that contains one of the shared names in File1, in no particular order, with some text underneath that spans a variable number of lines. They look something like this:
File1:
uniquename1 sharedname1
uqniename2 sharedname2
...
File2:
>sharedname45
dklajfwiffwf
flkewjfjfw
>sharedname196
lkdsjafwijwg
eflkwejfwfwf
weklfjwlflwf
My goal is to use File1 to replace the sharedname variables with their corresponding uniquename, as follows:
New File2:
>uniquename45
dklajfwif
flkewjfj
>uniquename196
lkdsjafwij
eflkwejf
This is what I've tried so far:
while read -r uniquenames sharednames; do
sed -i "s/$sharednames/$uniquenames/g" $File2
done < $File1
It works but it's ridiculously slow, trudging through those big files. The CPU usage is the rate-limiting step, so I was trying to parallel the modification to use the 8 cores at my disposal, but couldn't get it to work. I also tried splitting File1 and File2 into smaller chunks and running in batches simultaneously, but I couldn't get that to work, either. How would you implement this in parallel? Or do you see a different way of doing it?
Any suggestions would be welcomed.
UPDATE 1
Fantastic! Great answers thanks to #Cyrus and #JJoao and suggestions by other commentators. I implemented both in my script, on the recommendation of #JJoao to test the compute times, and it's an improvement (~3 hours instead of ~5). However, I'm just doing text file manipulation so I don't see how it should be taking any more than a couple of minutes. So, I'm still working on making better use of the available CPUs, so I'm tinkering with the suggestions to see if I can speed it up further.
UPDATE 2: correction to UPDATE 1
I included the modifications into my script and run it as such, but a chunk of my code was slowing it down. Instead, I ran the suggested bits of code individually on the target intermediary files. Here's what I saw:
Time for #Cyrus' sed to complete
real 70m47.484s
user 70m43.304s
sys 0m1.092s
Time for #JJoao's Perl script to complete
real 0m1.769s
user 0m0.572s
sys 0m0.244s
Looks like I'll be using the Perl script. Thanks for helping, everyone!
UPDATE 3
Here's the time taken by #Cyrus' improved sed command:
time sed -f <(sed -E 's|(.*) (.*)|s/^\2/>\1/|' File1 | tr "\n" ";") File2
real 21m43.555s
user 21m41.780s
sys 0m1.140s
With GNU sed and bash:
sed -f <(sed -E 's|(.*) (.*)|s/>\2/>\1/|' File1) File2
Update:
An attempt to speed it up:
sed -f <(sed -E 's|(.*) (.*)|s/^>\2/>\1/|' File1 | tr "\n" ";") File2
#!/usr/bin/perl
use strict;
my $file1=shift;
my %dic=();
open(F1,$file1) or die("cant find replcmente file\n");
while(<F1>){ # slurp File1 to dic
if(/(.*)\s*(.*)/){$dic{$2}=$1}
}
while(<>){ # for all File2 lines
s/(?<=>)(.*)/ $dic{$1} || $1/e; # sub ">id" by >dic{id}
print
}
I prefer #cyrus solution, but if you need to do that often you can use the previous perl script (chmod + install) as
a dict-replacement command.
Usage: dict-replacement File1 File* > output
It would be nice if you could tell us the time of the various solutions...

Grepping progressively through large file

I have several large data files (~100MB-1GB of text) and a sorted list of tens of thousands of timestamps that index data points of interest. The timestamp file looks like:
12345
15467
67256
182387
199364
...
And the data file looks like:
Line of text
12345 0.234 0.123 2.321
More text
Some unimportant data
14509 0.987 0.543 3.600
More text
15467 0.678 0.345 4.431
The data in the second file is all in order of timestamp. I want to grep through the second file using the time stamps of the first, printing the timestamp and fourth data item in an output file. I've been using this:
grep -wf time.stamps data.file | awk '{print $1 "\t" $4 }' >> output.file
This is taking on the order of a day to complete for each data file. The problem is that this command searches though the entire data file for every line in time.stamps, but I only need the search to pick up from the last data point. Is there any way to speed up this process?
You can do this entirely in awk …
awk 'NR==FNR{a[$1]++;next}($1 in a){print $1,$4}' timestampfile datafile
JS웃's awk solution is probably the way to go. If join is available and the first field of the irrelevant "data" is not numeric, you could exploit the fact that the files are in the same order and avoid a sorting step. This example uses bash process substitution on linux
join -o2.1,2.4 -1 1 -2 1 key.txt <(awk '$1 ~ /^[[:digit:]]+$/' data.txt)
'grep' has a little used option -f filename which gets the patterns from filename and does the matching. It is likely to beat the awk solution and your timestamps would not have to be sorted.

Resources