sort and sum for large data files - sorting

I have to process a file that sort does not seem to be able to handle.
The files are apprx. 3 GB each.
The input is as follows:
last-j nmod+j+n year-n 9492
last-j nmod+j+n night-n 8075
first-j nmod+j+n-the time-n 7749
same-j nmod+j+n-the time-n 7530
other-j nmod+j+n-the hand-n 5319
ast-j nmod+j+n year-n 1000
last-j nmod+j+n night-n 5000
first-j nmod+j+n-the time-n 1000
same-j nmod+j+n-the time-n 3000
other-j nmod+j+n-the hand-n 200
In which I need to sum the numbers of corresponding duplicates.
so the desired output would be as follows:
last-j nmod+j+n year-n 10492
last-j nmod+j+n night-n 13075
first-j nmod+j+n-the time-n 8749
same-j nmod+j+n-the time-n 10530
other-j nmod+j+n-the hand-n 5519
I have been trying this sort command, which should do the trick
sort input | uniq -c | awk '{print $2 "\t" $3 "\t" $1*$4}'
and it runs out of memory. Any suggestions on something that may be a bit more optimized to handle larger data files?? Thanks

Using an array in awk you can do it all together, no need to sort and uniq:
$ awk '{a[$1,$2,$3]+=$4} END{for (i in a) print i, a[i]}' file
first-jnmod+j+n-thetime-n 8749
ast-jnmod+j+nyear-n 1000
same-jnmod+j+n-thetime-n 10530
last-jnmod+j+nnight-n 13075
last-jnmod+j+nyear-n 9492
other-jnmod+j+n-thehand-n 5519
As this is using col 1, 2, 3 as indexes, then they are written all together. This can be solved having them in another array:
$ awk '{a[$1,$2,$3]+=$4; b[$1,$2,$3]=$1" "$2" "$3} END{for (i in a) print b[i], a[i]}' a
first-j nmod+j+n-the time-n 8749
ast-j nmod+j+n year-n 1000
same-j nmod+j+n-the time-n 10530
last-j nmod+j+n night-n 13075
last-j nmod+j+n year-n 9492
other-j nmod+j+n-the hand-n 5519

sort and other purely magical UNIX tools are as optimized as they -- probably -- can be. If you're counting entries in a file, and their unique occurrences don't fit in memory, loading them up to memory won't be a good solution -- this is the fastest approach, otherwise.
Apart from this, sorting the file -- O(n log n) --, and later counting the entries -- O(n) -- will certainly be the best solution -- unless you keep a k-size map of entries in memory, and keep on swapping data from mem to disk whenever a k + 1 key tries to be added to the map. Considering this, your solution (the one-liner with sort + uniq + awk) just needs a little tap.
Try to sort the file externally, using sort's magical abilities to do so; after that, the count will require at most one entry to be kept in memory -- which pretty much addresses your problem. The final two-liner could be something like:
sort -T <directory_for_temp_files> <input> > <output>
awk '{
if (cur == "$1 $3") { freq += $4; }
else { printf "%s %d\n", cur, freq; cur = "$1 $3"; freq = $4; }
}' < <output> > <final_output>

If this is running out of memory it is because of sort as uniq and awk only consume constant amounts of memory. You can run multiple sorts in parallel with GNU parallel, e.g. from the manual:
cat bigfile | parallel --pipe --files sort | parallel -Xj1 sort -m {} ';' rm {} >bigfile.sort
Here bigfile is split into blocks of around 1MB, each block ending in
'\n' (which is the default for --recend). Each block is passed to sort
and the output from sort is saved into files. These files are passed
to the second parallel that runs sort -m on the files before it
removes the files. The output is saved to bigfile.sort.
When the file is sorted you can stream it through the uniq/awk pipe you were using, e.g.:
cat bigfile.sort | uniq -c | awk '{print $2 "\t" $3 "\t" $1*$4}'

Related

Subsetting a CSV based on a percentage of unique values

I've been reading through other similar questions. I have this working, but it is very slow due to the size of the CSV I'm working with. Are there ways to make this more efficient?
My goal:
I have an incredibly large CSV (>100 GB). I would like to take all of the unique values in a column, extract 10% of these, and then use that 10% to subsample the original CSV.
What I'm doing:
1 - I'm pulling all unique values from column 11 and writing those to a text file:
cat File1.csv | cut -f11 -d , | sort | uniq > uniqueValues.txt
2 - Next, I'm sampling a random 10% of the values in uniqueValues.txt:
cat uniqueValues.txt | awk 'BEGIN {srand()} !/^$/ { if (rand() <= .10) print $0'} > uniqueValues.10pct.txt
3 - Next, I'm pulling the rows in File1.csv which have column 11 matching values from uniqueValues.10pct.txt:
awk -F, 'NR==FNR{a[$1]=$0;next}($11 in a){print}' uniqueValues.10pct.txt File1.csv > File1_subsample.csv
As far as I can tell, this seems to be working. Does this seem reasonable? Any suggestions on how to improve the efficiency?
Any suggestions on how to improve the efficiency?
Avoid sort in 1st step as 2nd and 3rd do not care about order, you might do your whole 1st step using single awk command as follows:
awk 'BEGIN{FS=","}!arr[$11]++{print $11}' File1.csv > uniqueValues.txt
Explanation: I inform GNU AWK that field separator (FS) is comma, then for each line I do arr[$11]++ to get number of occurence of value in 11th column and use ! to negate it, so 0 becomes true, whilst 1 and greater becomes false. If this hold true I print 11th column.
Please test above against your 1st step for you data and then select one which is faster.
As for 3th step you might attemp using not-GNU AWK if you are allowed to install tools at your machine. For example author of article¹ Don’t MAWK AWK – the fastest and most elegant big data munging language! found nawk faster than GNU AWK and mawk faster than nawk. After installing prepare test data and measure times for
gawk -F, 'NR==FNR{a[$1]=$0;next}($11 in a){print}' uniqueValues.10pct.txt File1.csv > File1_subsample.csv
nawk -F, 'NR==FNR{a[$1]=$0;next}($11 in a){print}' uniqueValues.10pct.txt File1.csv > File1_subsample.csv
mawk -F, 'NR==FNR{a[$1]=$0;next}($11 in a){print}' uniqueValues.10pct.txt File1.csv > File1_subsample.csv
then use one which proved by fastest.
¹be warned that values shown pertains to versions available at September 2009, you might get different times with version available at June 2022.
You might find this to be faster (untested since no sample input/output provided):
cut -f11 -d',' File1.csv |
sort -u > uniqueValues.txt
numUnq=$(wc -l < uniqueValues.txt)
shuf -n "$(( numUnq / 10 ))" uniqueValues.txt |
awk -F',' 'NR==FNR{a[$1]; next} $11 in vals' - File1.csv
You could try replacing that first cut | sort; numUnq=$(wc...) with
numUnq=$(awk -F',' '!seen[$11]++{print $11 > "uniqueValues.txt"; cnt++} END{print cnt+0}' File1.csv)
to see if that's any faster but I doubt it since cut, sort, and wc are all very fast while awk has to do regexp-based field splitting and store all $11 values in memory (which can get slow as the array size increases due to how dynamic array allocation works).
Create a sample *.csv file:
for ((i=1;i<=100;i++))
do
for ((j=1;j<=100;j++))
do
echo "a,b,c,d,e,f,g,h,i,j,${j},k,l,m"
done
done > large.csv
NOTES:
1,000 total lines
100 unique values in the 11th field
each unique value shows up 10 times in the file
We'll look at a couple awk ideas that:
keep track of unique values as we find them
apply the random percentage check as we encounter a new (unique) value
require just a single pass through the source file
NOTE: both of these awk scripts (below) replace all of OP's current code (cat/cut/sort/uniq/cat/awk/awk)
First idea applies our random percentage check each time we find a new unique value:
awk -F',' '
BEGIN { srand() }
!seen[$11]++ { if (rand() <= 0.10) # if this is the 1st time we have seen this value and rand() is <= 10% then ...
keep[$11] # add the value to our keep[] array
}
$11 in keep # print current line if $11 is an index in the keep[] array
' large.csv > small.csv
NOTES:
one drawback to this approach is that the total number of unique values is not guaranteed to always be exactly 10% since we're at the mercy of the rand() function, for example ...
a half dozen sample runs generated 70, 110, 100, 140, 110 lines (ie, 7, 11, 10, 14 and 11 unique values) in small.csv
A different approach where we pre-generate a random set of modulo 100 values (ie, 0 to 99); as we find a new uniq value we check the count (of uniq values) modulo 100 and if we find a match to our pre-generated set then we print the row:
awk -F',' -v pct=10 '
BEGIN { srand()
delete mods # force awk to treat all "mods" references as an array and not a scalar
while (length(mods) < pct) # repeat loop until we have "pct" unique indices in the mods[] array
mods[int(rand() * 100)] # generate random integers betwen 0 and 99
}
!seen[$11]++ { if ((++uniqcnt % 100) in mods) # if this is the 1st time we have seen this value then increment our unique value counter and if "modulo 100" is an index in the mods[] array then ...
keep[$11] # add the value to our keep[] array
}
$11 in keep # print current line if $11 is an index in the keep[] array
' large.csv > small.csv
NOTES:
for a large pct this assumes the rand() results are evenly distributed between 0 and 1 so that the mods[] array is populated in a timely manner
this has the benefit of printing lines that represent exactly 10% of the possible unique values (depending on number of unique values the percentage will actually be 10% +/- 1%)
a half dozen sample runs all generated exactly 100 lines (ie, 10 unique values) in small.csv
If OP still needs to generate the two intermediate (sorted) files (uniqueValues.txt and uniqueValues.10pct.txt) then this could be done in the same awk script via an END {...} block, eg:
END { PROCINFO["sorted_in"]="#ind_num_asc" # this line of code requires GNU awk otherwise OP can sort the files at the OS/bash level
for (i in seen)
print i > "uniqueValues.txt"
for (i in keep)
print i > "uniqueValues.10pct.txt" # use with 1st awk script
# print i > "uniqueValues." pct "pct.txt" # use with 2nd awk script
}

Stream filter large number of lines that are specified by line number from stdin

I have a huge xz compressed text file huge.txt.xz with millions of lines that is too large to keep around uncompressed (60GB).
I would like to quickly filter/select a large number of lines (~1000s) from that huge text file into a file filtered.txt. The line numbers to select could for example be specified in a separate text file select.txt with a format as follows:
10
14
...
1499
15858
Overall, I envisage a shell command as follows where "TO BE DETERMINED" is the command I'm looking for:
xz -dcq huge.txt.xz | "TO BE DETERMINED" select.txt >filtered.txt
I've managed to find an awk program from a closely related question that almost does the job - the only problem being that it takes a file name instead of reading from stdin. Unfortunately, I don't really understand the awk script and don't know enough awk to alter it in such a way to work in this case.
This is what works right now with the disadvantage of having a 60GB file lie around rather than streaming:
xz -dcq huge.txt.xz >huge.txt
awk '!firstfile_proceed { nums[$1]; next }
(FNR in nums)' select.txt firstfile_proceed=1 >filtered.txt
Inspiration: https://unix.stackexchange.com/questions/612680/remove-lines-with-specific-line-number-specified-in-a-file
Keeping with OP's current idea:
xz -dcq huge.txt.xz | awk '!firstfile_proceed { nums[$1]; next } (FNR in nums)' select.txt firstfile_proceed=1 -
Where the - (at the end of the line) tells awk to read from stdin (in this case the output from xz that's being piped to the awk call).
Another way to do this (replaces all of the above code):
awk '
FNR==NR { nums[$1]; next } # process first file
FNR in nums # process subsequent file(s)
' select.txt <(xz -dcq huge.txt.xz)
Comments removed and cut down to a 'one-liner':
awk 'FNR==NR {nums[$1];next} FNR in nums' select.txt <(xz -dcq huge.txt.xz)
Adding some logic to implement Ed Morton's comment (exit processing once FNR > largest value from select.txt):
awk '
# process first file
FNR==NR { nums[$1]
maxFNR= ($1>maxFNR ? $1 : maxFNR)
next
}
# process subsequent file(s):
FNR > maxFNR { exit }
FNR in nums
' select.txt <(xz -dcq huge.txt.xz)
NOTES:
keeping in mind we're talking about scanning millions of lines of input ...
FNR > maxFNR will obviously add some cpu/processing time to the overall operation (though less time than FNR in nums)
if the operation routinely needs to pull rows from, say, the last 25% of the file then FNR > maxFNR is likely providing little benefit (and probably slowing down the operation)
if the operation routinely finds all desired rows in, say, the first 50% of the file then FNR> maxFNR is probably worth the cpu/processing time to keep from scanning the entire input stream (then again, the xz operation, on the entire file, is likely the biggest time consumer)
net result: the additional NFR > maxFNR test may speed-up/slow-down the overall process depending on how much of the input stream needs to be processed in a typical run; OP would need to run some tests to see if there's a (noticeable) difference in overall runtime
To clarify my previous comment. I'll show a simple reproducible sample:
linelist content:
10
15858
14
1499
To simulate a long input, I'll use seq -w 100000000.
Comparing sed solution with my suggestion, we have:
#!/bin/bash
time (
sed 's/$/p/' linelist > selector
seq -w 100000000 | sed -nf selector
)
time (
sort -n linelist | sed '$!{s/$/p/};$s/$/{p;q}/' > my_selector
seq -w 100000000 | sed -nf my_selector
)
output:
000000010
000000014
000001499
000015858
real 1m23.375s
user 1m38.004s
sys 0m1.337s
000000010
000000014
000001499
000015858
real 0m0.013s
user 0m0.014s
sys 0m0.002s
Comparing my solution with awk:
#!/bin/bash
time (
awk '
# process first file
FNR==NR { nums[$1]
maxFNR= ($1>maxFNR ? $1 : maxFNR)
next
}
# process subsequent file(s):
FNR > maxFNR { exit }
FNR in nums
' linelist <(seq -w 100000000)
)
time (
sort -n linelist | sed '$!{s/$/p/};$s/$/{p;q}/' > my_selector
sed -nf my_selector <(seq -w 100000000)
)
output:
000000010
000000014
000001499
000015858
real 0m0.023s
user 0m0.020s
sys 0m0.001s
000000010
000000014
000001499
000015858
real 0m0.017s
user 0m0.007s
sys 0m0.001s
In my conclusion, seq using q is comparable with awk solution. For readability and maintainability I prefer awk solution.
Anyway, this test is simplistic and only useful for small comparisons. I don't know, for example, what the result would be if I test this against the real compressed file, with heavy disc I/O.
EDIT by Ed Morton:
Any speed test that results in all output values that are less than a second is a bad test because:
In general no-one cares if X runs in 0.1 or 0.2 secs, they're both fast enough unless being called in a large loop, and
Things like cache-ing can impact the results, and
Often a script that runs faster for a small input set where execution speed doesn't matter will run slower for a large input set where execution speed DOES matter (e.g. if the script that's slower for the small input spends time setting up data structures that will allow it to run faster for the larger)
The problem with the above example is it's only trying to print 4 lines rather than the 1000s of lines that the OP said they'd have to select so it doesn't exercise the difference between the sed and the awk solution that causes the sed solution to be much slower than the awk one which is that the sed solution has to test every target line number for every line of input while the awk solution just does a single hash lookup of the current line. It's an order(N) vs order(1) algorithm on each line of the input file.
Here's a better example showing printing every 100th line from a 1000000 line file (i.e. will select 1000 lines) rather than just 4 lines from any size file:
$ cat tst_awk.sh
#!/usr/bin/env bash
n=1000000
m=100
awk -v n="$n" -v m="$m" 'BEGIN{for (i=1; i<=n; i+=m) print i}' > linelist
seq "$n" |
awk '
FNR==NR {
nums[$1]
maxFNR = $1
next
}
FNR in nums {
print
if ( FNR == maxFNR ) {
exit
}
}
' linelist -
$ cat tst_sed.sh
#!/usr/bin/env bash
n=1000000
m=100
awk -v n="$n" -v m="$m" 'BEGIN{for (i=1; i<=n; i+=m) print i}' > linelist
sed '$!{s/$/p/};$s/$/{p;q}/' linelist > my_selector
seq "$n" |
sed -nf my_selector
$ time ./tst_awk.sh > ou.awk
real 0m0.376s
user 0m0.311s
sys 0m0.061s
$ time ./tst_sed.sh > ou.sed
real 0m33.757s
user 0m33.576s
sys 0m0.045s
As you can see the awk solution ran 2 orders of magnitude faster than the sed one, and they produced the same output:
$ diff ou.awk ou.sed
$
If I make the input file bigger and select 10,000 lines from it by setting:
n=10000000
m=1000
in each script, which is probably getting more realistic for the OPs usage, the difference becomes really impressive:
$ time ./tst_awk.sh > ou.awk
real 0m2.474s
user 0m2.843s
sys 0m0.122s
$ time ./tst_sed.sh > ou.sed
real 5m31.539s
user 5m31.669s
sys 0m0.183s
i.e. awk runs in 2.5 seconds while sed takes 5.5 minutes!
If you have a file of line numbers, add p to the end of each and run it as a sed script.
If linelist contains
10
14
1499
15858
then sed 's/$/p/' linelist > selector creates
10p
14p
1499p
15858p
then
$: for n in {1..1500}; do echo $n; done | sed -nf selector
10
14
1499
I didn't send enough lines through to match 15858 so that one didn't print.
This works the same with a decompression from a file.
$: tar xOzf x.tgz | sed -nf selector
10
14
1499

mawk syntax appropriate >1000 field file for counting non-numeric data column-wise?

The following silly hard-coding of what ought to be some kind of loop or parallel construct, works nominally, but it is poor mawk syntax. My good mawk syntax attempts have all failed, using for loops in mawk (not shown) and gnu parallel (not shown).
It really needs to read the CSV file from disk just 1 time, not one time per column, because I have a really big CSV file (millions of rows, thousands of columns). My original code worked fine-ish (not shown) but it read the whole disk file again for every column and it was taking hours and I killed it after realizing what was happening. I have a fast solid state disk using a GPU connector slot so disk reads are blazing fast on this device. Thus CPU is the bottleneck here. Code sillyness is even more of a bottleneck if I have to hard-code 4000 lines of basically the same statements except for column number.
The code is column-wise making counts of non-numeric values. I need some looping (for-loop) or parallel (preferred) because while the following works correctly on 2 columns, it is not a scalable way to write mawk code for thousands of columns.
tail -n +1 pht.csv | awk -F"," '(($1+0 != $1) && ($1!="")){cnt1++}; (($2+0 != $2) && ($2!="")){cnt2++} END{print cnt1+0; print cnt2+0}'
2
1
How can the "column 1 processing; column 2 processing;" duplicate code be reduced? How can looping be introduced? How can gnu parallel be introduced? Thanks much. New to awk, I am. Not new to other languages.
I keep expecting some clever combo of one or more of the following bash commands is going to solve this handily, but here I am many hours later with nothing to show. I come with open hands. Alms for the code-poor?
seq 1 2 ( >>2 for real life CSV file)
tail (to skip the header or not as needed)
mawk (nice-ish row-wise CSV file processing, with that handy syntax I showed you in my demo for finding non-numerics easily in a supposedly all-numeric CSV datafile of jumbo dimensions)
tr (removes newline which is handy for transpose-ish operations)
cut (to grab a column at a time)
parallel (fast is good, and I have mucho cores needing something to work on, and phat RAM)
Sorry, I am absolutely required to not use CSV specific libraries like python pandas or R dataframes. My hands are tied here. Sorry. Thank you for being so cool about it. I can only use bash command lines in this case.
My mawk can handle 32000+ columns so NF is not a problem here, unlike some other awk I've seen. I have less than 32000 columns (but not by that much).
Datafile pht.csv contains the following 3x2 dataset:
cat pht.csv
8,T1,
T13,3,T1
T13,,-6.350818276405334473e-01
don't have access to mawk but you can do something equivalent to this
awk -F, 'NR>1 {for(i=1;i<=NF;i++) if($i~/[[:alpha:]]/) a[i]++}
END {for(i=1;i in a;i++) print a[i]}' file
shouldn't take more than few minutes even for million records.
For recognizing exponential notation regex test is not going to work and you need to revert to $1+0!=$1 test as mentioned in the comments. Note that you don't have to check null string separately.
None of the solutions so far parallelize. Let's change that.
Assume you have a solution that works in serial and can read from a pipe:
doit() {
# This solution gives 5-10 MB/s depending on system
# Changed so it now also treats '' as zero
perl -F, -ane 'for(0..$#F) {
# Perl has no beautiful way of matching scientific notation
$s[$_] += $F[$_] !~ /^-?\d*(?:\.\d*)?(?:[eE][+\-]?\d+)?$/m
}
END { $" = ","; print "#s\n" }';
}
export -f doit
doit() {
# Somewhat faster - but regards empty fields as zero
mawk -F"," '{
for(i=1;i<=NF;i++) { cnt[i]+=(($i+0)!=$i) && ($i!="") }
}
END { for(i=1;i<NF;i++) printf cnt[i]","; print cnt[NF] }';
}
export -f doit
To parallelize this we need to split the big file into chunks and pass each chunk to the serial solution:
# This will spawn a process for each core
parallel --pipe-part -a pht.csv --block -1 doit > blocksums
(You need version 20161222 or later to use '--block -1').
To deal with the header we compute result of the header, but we negate the result:
head -n1 pht.csv | doit | perl -pe 's/(^|,)/$1-/g' > headersum
Now we can simply sum up the headersum and the blocksums:
cat headersum blocksums |
perl -F, -ane 'for(0..$#F) { $s[$_] += $F[$_] }
END { $" = ",";print "#s\n" }'
Or if you prefer the output line by line:
cat headersum blocksums |
perl -F, -ane 'for(0..$#F) { $s[$_] += $F[$_] }
END { $" = "\n";print "#s\n" }'
Is this what you're trying to do?
$ awk -v RS='[\n,]' '($1+0) != $1' file | sort | uniq -c
1 T1
2 T13
The above uses GNU awk for multi-char RS and should run in seconds for an input file like you describe. If you don't have GNU awk you could do:
$ tr ',' $'\n' < file | awk '($1+0) != $1' | sort | uniq -c
1 T1
2 T13
I'm avoiding the approach of using , as a FS since then you'd have to use $i in a loop which would cause awk to do field splitting for every input line which adds on time but you could try it:
$ awk -F, '{for (i=1;i<=NF;i++) if (($i+0) != $i) print $i}' file | sort | uniq -c
1 T1
2 T13
You could do the unique counting all in awk with an array indexed by the non-numeric values but then you potentially have to store a lot of data in memory (unlike with sort which uses temp swap files as necessary) so YMMV with that approach.
I solved it independently. What finally did it for me was the dynamic variable creation examples at the following URL. http://cfajohnson.com/shell/cus-faq-2.html#Q24
Here is the solution I developed. Note: I have added another column with some missing data for a more complete unit test. Mine is not necessarily the best solution, which is TBD. It works correctly on the small csv shown is all I know at the moment. The best solution will also need to run really fast on a 40 GB csv file (not shown haha).
$ cat pht.csv
8,T1,
T13,3,T1
T13,,0
$ tail -n +1 pht.csv | awk -F"," '{ for(i=1;i<=NF;i++) { cnt[i]+=(($i+0)!=$i) && ($i!="") } } END { for(i=1;i<=NF;i++) print cnt[i] }'
2
1
1
ps. Honestly I am not satisfied with my own answer. They say that premature optimization is the root of all evil. Well that maxim does not apply here. I really, really want gnu parallel in there, instead of the for-loop if possible, because I have a need for speed.
Final note: Below I am sharing performance timings of sequential and parallel versions, and best available unit test dataset. Special thank you to Ole Tange for big help developing code to use his nice gnu parallel command in this application.
Unit test datafile, final version:
$ cat pht2.csv
COLA99,COLB,COLC,COLD
8,T1,,T1
T13,3,T1,0.03
T13,,-6.350818276405334473e-01,-0.036
Timing on big data (not shown) for sequential version of column-wise non-numeric counts:
ga#ga-HP-Z820:/mnt/fastssd$ time tail -n +2 train_all.csv | awk -F"," '{ for(i=1; i<=NF; i++){ cnt[i]+=(($i+0)!=$i) && ($i!="") } } END { for(i=1;i<=NF;i++) print cnt[i] }' > /dev/null
real 35m37.121s
Timing on big data for parallel version of column-wise non-numeric counts:
# Correctness - 2 1 1 1 is the correct output.
#
# pht2.csv: 2 1 1 1 :GOOD
# train_all.csv:
# real 1m14.253s
doit1() {
perl -F, -ane 'for(0..$#F) {
# Perl has no beautiful way of matching scientific notation
$s[$_] += $F[$_] !~ /^-?\d+(?:\.\d*)?(?:[eE][+\-]?\d+)?$/m
}
END { $" = ","; print "#s\n" }';
}
# pht2.csv: 2 1 1 1 :GOOD
# train_all.csv:
# real 1m59.960s
doit2() {
mawk -F"," '{
for(i=1;i<=NF;i++) { cnt[i]+=(($i+0)!=$i) && ($i!="") }
}
END { for(i=1;i<NF;i++) printf cnt[i]","; print cnt[NF] }';
}
export -f doit1
parallel --pipe-part -a "$fn" --block -1 doit1 > blocksums
if [ $csvheader -eq 1 ]
then
head -n1 "$fn" | doit1 | perl -pe 's/(^|,)/$1-/g' > headersum
cat headersum blocksums | perl -F, -ane 'for(0..$#F) { $s[$_] += $F[$_] } END { $" = "\n";print "#s\n" }' > "$outfile"
else
cat blocksums | perl -F, -ane 'for(0..$#F) { $s[$_] += $F[$_] } END { $" = "\n";print "#s\n" }' > "$outfile"
fi
NEW: Here is the ROW-wise (not column-wise) counts in sequential code:
tail -n +2 train_all.csv | awk -F"," '{ cnt=0; for(i=1; i<=NF; i++){ cnt+=(($i+0)!=$i) && ($i!="") } print cnt; }' > train_all_cnt_nonnumerics_rowwwise.out.txt
Context: Project is machine learning. This is part of a data exploration. ~25x parallel speedup seen on Dual Xeon 32 virtual / 16 physical core shared memory host using Samsung 950 Pro SSD storage: (32x60) seconds sequential time, 74 sec parallel time. AWESOME!

How to use grep -c to count ocurrences of various strings in a file?

i have a bunch files with data from a company and i need to count, let's say, how many people from a certain cities there are. Initially i was doing it manually with
grep -c 'Chicago' file.csv
But now i have to look for a lot cities and it would be time consuming to do this manually every time. So i did some reaserch and found this:
#!/bin/sh
for p in 'Chicago' 'Washington' 'New York'; do
grep -c '$p' 'file.csv'
done
But it doenst work. It keeps giving me 0s as output and im not sure what is wrong. Anyways, basically what i need is for an output with every result (just the values) given by grep in a column so i can copy directly to a spreadsheet. Ex.:
132
407
523
Thanks in advance.
You should use sort + uniq for that:
$ awk '{print $<N>}' file.csv | sort | uniq -c
where N is the column number of cities (I assume it structured, as it's CSV file).
For example, which shell how often used on my system:
$ awk -F: '{print $7}' /etc/passwd | sort | uniq -c
1 /bin/bash
1 /bin/sync
1 /bin/zsh
1 /sbin/halt
41 /sbin/nologin
1 /sbin/shutdown
$
From the title, it sounds like you want to count the number of occurrences of the string rather than the number of lines on which the string appears, but since you accept the grep -c answer I'll assume you actually only care about the latter. Do not use grep and read the file multiple times. Count everything in one pass:
awk '/Chicago/ {c++} /Washington/ {w++} /New York/ {n++}
END { print c; print w; print n }' input-file
Note that this will print a blank line instead of "0" for any string that does not appear, so you migt want to initialize. There are several ways to do that. I like:
awk '/Chicago/ {c++} /Washington/ {w++} /New York/ {n++}
END { print c; print w; print n }' c=0 w=0 n=0 input-file

bash - shuffle a file that is too large to fit in memory

I've got a file that's too large to fit in memory. shuf seems to run in RAM, and sort -R doesn't shuffle (identical lines end up next to each other; I need all of the lines to be shuffled). Are there any options other than rolling my own solution?
Using a form of decorate-sort-undecorate pattern and awk you can do something like:
$ seq 10 | awk 'BEGIN{srand();} {printf "%06d %s\n", rand()*1000000, $0;}' | sort -n | cut -c8-
8
5
1
9
6
3
7
2
10
4
For a file, you would do:
$ awk 'BEGIN{srand();} {printf "%06d %s\n", rand()*1000000, $0;}' SORTED.TXT | sort -n | cut -c8- > SHUFFLED.TXT
or cat the file at the start of the pipeline.
This works by generating a column of random numbers between 000000 and 999999 inclusive (decorate); sorting on that column (sort); then deleting the column (undecorate). That should work on platforms where sort does not understand numerics by generating a column with leading zeros for lexicographic sorting.
You can increase that randomization, if desired, in several ways:
If your platform's sort understands numerical values (POSIX, GNU and BSD do) you can do awk 'BEGIN{srand();} {printf "%0.15f\t%s\n", rand(), $0;}' FILE.TXT | sort -n | cut -f 2- to use a near double float for random representation.
If you are limited to a lexicographic sort, just combine two calls to rand into one column like so: awk 'BEGIN{srand();} {printf "%06d%06d\t%s\n", rand()*1000000,rand()*1000000, $0;}' FILE.TXT | sort -n | cut -f 2- which gives a composite 12 digits of randomization.
Count lines (wc -l) and generate a list of numbers corresponding to line numbers, in a random order - perhaps by generating a list of numbers in a temp file (use /tmp/, which is in RAM typically, and thus relatively fast). Then copy the line corresponding to each number to the target file in the order of the shuffled numbers.
This would be time-inefficient, because of the amount of seeking for newlines in the file, but it would work on almost any size of file.
Have a look at https://github.com/alexandres/terashuf . From page:
terashuf implements a quasi-shuffle algorithm for shuffling
multi-terabyte text files using limited memory
First of all, I would say, this is not a strict global shuffle solution.
Generally, my idea is to split the large file into smaller ones, and then do the shuffle.
Split large file into pieces:
split -bytes=500M large_file small_file_
This will split your large_file into small_file_aa, small_file_ab....
Shuffle:
shuf small_file_aa > small_file_aa.shuf
You may try to blend the files several times to get a result approximate to global shuffle.
If the file is within a few orders of magnitude of what can fit in memory, one option is to randomly distribute the lines among (say) 1000 temporary files, then shuffle each of those files and concatenate the result:
perl -we ' my $NUM_FILES = 1000;
my #fhs;
for (my $i = 0; $i < $NUM_FILES; ++$i) {
open $fh[$i], "> tmp.$i.txt"
or die "Error opening tmp.$i.txt: $!";
}
while (<>) {
$fh[int rand $NUM_FILES]->print($_);
}
foreach my $fh (#fhs) {
close $fh;
}
' < input.txt \
&& \
for tmp_file in tmp.*.txt ; do
shuf ./"$tmp_file" && rm ./"$tmp_file"
done > output.txt
(Of course, there will be some variation in the sizes of the temporary files — they won't all be exactly one-thousandth the size of the original file — so if you use this approach, you need to give yourself some buffer by erring on the side of more, smaller files.)
How about:
perl <large-input-file -lne 'print rand(), "\t", $_' | sort | perl -lpe 's/^.*?\t//' >shuffled-output-file

Resources