mawk syntax appropriate >1000 field file for counting non-numeric data column-wise? - bash

The following silly hard-coding of what ought to be some kind of loop or parallel construct, works nominally, but it is poor mawk syntax. My good mawk syntax attempts have all failed, using for loops in mawk (not shown) and gnu parallel (not shown).
It really needs to read the CSV file from disk just 1 time, not one time per column, because I have a really big CSV file (millions of rows, thousands of columns). My original code worked fine-ish (not shown) but it read the whole disk file again for every column and it was taking hours and I killed it after realizing what was happening. I have a fast solid state disk using a GPU connector slot so disk reads are blazing fast on this device. Thus CPU is the bottleneck here. Code sillyness is even more of a bottleneck if I have to hard-code 4000 lines of basically the same statements except for column number.
The code is column-wise making counts of non-numeric values. I need some looping (for-loop) or parallel (preferred) because while the following works correctly on 2 columns, it is not a scalable way to write mawk code for thousands of columns.
tail -n +1 pht.csv | awk -F"," '(($1+0 != $1) && ($1!="")){cnt1++}; (($2+0 != $2) && ($2!="")){cnt2++} END{print cnt1+0; print cnt2+0}'
2
1
How can the "column 1 processing; column 2 processing;" duplicate code be reduced? How can looping be introduced? How can gnu parallel be introduced? Thanks much. New to awk, I am. Not new to other languages.
I keep expecting some clever combo of one or more of the following bash commands is going to solve this handily, but here I am many hours later with nothing to show. I come with open hands. Alms for the code-poor?
seq 1 2 ( >>2 for real life CSV file)
tail (to skip the header or not as needed)
mawk (nice-ish row-wise CSV file processing, with that handy syntax I showed you in my demo for finding non-numerics easily in a supposedly all-numeric CSV datafile of jumbo dimensions)
tr (removes newline which is handy for transpose-ish operations)
cut (to grab a column at a time)
parallel (fast is good, and I have mucho cores needing something to work on, and phat RAM)
Sorry, I am absolutely required to not use CSV specific libraries like python pandas or R dataframes. My hands are tied here. Sorry. Thank you for being so cool about it. I can only use bash command lines in this case.
My mawk can handle 32000+ columns so NF is not a problem here, unlike some other awk I've seen. I have less than 32000 columns (but not by that much).
Datafile pht.csv contains the following 3x2 dataset:
cat pht.csv
8,T1,
T13,3,T1
T13,,-6.350818276405334473e-01

don't have access to mawk but you can do something equivalent to this
awk -F, 'NR>1 {for(i=1;i<=NF;i++) if($i~/[[:alpha:]]/) a[i]++}
END {for(i=1;i in a;i++) print a[i]}' file
shouldn't take more than few minutes even for million records.
For recognizing exponential notation regex test is not going to work and you need to revert to $1+0!=$1 test as mentioned in the comments. Note that you don't have to check null string separately.

None of the solutions so far parallelize. Let's change that.
Assume you have a solution that works in serial and can read from a pipe:
doit() {
# This solution gives 5-10 MB/s depending on system
# Changed so it now also treats '' as zero
perl -F, -ane 'for(0..$#F) {
# Perl has no beautiful way of matching scientific notation
$s[$_] += $F[$_] !~ /^-?\d*(?:\.\d*)?(?:[eE][+\-]?\d+)?$/m
}
END { $" = ","; print "#s\n" }';
}
export -f doit
doit() {
# Somewhat faster - but regards empty fields as zero
mawk -F"," '{
for(i=1;i<=NF;i++) { cnt[i]+=(($i+0)!=$i) && ($i!="") }
}
END { for(i=1;i<NF;i++) printf cnt[i]","; print cnt[NF] }';
}
export -f doit
To parallelize this we need to split the big file into chunks and pass each chunk to the serial solution:
# This will spawn a process for each core
parallel --pipe-part -a pht.csv --block -1 doit > blocksums
(You need version 20161222 or later to use '--block -1').
To deal with the header we compute result of the header, but we negate the result:
head -n1 pht.csv | doit | perl -pe 's/(^|,)/$1-/g' > headersum
Now we can simply sum up the headersum and the blocksums:
cat headersum blocksums |
perl -F, -ane 'for(0..$#F) { $s[$_] += $F[$_] }
END { $" = ",";print "#s\n" }'
Or if you prefer the output line by line:
cat headersum blocksums |
perl -F, -ane 'for(0..$#F) { $s[$_] += $F[$_] }
END { $" = "\n";print "#s\n" }'

Is this what you're trying to do?
$ awk -v RS='[\n,]' '($1+0) != $1' file | sort | uniq -c
1 T1
2 T13
The above uses GNU awk for multi-char RS and should run in seconds for an input file like you describe. If you don't have GNU awk you could do:
$ tr ',' $'\n' < file | awk '($1+0) != $1' | sort | uniq -c
1 T1
2 T13
I'm avoiding the approach of using , as a FS since then you'd have to use $i in a loop which would cause awk to do field splitting for every input line which adds on time but you could try it:
$ awk -F, '{for (i=1;i<=NF;i++) if (($i+0) != $i) print $i}' file | sort | uniq -c
1 T1
2 T13
You could do the unique counting all in awk with an array indexed by the non-numeric values but then you potentially have to store a lot of data in memory (unlike with sort which uses temp swap files as necessary) so YMMV with that approach.

I solved it independently. What finally did it for me was the dynamic variable creation examples at the following URL. http://cfajohnson.com/shell/cus-faq-2.html#Q24
Here is the solution I developed. Note: I have added another column with some missing data for a more complete unit test. Mine is not necessarily the best solution, which is TBD. It works correctly on the small csv shown is all I know at the moment. The best solution will also need to run really fast on a 40 GB csv file (not shown haha).
$ cat pht.csv
8,T1,
T13,3,T1
T13,,0
$ tail -n +1 pht.csv | awk -F"," '{ for(i=1;i<=NF;i++) { cnt[i]+=(($i+0)!=$i) && ($i!="") } } END { for(i=1;i<=NF;i++) print cnt[i] }'
2
1
1
ps. Honestly I am not satisfied with my own answer. They say that premature optimization is the root of all evil. Well that maxim does not apply here. I really, really want gnu parallel in there, instead of the for-loop if possible, because I have a need for speed.

Final note: Below I am sharing performance timings of sequential and parallel versions, and best available unit test dataset. Special thank you to Ole Tange for big help developing code to use his nice gnu parallel command in this application.
Unit test datafile, final version:
$ cat pht2.csv
COLA99,COLB,COLC,COLD
8,T1,,T1
T13,3,T1,0.03
T13,,-6.350818276405334473e-01,-0.036
Timing on big data (not shown) for sequential version of column-wise non-numeric counts:
ga#ga-HP-Z820:/mnt/fastssd$ time tail -n +2 train_all.csv | awk -F"," '{ for(i=1; i<=NF; i++){ cnt[i]+=(($i+0)!=$i) && ($i!="") } } END { for(i=1;i<=NF;i++) print cnt[i] }' > /dev/null
real 35m37.121s
Timing on big data for parallel version of column-wise non-numeric counts:
# Correctness - 2 1 1 1 is the correct output.
#
# pht2.csv: 2 1 1 1 :GOOD
# train_all.csv:
# real 1m14.253s
doit1() {
perl -F, -ane 'for(0..$#F) {
# Perl has no beautiful way of matching scientific notation
$s[$_] += $F[$_] !~ /^-?\d+(?:\.\d*)?(?:[eE][+\-]?\d+)?$/m
}
END { $" = ","; print "#s\n" }';
}
# pht2.csv: 2 1 1 1 :GOOD
# train_all.csv:
# real 1m59.960s
doit2() {
mawk -F"," '{
for(i=1;i<=NF;i++) { cnt[i]+=(($i+0)!=$i) && ($i!="") }
}
END { for(i=1;i<NF;i++) printf cnt[i]","; print cnt[NF] }';
}
export -f doit1
parallel --pipe-part -a "$fn" --block -1 doit1 > blocksums
if [ $csvheader -eq 1 ]
then
head -n1 "$fn" | doit1 | perl -pe 's/(^|,)/$1-/g' > headersum
cat headersum blocksums | perl -F, -ane 'for(0..$#F) { $s[$_] += $F[$_] } END { $" = "\n";print "#s\n" }' > "$outfile"
else
cat blocksums | perl -F, -ane 'for(0..$#F) { $s[$_] += $F[$_] } END { $" = "\n";print "#s\n" }' > "$outfile"
fi
NEW: Here is the ROW-wise (not column-wise) counts in sequential code:
tail -n +2 train_all.csv | awk -F"," '{ cnt=0; for(i=1; i<=NF; i++){ cnt+=(($i+0)!=$i) && ($i!="") } print cnt; }' > train_all_cnt_nonnumerics_rowwwise.out.txt
Context: Project is machine learning. This is part of a data exploration. ~25x parallel speedup seen on Dual Xeon 32 virtual / 16 physical core shared memory host using Samsung 950 Pro SSD storage: (32x60) seconds sequential time, 74 sec parallel time. AWESOME!

Related

Stream filter large number of lines that are specified by line number from stdin

I have a huge xz compressed text file huge.txt.xz with millions of lines that is too large to keep around uncompressed (60GB).
I would like to quickly filter/select a large number of lines (~1000s) from that huge text file into a file filtered.txt. The line numbers to select could for example be specified in a separate text file select.txt with a format as follows:
10
14
...
1499
15858
Overall, I envisage a shell command as follows where "TO BE DETERMINED" is the command I'm looking for:
xz -dcq huge.txt.xz | "TO BE DETERMINED" select.txt >filtered.txt
I've managed to find an awk program from a closely related question that almost does the job - the only problem being that it takes a file name instead of reading from stdin. Unfortunately, I don't really understand the awk script and don't know enough awk to alter it in such a way to work in this case.
This is what works right now with the disadvantage of having a 60GB file lie around rather than streaming:
xz -dcq huge.txt.xz >huge.txt
awk '!firstfile_proceed { nums[$1]; next }
(FNR in nums)' select.txt firstfile_proceed=1 >filtered.txt
Inspiration: https://unix.stackexchange.com/questions/612680/remove-lines-with-specific-line-number-specified-in-a-file
Keeping with OP's current idea:
xz -dcq huge.txt.xz | awk '!firstfile_proceed { nums[$1]; next } (FNR in nums)' select.txt firstfile_proceed=1 -
Where the - (at the end of the line) tells awk to read from stdin (in this case the output from xz that's being piped to the awk call).
Another way to do this (replaces all of the above code):
awk '
FNR==NR { nums[$1]; next } # process first file
FNR in nums # process subsequent file(s)
' select.txt <(xz -dcq huge.txt.xz)
Comments removed and cut down to a 'one-liner':
awk 'FNR==NR {nums[$1];next} FNR in nums' select.txt <(xz -dcq huge.txt.xz)
Adding some logic to implement Ed Morton's comment (exit processing once FNR > largest value from select.txt):
awk '
# process first file
FNR==NR { nums[$1]
maxFNR= ($1>maxFNR ? $1 : maxFNR)
next
}
# process subsequent file(s):
FNR > maxFNR { exit }
FNR in nums
' select.txt <(xz -dcq huge.txt.xz)
NOTES:
keeping in mind we're talking about scanning millions of lines of input ...
FNR > maxFNR will obviously add some cpu/processing time to the overall operation (though less time than FNR in nums)
if the operation routinely needs to pull rows from, say, the last 25% of the file then FNR > maxFNR is likely providing little benefit (and probably slowing down the operation)
if the operation routinely finds all desired rows in, say, the first 50% of the file then FNR> maxFNR is probably worth the cpu/processing time to keep from scanning the entire input stream (then again, the xz operation, on the entire file, is likely the biggest time consumer)
net result: the additional NFR > maxFNR test may speed-up/slow-down the overall process depending on how much of the input stream needs to be processed in a typical run; OP would need to run some tests to see if there's a (noticeable) difference in overall runtime
To clarify my previous comment. I'll show a simple reproducible sample:
linelist content:
10
15858
14
1499
To simulate a long input, I'll use seq -w 100000000.
Comparing sed solution with my suggestion, we have:
#!/bin/bash
time (
sed 's/$/p/' linelist > selector
seq -w 100000000 | sed -nf selector
)
time (
sort -n linelist | sed '$!{s/$/p/};$s/$/{p;q}/' > my_selector
seq -w 100000000 | sed -nf my_selector
)
output:
000000010
000000014
000001499
000015858
real 1m23.375s
user 1m38.004s
sys 0m1.337s
000000010
000000014
000001499
000015858
real 0m0.013s
user 0m0.014s
sys 0m0.002s
Comparing my solution with awk:
#!/bin/bash
time (
awk '
# process first file
FNR==NR { nums[$1]
maxFNR= ($1>maxFNR ? $1 : maxFNR)
next
}
# process subsequent file(s):
FNR > maxFNR { exit }
FNR in nums
' linelist <(seq -w 100000000)
)
time (
sort -n linelist | sed '$!{s/$/p/};$s/$/{p;q}/' > my_selector
sed -nf my_selector <(seq -w 100000000)
)
output:
000000010
000000014
000001499
000015858
real 0m0.023s
user 0m0.020s
sys 0m0.001s
000000010
000000014
000001499
000015858
real 0m0.017s
user 0m0.007s
sys 0m0.001s
In my conclusion, seq using q is comparable with awk solution. For readability and maintainability I prefer awk solution.
Anyway, this test is simplistic and only useful for small comparisons. I don't know, for example, what the result would be if I test this against the real compressed file, with heavy disc I/O.
EDIT by Ed Morton:
Any speed test that results in all output values that are less than a second is a bad test because:
In general no-one cares if X runs in 0.1 or 0.2 secs, they're both fast enough unless being called in a large loop, and
Things like cache-ing can impact the results, and
Often a script that runs faster for a small input set where execution speed doesn't matter will run slower for a large input set where execution speed DOES matter (e.g. if the script that's slower for the small input spends time setting up data structures that will allow it to run faster for the larger)
The problem with the above example is it's only trying to print 4 lines rather than the 1000s of lines that the OP said they'd have to select so it doesn't exercise the difference between the sed and the awk solution that causes the sed solution to be much slower than the awk one which is that the sed solution has to test every target line number for every line of input while the awk solution just does a single hash lookup of the current line. It's an order(N) vs order(1) algorithm on each line of the input file.
Here's a better example showing printing every 100th line from a 1000000 line file (i.e. will select 1000 lines) rather than just 4 lines from any size file:
$ cat tst_awk.sh
#!/usr/bin/env bash
n=1000000
m=100
awk -v n="$n" -v m="$m" 'BEGIN{for (i=1; i<=n; i+=m) print i}' > linelist
seq "$n" |
awk '
FNR==NR {
nums[$1]
maxFNR = $1
next
}
FNR in nums {
print
if ( FNR == maxFNR ) {
exit
}
}
' linelist -
$ cat tst_sed.sh
#!/usr/bin/env bash
n=1000000
m=100
awk -v n="$n" -v m="$m" 'BEGIN{for (i=1; i<=n; i+=m) print i}' > linelist
sed '$!{s/$/p/};$s/$/{p;q}/' linelist > my_selector
seq "$n" |
sed -nf my_selector
$ time ./tst_awk.sh > ou.awk
real 0m0.376s
user 0m0.311s
sys 0m0.061s
$ time ./tst_sed.sh > ou.sed
real 0m33.757s
user 0m33.576s
sys 0m0.045s
As you can see the awk solution ran 2 orders of magnitude faster than the sed one, and they produced the same output:
$ diff ou.awk ou.sed
$
If I make the input file bigger and select 10,000 lines from it by setting:
n=10000000
m=1000
in each script, which is probably getting more realistic for the OPs usage, the difference becomes really impressive:
$ time ./tst_awk.sh > ou.awk
real 0m2.474s
user 0m2.843s
sys 0m0.122s
$ time ./tst_sed.sh > ou.sed
real 5m31.539s
user 5m31.669s
sys 0m0.183s
i.e. awk runs in 2.5 seconds while sed takes 5.5 minutes!
If you have a file of line numbers, add p to the end of each and run it as a sed script.
If linelist contains
10
14
1499
15858
then sed 's/$/p/' linelist > selector creates
10p
14p
1499p
15858p
then
$: for n in {1..1500}; do echo $n; done | sed -nf selector
10
14
1499
I didn't send enough lines through to match 15858 so that one didn't print.
This works the same with a decompression from a file.
$: tar xOzf x.tgz | sed -nf selector
10
14
1499

How to work around open files limit when demuxing files?

I frequently have large text files (10-100GB decompressed) to demultiplex based on barcodes in each line, where in practice the number of resulting individual files (unique barcodes) is between 1K and 20K. I've been using awk for this and it accomplishes the task. However, I've noticed that the rate of demuxing larger files (which correlates with more unique barcodes used) is significantly slower (10-20X). Checking ulimit -n shows 4096 as the limit on open files per process, so I suspect that the slowdown is due to the overhead of awk being forced to constantly close and reopen files whenever the total number of demuxed files exceeds 4096.
Lacking root access (i.e., the limit is fixed), what kinds of workarounds could be used to circumvent this bottleneck?
I do have a list of all barcodes present in each file, so I've considered forking multiple awk processes where each is assigned a mutually exclusive subset (< 4096) of barcodes to search for. However, I'm concerned the overhead of having to check each line's barcode for set membership might defeat the gains of not closing files.
Is there a better strategy?
I'm not married to awk, so approaches in other scripting or compiled languages are welcome.
Specific Example
Data Generation (FASTQ with barcodes)
The following generates data similar to what I'm specifically working with. Each entry consists of 4 lines, where the barcode is an 18 character word using the non-ambiguous DNA alphabet.
1024 unique barcodes | 1 million reads
cat /dev/urandom | tr -dc "ACGT" | fold -w 5 | \
awk '{ print "#batch."NR"_"$0"AAAAAAAAAAAAA_ACGTAC length=1\nA\n+\nI" }' | \
head -n 4000000 > cells.1K.fastq
16384 unique barcodes | 1 million reads
cat /dev/urandom | tr -dc "ACGT" | fold -w 7 | \
awk '{ print "#batch."NR"_"$0"AAAAAAAAAAA_ACGTAC length=1\nA\n+\nI" }' | \
head -n 4000000 > cells.16K.fastq
awk script for demultiplexing
Note that in this case I'm writing to 2 files for each unique barcode.
demux.awk
#!/usr/bin/awk -f
BEGIN {
if (length(outdir) == 0 || length(prefix) == 0) {
print "Variables 'outdir' and 'prefix' must be defined!" > "/dev/stderr";
exit 1;
}
print "[INFO] Initiating demuxing..." > "/dev/stderr";
}
{
if (NR%4 == 1) {
match($1, /.*_([ACGT]{18})_([ACGTN]{6}).*/, bx);
print bx[2] >> outdir"/"prefix"."bx[1]".umi";
}
print >> outdir"/"prefix"."bx[1]".fastq";
if (NR%40000 == 0) {
printf("[INFO] %d reads processed\n", NR/4) > "/dev/stderr";
}
}
END {
printf("[INFO] %d total reads processed\n", NR/4) > "/dev/stderr";
}
Usage
awk -v outdir="/tmp/demux1K" -v prefix="batch" -f demux.awk cells.1K.fastq
or similarly for the cells.16K.fastq.
Assuming you're the only one running awk, you can verify the approximate number of open files using
lsof | grep "awk" | wc -l
Observed Behavior
Despite the files being the same size, the one with 16K unique barcodes runs 10X-20X slower than the one with only 1K unique barcodes.
Without seeing any sample input/output or the script you're currently executing it's very much guesswork but if you currently have the barcode in field 1 and are doing (assuming GNU awk so you don't have your own code managing the open files):
awk '{print > $1}' file
then if managing open files really is your problem you'll get a significant improvement if you change it to:
sort file | '$1!=f{close(f};f=$1} {print > f}'
The above is, of course, making assumptions about what these barcoode values are, which field holds them, what separates fields, whether or not the output order has to match the original, what else your code might be doing that gets slower as the input grows, etc., etc. since you haven't shown us any of that yet.
If that's not all you need then edit your question to include the missing MCVE.
Given your updated question with your script and the info that the input is 4-line blocks, I'd approach the problem by adding the key "bx" values at the front of each record and using NUL to separate the 4-line blocks then using NUL as the record separator for sort and the subsequent awk:
$ cat tst.sh
infile="$1"
outdir="${infile}_out"
prefix="foo"
mkdir -p "$outdir" || exit 1
awk -F'[_[:space:]]' -v OFS='\t' -v ORS= '
NR%4 == 1 { print $2 OFS $3 OFS }
{ print $0 (NR%4 ? RS : "\0") }
' "$infile" |
sort -z |
awk -v RS='\0' -F'\t' -v outdir="$outdir" -v prefix="$prefix" '
BEGIN {
if ( (outdir == "") || (prefix == "") ) {
print "Variables \047outdir\047 and \047prefix\047 must be defined!" | "cat>&2"
exit 1
}
print "[INFO] Initiating demuxing..." | "cat>&2"
outBase = outdir "/" prefix "."
}
{
bx1 = $1
bx2 = $2
fastq = $3
if ( bx1 != prevBx1 ) {
close(umiOut)
close(fastqOut)
umiOut = outBase bx1 ".umi"
fastqOut = outBase bx1 ".fastq"
prevBx1 = bx1
}
print bx2 > umiOut
print fastq > fastqOut
if (NR%10000 == 0) {
printf "[INFO] %d reads processed\n", NR | "cat>&2"
}
}
END {
printf "[INFO] %d total reads processed\n", NR | "cat>&2"
}
'
When run against input files generated as you describe in your question:
$ wc -l cells.*.fastq
4000000 cells.16K.fastq
4000000 cells.1K.fastq
the results are:
$ time ./tst.sh cells.1K.fastq 2>/dev/null
real 0m55.333s
user 0m56.750s
sys 0m1.277s
$ ls cells.1K.fastq_out | wc -l
2048
$ wc -l cells.1K.fastq_out/*.umi | tail -1
1000000 total
$ wc -l cells.1K.fastq_out/*.fastq | tail -1
4000000 total
$ time ./tst.sh cells.16K.fastq 2>/dev/null
real 1m6.815s
user 0m59.058s
sys 0m5.833s
$ ls cells.16K.fastq_out | wc -l
32768
$ wc -l cells.16K.fastq_out/*.umi | tail -1
1000000 total
$ wc -l cells.16K.fastq_out/*.fastq | tail -1
4000000 total

Bash script to print X lines of a file in sequence

I'd be very grateful for your help with something probably quite simple.
I have a table (table2.txt), which has a single column of randomly generated numbers, and is about a million lines long.
2655087
3721239
5728533
9082076
2016819
8983893
9446748
6607974
I want to create a loop that repeats 10,000 times, so that for iteration 1, I print lines 1 to 4 to a file (file0.txt), for iteration 2, I print lines 5 to 8 (file1.txt), and so on.
What I have so far is this:
#!/bin/bash
for i in {0..10000}
do
awk 'NR==((4 * "$i") +1)' table2.txt > file"$i".txt
awk 'NR==((4 * "$i") +2)' table2.txt >> file"$i".txt
awk 'NR==((4 * "$i") +3)' table2.txt >> file"$i".txt
awk 'NR==((4 * "$i") +4)' table2.txt >> file"$i".txt
done
Desired output for file0.txt:
2655087
3721239
5728533
9082076
Desired output for file1.txt:
2016819
8983893
9446748
6607974
Something is going wrong with this, because I am getting identical outputs from all my files (i.e. they all look like the desired output of file0.txt). Hopefully you can see from my script that during the second iteration, i.e. when i=2, I want the output to be the values of rows 5, 6, 7 and 8.
This is probably a very simple syntax error, and I would be grateful if you can tell me where I'm going wrong (or give me a less cumbersome solution!)
Thank you very much.
The beauty of awk is that you can do this in one awk line :
awk '{ print > ("file"c".txt") }
(NR % 4 == 0) { ++c }
(c == 10001) { exit }' <file>
This can be slightly more optimized and file handling friendly (cfr. James Brown):
awk 'BEGIN{f="file0.txt" }
{ print > f }
(NR % 4 == 0) { close(f); f="file"++c".txt" }
(c == 10001) { exit }' <file>
Why did your script fail?
The reason why your script is failing is because you used single quotes and tried to pass a shell variable to it. Your lines should read :
awk 'NR==((4 * '$i') +1)' table2.txt > file"$i".txt
but this is very ugly and should be improved with
awk -v i=$i 'NR==(4*i+1)' table2.txt > file"$i".txt
Why is your script slow?
The way you are processing your file is by doing a loop of 10001 iterations. Per iterations, you perform 4 awk calls. Each awk call reads the full file completely and writes out a single line. So in the end you read your files 40004 times.
To optimise your script step by step, I would do the following :
Terminate awk to step reading the file after the line is print
#!/bin/bash
for i in {0..10000}; do
awk -v i=$i 'NR==(4*i+1){print; exit}' table2.txt > file"$i".txt
awk -v i=$i 'NR==(4*i+2){print; exit}' table2.txt >> file"$i".txt
awk -v i=$i 'NR==(4*i+3){print; exit}' table2.txt >> file"$i".txt
awk -v i=$i 'NR==(4*i+4){print; exit}' table2.txt >> file"$i".txt
done
Merge the 4 awk calls into a single one. This prevents reading the first lines over and over per loop cycle.
#!/bin/bash
for i in {0..10000}; do
awk -v i=$i '(NR<=4*i) {next} # skip line
(NR> 4*(i+1)}{exit} # exit awk
1' table2.txt > file"$i".txt # print line
done
remove the final loop (see top of this answer)
This is functionally the same as #JamesBrown's answer but just written more awk-ishly so don't accept this, I just posted it to show the more idiomatic awk syntax as you can't put formatted code in a comment.
awk '
(NR%4)==1 { close(out); out="file" c++ ".txt" }
c > 10000 { exit }
{ print > out }
' file
See why-is-using-a-shell-loop-to-process-text-considered-bad-practice for some of the reasons why you should avoid shell loops for manipulating text.
With just bash you can do it very simple:
chunk=4
files=10000
head -n $(($chunk*$files)) table2.txt |
split -d -a 5 --additional-suffix=.txt -l $chunk - file
Basically read first 10k lines and split them into chunks of 4 consecutive lines, using file as prefix and .txt as suffix for the new files.
If you want a numeric identifier, you will need 5 digits (-a 5), as pointed in the comments (credit: #kvantour).
Another awk:
$ awk '{if(NR%4==1){if(i==10000)exit;close(f);f="file" i++ ".txt"}print > f}' file
$ ls
file file0.txt file1.txt
Explained:
awk ' {
if(NR%4==1) { # use mod to recognize first record of group
if(i==10000) # exit after 10000 files
exit # test with 1
close(f) # close previous file
f="file" i++ ".txt" # make a new filename
}
print > f # output record to file
}' file

Another approach to apply RIPEMD in CSV file

I am looking for another approach to apply RIPEMD-160 to the second column of a csv file.
Here is my code
awk -F "," -v env_var="$key" '{
tmp="echo -n \047" $2 env_var "\047 | openssl ripemd160 | cut -f2 -d\047 \047"
if ( (tmp | getline cksum) > 0 ) {
$3 = toupper(cksum)
}
close(tmp)
print
}' /test/source.csv > /ziel.csv
I run it in a big csv file (1Go), it takes 2 days and I get only 100Mo, that means i need to wait a month to get all my new CSV.
Can you help me with another idea and approach to get my data faster.
Thanks in advance
you can use GNU Parallel to increase the speed of output by executing the awk command in parallel For explanation check here
cat /test/source.csv | parallel --pipe awk -F "," -v env_var="$key" '{
tmp="echo -n \047" $2 env_var "\047 | openssl ripemd160 | cut -f2 -d\047 \047"
if ( (tmp | getline cksum) > 0 ) {
$3 = toupper(cksum)
}
close(tmp)
print
}' > /ziel.csv
# prepare a batch (to avoir fork from awk)
awk -F "," -v env_var="$key" '
BEGIN {
print "if [ -r /tmp/MD160.Result ];then rm /tmp/MD160.Result;fi"
}
{
print "echo \"\$( echo -n \047" $2 env_var "\047 | openssl ripemd160 )\" >> /tmp/MD160.Result"
} ' /test/source.csv > /tmp/MD160.eval
# eval the MD for each line with batch fork (should be faster)
. /tmp/MD160.eval
# take result and adapt for output
awk '
# load MD160
FNR == NR { m[NR] = toupper($2); next }
# set FS to ","
FNR == 1 { FS = ","; $0 = $0 "" }
# adapt original line
{ $3 = m[FNR]; print}
' /tmp/MD160.Result /test/source.csv > /ziel.csv
Note:
not tested (so the print need maybe some tuning with escape)
no error treatment (assume everything is ok). I advice to make some test (like inclunding line reference in reply and test in second awk).
fork at batch level will be lot more faster than fork from awk including piping fork, catching the reply
not a specialist of openssl ripemd160 but there is maybe another way to treat element in a bulk process without opening everytime a fork from same file/source
Your solution hits Cygwin where it hurts the most: Spawning new programs. Cygwin is terrible slow at this.
You can make this faster by using all cores in you computer, but it will still be very slow.
You need a program that does not start other programs to compute the RIPEMD sum. Here is a small Python script that takes the CSV on standard input and outputs the CSV on standard output with the second column replaced with the RIPEMD sum.
riper.py:
#!/usr/bin/python
import hashlib
import fileinput
import os
key = os.environ['key']
for line in fileinput.input():
# Naiive CSV reader - split on ,
col = line.rstrip().split(",")
# Compute RIPEMD on column 2
h = hashlib.new('ripemd160')
h.update(col[1]+key)
# Update column 2 with the hexdigext
col[1] = h.hexdigest().upper();
print ','.join(col)
Now you can run:
cat source.csv | key=a python riper.py > ziel.csv
This will still only use a single core of your system. To use all core GNU Parallel can help. If you do not have GNU Parallel 20161222 or newer in your package system, it can be installed as:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
You will need Perl installed to run GNU Parallel:
key=a
export key
parallel --pipe-part --block -1 -a source.csv -k python riper.py > ziel.csv
This will on the fly chop source.csv into one block per CPU core and for each block run the python script. On my 8 core this processes a 1 GB file with 139482000 lines in 300 seconds.
If you need it faster still, you will need to convert riper.py to a compiled language (e.g. C).

How to efficiently sum two columns in a file with 270,000+ rows in bash

I have two columns in a file, and I want to automate summing both values per row
for example
read write
5 6
read write
10 2
read write
23 44
I want to then sum the "read" and "write" of each row. Eventually after summing, I'm finding the max sum and putting that max value in a file. I feel like I have to use grep -v to rid of the column headers per row, which like stated in the answers, makes the code inefficient since I'm grepping the entire file just to read a line.
I currently have this in a bash script (within a for loop where $x is the file name) to sum the columns line by line
lines=`grep -v READ $x|wc -l | awk '{print $1}'`
line_num=1
arr_num=0
while [ $line_num -le $lines ]
do
arr[$arr_num]=`grep -v READ $x | sed $line_num'q;d' | awk '{print $2 + $3}'`
echo $line_num
line_num=$[$line_num+1]
arr_num=$[$arr_num+1]
done
However, the file to be summed has 270,000+ rows. The script has been running for a few hours now, and it is nowhere near finished. Is there a more efficient way to write this so that it does not take so long?
Use awk instead and take advantage of modulus function:
awk '!(NR%2){print $1+$2}' infile
awk is probably faster, but the idiomatic bash way to do this is something like:
while read -a line; do # read each line one-by-one, into an array
# use arithmetic expansion to add col 1 and 2
echo "$(( ${line[0]} + ${line[1]} ))"
done < <(grep -v READ input.txt)
Note the file input file is only read once (by grep) and the number of externally forked programs is kept to a minimum (just grep, called only once for the whole input file). The rest of the commands are bash builtins.
Using the <( ) process substition, in case variables set in the while loop are required out of scope of the while loop. Otherwise a | pipe could be used.
Your question is pretty verbose, yet your goal is not clear. The way I read it, your numbers are on every second line, and you want only to find the maximum sum. Given that:
awk '
NR%2 == 1 {next}
NR == 2 {max = $1+$2; next}
$1+$2 > max {max = $1+$2}
END {print max}
' filename
You could also use a pipeline with tools that implicitly loop over the input like so:
grep -v read INFILE | tr -s ' ' + | bc | sort -rn | head -1 > OUTFILE
This assumes there are spaces between your read and write data values.
Why not run:
awk 'NR==1 { print "sum"; next } { print $1 + $2 }'
You can afford to run it on the file while the other script it still running. It'll be complete in a few seconds at most (prediction). When you're confident it's right, you can kill the other process.
You can use Perl or Python instead of awk if you prefer.
Your code is running grep, sed and awk on each line of the input file; that's damnably expensive. And it isn't even writing the data to a file; it is creating an array in Bash's memory that'll need to be printed to the output file later.
Assuming that it's always one 'header' row followed by one 'data' row:
awk '
BEGIN{ max = 0 }
{
if( NR%2 == 0 ){
sum = $1 + $2;
if( sum > max ) { max = sum }
}
}
END{ print max }' input.txt
Or simply trim out all lines that do not conform to what you want:
grep '^[0-9]\+\s\+[0-9]\+$' input.txt | awk '
BEGIN{ max = 0 }
{
sum = $1 + $2;
if( sum > max ) { max = sum }
}
END{ print max }' input.txt

Resources