Another approach to apply RIPEMD in CSV file - bash

I am looking for another approach to apply RIPEMD-160 to the second column of a csv file.
Here is my code
awk -F "," -v env_var="$key" '{
tmp="echo -n \047" $2 env_var "\047 | openssl ripemd160 | cut -f2 -d\047 \047"
if ( (tmp | getline cksum) > 0 ) {
$3 = toupper(cksum)
}
close(tmp)
print
}' /test/source.csv > /ziel.csv
I run it in a big csv file (1Go), it takes 2 days and I get only 100Mo, that means i need to wait a month to get all my new CSV.
Can you help me with another idea and approach to get my data faster.
Thanks in advance

you can use GNU Parallel to increase the speed of output by executing the awk command in parallel For explanation check here
cat /test/source.csv | parallel --pipe awk -F "," -v env_var="$key" '{
tmp="echo -n \047" $2 env_var "\047 | openssl ripemd160 | cut -f2 -d\047 \047"
if ( (tmp | getline cksum) > 0 ) {
$3 = toupper(cksum)
}
close(tmp)
print
}' > /ziel.csv

# prepare a batch (to avoir fork from awk)
awk -F "," -v env_var="$key" '
BEGIN {
print "if [ -r /tmp/MD160.Result ];then rm /tmp/MD160.Result;fi"
}
{
print "echo \"\$( echo -n \047" $2 env_var "\047 | openssl ripemd160 )\" >> /tmp/MD160.Result"
} ' /test/source.csv > /tmp/MD160.eval
# eval the MD for each line with batch fork (should be faster)
. /tmp/MD160.eval
# take result and adapt for output
awk '
# load MD160
FNR == NR { m[NR] = toupper($2); next }
# set FS to ","
FNR == 1 { FS = ","; $0 = $0 "" }
# adapt original line
{ $3 = m[FNR]; print}
' /tmp/MD160.Result /test/source.csv > /ziel.csv
Note:
not tested (so the print need maybe some tuning with escape)
no error treatment (assume everything is ok). I advice to make some test (like inclunding line reference in reply and test in second awk).
fork at batch level will be lot more faster than fork from awk including piping fork, catching the reply
not a specialist of openssl ripemd160 but there is maybe another way to treat element in a bulk process without opening everytime a fork from same file/source

Your solution hits Cygwin where it hurts the most: Spawning new programs. Cygwin is terrible slow at this.
You can make this faster by using all cores in you computer, but it will still be very slow.
You need a program that does not start other programs to compute the RIPEMD sum. Here is a small Python script that takes the CSV on standard input and outputs the CSV on standard output with the second column replaced with the RIPEMD sum.
riper.py:
#!/usr/bin/python
import hashlib
import fileinput
import os
key = os.environ['key']
for line in fileinput.input():
# Naiive CSV reader - split on ,
col = line.rstrip().split(",")
# Compute RIPEMD on column 2
h = hashlib.new('ripemd160')
h.update(col[1]+key)
# Update column 2 with the hexdigext
col[1] = h.hexdigest().upper();
print ','.join(col)
Now you can run:
cat source.csv | key=a python riper.py > ziel.csv
This will still only use a single core of your system. To use all core GNU Parallel can help. If you do not have GNU Parallel 20161222 or newer in your package system, it can be installed as:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
You will need Perl installed to run GNU Parallel:
key=a
export key
parallel --pipe-part --block -1 -a source.csv -k python riper.py > ziel.csv
This will on the fly chop source.csv into one block per CPU core and for each block run the python script. On my 8 core this processes a 1 GB file with 139482000 lines in 300 seconds.
If you need it faster still, you will need to convert riper.py to a compiled language (e.g. C).

Related

Bash: Working with CSV file to build a loop and save the result

Using Bash, I'm wanting to get a list of email addresses from a CSV file to do a recursive grep search on it for a bunch of directories looking for a match in specific metadata XML files, and then also tallying up how many results I find for each address throughout the directory tree (i.e. updating the tally field in the same CSV file).
accounts.csv looks something like this:
updated to more accurately reflect real-world data
email,date,bar,URL,"something else",tally
address#somewhere.com,21/04/2015,1.2.3.4,https://blah.com/,"blah blah",5
something#that.com,17/06/2015,5.6.7.8,https://blah.com/,"lah yah",0
another#here.com,7/08/2017,9.10.11.12,https://blah.com/,"wah wah",1
For example, if we put address#somewhere.com in $email from the list, run
grep -rl "${email}" --include=\*_meta.xml --only-matching | wc -l
on it and then add that result to the tally column.
At the moment I can get the first column of that CSV file (minus the heading/first line) using
awk -F"," '{print $1}' accounts.csv | tail -n +2
but I'm lost how to do the looping and also the writing of the result back to the CSV file...
So for instance, with another#here.com if we run
grep -rl "${email}" --include=\*_meta.xml --only-matching | wc -l
and the result is say 17, how can I update that line to become:
another#here.com,7/08/2017,9.10.11.12,https://blah.com/,"wah wah",17
Is this possible with maybe awk or sed?
This is where I'm up to:
#!/bin/bash
# make temporary list of email addresses
awk -F"," '{print $1}' accounts.csv | tail -n +2 > emails.tmp
# loop over each
while read email; do
# count how many uploads for current email address
grep -rl "${email}" --include=\*_meta.xml --only-matching | wc -l
done < emails.tmp
XML Metadata looks something like this:
<?xml version="1.0" encoding="UTF-8"?>
<metadata>
<identifier>SomeTitleNameGoesHere</identifier>
<mediatype>audio</mediatype>
<collection>opensource_movies</collection>
<description>example <br /></description>
<subject>testing</subject>
<title>Some Title Name Goes Here</title>
<uploader>another#here.com</uploader>
<addeddate>2017-05-28 06:20:54</addeddate>
<publicdate>2017-05-28 06:21:15</publicdate>
<curation>[curator]email#address.com[/curator][date]20170528062151[/date][comment]checked for malware[/comment]</curation>
</metadata>
how to do the looping and also the writing of the result back to the CSV file
awk does the looping automatically. You can change any field by assigning to it. So to change a tally field (the 6th in each line) you would do $6 = ....
awk is a great tool for many scenarios. You probably can safe a lot of time in the future by investing some minutes in a short tutorial now.
The only non-trivial part is getting the output of grep into awk.
The following script increments each tally by the count of *_meta.xml files containing the given email address:
awk -F, -v OFS=, -v q=\' 'NR>1 {
cmd = "grep -rlFw " q $1 q " --include=\\*_meta.xml | wc -l";
cmd | getline c;
close(cmd);
$6 = c
} 1' accounts.csv
For simplicity we assume that filenames are free of linebreaks and email addresses are free of '.
To reduce possible false positives, I also added the -F and -w option to your grep command.
-F searches literal strings; without it, searching for a.b#c would give false positives for things like axb#c and a-b#c.
-w matches only whole words; without it, searching for b#c would give a false positive for ab#c. This isn't 100% safe, as a-b#c would still give a false positive, but without knowing more about the structure of your xml files we cannot fix this.
A pipeline to reduce the number of greps:
grep -rHo --include=\*_meta.xml -f <(awk -F, 'NR > 1 {print $1}' accounts.csv) \
| gawk -F, -v OFS=',' '
NR == FNR {
# store the filenames for each email
if (match($0, /^([^:]+):(.+)/, m)) tally[m[2]][m[1]]
next
}
FNR > 1 {$4 = length(tally[$1])}
1
' - accounts.csv
Here is a solution using single awk command to achieve this. This solution will be highly performant as compared to other solutions because it is scanning each XML file only once for all the email addresses found in first column of the CSV file. Also it is not invoking any external command or spawning a sub0shell anywhere.
This should work in any version of awk.
cat srch.awk
# function to escape regex meta characters
function esc(s, tmp) {
tmp = s
gsub(/[&+.]/, "\\\\&", tmp)
return tmp
}
BEGIN {FS=OFS=","}
# while processing csv file
NR == FNR {
# save escaped email address in array em skipping header row
if (FNR > 1)
em[esc($1)] = 0
# save each row in rec array
rec[++n] = $0
next
}
# this block will execute for eaxh XML file
{
# loop each email and save count of matched email in array em
# PS: gsub return no of substitutionx
for (i in em)
em[i] += gsub(i, "&")
}
END {
# print header row
print rec[1]
# from 2nd row onwards split row into columns using comma
for (i=2; i<=n; ++i) {
split(rec[i], a, FS)
# 6th column is the count of occurrence from array em
print a[1], a[2], a[3], a[4], a[5], em[esc(a[1])]
}
}
Use it as:
awk -f srch.awk accounts.csv $(find . -name '*_meta.xml') > tmp && mv tmp accounts.csv
A script that handles accounts.csv line by line and replaces the data in accounts.new.csv for comparison.
#! /bin/bash
file_old=accounts.csv
file_new=${file_old/csv/new.csv}
delimiter=","
x=1
# Copy file
cp ${file_old} ${file_new}
while read -r line; do
# Skip first line
if [[ $x -gt 1 ]]; then
# Read data into variables
IFS=${delimiter} read -r address foo bar tally somethingelse <<< ${line}
cnt=$(find . -name '*_meta.xml' -exec grep -lo "${address}" {} \; | wc -l)
# Reset tally
tally=$cnt
# Change line number $x in new file
sed "${x}s/.*/${address} ${foo} ${bar} ${tally} ${somethingelse}/; ${x}s/ /${delimiter}/g" \
-i ${file_new}
fi
((x++))
done < ${file_old}
The input and ouput:
# Input
$ find . -name '*_meta.xml' -exec cat {} \; | sort | uniq -c
2 address#somewhere.com
1 something#that.com
$ cat accounts.csv
email,foo,bar,tally,somethingelse
address#somewhere.com,bar1,foo2,-1,blah
something#that.com,bar2,foo3,-1,blah
another#here.com,bar4,foo5,-1,blah
# output
$ ./test.sh
$ cat accounts.new.csv
email,foo,bar,tally,somethingelse
address#somewhere.com,bar1,foo2,2,blah
something#that.com,bar2,foo3,1,blah
another#here.com,bar4,foo5,0,blah

How to work around open files limit when demuxing files?

I frequently have large text files (10-100GB decompressed) to demultiplex based on barcodes in each line, where in practice the number of resulting individual files (unique barcodes) is between 1K and 20K. I've been using awk for this and it accomplishes the task. However, I've noticed that the rate of demuxing larger files (which correlates with more unique barcodes used) is significantly slower (10-20X). Checking ulimit -n shows 4096 as the limit on open files per process, so I suspect that the slowdown is due to the overhead of awk being forced to constantly close and reopen files whenever the total number of demuxed files exceeds 4096.
Lacking root access (i.e., the limit is fixed), what kinds of workarounds could be used to circumvent this bottleneck?
I do have a list of all barcodes present in each file, so I've considered forking multiple awk processes where each is assigned a mutually exclusive subset (< 4096) of barcodes to search for. However, I'm concerned the overhead of having to check each line's barcode for set membership might defeat the gains of not closing files.
Is there a better strategy?
I'm not married to awk, so approaches in other scripting or compiled languages are welcome.
Specific Example
Data Generation (FASTQ with barcodes)
The following generates data similar to what I'm specifically working with. Each entry consists of 4 lines, where the barcode is an 18 character word using the non-ambiguous DNA alphabet.
1024 unique barcodes | 1 million reads
cat /dev/urandom | tr -dc "ACGT" | fold -w 5 | \
awk '{ print "#batch."NR"_"$0"AAAAAAAAAAAAA_ACGTAC length=1\nA\n+\nI" }' | \
head -n 4000000 > cells.1K.fastq
16384 unique barcodes | 1 million reads
cat /dev/urandom | tr -dc "ACGT" | fold -w 7 | \
awk '{ print "#batch."NR"_"$0"AAAAAAAAAAA_ACGTAC length=1\nA\n+\nI" }' | \
head -n 4000000 > cells.16K.fastq
awk script for demultiplexing
Note that in this case I'm writing to 2 files for each unique barcode.
demux.awk
#!/usr/bin/awk -f
BEGIN {
if (length(outdir) == 0 || length(prefix) == 0) {
print "Variables 'outdir' and 'prefix' must be defined!" > "/dev/stderr";
exit 1;
}
print "[INFO] Initiating demuxing..." > "/dev/stderr";
}
{
if (NR%4 == 1) {
match($1, /.*_([ACGT]{18})_([ACGTN]{6}).*/, bx);
print bx[2] >> outdir"/"prefix"."bx[1]".umi";
}
print >> outdir"/"prefix"."bx[1]".fastq";
if (NR%40000 == 0) {
printf("[INFO] %d reads processed\n", NR/4) > "/dev/stderr";
}
}
END {
printf("[INFO] %d total reads processed\n", NR/4) > "/dev/stderr";
}
Usage
awk -v outdir="/tmp/demux1K" -v prefix="batch" -f demux.awk cells.1K.fastq
or similarly for the cells.16K.fastq.
Assuming you're the only one running awk, you can verify the approximate number of open files using
lsof | grep "awk" | wc -l
Observed Behavior
Despite the files being the same size, the one with 16K unique barcodes runs 10X-20X slower than the one with only 1K unique barcodes.
Without seeing any sample input/output or the script you're currently executing it's very much guesswork but if you currently have the barcode in field 1 and are doing (assuming GNU awk so you don't have your own code managing the open files):
awk '{print > $1}' file
then if managing open files really is your problem you'll get a significant improvement if you change it to:
sort file | '$1!=f{close(f};f=$1} {print > f}'
The above is, of course, making assumptions about what these barcoode values are, which field holds them, what separates fields, whether or not the output order has to match the original, what else your code might be doing that gets slower as the input grows, etc., etc. since you haven't shown us any of that yet.
If that's not all you need then edit your question to include the missing MCVE.
Given your updated question with your script and the info that the input is 4-line blocks, I'd approach the problem by adding the key "bx" values at the front of each record and using NUL to separate the 4-line blocks then using NUL as the record separator for sort and the subsequent awk:
$ cat tst.sh
infile="$1"
outdir="${infile}_out"
prefix="foo"
mkdir -p "$outdir" || exit 1
awk -F'[_[:space:]]' -v OFS='\t' -v ORS= '
NR%4 == 1 { print $2 OFS $3 OFS }
{ print $0 (NR%4 ? RS : "\0") }
' "$infile" |
sort -z |
awk -v RS='\0' -F'\t' -v outdir="$outdir" -v prefix="$prefix" '
BEGIN {
if ( (outdir == "") || (prefix == "") ) {
print "Variables \047outdir\047 and \047prefix\047 must be defined!" | "cat>&2"
exit 1
}
print "[INFO] Initiating demuxing..." | "cat>&2"
outBase = outdir "/" prefix "."
}
{
bx1 = $1
bx2 = $2
fastq = $3
if ( bx1 != prevBx1 ) {
close(umiOut)
close(fastqOut)
umiOut = outBase bx1 ".umi"
fastqOut = outBase bx1 ".fastq"
prevBx1 = bx1
}
print bx2 > umiOut
print fastq > fastqOut
if (NR%10000 == 0) {
printf "[INFO] %d reads processed\n", NR | "cat>&2"
}
}
END {
printf "[INFO] %d total reads processed\n", NR | "cat>&2"
}
'
When run against input files generated as you describe in your question:
$ wc -l cells.*.fastq
4000000 cells.16K.fastq
4000000 cells.1K.fastq
the results are:
$ time ./tst.sh cells.1K.fastq 2>/dev/null
real 0m55.333s
user 0m56.750s
sys 0m1.277s
$ ls cells.1K.fastq_out | wc -l
2048
$ wc -l cells.1K.fastq_out/*.umi | tail -1
1000000 total
$ wc -l cells.1K.fastq_out/*.fastq | tail -1
4000000 total
$ time ./tst.sh cells.16K.fastq 2>/dev/null
real 1m6.815s
user 0m59.058s
sys 0m5.833s
$ ls cells.16K.fastq_out | wc -l
32768
$ wc -l cells.16K.fastq_out/*.umi | tail -1
1000000 total
$ wc -l cells.16K.fastq_out/*.fastq | tail -1
4000000 total

mawk syntax appropriate >1000 field file for counting non-numeric data column-wise?

The following silly hard-coding of what ought to be some kind of loop or parallel construct, works nominally, but it is poor mawk syntax. My good mawk syntax attempts have all failed, using for loops in mawk (not shown) and gnu parallel (not shown).
It really needs to read the CSV file from disk just 1 time, not one time per column, because I have a really big CSV file (millions of rows, thousands of columns). My original code worked fine-ish (not shown) but it read the whole disk file again for every column and it was taking hours and I killed it after realizing what was happening. I have a fast solid state disk using a GPU connector slot so disk reads are blazing fast on this device. Thus CPU is the bottleneck here. Code sillyness is even more of a bottleneck if I have to hard-code 4000 lines of basically the same statements except for column number.
The code is column-wise making counts of non-numeric values. I need some looping (for-loop) or parallel (preferred) because while the following works correctly on 2 columns, it is not a scalable way to write mawk code for thousands of columns.
tail -n +1 pht.csv | awk -F"," '(($1+0 != $1) && ($1!="")){cnt1++}; (($2+0 != $2) && ($2!="")){cnt2++} END{print cnt1+0; print cnt2+0}'
2
1
How can the "column 1 processing; column 2 processing;" duplicate code be reduced? How can looping be introduced? How can gnu parallel be introduced? Thanks much. New to awk, I am. Not new to other languages.
I keep expecting some clever combo of one or more of the following bash commands is going to solve this handily, but here I am many hours later with nothing to show. I come with open hands. Alms for the code-poor?
seq 1 2 ( >>2 for real life CSV file)
tail (to skip the header or not as needed)
mawk (nice-ish row-wise CSV file processing, with that handy syntax I showed you in my demo for finding non-numerics easily in a supposedly all-numeric CSV datafile of jumbo dimensions)
tr (removes newline which is handy for transpose-ish operations)
cut (to grab a column at a time)
parallel (fast is good, and I have mucho cores needing something to work on, and phat RAM)
Sorry, I am absolutely required to not use CSV specific libraries like python pandas or R dataframes. My hands are tied here. Sorry. Thank you for being so cool about it. I can only use bash command lines in this case.
My mawk can handle 32000+ columns so NF is not a problem here, unlike some other awk I've seen. I have less than 32000 columns (but not by that much).
Datafile pht.csv contains the following 3x2 dataset:
cat pht.csv
8,T1,
T13,3,T1
T13,,-6.350818276405334473e-01
don't have access to mawk but you can do something equivalent to this
awk -F, 'NR>1 {for(i=1;i<=NF;i++) if($i~/[[:alpha:]]/) a[i]++}
END {for(i=1;i in a;i++) print a[i]}' file
shouldn't take more than few minutes even for million records.
For recognizing exponential notation regex test is not going to work and you need to revert to $1+0!=$1 test as mentioned in the comments. Note that you don't have to check null string separately.
None of the solutions so far parallelize. Let's change that.
Assume you have a solution that works in serial and can read from a pipe:
doit() {
# This solution gives 5-10 MB/s depending on system
# Changed so it now also treats '' as zero
perl -F, -ane 'for(0..$#F) {
# Perl has no beautiful way of matching scientific notation
$s[$_] += $F[$_] !~ /^-?\d*(?:\.\d*)?(?:[eE][+\-]?\d+)?$/m
}
END { $" = ","; print "#s\n" }';
}
export -f doit
doit() {
# Somewhat faster - but regards empty fields as zero
mawk -F"," '{
for(i=1;i<=NF;i++) { cnt[i]+=(($i+0)!=$i) && ($i!="") }
}
END { for(i=1;i<NF;i++) printf cnt[i]","; print cnt[NF] }';
}
export -f doit
To parallelize this we need to split the big file into chunks and pass each chunk to the serial solution:
# This will spawn a process for each core
parallel --pipe-part -a pht.csv --block -1 doit > blocksums
(You need version 20161222 or later to use '--block -1').
To deal with the header we compute result of the header, but we negate the result:
head -n1 pht.csv | doit | perl -pe 's/(^|,)/$1-/g' > headersum
Now we can simply sum up the headersum and the blocksums:
cat headersum blocksums |
perl -F, -ane 'for(0..$#F) { $s[$_] += $F[$_] }
END { $" = ",";print "#s\n" }'
Or if you prefer the output line by line:
cat headersum blocksums |
perl -F, -ane 'for(0..$#F) { $s[$_] += $F[$_] }
END { $" = "\n";print "#s\n" }'
Is this what you're trying to do?
$ awk -v RS='[\n,]' '($1+0) != $1' file | sort | uniq -c
1 T1
2 T13
The above uses GNU awk for multi-char RS and should run in seconds for an input file like you describe. If you don't have GNU awk you could do:
$ tr ',' $'\n' < file | awk '($1+0) != $1' | sort | uniq -c
1 T1
2 T13
I'm avoiding the approach of using , as a FS since then you'd have to use $i in a loop which would cause awk to do field splitting for every input line which adds on time but you could try it:
$ awk -F, '{for (i=1;i<=NF;i++) if (($i+0) != $i) print $i}' file | sort | uniq -c
1 T1
2 T13
You could do the unique counting all in awk with an array indexed by the non-numeric values but then you potentially have to store a lot of data in memory (unlike with sort which uses temp swap files as necessary) so YMMV with that approach.
I solved it independently. What finally did it for me was the dynamic variable creation examples at the following URL. http://cfajohnson.com/shell/cus-faq-2.html#Q24
Here is the solution I developed. Note: I have added another column with some missing data for a more complete unit test. Mine is not necessarily the best solution, which is TBD. It works correctly on the small csv shown is all I know at the moment. The best solution will also need to run really fast on a 40 GB csv file (not shown haha).
$ cat pht.csv
8,T1,
T13,3,T1
T13,,0
$ tail -n +1 pht.csv | awk -F"," '{ for(i=1;i<=NF;i++) { cnt[i]+=(($i+0)!=$i) && ($i!="") } } END { for(i=1;i<=NF;i++) print cnt[i] }'
2
1
1
ps. Honestly I am not satisfied with my own answer. They say that premature optimization is the root of all evil. Well that maxim does not apply here. I really, really want gnu parallel in there, instead of the for-loop if possible, because I have a need for speed.
Final note: Below I am sharing performance timings of sequential and parallel versions, and best available unit test dataset. Special thank you to Ole Tange for big help developing code to use his nice gnu parallel command in this application.
Unit test datafile, final version:
$ cat pht2.csv
COLA99,COLB,COLC,COLD
8,T1,,T1
T13,3,T1,0.03
T13,,-6.350818276405334473e-01,-0.036
Timing on big data (not shown) for sequential version of column-wise non-numeric counts:
ga#ga-HP-Z820:/mnt/fastssd$ time tail -n +2 train_all.csv | awk -F"," '{ for(i=1; i<=NF; i++){ cnt[i]+=(($i+0)!=$i) && ($i!="") } } END { for(i=1;i<=NF;i++) print cnt[i] }' > /dev/null
real 35m37.121s
Timing on big data for parallel version of column-wise non-numeric counts:
# Correctness - 2 1 1 1 is the correct output.
#
# pht2.csv: 2 1 1 1 :GOOD
# train_all.csv:
# real 1m14.253s
doit1() {
perl -F, -ane 'for(0..$#F) {
# Perl has no beautiful way of matching scientific notation
$s[$_] += $F[$_] !~ /^-?\d+(?:\.\d*)?(?:[eE][+\-]?\d+)?$/m
}
END { $" = ","; print "#s\n" }';
}
# pht2.csv: 2 1 1 1 :GOOD
# train_all.csv:
# real 1m59.960s
doit2() {
mawk -F"," '{
for(i=1;i<=NF;i++) { cnt[i]+=(($i+0)!=$i) && ($i!="") }
}
END { for(i=1;i<NF;i++) printf cnt[i]","; print cnt[NF] }';
}
export -f doit1
parallel --pipe-part -a "$fn" --block -1 doit1 > blocksums
if [ $csvheader -eq 1 ]
then
head -n1 "$fn" | doit1 | perl -pe 's/(^|,)/$1-/g' > headersum
cat headersum blocksums | perl -F, -ane 'for(0..$#F) { $s[$_] += $F[$_] } END { $" = "\n";print "#s\n" }' > "$outfile"
else
cat blocksums | perl -F, -ane 'for(0..$#F) { $s[$_] += $F[$_] } END { $" = "\n";print "#s\n" }' > "$outfile"
fi
NEW: Here is the ROW-wise (not column-wise) counts in sequential code:
tail -n +2 train_all.csv | awk -F"," '{ cnt=0; for(i=1; i<=NF; i++){ cnt+=(($i+0)!=$i) && ($i!="") } print cnt; }' > train_all_cnt_nonnumerics_rowwwise.out.txt
Context: Project is machine learning. This is part of a data exploration. ~25x parallel speedup seen on Dual Xeon 32 virtual / 16 physical core shared memory host using Samsung 950 Pro SSD storage: (32x60) seconds sequential time, 74 sec parallel time. AWESOME!

Tiny utility decoding base64-encoded file names

For convenience and speed of debugging my R code, I decided to create a tiny AWK script. All it has to do is to decode all base64-encoded names of files (.RData) in a particular directory. I've tried my best in two attempts. The following are my results so far. Any help will be appreciated!
The first attempt is an AWK script embedded in a shell command:
ls -1 ../cache/SourceForge | awk 'BEGIN {FS="."; print ""} {printf("%s", $1); printf("%s", " -> "); print $1 | "base64 -d -"; print ""} END {print ""}'
The resulting output is close to what is needed, however, instead of printing each decoded filename on the same line with the original encoded one, this one-liner prints all decoded names in the end of processing with no output separator at all:
cHJqTGljZW5zZQ== ->
cHViUm9hZG1hcA== ->
dG90YWxEZXZz ->
dG90YWxQcm9qZWN0cw== ->
QWxsUHJvamVjdHM= ->
Y29udHJpYlBlb3BsZQ== ->
Y29udHJpYlByb2Nlc3M= ->
ZG1Qcm9jZXNz ->
ZGV2TGlua3M= ->
ZGV2U3VwcG9ydA== ->
prjLicensepubRoadmaptotalDevstotalProjectsAllProjectscontribPeoplecontribProcessdmProcessdevLinksdevSupport
The second attempt is the following self-contained AWK script:
#!/usr/bin/gawk -f
BEGIN {FS="."; print ""; files = "ls -1 ../cache/SourceForge"}
{
decode = "base64 -d -";
printf("%s", $1); printf("%s", " -> "); print $1 | decode; print ""
}
END {print ""}
However, this script's behavior is surprising in that, firstly, it awaits for input, and, secondly, upon receiving EOF (Ctrl-D), doesn't produce any output.
A mostly bash solution:
for f in ../cache/SourceForge/*; do
base=$(basename $f .RData)
echo "$base => $(base64 -d <<<$base)"
done
Or, using more bash:
for f in ../cache/SourceForge/*; do
f=${f##*/}; f=${f%%.*}
echo "$f => $(base64 -d <<<$f)"
done
In both cases, you could use ../cache/SourceForge/*.RData to be more specific about which filenames you want. In the second one, using f=${f%.*} will cause only one extension to be removed. Or f=${f%.RData} will cause only the .RData extension to be removed. But it probably makes little difference in that specific application.
while read
do
base64 -d <<< $REPLY
echo
done < infile.txt
Result
prjLicense
pubRoadmap
totalDevs
totalProjects
AllProjects
contribPeople
contribProcess
dmProcess
devLinks
devSupport
You need to close the process you are writing to between each line or awk sends all the printed lines to the same process (and it only prints output when it finishes I guess). Add close("base64 -d -") to the end of that action block (same exact command string). For example:
ls | awk -F. '{ printf("%25s -> ", $1); print $1 | "base64 -d -"; close("base64 -d -"); print "" }'
Your second snippet isn't running that ls command. It is just assigning it to a variable and doing nothing with that. You need to pipe the output from ls to awk -f <yourscript> or ./your-script.awk or similar to get it to work. (This is why it is waiting for input from you by the way, you haven't given it any.)
To actually run the ls from awk you need to use getline.
Something like awk 'BEGIN {while ( ("ls -1" | getline) > 0 ) {print}}'

Calling an executable program using awk

I have a program in C that I want to call by using awk in shell scripting. How can I do something like this?
From the AWK man page:
system(cmd)
executes cmd and returns its exit status
The GNU AWK manual also has a section that, in part, describes the system function and provides an example:
system("date | mail -s 'awk run done' root")
A much more robust way would be to use the getline() function of GNU awk to use a variable from a pipe. In form cmd | getline result, cmd is run, then its output is piped to getline. It returns 1 if got output, 0 if EOF, -1 on failure.
First construct the command to run in a variable in the BEGIN clause if the command is not dependant on the contents of the file, e.g. a simple date or an ls.
A simple example of the above would be
awk 'BEGIN {
cmd = "ls -lrth"
while ( ( cmd | getline result ) > 0 ) {
print result
}
close(cmd);
}'
When the command to run is part of the columnar content of a file, you generate the cmd string in the main {..} as below. E.g. consider a file whose $2 contains the name of the file and you want it to be replaced with the md5sum hash content of the file. You can do
awk '{ cmd = "md5sum "$2
while ( ( cmd | getline md5result ) > 0 ) {
$2 = md5result
}
close(cmd);
}1'
Another frequent usage involving external commands in awk is during date processing when your awk does not support time functions out of the box with mktime(), strftime() functions.
Consider a case when you have Unix EPOCH timestamp stored in a column and you want to convert that to a human readable date format. Assuming GNU date is available
awk '{ cmd = "date -d #" $1 " +\"%d-%m-%Y %H:%M:%S\""
while ( ( cmd | getline fmtDate) > 0 ) {
$1 = fmtDate
}
close(cmd);
}1'
for an input string as
1572608319 foo bar zoo
the above command produces an output as
01-11-2019 07:38:39 foo bar zoo
The command can be tailored to modify the date fields on any of the columns in a given line. Note that -d is a GNU specific extension, the *BSD variants support -f ( though not exactly similar to -d).
More information about getline can be referred to from this AllAboutGetline article at awk.freeshell.org page.
There are several ways.
awk has a system() function that will run a shell command:
system("cmd")
You can print to a pipe:
print "blah" | "cmd"
You can have awk construct commands, and pipe all the output to the shell:
awk 'some script' | sh
Something as simple as this will work
awk 'BEGIN{system("echo hello")}'
and
awk 'BEGIN { system("date"); close("date")}'
I use the power of awk to delete some of my stopped docker containers. Observe carefully how i construct the cmd string first before passing it to system.
docker ps -a | awk '$3 ~ "/bin/clish" { cmd="docker rm "$1;system(cmd)}'
Here, I use the 3rd column having the pattern "/bin/clish" and then I extract the container ID in the first column to construct my cmd string and passed that to system.
It really depends :) One of the handy linux core utils (info coreutils) is xargs. If you are using awk you probably have a more involved use-case in mind - your question is not very detailled.
printf "1 2\n3 4" | awk '{ print $2 }' | xargs touch
Will execute touch 2 4. Here touch could be replaced by your program. More info at info xargs and man xargs (really, read these).
I believe you would like to replace touch with your program.
Breakdown of beforementioned script:
printf "1 2\n3 4"
# Output:
1 2
3 4
# The pipe (|) makes the output of the left command the input of
# the right command (simplified)
printf "1 2\n3 4" | awk '{ print $2 }'
# Output (of the awk command):
2
4
# xargs will execute a command with arguments. The arguments
# are made up taking the input to xargs (in this case the output
# of the awk command, which is "2 4".
printf "1 2\n3 4" | awk '{ print $2 }' | xargs touch
# No output, but executes: `touch 2 4` which will create (or update
# timestamp if the files already exist) files with the name "2" and "4"
Update In the original answer, I used echo instead of printf. However, printf is the better and more portable alternative as was pointed out by a comment (where great links with discussions can be found).
#!/usr/bin/awk -f
BEGIN {
command = "ls -lh"
command |getline
}
Runs "ls -lh" in an awk script
You can call easily with parameters via the system argument.
For example, to kill jobs corresponding to a certain string (we can otherly of course) :
ps aux | grep my_searched_string | awk '{system("kill " $2)}'
I was able to have this done via below method
cat ../logs/em2.log.1 |grep -i 192.168.21.15 |awk '{system(`date`); print $1}'
awk has a function called system it enables you to execute any linux bash command within the output of awk.

Resources