For convenience and speed of debugging my R code, I decided to create a tiny AWK script. All it has to do is to decode all base64-encoded names of files (.RData) in a particular directory. I've tried my best in two attempts. The following are my results so far. Any help will be appreciated!
The first attempt is an AWK script embedded in a shell command:
ls -1 ../cache/SourceForge | awk 'BEGIN {FS="."; print ""} {printf("%s", $1); printf("%s", " -> "); print $1 | "base64 -d -"; print ""} END {print ""}'
The resulting output is close to what is needed, however, instead of printing each decoded filename on the same line with the original encoded one, this one-liner prints all decoded names in the end of processing with no output separator at all:
cHJqTGljZW5zZQ== ->
cHViUm9hZG1hcA== ->
dG90YWxEZXZz ->
dG90YWxQcm9qZWN0cw== ->
QWxsUHJvamVjdHM= ->
Y29udHJpYlBlb3BsZQ== ->
Y29udHJpYlByb2Nlc3M= ->
ZG1Qcm9jZXNz ->
ZGV2TGlua3M= ->
ZGV2U3VwcG9ydA== ->
The second attempt is the following self-contained AWK script:
#!/usr/bin/gawk -f
BEGIN {FS="."; print ""; files = "ls -1 ../cache/SourceForge"}
decode = "base64 -d -";
printf("%s", $1); printf("%s", " -> "); print $1 | decode; print ""
END {print ""}
However, this script's behavior is surprising in that, firstly, it awaits for input, and, secondly, upon receiving EOF (Ctrl-D), doesn't produce any output.
A mostly bash solution:
for f in ../cache/SourceForge/*; do
base=$(basename $f .RData)
echo "$base => $(base64 -d <<<$base)"
Or, using more bash:
for f in ../cache/SourceForge/*; do
f=${f##*/}; f=${f%%.*}
echo "$f => $(base64 -d <<<$f)"
In both cases, you could use ../cache/SourceForge/*.RData to be more specific about which filenames you want. In the second one, using f=${f%.*} will cause only one extension to be removed. Or f=${f%.RData} will cause only the .RData extension to be removed. But it probably makes little difference in that specific application.
while read
base64 -d <<< $REPLY
done < infile.txt
You need to close the process you are writing to between each line or awk sends all the printed lines to the same process (and it only prints output when it finishes I guess). Add close("base64 -d -") to the end of that action block (same exact command string). For example:
ls | awk -F. '{ printf("%25s -> ", $1); print $1 | "base64 -d -"; close("base64 -d -"); print "" }'
Your second snippet isn't running that ls command. It is just assigning it to a variable and doing nothing with that. You need to pipe the output from ls to awk -f <yourscript> or ./your-script.awk or similar to get it to work. (This is why it is waiting for input from you by the way, you haven't given it any.)
To actually run the ls from awk you need to use getline.
Something like awk 'BEGIN {while ( ("ls -1" | getline) > 0 ) {print}}'
I've got a file with several columns, like so:
13:46:48 user1
13:46:49 user2
13:48:07 user3
I'd like to transform one of the columns by passing it as input to a program:
echo "" | transformExternalIp
I wrote a small bit of awk to do this:
awk '{ ("echo " $2 " | transformExternalIp") | getline output; $2=output; print}'
But what I got surprised me. Initially, it looked like it was working as expected, but then I started to see weird repeated values. In order to debug, I removed my fancy "transformExternalIp" program in case it was the problem and replaced it with echo and cat, which means literally nothing should change:
awk '{ ("echo " $2 " | cat") | getline output; print $2 " - " output}' connections.txt
For the first thousand lines or so, the left and right sides matched, but then after that, the right side frequently stopped changing: - - -
# .... (okay for a long while) - - - -
What the heck have I done wrong? I'm guessing that I'm misunderstanding something about awk.
Close the command after each invocation to insure a new copy of the command is run for the next set of input, eg:
awk '{ ("echo " $2 " | transformExternalIp") | getline output
close("echo " $2 " | transformExternalIp")
# or, to reduce issues from making a typo:
awk '{ cmd="echo " $2 " | transformExternalIp"
(cmd) | getline output
For more details see this and this.
During my testing with a dummy script (echo $RANDOM; sleep .1) I could generate similar results as OP ... some good/expected lines and then a bunch of duplicates.
I noticed that as soon as the duplicates started occuring, the dummy script wasn't actually being called any more and instead awk was treating the system call as a static result (ie, kept re-using the value from the last 'good' call); it was quite noticeable because the sleep .1 was no longer being called so the output from the awk script sped up significantly.
Can't say that I understand 100% what's happening under the covers ... perhaps an issue with how the script (my dummy script; OP's transforExternalIp) behaves with multiple lines of input when expecting one line of input ... an issue with a limit on the number of open/active process handles ... shrug
("echo" $2" | cat") creates a fork almost every time that you use it.
Then, when the above instruction reaches some kind of fork limit, the output variable isn't updated by getline anymore; that's what's happening here.
If you're using GNU awk then you can fix the issue with a Coprocess:
awk '
BEGIN { cmd = "cat" }
print $2 |& cmd
cmd |& getline output
print $2 " - " output
' connections.txt
Awk is a tool to manipulate text. A shell is a tool to sequence calls to other tools. Don't use a shell to call awk to sequence calls to transformExternalIp as if it were a shell, just use a shell:
while read -r _ old_ip _; do
new_ip=$(printf '%s\n' "$old_ip" | transformExternalIp)
printf '%s - %s\n' "$old_ip" "$new_ip"
done < connections.txt
When you're using awk you're spawning a subshell for every call to transformExternalIp so it's no more efficient (probably a bit less efficient) than just staying in shell.
Using Bash, I'm wanting to get a list of email addresses from a CSV file to do a recursive grep search on it for a bunch of directories looking for a match in specific metadata XML files, and then also tallying up how many results I find for each address throughout the directory tree (i.e. updating the tally field in the same CSV file).
accounts.csv looks something like this:
updated to more accurately reflect real-world data
email,date,bar,URL,"something else",tally,21/04/2015,,,"blah blah",5,17/06/2015,,,"lah yah",0,7/08/2017,,,"wah wah",1
For example, if we put in $email from the list, run
grep -rl "${email}" --include=\*_meta.xml --only-matching | wc -l
on it and then add that result to the tally column.
At the moment I can get the first column of that CSV file (minus the heading/first line) using
awk -F"," '{print $1}' accounts.csv | tail -n +2
but I'm lost how to do the looping and also the writing of the result back to the CSV file...
So for instance, with if we run
grep -rl "${email}" --include=\*_meta.xml --only-matching | wc -l
and the result is say 17, how can I update that line to become:,7/08/2017,,,"wah wah",17
Is this possible with maybe awk or sed?
This is where I'm up to:
# make temporary list of email addresses
awk -F"," '{print $1}' accounts.csv | tail -n +2 > emails.tmp
# loop over each
while read email; do
# count how many uploads for current email address
grep -rl "${email}" --include=\*_meta.xml --only-matching | wc -l
done < emails.tmp
XML Metadata looks something like this:
<?xml version="1.0" encoding="UTF-8"?>
<description>example <br /></description>
<title>Some Title Name Goes Here</title>
<addeddate>2017-05-28 06:20:54</addeddate>
<publicdate>2017-05-28 06:21:15</publicdate>
<curation>[curator][/curator][date]20170528062151[/date][comment]checked for malware[/comment]</curation>
how to do the looping and also the writing of the result back to the CSV file
awk does the looping automatically. You can change any field by assigning to it. So to change a tally field (the 6th in each line) you would do $6 = ....
awk is a great tool for many scenarios. You probably can safe a lot of time in the future by investing some minutes in a short tutorial now.
The only non-trivial part is getting the output of grep into awk.
The following script increments each tally by the count of *_meta.xml files containing the given email address:
awk -F, -v OFS=, -v q=\' 'NR>1 {
cmd = "grep -rlFw " q $1 q " --include=\\*_meta.xml | wc -l";
cmd | getline c;
$6 = c
} 1' accounts.csv
For simplicity we assume that filenames are free of linebreaks and email addresses are free of '.
To reduce possible false positives, I also added the -F and -w option to your grep command.
-F searches literal strings; without it, searching for a.b#c would give false positives for things like axb#c and a-b#c.
-w matches only whole words; without it, searching for b#c would give a false positive for ab#c. This isn't 100% safe, as a-b#c would still give a false positive, but without knowing more about the structure of your xml files we cannot fix this.
A pipeline to reduce the number of greps:
grep -rHo --include=\*_meta.xml -f <(awk -F, 'NR > 1 {print $1}' accounts.csv) \
| gawk -F, -v OFS=',' '
NR == FNR {
# store the filenames for each email
if (match($0, /^([^:]+):(.+)/, m)) tally[m[2]][m[1]]
FNR > 1 {$4 = length(tally[$1])}
' - accounts.csv
Here is a solution using single awk command to achieve this. This solution will be highly performant as compared to other solutions because it is scanning each XML file only once for all the email addresses found in first column of the CSV file. Also it is not invoking any external command or spawning a sub0shell anywhere.
This should work in any version of awk.
cat srch.awk
# function to escape regex meta characters
function esc(s, tmp) {
tmp = s
gsub(/[&+.]/, "\\\\&", tmp)
return tmp
# while processing csv file
NR == FNR {
# save escaped email address in array em skipping header row
if (FNR > 1)
em[esc($1)] = 0
# save each row in rec array
rec[++n] = $0
# this block will execute for eaxh XML file
# loop each email and save count of matched email in array em
# PS: gsub return no of substitutionx
for (i in em)
em[i] += gsub(i, "&")
# print header row
print rec[1]
# from 2nd row onwards split row into columns using comma
for (i=2; i<=n; ++i) {
split(rec[i], a, FS)
# 6th column is the count of occurrence from array em
print a[1], a[2], a[3], a[4], a[5], em[esc(a[1])]
Use it as:
awk -f srch.awk accounts.csv $(find . -name '*_meta.xml') > tmp && mv tmp accounts.csv
A script that handles accounts.csv line by line and replaces the data in for comparison.
#! /bin/bash
# Copy file
cp ${file_old} ${file_new}
while read -r line; do
# Skip first line
if [[ $x -gt 1 ]]; then
# Read data into variables
IFS=${delimiter} read -r address foo bar tally somethingelse <<< ${line}
cnt=$(find . -name '*_meta.xml' -exec grep -lo "${address}" {} \; | wc -l)
# Reset tally
# Change line number $x in new file
sed "${x}s/.*/${address} ${foo} ${bar} ${tally} ${somethingelse}/; ${x}s/ /${delimiter}/g" \
-i ${file_new}
done < ${file_old}
The input and ouput:
# Input
$ find . -name '*_meta.xml' -exec cat {} \; | sort | uniq -c
$ cat accounts.csv
# output
$ ./
$ cat
I have a list of CSV files, I have to print a variable name (dynamically; it will change), to last column in the CSV files.
Here is the code:
addProgramtypeID () {
for csv in $1
echo $file_name
f=`echo $file_name | cut -d '_' -f3 | cut -d '.' -f1`
echo $f
k=`grep -i $f Program_type.csv | cut -d ',' -f3`
echo $k
awk '{ print $0 "," "'"$k"'" }' "$csv" > tempfile && mv tempfile "$csv"
addProgramtypeID "T_H_EDCGO.csv"
As of now the variable value K is being printed at the 1st column of the CSV file , also it is removing the first 2 characters of the first column in the file. My requirement is that the variable value should always come as the last column in the CSV file.
input :
if suppose $k=2
Assuming there is is nothing nasty in your CSV file, you can use awk as follows:
for csv_file in $ALL_MY_FILES
cat csv_file | awk 'BEGIN{FS=","}; {print($(NF))}'
Or even just
cat $ALL_MY_FILES | awk 'BEGIN{FS=","}; {print($(NF))}'
Both of these will print the last line column of all the csv files. The results from each CSV are just appended together (is that really what you want?).
The difficulties are on the awk side. This completely unaware of things like quited strings
or extra whitespace. My recommendation is to try the line above, see what goes wrong (if anything) and then start tweaking.
It looks like what you want is just:
$ cat
addProgramtypeID () {
awk -v csv="$csv" '
BEGIN{ FS=OFS=","; split(csv,csvA,/[_.]/); f=csvA[3] }
NR==FNR { if ($0 ~ f) { k = $3 }; next }
{ print $0, k }
' Program_type.csv "$csv" > tempfile && mv tempfile "$csv"
addProgramtypeID "T_H_EDC.csv"
$ cat Program_type.csv
$ cat T_H_EDC.csv
$ ./
$ cat T_H_EDC.csv
but it's hard to tell since your posted sample input could not produce your posted desired output so I had to make some up.
if ($0 ~ f) should probably just be if ($1 == f), I just copied what your original grep f <file> logic would do.
I have a program in C that I want to call by using awk in shell scripting. How can I do something like this?
From the AWK man page:
executes cmd and returns its exit status
The GNU AWK manual also has a section that, in part, describes the system function and provides an example:
system("date | mail -s 'awk run done' root")
A much more robust way would be to use the getline() function of GNU awk to use a variable from a pipe. In form cmd | getline result, cmd is run, then its output is piped to getline. It returns 1 if got output, 0 if EOF, -1 on failure.
First construct the command to run in a variable in the BEGIN clause if the command is not dependant on the contents of the file, e.g. a simple date or an ls.
A simple example of the above would be
awk 'BEGIN {
cmd = "ls -lrth"
while ( ( cmd | getline result ) > 0 ) {
print result
When the command to run is part of the columnar content of a file, you generate the cmd string in the main {..} as below. E.g. consider a file whose $2 contains the name of the file and you want it to be replaced with the md5sum hash content of the file. You can do
awk '{ cmd = "md5sum "$2
while ( ( cmd | getline md5result ) > 0 ) {
$2 = md5result
Another frequent usage involving external commands in awk is during date processing when your awk does not support time functions out of the box with mktime(), strftime() functions.
Consider a case when you have Unix EPOCH timestamp stored in a column and you want to convert that to a human readable date format. Assuming GNU date is available
awk '{ cmd = "date -d #" $1 " +\"%d-%m-%Y %H:%M:%S\""
while ( ( cmd | getline fmtDate) > 0 ) {
$1 = fmtDate
for an input string as
1572608319 foo bar zoo
the above command produces an output as
01-11-2019 07:38:39 foo bar zoo
The command can be tailored to modify the date fields on any of the columns in a given line. Note that -d is a GNU specific extension, the *BSD variants support -f ( though not exactly similar to -d).
More information about getline can be referred to from this AllAboutGetline article at page.
There are several ways.
awk has a system() function that will run a shell command:
You can print to a pipe:
print "blah" | "cmd"
You can have awk construct commands, and pipe all the output to the shell:
awk 'some script' | sh
Something as simple as this will work
awk 'BEGIN{system("echo hello")}'
awk 'BEGIN { system("date"); close("date")}'
I use the power of awk to delete some of my stopped docker containers. Observe carefully how i construct the cmd string first before passing it to system.
docker ps -a | awk '$3 ~ "/bin/clish" { cmd="docker rm "$1;system(cmd)}'
Here, I use the 3rd column having the pattern "/bin/clish" and then I extract the container ID in the first column to construct my cmd string and passed that to system.
It really depends :) One of the handy linux core utils (info coreutils) is xargs. If you are using awk you probably have a more involved use-case in mind - your question is not very detailled.
printf "1 2\n3 4" | awk '{ print $2 }' | xargs touch
Will execute touch 2 4. Here touch could be replaced by your program. More info at info xargs and man xargs (really, read these).
I believe you would like to replace touch with your program.
Breakdown of beforementioned script:
printf "1 2\n3 4"
# Output:
1 2
3 4
# The pipe (|) makes the output of the left command the input of
# the right command (simplified)
printf "1 2\n3 4" | awk '{ print $2 }'
# Output (of the awk command):
# xargs will execute a command with arguments. The arguments
# are made up taking the input to xargs (in this case the output
# of the awk command, which is "2 4".
printf "1 2\n3 4" | awk '{ print $2 }' | xargs touch
# No output, but executes: `touch 2 4` which will create (or update
# timestamp if the files already exist) files with the name "2" and "4"
Update In the original answer, I used echo instead of printf. However, printf is the better and more portable alternative as was pointed out by a comment (where great links with discussions can be found).
#!/usr/bin/awk -f
command = "ls -lh"
command |getline
Runs "ls -lh" in an awk script
You can call easily with parameters via the system argument.
For example, to kill jobs corresponding to a certain string (we can otherly of course) :
ps aux | grep my_searched_string | awk '{system("kill " $2)}'
I was able to have this done via below method
cat ../logs/em2.log.1 |grep -i |awk '{system(`date`); print $1}'
awk has a function called system it enables you to execute any linux bash command within the output of awk.
In bash, is there a way to chain multiple commands, all taking the same input from stdin? That is, one command reads stdin, does some processing, writes the output to a file. The next command in the chain gets the same input as what the first command got. And so on.
For example, consider a large text file to be split into multiple files by filtering the content. Something like this:
cat food_expenses.txt | grep "coffee" > coffee.txt | grep "tea" > tea.txt | grep "honey cake" > cake.txt
This obviously does not work, because the second grep gets the first grep's output, not the original text file. I tried inserting tee's but that does not help. Is there some bash magic that can cause the first grep to send its input to the pipe, not the output?
And by the way, splitting a file was a simple example. Consider splitting (filering by pattern search) a continuous live text stream coming over a network and writing the output to different named pipes or sockets. I would like to know if there is an easy way to do it using a shell script.
(This question is a cleaned up version of my earlier one , based on responses that pointed out the unclearness)
For this example, you should use awk as semiuseless suggests.
But in general to have N arbitrary programs read a copy of a single input stream, you can use tee and bash's process output substitution operator:
tee <food_expenses.txt \
>(grep "coffee" >coffee.txt) \
>(grep "tea" >tea.txt) \
>(grep "honey cake" >cake.txt)
Note that >(command) is a bash extension.
The obvious question is why do you want to do this within one command ?
If you don't want to write a script, and you want to run stuff in parallel, bash supports the concepts of subshells, and these can run in parallel. By putting your command in brackets, you can run your greps (or whatever) concurrently e.g.
$ (grep coffee food_expenses.txt > coffee.txt) && (grep tea food_expenses.txt > tea.txt)
Note that in the above your cat may be redundant since grep takes an input file argument.
You can (instead) play around with redirecting output through different streams. You're not limited to stdout/stderr but can assign new streams as required. I can't advise more on this other than direct you to examples here
I like Stephen's idea of using awk instead of grep.
It ain't pretty, but here's a command that uses output redirection to keep all data flowing through stdout:
cat food.txt |
awk '/coffee/ {print $0 > "/dev/stderr"} {print $0}'
2> coffee.txt |
awk '/tea/ {print $0 > "/dev/stderr"} {print $0}'
2> tea.txt
As you can see, it uses awk to send all lines matching 'coffee' to stderr, and all lines regardless of content to stdout. Then stderr is fed to a file, and the process repeats with 'tea'.
If you wanted to filter out content at each step, you might use this:
cat food.txt |
awk '/coffee/ {print $0 > "/dev/stderr"} $0 !~ /coffee/ {print $0}'
2> coffee.txt |
awk '/tea/ {print $0 > "/dev/stderr"} $0 !~ /tea/ {print $0}'
2> tea.txt
You could use awk to split into up to two files:
awk '/Coffee/ { print "Coffee" } /Tea/ { print "Tea" > "/dev/stderr" }' inputfile > coffee.file.txt 2> tea.file.txt
I am unclear why the filtering needs to be done in different steps. A single awk program can scan all the incoming lines, and dispatch the appropriate lines to individual files. This is a very simple dispatch that can feed multiple secondary commands (i.e. persistent processes that monitor the output files for new input, or the files could be sockets that are setup ahead of time and written to by the awk process.).
If there is a reason to have every filter see every line, then just remove the "next;" statements, and every filter will see every line.
$ cat split.awk
/^coffee/ {
print $0 >> "/tmp/coffee.txt" ;
/^tea/ {
print $0 >> "/tmp/tea.txt" ;
{ # default
print $0 >> "/tmp/other.txt" ;
END {}
Here are two bash scripts without awk. The second one doesn't even use grep!
With grep:
tail -F food_expenses.txt | \
while read line
for word in "coffee" "tea" "honey cake"
if [[ $line != ${line#*$word*} ]]
echo "$line"|grep "$word" >> ${word#* }.txt # use the last word in $word for the filename (i.e. cake.txt for "honey cake")
Without grep:
tail -F food_expenses.txt | \
while read line
for word in "coffee" "tea" "honey cake"
if [[ $line != ${line#*$word*} ]] # does the line contain the word?
echo "$line" >> ${word#* }.txt # use the last word in $word for the filename (i.e. cake.txt for "honey cake")
Here's an AWK method:
awk 'BEGIN {
list = "coffee tea";
split(list, patterns)
for (pattern in patterns) {
if ($0 ~ patterns[pattern]) {
print > patterns[pattern] ".txt"
}' food_expenses.txt
Working with patterns which include spaces remains to be resolved.
You can probably write a simple AWK script to do this in one shot. Can you describe the format of your file a little more?
Is it space/comma separated?
do you have the item descriptions on a specific 'column' where columns are defined by some separator like space, comma or something else?
If you can afford multiple grep runs this will work,
grep coffee food_expanses.txt> coffee.txt
grep tea food_expanses.txt> tea.txt
and, so on.
Assuming that your input is not infinite (as in the case of a network stream that you never plan on closing) I might consider using a subshell to put the data into a temp file, and then a series of other subshells to read it. I haven't tested this, but maybe it would look something like this
{ cat inputstream > tempfile };
{ grep tea tempfile > tea.txt };
{ grep coffee tempfile > coffee.txt};
I'm not certain of an elegant solution to the file getting too large if your input stream is not bounded in size however.