getting errors on simple awk script for-in declaration - bash

I wrote a simple awk function to print every characters in a array and the compiler is returning the following errors and I don't know what could be wrong.
function print_streaks_info(arr) {
for(index in arr){
print "starting index: " index
awk: streak_script.awk: line 2: syntax error at or near in
awk: streak_script.awk: line 5: syntax error at or near }
I didn't try much since i am just starting out with awk and the code is copied and pasted from a tutorial

As mark-fuso stated in the comments, the issue is your use of the reserved word "index", which is a built-in function for searching for patterns in a string.
If you replace your script instances of "index" by "indx", your script will work without problems.
Below is a demo script incorporating your original function (modified) and using Carlo Costa's test array (also modified):
echo "" |
awk '\
function print_streaks_info(arr){
for( indx in arr ){
print "starting index: " indx ;
break ;
} ;
Customer["14587"] = "Neil Johnson" ;
Customer["8953"] = "Ella binte Nazir" ;
Customer["23455"] = "Bruce Hyslop" ;
Customer["6335"] = "Isabella" ;
print_streaks_info(Customer) ;
echo -e "\n NOTE: The 'in' operator is reporting the array in sequential order of the indexes,\n\t NOT in the order in which they were assigned.\n"


Reverse complement SOME sequences in fasta file

I've been reading lots of helpful posts about reverse complementing sequences, but I've got what seems to be an unusual request. I'm working in bash and I have DNA sequences in fasta format in my stdout that I'd like to pass on down the pipe. The seemingly unusual bit is that I'm trying to reverse complement SOME of those sequences, so that the output has all the sequences in the same direction (for multiple sequence alignment later).
My fasta headers end in either "C" or "+". I'd like to reverse complement the ones that end in "C". Here's a little subset:
I know there are lots of ways to reverse complement out there, like:
echo ACCTTGAAA | tr ACGTacgt TGCAtgca | rev
seqtk seq -r in.fa > out.fa
But I'm not sure how to do this for only those sequences that have a C at the end of the header. I think awk or sed is probably the ticket, but I'm at a loss as to how to actually code it. I can get the sequence headers with awk, like:
awk '/^>/ { print $0 }'
But if someone could help me figure out how to turn that awk statement into one that asks "if the last character in the header has a C, do this!" that would be great!
Edited to add:
I was so tired when I made this post, I apologize for not including my desired output. Here is what I'd like to output to look like, using my little example:
You can see the sequence that ends in + is unchanged, but the sequence with a header that ends in C is reverse complemented.
An earlier answer (by Ed Morton) uses a self-contained awk procedure to selectively reverse-complement sequences following a comment line ending with "C". Although I think that to be the best approach, I will offer an alternative approach that might have wider applicability.
The procedure here uses awk's system() function to send data extracted from the fasta file in awk to the shell where the sequence can be processed by any of the many shell applications existing for sequence manipulation.
I have defined an awk user function to pass the isolated sequence from awk to the shell. It can be called from any part of the awk procedure:
function processSeq(s)
{system("echo \"" s "\" | tr ACGTacgt TGCAtgca | rev ");}
The argument of the system function is a string containing the command you would type into terminal to achieve the desired outcome (in this case I've used one of the example reverse-complement routines mentioned in the question). The parts to note are the correct escaping of quote marks that are to appear in the shell command, and the variable s that will be substituted for the sequence string assigned to it when the function is called. The value of s is concatenated with the strings quoted before and after it in the argument to system() shown above.
isolating the required sequences
The rest of the procedure addresses how to achieve:
"if the last character in the header has a C, do this"
Before making use of shell applications, awk needs to isolate the part(s) of the file to process. In general terms, awk employs one or more pattern/action blocks where only records (lines by default) that match a given pattern are processed by the subsequent action commands. For example, the following illustrative procedure performs the action of printing the whole line print $0 if the pattern /^>/ && /C$/ is true for that line (where /^>/ looks for ">" at the start of a line and /C$/ looks for "C" at the end of the same line.:
/^>/ && /C$/{ print $0 }
For the current needs, the sequence begins on the next record (line) after any record beginning with > and ending with C. One way of referencing that next line is to set a variable (named line in my example) when the C line is encountered and establishing a later pattern for the record with numerical value one more than line variable.
Because fasta sequences may extend over several lines, we have to accumulate several successive lines following a C title line. I have achieved this by concatenating each line following the C title line until a record beginning with > is encountered again (or until the end of the file is reached, using the END block).
In order that sequence lines following a non-C title line are ignored, I have used a variable named flag with values of either "do" or "ignore" set when a title record is encountered.
The call to a the custom function processSeq() that employs the system() command, is made at the beginning of a C title action block if the variable seq holds an accumulated sequence (and in the END block for relevant sequences that occur at the end of the file where there will be no title line).
Test file and procedure
A modified version of your example fasta was used to test the procedure. It contains an extra relevant C record with three and-a-bit lines instead of two, and an extra irrelevant + record.
awk '
/^>/ && /C$/{
if (length(seq)>0) {processSeq(seq); seq="";}
line=NR; print $0; flag="do"; next;
/^>/ {line=NR; flag="ignore"}
NR>1 && NR==(line+1) && (flag=="do"){seq=seq $0; line=NR; next}
function processSeq(s)
{system("echo \"" s "\" | tr ACGTacgt TGCAtgca | rev ");}
END { if (length(seq)>0) processSeq(seq);}
' seq.fasta
Tested using GNU Awk 5.1.0 on a Raspberry Pi 400.
performance note
Because calling sytstem() creates a sub shell, this process will be slower than a self-contained awk procedure. It might be useful where existing shell routines are available or tricky to reproduce with custom awk routines.
Edit: modification to include unaltered + records
This version has some repetition of earlier blocks, with minor changes, to handle printing of the lines that are not to be reverse-complemented (the changes should be self-explanatory if the main explanations were understood)
awk '
/^>/ && /C$/{
if (length(seq)>0 && flag=="do") {processSeq(seq)} else {print seq} seq="";line=NR; print $0; flag="do"; next;
/^>/ {if (length(seq)>0 && flag=="do") {processSeq(seq)} else {print seq} seq=""; print $0; line=NR; flag="ignore"}
NR>1 && NR==(line+1){seq=seq $0; line=NR; next}
function processSeq(s)
{system("echo \"" s "\" | tr ACGTacgt TGCAtgca | rev ");}
END { if (length(seq)>0 && flag=="do") {processSeq(seq)} else {print seq}}
' seq.fasta
Using any awk:
$ cat tst.awk
/^>/ {
if ( NR > 1 ) {
head = $0
tail = ""
{ tail = ( tail == "" ? "" : tail ORS ) $0 }
END { prt() }
function prt( type) {
type = substr(head,length(head),1)
tail = ( type == "C" ? rev( tr( tail, "ACGTacgt TGCAtgca" ) ) : tail )
print head ORS tail
function tr(oldStr,trStr, i,lgth,char,newStr) {
if ( !_trSeen[trStr]++ ) {
lgth = (length(trStr) - 1) / 2
for ( i=1; i<=lgth; i++ ) {
_trMap[trStr,substr(trStr,i,1)] = substr(trStr,lgth+1+i,1)
lgth = length(oldStr)
for (i=1; i<=lgth; i++) {
char = substr(oldStr,i,1)
newStr = newStr ( (trStr,char) in _trMap ? _trMap[trStr,char] : char )
return newStr
function rev(oldStr, i,lgth,char,newStr) {
lgth = length(oldStr)
for ( i=1; i<=lgth; i++ ) {
char = substr(oldStr,i,1)
newStr = char newStr
return newStr
$ awk -f tst.awk file
This might work for you (GNU sed):
sed -nE ':a;p;/^>.*C$/!b
:c;tc;/\n$/{s///p;bb};s/(.*)\n(.)/\2\1\n/;tc' file
Print the current line and then inspect it.
If the line does not begin with > and end with C, bail out and repeat.
Otherwise, fetch the next line and if it begins with >, repeat the above line.
Otherwise, insert a newline (to use as a pivot point when reversing the line), complement the code of the line using a translation command. Then set about reversing the line, character by character until the inserted newline makes its way to the end of the line.
Remove the newline, print the result and repeat the line above.
N.B. The n command will terminate the script when it is executed after the last line has been read.
Since the OP has amended the ouput, another solution is when the whole of the sequence is complemented and then reversed. Here is another solution that I believe follows these criteria.
sed -nE ':a;p;/^>.*C$/!b
:d;y/%/\n/;p;z;x;$!ba' file

awk or other shell to convert delimited list into a table

So what I have is a huge csv akin to this:
Which is not ealisy readable. Eith there being only 4 types of events I'm useing spreadsheets to convert this into the following:
Only events are limited to 4, pools and shards can be indefinite really. But the events may be missing from the lines - not all pools/shards have all 4 events every day.
So I tried doing this within an awk in the shell script that gathers the csv in the first place, but I'm failing spectacuraly, no working code can even be shown since it's producing zero results.
Basically I tried sorting the CSV reading the first two fields of a row, comparing to previous row and if matching comparing the third field to a set array of event strings then storing the fouth field in a variable respective to the event, and one the first two fileds are not matching - finally print the whole line including variables.
Sorry for the one-liner, testing and experimenting directly in the command line. It's embarassing, it does nothing.
awk -F, '{if (a==$1&&b==$2) {if ($3=="Event1") {r=$4} ; if ($3=="Event2") {d=$4} ; if ($3=="Event3") {t=$4} ; if ($3=="Event4") {p=$4}} else {printf $a","$b","$r","$d","$p","$t"\n"; a=$1 ; b=$2 ; if ($3=="Event1") {r=$4} ; if ($3=="Event2") {d=$4} ; if ($3=="Event3") {t=$4} ; if ($3=="Event4") {p=$4} ; a=$1; b=$2}} END {printf "\n"}'
You could simply use an assoc array: awk -F, -f parse.awk input.csv with parse.awk being:
sub(/Event/, "", $3);
for (name in res) {
printf("%s,%s,%s,%s,%s\n", name, res[name][1], res[name][2], res[name][3], res[name][4])
Order could be confused by awk, but my test output is:
PS: Please use an editor to write awk source code. Your one-liner is really hard to read. Since I used a different approach, I did not even try do get it "right"... ;)
$ cat tst.awk
BEGIN { FS=OFS="," }
{ key = $1 OFS $2 }
key != prev {
if ( NR>1 ) {
print prev, f["Event1"], f["Event2"], f["Event3"], f["Event4"]
delete f
prev = key
{ f[$3] = $4 }
END { print key, f["Event1"], f["Event2"], f["Event3"], f["Event4"] }
$ sort file | awk -f tst.awk

awk: data missed while parsing file

I have written a script to parse hourly log files to extract "CustomerId, Marketplace, StartTime, and DealIdClicked" data. The log file structure is like so:
Size=0 bytes
Here is the script I have written to parse the log.
function readServiceLog() {
local _logfile="$1"
local _csvFile="$2"
local _logFileName=$(getLogFileName "$_logfile")
parseLogFile "$_logfile" "$_csvFile"
echo "$_logFileName" >>"$SCRIPT_PATH/excludeFile.txt"
# Function to match regex and extract required data.
function parseLogFile() {
local _logfile=$1
local _csvFile=$2
zcat <"$_logfile" | awk -v csvFilePath="$_csvFile" '
dealIdRegex = "Deals_Content_DealIdClicked_"
delete RECORD
if (match(logLine,InfoRegex)) {
after = substr(logLine,RSTART+RLENGTH);
if(match(after, dealIdRegex)) {
afterDeal = substr(after,RSTART+RLENGTH);
dealId = substr(afterDeal, 1, index(afterDeal,",")-1)
RECORD[0] = dealId
if (match(logLine,customerIdRegex)) {
after = substr(logLine,RSTART+RLENGTH);
customerid = substr(after, 1, length(after))
RECORD[1] = customerid
if (match(logLine,startTimeRegex)) {
after = substr(logLine,RSTART+RLENGTH);
startTime = substr(after, 1, length(after))
RECORD[2] = startTime
if (match(logLine,marketplaceIdRegex)) {
after = substr(logLine,RSTART+RLENGTH);
marketplaceId = substr(after, 1, length(after))
RECORD[3] = marketplaceId
if (match(logLine,EOERegex)) {
if(length(RECORD) == 4) {
printf("%s,%s,%s,%s\n", RECORD[0],RECORD[1],RECORD[2],RECORD[3]) >> csvFilePath
delete RECORD
function processHourlyFile() {
local _currentProcessingFolder=$1
local _outputFolder=$(getOutputFolderName) //getOutputFolderName function is from util class.
mkdir -p "$_outputFolder"
local _csvFileName="$_outputFolder/${_currentProcessingFolder##*/}.csv"
for entry in "$_currentProcessingFolder"/*; do
if [[ "$entry" == *"$SERVICE_LOG"* ]]; then
readServiceLog "$entry" "$_csvFileName"
# Main execution to spawn new processes for parallel parsing.
function main() {
local _processCount=1
for entry in $INPUT_LOG_PATH/*; do
processHourlyFile $entry &
# wait for all pids
for pid in ${pids[*]}; do
wait $pid
printf '\nFinished!\n'
Expected output:
A comma separated file.
The script spawns 24 processes to parse 24-hour logs for an entire day. After parsing the files, I verified the count of record, and some time it doesn’t match with the original log file record count.
I am stuck on this from the last two days with no luck. Any help would be appreciated.
Thanks in advance.
awk -F= '
/^Info/ {
sub(/.*DealIdClicked_/, "")
sub(/,.*/, "")
print $0, a["CustomerId"], a["StartTime"], a["Marketplace"]
delete a
}' OFS=, filename
When run on your input file, the above produces the desired output:
How it works
-F= tells awk to use = as the field separator on input.
{ a[$1]=$2 } tells awk to save the second field, $2, in associative array a under the key $1.
/^Info/ { ... } tells awk to perform the commands in curly braces whenever the line starts with Info. Those commands are:
sub(/.*DealIdClicked_/, "") removes all parts of the line up to and including DealIdClicked_.
sub(/,.*/, "") tells awk to remove from what's left of the line everything from the first comma to the end of the line.
The remainder of the line, still called $0, is the "DealId" that we want.
print $0, a["CustomerId"], a["StartTime"], a["Marketplace"] tells awk to print the output that we want.
delete a this deletes array a so we start over clean on the next record.
OFS=, tells awk to use a comma as the field separator on output.

Parse out key=value pairs into variables

I have a bunch of different kinds of files I need to look at periodically, and what they have in common is that the lines have a bunch of key=value type strings. So something like:
Version=2 Len=17 Hello Var=Howdy Other
I would like to be able to reference the names directly from awk... so something like:
cat some_file | ... | awk '{print Var, $5}' # prints Howdy Other
How can I go about doing that?
The closest you can get is to parse the variables into an associative array first thing every line. That is to say,
awk '{ delete vars; for(i = 1; i <= NF; ++i) { n = index($i, "="); if(n) { vars[substr($i, 1, n - 1)] = substr($i, n + 1) } } Var = vars["Var"] } { print Var, $5 }'
More readably:
delete vars; # clean up previous variable values
for(i = 1; i <= NF; ++i) { # walk through fields
n = index($i, "="); # search for =
if(n) { # if there is one:
# remember value by name. The reason I use
# substr over split is the possibility of
# something like Var=foo=bar=baz (that will
# be parsed into a variable Var with the
# value "foo=bar=baz" this way).
vars[substr($i, 1, n - 1)] = substr($i, n + 1)
# if you know precisely what variable names you expect to get, you can
# assign to them here:
Var = vars["Var"]
Version = vars["Version"]
Len = vars["Len"]
print Var, $5 # then use them in the rest of the code
$ cat file | sed -r 's/[[:alnum:]]+=/\n&/g' | awk -F= '$1=="Var"{print $2}'
Howdy Other
Or, avoiding the useless use of cat:
$ sed -r 's/[[:alnum:]]+=/\n&/g' file | awk -F= '$1=="Var"{print $2}'
Howdy Other
How it works
sed -r 's/[[:alnum:]]+=/\n&/g'
This places each key,value pair on its own line.
awk -F= '$1=="Var"{print $2}'
This reads the key-value pairs. Since the field separator is chosen to be =, the key ends up as field 1 and the value as field 2. Thus, we just look for lines whose first field is Var and print the corresponding value.
Since discussion in commentary has made it clear that a pure-bash solution would also be acceptable:
''|[0-3].*) echo "ERROR: Bash 4.0 required" >&2; exit 1;;
while read -r -a words; do # iterate over lines of input
declare -A vars=( ) # refresh variables for each line
set -- "${words[#]}" # update positional parameters
for word; do
if [[ $word = *"="* ]]; then # if a word contains an "="...
vars[${word%%=*}]=${word#*=} # ...then set it as an associative-array key
echo "${vars[Var]} $5" # Here, we use content read from that line.
done <<<"Version=2 Len=17 Hello Var=Howdy Other"
The <<<"Input Here" could also be <file.txt, in which case lines in the file would be iterated over.
If you wanted to use $Var instead of ${vars[Var]}, then substitute printf -v "${word%%=*}" %s "${word*=}" in place of vars[${word%%=*}]=${word#*=}, and remove references to vars elsewhere. Note that this doesn't allow for a good way to clean up variables between lines of input, as the associative-array approach does.
I will try to explain you a very generic way to do this which you can adapt easily if you want to print out other stuff.
Assume you have a string which has a format like this:
key1=value1 key2=value2 key3=value3
or more generic
With fs1 and fs2 two different field separators.
You would like to make a selection or some operations with these values. To do this, the easiest is to store these in an associative array:
array["key1"] => value1
array["key2"] => value2
array["key3"] => value3
array["key1","full"] => "key1=value1"
array["key2","full"] => "key2=value2"
array["key3","full"] => "key3=value3"
This can be done with the following function in awk:
function str2map(str,fs1,fs2,map, n,tmp) {
for (;n>0;n--) {
map[tmp[1]]=tmp[2]; map[tmp[1],"full"]=map[n]
delete map[n]
So, after processing the string, you have the full flexibility to do operations in any way you like:
awk '
function str2map(str,fs1,fs2,map, n,tmp) {
for (;n>0;n--) {
map[tmp[1]]=tmp[2]; map[tmp[1],"full"]=map[n]
delete map[n]
{ str2map($0," ","=",map) }
{ print map["Var","full"] }
' file
The advantage of this method is that you can easily adapt your code to print any other key you are interested in, or even make selections based on this, example:
(map["Version"] < 3) { print map["var"]/map["Len"] }
The simplest and easiest way is to use the string substitution like this:
echo "'$name' : '$value'"
The output is:
'' : '1234567890=='
Using bash's set command, we can split the line into positional parameters like awk.
For each word, we'll try to read a name value pair delimited by =.
When we find a value, assign it to the variable named $key using bash's printf -v feature.
#!/usr/bin/env bash
line='Version=2 Len=17 Hello Var=Howdy Other'
set $line
for word in "$#"; do
IFS='=' read -r key val <<< "$word"
test -n "$val" && printf -v "$key" "$val"
echo "$Var $5"
Howdy Other
an awk-based solution that doesn't require manually checking the fields to locate the desired key pair :
approach being avoid splitting unnecessary fields or arrays - only performing regex match via function call when needed
only returning FIRST occurrence of input key value. Subsequent matches along the row are NOT returned
i just called it S() cuz it's the closest letter to $
I only included an array (_) of the 3 test values for demo purposes. Those aren't needed. In fact, no state information is being kept at all
caveat being : key-match must be exact - this version of the code isn't for case-insensitive or fuzzy/agile matching
Tested and confirmed working on
- gawk 5.1.1
- mawk 1.3.4
- mawk-2/
- macos nawk
# gawk profile, created Fri May 27 02:07:53 2022
{m,n,g}awk '
function S(__,_) {
return \
! match($(_=_<_), "(^|["(_="[:blank:]]")")"(__)"[=][^"(_)"*") \
? "^$" \
: substr(__=substr($-_, RSTART, RLENGTH), index(__,"=")+_^!_)
BEGIN { OFS = "\f" # This array is only for testing
_["Version"] _["Len"] _["Var"] # purposes. Feel free to discard at will
} {
for (__ in _) {
print __, S(__) } }'
So either call the fields in BAU fashion
- $5, $0, $NF, etc
or call S(QUOTED_KEY_VALUE), case-sensitive, like
As a safeguard, to prevent mis-interpreting null strings
or invalid inputs as $0, a non-match returns ^$
instead of empty string
S("Version") to get back 2.
As a bonus, it can safely handle values in multibyte unicode, both for values and even for keys, regardless of whether ur awk is UTF-8-aware or not :
1 ✜
2 Version
3 Var
4 Len
5 ✜=🤡 Version=2 Len=17 Hello Var=Howdy Other
I know this is particularly regarding awk but mentioning this as many people come here for solutions to break down name = value pairs ( with / without using awk as such).
I found below way simple straight forward and very effective in managing multiple spaces / commas as well -
change="foo=red bar=green baz=blue"
#use below if var is in CSV (instead of space as delim)
change=`echo $change | tr ',' ' '`
for change in $changes; do
set -- `echo $change | tr '=' ' '`
echo "variable name == $1 and variable value == $2"
#can assign value to a variable like below
eval my_var_$1=$2;

awk script error on solaris

I have following script and trying to run it:
start = 0
if (match($0, "<WorkflowProcess ")) {
if ((startTag < 2) || (endTag == startTag)) {
if (match($0, "</WorkflowProcess>")) {
However I always get this error:
awk: syntax error near line 6
awk: illegal statement near line 6
awk: syntax error near line 10
awk: bailing out near line 10
Any thoughts? I have tried to convert it via dos2unix and also with tr -d '\r' but it's still the same issue. The input parameter is in my opinion corect when I am sending a fullpath with file name and extention (/export/home/test/file.txt). All files have 0777.
How do you try to run that program?
If you use awk "... all that program ...", then the shell will expand $0 to its own path, which probably has a leading /... Although, now that I look at it, that should fail earlier with the internal ". Still, it would be useful to see the precise command line.
By the way, why are you calling match? It would be much more idiomatic to write:
awk '
/<WorkflowProcess / { ++startTag }
startTag < 2 || startTag == endTag { print }
/</WorkflowProcess>/ { ++endTag }
which avoids the explicit use of $0 altogether.
On SunOS nawk is often the better choice :
nawk -f script.awk /export/home/test/file.txt
Just an idea, in the BEGIN rule you initialize start, not startTag, but then you increment startTag in the next rule. I know, this works in GNU awk and all, but maybe you should try initializing startTag.
