Initialize an Array inside AWK Command and use the Array to Print using AWK - bash

Im trying to Do a Comparison of 2 File Data and print certain out out of it.
My objective mainly here is to initlize an araay containing some values inside the same awk statement and use it for some printing purpose.
Below is the Command i am using which i feel looking like some syntactical error.
Please Help in the AWK part how I should define the Array also How i cna use it inside it.
Command tried -
paste -d "|" filedata.txt tabdata.txt | awk -F '|' '{array=("RE_LOG_ID" "FILE_RUN_ID" "FH_RECORDTYPE" "FILECATEGORY")}' '{c=NF/2;for(i=1;i<=c;i++)if($i!=$(i+c))printf "%s|%s|%s|%s\n",$1,${array[i]},$i,$(i+c)}'
SAMPLE INPUT FILE
filedata.txt
A|1|2|3
B|2|3|4
tabdata.txt
A|1|4|3
B|2|3|7
So my Output i am wanting is . -
A|FH_RECORDTYPE|2|4
B|FILECATEGORY|4|7
The Output Comprises the Differences -
PRIMARYKEY|COLUMNNAME|FILE1DATA|FILE2DATA
I want the Array to be initialized inside the AWK as array=("RE_LOG_ID" "FILE_RUN_ID" "FH_RECORDTYPE" "FILECATEGORY") and will correspond Column Names
The fetching columnname from the array- condition will be when ($i!=$(i+c)) whichever "i"th position mismatches i will print the "i" th Element from the Array.
Finding the Differences Section is working perfect if i remove the array part from my command, but my ask is i want to initialize an array containing the column names and print it too within the awk statement.
Just i need help how to incorporate the Array Part within AWK.

Unfortunately arrays in AWK cannot be assigned as you expect. As an alternative, you can use split function like:
split("RE_LOG_ID FILE_RUN_ID FH_RECORDTYPE FILECATEGORY", array, " ")
(Optional " " is needed because FS is overwritten.)
Then your command will look like:
paste -d "|" filedata.txt tabdata.txt | awk -F '|' '
BEGIN {split("RE_LOG_ID FILE_RUN_ID FH_RECORDTYPE FILECATEGORY", array, " ")}
{
c= NF/2;
for(i=1; i<=c; i++)
if ($i != $(i+c))
printf "%s|%s|%s|%s\n", $1, array[i], $i, $(i+c);
}'

Related

How to sort array of strings by function in shell script

I have the following list of strings in shell script:
something-7-5-2020.dump
another-7-5-2020.dump
anoter2-6-5-2020.dump
another-4-5-2020.dump
another2-4-5-2020.dump
something-2-5-2020.dump
another-2-5-2020.dump
8-1-2021
26-1-2021
20-1-2021
19-1-2021
3-9-2020
29-9-2020
28-9-2020
24-9-2020
1-9-2020
6-8-2020
20-8-2020
18-8-2020
12-8-2020
10-8-2020
7-7-2020
5-7-2020
27-7-2020
7-6-2020
5-6-2020
23-6-2020
18-6-2020
28-5-2020
26-5-2020
9-12-2020
28-12-2020
15-12-2020
1-12-2020
27-11-2020
20-11-2020
19-11-2020
18-11-2020
1-11-2020
11-11-2020
31-10-2020
29-10-2020
27-10-2020
23-10-2020
21-10-2020
15-10-2020
23-09-2020
So my goal is to sort them by date, but it's in dd-mm-yyyy and d-m-yyyy format and sometimes there's a word before like word-dd-mm-yyyy. I would like to create a function to sort the values like any other language so it ignores the first word, casts the date to a common format and compares that format. In javascript it would be something like:
arrayOfStrings.sort((a, b) => functionToOrderStrings())
My code to obtain the array is the following:
dumps=$(gsutil ls gs://organization-dumps/ambient | sed "s:gs\://organization-dumps/ambient/::" | sed '/^$/d' | sed 's:/$::' | sort --reverse --key=3 --key=2 --key=1 --field-separator=-)
echo "$dumps"
I would like to say that I've already searched this in Stackoverflow and none of the answers did help me, because all of them are oriented to sort dates in correct format and that's not my case.
If you have the results in a pipeline, involving an array seems completely superfluous here.
You can apply a technique called a Schwartzian transform: add a prefix to each line with a normalized version the data so it can be easily sorted, then sort, then discard the prefix.
I'm guessing something like the following;
gsutil ls gs://organization-dumps/ambient |
awk '{ sub("gs:\/\/organization-dumps/ambient/", "");
if (! $0) next;
sub("/$", "");
d = $0;
sub(/^[^0-9][^-]*-/, "", d);
sub(/[^0-9]*$/, "", d);
split(d, w, "-");
printf "%04i-%02i-%02i\t%s\n", w[3], w[2], w[1], $0 }' |
sort -n | cut -f2-
In so many words, we are adding a tab-delimited field in front of every line, then sorting on that, then discarding the first field with cut -f2-. The field extraction contains some assumptions which seem to be valid for your test data, but may need additional tweaking if you have real data with corner cases like if the label before the date could sometimes contain a number with dashes around it, too.
If you want to capture the result in a variable, like in your original code, that's easy to do; but usually, you should just run everything in a pipeline.
Notice that I factored your multiple sed scripts into the Awk script, too, some of that with a fair amount of guessing as to what the input looks like and what the sed scripts were supposed to accomplish. (Perhaps also note that sed, like Awk, is a scripting language; to run several sed commands on the same input, just put them after each other in the same sed script.)
Preprocess input to be in the format you want it to be for sorting.
Sort
Remove artifacts from step 1
The following:
sed -E '
# extract the date and put it in first column separated by tab
# this could be better, its just an example
s/(.*-)?([0-9]?[0-9]-[0-9]?[0-9]-[0-9]{4})/\2\t&/;
# If day is a single digit, add a zero in front
s/^([0-9]-)/0\1/;
# If month is a single digit, add a zero in front
s/^([0-9][0-9]-)([0-9]-)/\10\2/
# year in front? no idea - shuffle the way you want
s/([0-9]{2})-([0-9]{2})-([0-9]{4})/\3-\2-\1/
' input.txt | sort | cut -f2-
outputs:
another-2-5-2020.dump
something-2-5-2020.dump
another-4-5-2020.dump
another2-4-5-2020.dump
anoter2-6-5-2020.dump
another-7-5-2020.dump
something-7-5-2020.dump
26-5-2020
28-5-2020
5-6-2020
7-6-2020
18-6-2020
23-6-2020
5-7-2020
7-7-2020
27-7-2020
6-8-2020
10-8-2020
12-8-2020
18-8-2020
20-8-2020
1-9-2020
3-9-2020
23-09-2020
24-9-2020
28-9-2020
29-9-2020
15-10-2020
21-10-2020
23-10-2020
27-10-2020
29-10-2020
31-10-2020
1-11-2020
11-11-2020
18-11-2020
19-11-2020
20-11-2020
27-11-2020
1-12-2020
9-12-2020
15-12-2020
28-12-2020
8-1-2021
19-1-2021
20-1-2021
26-1-2021
Using GNU awk:
gsutil ls gs://organization-dumps/ambient | awk '{ match($0,/[[:digit:]]{1,2}-[[:digit:]]{1,2}-[[:digit:]]{4}/);dayt=substr($0,RSTART,RLENGTH);split(dayt,map,"-");length(map[1])==1?map[1]="0"map[1]:map[1]=map[1];length(map[2])==1?map[2]="0"map[2]:map[2]=map[2];map1[mktime(map[3]" "map[2]" "map[1]" 00 00 00")]=$0 } END { PROCINFO["sorted_in"]="#ind_num_asc";for (i in map1) { print map1[i] } }'
Explanation:
gsutil ls gs://organization-dumps/ambient | awk '{
match($0,/[[:digit:]]{1,2}-[[:digit:]]{1,2}-[[:digit:]]{4}/); # Check that lines contain a date
dayt=substr($0,RSTART,RLENGTH); # Extract the date
split(dayt,map,"-"); # Split the date in the array map based on "-" as the delimiter
length(map[1])==1? map[1]="0"map[1]:map[1]=map[1];length(map[2])==1?map[2]="0"map[2]:map[2]=map[2]; # Pad the month and day with "0" if required
map1[mktime(map[3]" "map[2]" "map[1]" 00 00 00")]=$0 # Get the epoch format date based on the values in the map array and use this for the index of the array map1 with the line as the value
}
END {
PROCINFO["sorted_in"]="#ind_num_asc"; # Set the ordering of the array
for (i in map1) {
print map1[i] # Loop through map1 and print the values (lines)
}
}'
Using GNU awk, you can do this fairly easy:
awk 'BEGIN{PROCINFO["sorted_in"]="#ind_num_asc"; FS="."}
{n=split($1,t,"-"); a[t[n]*10000 + t[n-1]*100 + t[n-2]]=$0}
END {for(i in a) print a[i]}' file
Essentially, we are asking GNU awk to traverse an array by index in ascending numeric order. Per line read, we extract the date. The date is always located before the <dot>-character and thus always in field 1 if the dot is the field separator (FS="."). We split the first field by the hyphen and use the total number of fields to extract the date. We convert the date simplistically to some number (YYYY*10000+MM*100+DD; DD<100 && MM*100 < 10000) and ask awk to sort it by that number.
It is now possible to combine the full pipe-line in a single awk:
$ gsutil ls gs://organization-dumps/ambient \
| awk 'BEGIN{PROCINFO["sorted_in"]="#ind_num_asc"; FS="."}
{sub("gs://organization-dumps/ambient/",""); sub("/$","")}
(NF==0){next}
{n=split($1,t,"-"); a[t[n]*10000 + t[n-1]*100 + t[n-2]]=$0}
END {for(i in a) print a[i]}'

Convert a bash array into an awk array

I have an array in bash and want to use this array in an awk script. How can I pass the array from bash to awk?
The keys of the awk array should be the indices of the bash array. For simplicity, we can assume that the bash array is dense, that is, the array is not sparse like a=([3]=x [5]=y).
The elements inside the array can have any value. Besides strange unicode symbols and ascii control characters they may contain spaces or even newlines. Also, there might be empty ("") entries which should be retained. As an example consider the following array:
a=(AB " C D " $'E\nF\tG' "¼ẞ🍕" "")
Extending approach #1 provided by Socowi, it is possible to address the shortcoming that he identified using the awk split function. Note that this solution does not use the stdin - it uses command line options - allowing awk to process stdin, files, etc.
The solution will convert the 'a' bash array into the 'a' awk, using intermediate awk file AVG (process substituion). This is a workaround to the bash limit that prevent NUL from being stored in a string.
a=(AB " C D " $'E\nF\tG' "¼ẞ🍕" "")
awk -v AVF=<(printf '%s\0' "${a[#]}") '
BEGIN {
# Temporary RS to allow reading the array with a single read.
saveRS=RS
RS=""
getline AV < AVF
rs = saveRS
na=split(AV, a, "\\0")
# Remove trailing empty element (printf add trailing separator).
delete a[na]
na-- ; for (i=1 ; i<=na ; i++ ) print "AV#", i, "=" a[i]
}{
# Use a[x]
}
'
Output:
1 AB
2 C D
3 E
F G
4 ¼ẞ🍕
5
Previous solution: For practical reason, Using the '\001' character as separator. make the script much easier (can use any other character sequence that is known not to appear in the info array). Bash command substitution does not allow NUL character. Hopefully, not a major issue, as this control character is not used for normal files, etc. I believe possible to solve this, but I'm not how.
The solution will convert the 'a' bash array into the 'a' awk, using intermediate awk variable 'AV'.
a=(AB " C D " $'E\nF\tG' "¼ẞ🍕" "")
awk -v AV="$(printf '%s\1' "${a[#]}")" '
BEGIN {
na=split(AV, a, "\\1") }
# Remove trailing empty element (printf add trailing separator).
delete a[na]
for (i=1 ; i<=na ; i++ ) print "AV#", i, "=" a[i]
{
# Use a[x]
}
'
Approach 1: Reading in awk
Since the array elements can contain any character but the null byte (\0) we have to delimit them by \0. This is done with printf. For simplicity we assume that the array has at least one entry.
Due to the \0 we can no longer pass the string to awk as an argument but have to use (or emulate) a file instead. We then read that file in awk using \0 as the record separator RS (may require GNU awk).
awk 'BEGIN {RS="\0"} {a[n++]=$0; next}' <(printf %s\\0 "${a[#]}")
This reliably constructs the awk array a from the bash array a. The length of a is stored in n.
This approach is ugly when you actually want to use it. There is no simple step-by-step instruction on how to incorporate this approach into your existing awk script. Normally, your awk script would read another file afterwards, therefore you have to change the record separator RS after the array file was read. This can be done with NR>FNR. However, if your awk script already reads multiple files and relies on something like NR==FNR things get complicated.
Approach 2: Generating awk Code with bash
Instead of parsing the array in awk we hard-code the array by generating awk code. This code will be injected at the beginning of an existing awk script and initialize the array. This approach also supports sparse arrays and associative arrays and should work with all awk versions, not only GNU.
For the code generation we have to correctly quote all strings. For example, the code generator echo "a[0]=${a[0]}" would fail if ${a[0]} was " resulting in the code a[1]=""". POSIX awk supports octal escape sequences (\012) which can encode all bytes. We simply encoding everything. That way we cannot forget any special symbols (even though the generated code is a bit inefficient).
octString() {
printf %s "$*" | od -bvAn | tr ' ' '\\' | tr -d '\n'
}
arrayToAwk() {
printf 'BEGIN{'
n=0
for key in "${!a[#]}"; do
printf 'a["%s"]="%s";' "$(octString "$key")" "$(octString "${a[$key]}")"
((n++))
done
echo "n=$n}"
}
The function arrayToAwk converts the bash array a (can be sparse or associative) into a BEGIN block. After inserting the generated code block at the begging of your existing awk program you can use the awk array a anywhere inside awk without having to adapt anything (assuming that the variable names a and n were unused before). n is the size of the awk array a.
For awk commands of the form awk ... 'program' ... use
awk ... "$(arrayToAwk)"'program' ...
For big arrays this might result in the error Argument list too long. You can circumvent this problem using a program file:
awk ... -f <(arrayToAwk; echo 'program') ...
For awk commands of the form awk ... -f progfile ... use
awk ... -f <(arrayToAwk; cat progfile) ...
I'd like to point out that this can be extremely simple if you do not mind using ARGV and deleting all the non-file arguments. One way:
>cat awk_script.sh
#!/bin/awk -f
BEGIN{
i=1
while(ARGV[i] != "--" && i < ARGC) {
print ARGV[i]
delete ARGV[i]
i++
}
if(i < ARGC)
delete ARGV[i]
} {
print "File 1 contains at 1",$1
}
Then run it with:
>./awk_script.sh "${a[#]}" -- file1
AB
C D
E
F G
¼ẞ�
File 1 contains at 1 a
Obviously I'm missing some symbols.
Note while I like this method it assumes -- is not in the array, as pointed out by Oguz Ismail. They give a great alternate solution of having the first argument the length of your list.
This can be a one liner to where you have
awk 'BEGIN{... get and delete first arguments ...}{process files}END{if wanted} "${a[#]}" file1 file2...
but will become unreadable very quickly.

awk store a pattern result to a shell array variable

I am trying to store the result of a pattern matched by awk to a shell array variable. Here's a simplified example of the same:
#!/bin/bash
declare -a array1=()
declare -a array2=()
READ_FILE="directory1/read_file.csv"
WRITE_FILE="directory2/results.csv"
#variable for counting array index
count1=0
count2=0
#
#
# need help with line below
# $2 below is the second set of characters which is a floating point number
awk -F 'string1_to_search' '{$array1[count1++] = $2}' $READ_FILE
awk -F 'string2_to_search' '{$array2[count2++] = $2}' $READ_FILE
#count++ indicates post increment of count variable
#do something with the array
.
.
#end
any suggestions would be helpful.
Something roughly like this, then?
awk '/string1_to_search/ {
count["id1"]++; sum["id1"] += $2 }
/string2_too/ {
count["id2"]++; sum["id2"] += $2 }
# ...
END { for (k in count) printf("%s: sum %f/count %i = avg %f\n", k, sum[k], count[k], sum[k]/count[k]) }' inputfile
I seem to recall there was a clever way to calculate a rolling variance without keeping the entire input set in memory; or else just collect the values space-separated value["id"] = value["id"] " " $2 and split into a list and loop over it near the end. Alternatively, simplify this to only examine one search string at a time and run it multiple times (let's hope then the input isn't very big). Or switch to Perl, which will easily let you collect lists of lists and other nested structures.
Obviously break out common functionality into separate functions so you don't have repeated code ... I suppose it's actually clearer like this, but if you find bugs, or need other changes, you only want to have to change one place in the code.
another method to do it is making awk print the number which can be passed to an array variable in bash like this :
mapfile -t array1 < <( awk -F 'string1_to_search' '{print $2}' "$READ_FILE" )
Later for taking out mean, variance and SD we can use bc tool from within the bash

Want to sort a file based on another file in unix shell

I have 2 files refer.txt and parse.txt
refer.txt contains the following
julie,remo,rob,whitney,james
parse.txt contains
remo/hello/1.0,remo/hello2/2.0,remo/hello3/3.0,whitney/hello/1.0,julie/hello/2.0,julie/hello/3.0,rob/hello/4.0,james/hello/6.0
Now my output.txt should list the files in parse.txt based on the order specified in refer.txt
ex of output.txt should be:
julie/hello/2.0,julie/hello/3.0,remo/hello/1.0,remo/hello2/2.0,remo/hello3/3.0,rob/hello/4.0,whitney/hello/1.0,james/hello/6.0
i have tried the following code:
sort -nru refer.txt parse.txt
but no luck.
please assist me.TIA
You can do that using gnu-awk:
awk -F/ -v RS=',|\n' 'FNR==NR{a[$1] = (a[$1])? a[$1] "," $0 : $0 ; next}
{s = (s)? s "," a[$1] : a[$1]} END{print s}' parse.txt refer.txt
Output:
julie/hello/2.0,julie/hello/3.0,remo/hello/1.0,remo/hello2/2.0,remo/hello3/3.0,rob/hello/4.0,whitney/hello/1.0,james/hello/6.0
Explanation:
-F/ # Use field separator as /
-v RS=',|\n' # Use record separator as comma or newline
NR == FNR { # While processing parse.txt
a[$1]=(a[$1])?a[$1] ","$0:$0 # create an array with 1st field as key and value as all the
# records with keys julie, remo, rob etc.
}
{ # while processing the second file refer.txt
s = (s)?s "," a[$1]:a[$1] # aggregate all values by reading key from 2nd file
}
END {print s } # print all the values
In pure native bash (4.x):
# read each file into an array
IFS=, read -r -a values <parse.txt
IFS=, read -r -a ordering <refer.txt
# create a map from content before "/" to comma-separated full values in preserved order
declare -A kv=( )
for value in "${values[#]}"; do
key=${value%%/*}
if [[ ${kv[$key]} ]]; then
kv[$key]+=",$value" # already exists, comma-separate
else
kv[$key]="$value"
fi
done
# go through refer list, putting full value into "out" array for each entry
out=( )
for value in "${ordering[#]}"; do
out+=( "${kv[$value]}" )
done
# print "out" array in comma-separated form
IFS=,
printf '%s\n' "${out[*]}" >output.txt
If you're getting more output fields than you have input fields, you're probably trying to run this with bash 3.x. Since associative array support is mandatory for correct operation, this won't work.
tr , "\n" refer.txt | cat -n >person_id.txt # 'cut -n' not posix, use sed and paste
cat person_id.txt | while read person_id person_key
do
print "$person_id" > $person_key
done
tr , "\n" parse.txt | sed 's/(^[^\/]*)(\/.*)$/\1 \1\2/' >person_data.txt
cat person_data.txt | while read foreign_key person_data
do
person_id="$(<$foreign_key)"
print "$person_id" " " "$person_data" >>merge.txt
done
sort merge.txt >output.txt
A text book data processing approach, a person id table, a person data table, merged on a common key field, which is the first name of the person:
[person_key] [person_id]
- person id table, a unique sortable 'id' for each person (line number in this instance, since that is the desired sort order), and key for each person (their first name)
[person_key] [person_data]
- person data table, the data for each person indexed by 'person_key'
[person_id] [person_data]
- a merge of the 'person_id' table and 'person_data' table on 'person_key', which can then be sorted on person_id, giving the output as requested
The trick is to implement an associative array using files, the file name being the key (in this instance 'person_key'), the content being the value. [Essentially a random access file implemented using the filesystem.]
This actually adds a step to the otherwise simple but not very efficient task of grepping parse.txt with each value in refer.txt - which is more efficient I'm not sure.
NB: The above code is very unlikely to work out of the box.
NBB: On reflection, probably a better way of doing this would be to use the file system to create a random access file of parse.txt (essentially an index), and to then consider refer.txt as a batch file, submitting it as a job as such, printing out from the parse.txt random access file the data for each of the names read in from refer.txt in turn:
# 1) index data file on required field
cat person_data.txt | while read data
do
key="$(print "$data" | sed 's/(^[^\/]*)/\1/')" # alt. `cut -d'/' -f1` ??
print "$data" >>./person_data/"$key"
done
# 2) run batch job
cat refer_data.txt | while read key
do
print ./person_data/"$key"
done
However having said that, using egrep is probably just as rigorous a solution or at least for small datasets, I would most certainly use this approach given the specific question posed. (Or maybe not! The above could well prove faster as well as being more robust.)
Command
while read line; do
grep -w "^$line" <(tr , "\n" < parse.txt)
done < <(tr , "\n" < refer.txt) | paste -s -d , -
Key points
For both files, newlines are translated to commas using the tr command (without actually changing the files themselves). This is useful because while read and grep work under the assumption that your records are separated by newlines instead of commas.
while read will read in every name from refer.txt, (i.e julie, remo, etc.) and then use grep to retrieve lines from parse.txt containing that name.
The ^ in the regex ensures matching is only performed from the start of the string and not in the middle (thanks to #CharlesDuffy's comment below), and the -w option for grep allows whole-word matching only. For example, this ensures that "rob" only matches "rob/..." and not "robby/..." or "throb/...".
The paste command at the end will comma-separate the results. Removing this command will print each result on its own line.

How to combine two lines that share the same keyword?

lets say I have a file looking somewhat like this:
X NeedThis1 KEYWORD
.
.
NeedThis2 X KEYWORD
And I need to combine the two lines into one like this:
NeedThis2 NeedThis1 KEYWORD
It needs to be done for every line in that file that contains the same KEYWORD but it can't combine two lines that look like this (two X's at the first|second position)
X NeedThis1 KEYWORD
X NeedThis2 KEYWORD
I am considering myself bash-noob so any advice if it can be done with something like awk or sed would be appreciated.
awk '
{if ($1 == "X") end[$3] = $2; else start[$3] = $1}
END {for (kw in start) if (kw in end) print start[kw], end[kw], kw}
' file
Try this:
awk '
$1=="X" {key = $NF; value = $2; next}
$2=="X" && $NF==key {print value, $1, key}' file
Explanation:
When a line where first field is X, store the last field as key and second field as value.
Look for the next line where second field is X and last field matches the key stored from pervious action.
When found, print the value of last matched line along with first field of the current line and the key.
This will most definitely break if your data does not match the sample you have shown (if it has more spaces or fields in between), so feel free to adjust as per your needs.
I won't give you the full answer, but if you have some way to identify "KEYWORD" (not in your problem statement), then use a BASH associative array:
declare -A keys
while IFS= read -u3 -r line
do
set -- $line
eval keyword=\$$#
keys[$keyword]+=${line%$keyword}
done
you'll certainly have to do some more fiddling, but your problem statement is incomplete and some of the work needs to be an exercise for the reader.

Resources