Parsing data from function in POSIX - bash

I'm using POSIX. I have a function called get_data which returns:
4;Fix README;feature4;develop;URL5
2;Fix file3;feature2;develop;URL2
5;Fix README;feature2;develop;URL3
1;Fix file2;feature1;develop;URL1
I want to get to the URL (last part) of latest feature2 (based on the first index). In the above example, it will return URL3 because it has feature2 in the third field and 5 > 2 in the first field.
The first thing I tried is:
url=$(get_data | grep feature2)
But I don't like this solution because the other lines also can contain feature2 on other fields. If it was Bash I would use BASH_REMATCH with regex, but here I'm not sure what is the best most elegant way to get that URL.
Is it possible to get some suggestion on how to do it?

Use awk:
url=$(get_data | awk -F";" '$3 == "feature2" && $1 > idx {idx=$1; url=$5} END {print url}')
After splitting each line into ;-delimited fields, save the fifth field from a line whose third field is the desired feature, if the index is greater than the one you last saved. Once you have checked each line, output the final value of url.

Using sort and awk, we can do
url=$(get_data | sort -t ";" -k 1nr,1nr file | awk -F";" '$3 == "feature2"{print $5;exit}')

Related

grep for duration=N while N is longer than X, the position of this sentence changes between lines

I have a very long file with a format of column_name=column_val, column_name2=column_val2 and so on.
the columns are not in the right order, lets say for example i have this file:
bar=x moshe=foo test=x duration=5
moshe=foo2 test=y duration=0 bar=y
duration=3 moshe=foo3 bar=z test=x
i want to return lines only where duration is greater then 2
as far as I know awk is not optional since i can't tell where the columns are location in each line.
on IRC in #bash channel someone recommended using gawk's match(). there too i was having problem seeing how to resolve this while each line the duration is elsewhere.
any ideas?
thanks
You can use duration= as field separator:
# showing field content in numeric context
$ awk -F'duration=' '{print +$2}' ip.txt
5
0
3
# use required numeric comparison to get desired output
$ awk -F'duration=' '+$2 > 2' ip.txt
bar=x moshe=foo test=x duration=5
duration=3 moshe=foo3 bar=z test=x
See https://www.gnu.org/software/gawk/manual/html_node/Strings-And-Numbers.html for conversion details
Unary + works on GNU awk, not sure about other versions. 0+$2 should work everywhere to force numeric context.
Note that if you have multiple duration= in a line, only the first one will be tested.
Extract the data with regex and compare.
awk '0+gensub(".*duration=([0-9]*).*", "\\1", "1") > 2'
#edit as above, the 0+ is needed to convert string to integer.
If you want to use grep:
grep -E 'duration=([3-9] |[0-9]{2,})' "file"
awk '/duration/ {
for (counter=1; counter <= NF; counter++) {
if ($counter ~ /^duration*/) {
value=substr($counter, index($counter,"=")+1);
if (value > 2) {
print $0;
}
}
}
}' <inputfile>

How to sort array of strings by function in shell script

I have the following list of strings in shell script:
something-7-5-2020.dump
another-7-5-2020.dump
anoter2-6-5-2020.dump
another-4-5-2020.dump
another2-4-5-2020.dump
something-2-5-2020.dump
another-2-5-2020.dump
8-1-2021
26-1-2021
20-1-2021
19-1-2021
3-9-2020
29-9-2020
28-9-2020
24-9-2020
1-9-2020
6-8-2020
20-8-2020
18-8-2020
12-8-2020
10-8-2020
7-7-2020
5-7-2020
27-7-2020
7-6-2020
5-6-2020
23-6-2020
18-6-2020
28-5-2020
26-5-2020
9-12-2020
28-12-2020
15-12-2020
1-12-2020
27-11-2020
20-11-2020
19-11-2020
18-11-2020
1-11-2020
11-11-2020
31-10-2020
29-10-2020
27-10-2020
23-10-2020
21-10-2020
15-10-2020
23-09-2020
So my goal is to sort them by date, but it's in dd-mm-yyyy and d-m-yyyy format and sometimes there's a word before like word-dd-mm-yyyy. I would like to create a function to sort the values like any other language so it ignores the first word, casts the date to a common format and compares that format. In javascript it would be something like:
arrayOfStrings.sort((a, b) => functionToOrderStrings())
My code to obtain the array is the following:
dumps=$(gsutil ls gs://organization-dumps/ambient | sed "s:gs\://organization-dumps/ambient/::" | sed '/^$/d' | sed 's:/$::' | sort --reverse --key=3 --key=2 --key=1 --field-separator=-)
echo "$dumps"
I would like to say that I've already searched this in Stackoverflow and none of the answers did help me, because all of them are oriented to sort dates in correct format and that's not my case.
If you have the results in a pipeline, involving an array seems completely superfluous here.
You can apply a technique called a Schwartzian transform: add a prefix to each line with a normalized version the data so it can be easily sorted, then sort, then discard the prefix.
I'm guessing something like the following;
gsutil ls gs://organization-dumps/ambient |
awk '{ sub("gs:\/\/organization-dumps/ambient/", "");
if (! $0) next;
sub("/$", "");
d = $0;
sub(/^[^0-9][^-]*-/, "", d);
sub(/[^0-9]*$/, "", d);
split(d, w, "-");
printf "%04i-%02i-%02i\t%s\n", w[3], w[2], w[1], $0 }' |
sort -n | cut -f2-
In so many words, we are adding a tab-delimited field in front of every line, then sorting on that, then discarding the first field with cut -f2-. The field extraction contains some assumptions which seem to be valid for your test data, but may need additional tweaking if you have real data with corner cases like if the label before the date could sometimes contain a number with dashes around it, too.
If you want to capture the result in a variable, like in your original code, that's easy to do; but usually, you should just run everything in a pipeline.
Notice that I factored your multiple sed scripts into the Awk script, too, some of that with a fair amount of guessing as to what the input looks like and what the sed scripts were supposed to accomplish. (Perhaps also note that sed, like Awk, is a scripting language; to run several sed commands on the same input, just put them after each other in the same sed script.)
Preprocess input to be in the format you want it to be for sorting.
Sort
Remove artifacts from step 1
The following:
sed -E '
# extract the date and put it in first column separated by tab
# this could be better, its just an example
s/(.*-)?([0-9]?[0-9]-[0-9]?[0-9]-[0-9]{4})/\2\t&/;
# If day is a single digit, add a zero in front
s/^([0-9]-)/0\1/;
# If month is a single digit, add a zero in front
s/^([0-9][0-9]-)([0-9]-)/\10\2/
# year in front? no idea - shuffle the way you want
s/([0-9]{2})-([0-9]{2})-([0-9]{4})/\3-\2-\1/
' input.txt | sort | cut -f2-
outputs:
another-2-5-2020.dump
something-2-5-2020.dump
another-4-5-2020.dump
another2-4-5-2020.dump
anoter2-6-5-2020.dump
another-7-5-2020.dump
something-7-5-2020.dump
26-5-2020
28-5-2020
5-6-2020
7-6-2020
18-6-2020
23-6-2020
5-7-2020
7-7-2020
27-7-2020
6-8-2020
10-8-2020
12-8-2020
18-8-2020
20-8-2020
1-9-2020
3-9-2020
23-09-2020
24-9-2020
28-9-2020
29-9-2020
15-10-2020
21-10-2020
23-10-2020
27-10-2020
29-10-2020
31-10-2020
1-11-2020
11-11-2020
18-11-2020
19-11-2020
20-11-2020
27-11-2020
1-12-2020
9-12-2020
15-12-2020
28-12-2020
8-1-2021
19-1-2021
20-1-2021
26-1-2021
Using GNU awk:
gsutil ls gs://organization-dumps/ambient | awk '{ match($0,/[[:digit:]]{1,2}-[[:digit:]]{1,2}-[[:digit:]]{4}/);dayt=substr($0,RSTART,RLENGTH);split(dayt,map,"-");length(map[1])==1?map[1]="0"map[1]:map[1]=map[1];length(map[2])==1?map[2]="0"map[2]:map[2]=map[2];map1[mktime(map[3]" "map[2]" "map[1]" 00 00 00")]=$0 } END { PROCINFO["sorted_in"]="#ind_num_asc";for (i in map1) { print map1[i] } }'
Explanation:
gsutil ls gs://organization-dumps/ambient | awk '{
match($0,/[[:digit:]]{1,2}-[[:digit:]]{1,2}-[[:digit:]]{4}/); # Check that lines contain a date
dayt=substr($0,RSTART,RLENGTH); # Extract the date
split(dayt,map,"-"); # Split the date in the array map based on "-" as the delimiter
length(map[1])==1? map[1]="0"map[1]:map[1]=map[1];length(map[2])==1?map[2]="0"map[2]:map[2]=map[2]; # Pad the month and day with "0" if required
map1[mktime(map[3]" "map[2]" "map[1]" 00 00 00")]=$0 # Get the epoch format date based on the values in the map array and use this for the index of the array map1 with the line as the value
}
END {
PROCINFO["sorted_in"]="#ind_num_asc"; # Set the ordering of the array
for (i in map1) {
print map1[i] # Loop through map1 and print the values (lines)
}
}'
Using GNU awk, you can do this fairly easy:
awk 'BEGIN{PROCINFO["sorted_in"]="#ind_num_asc"; FS="."}
{n=split($1,t,"-"); a[t[n]*10000 + t[n-1]*100 + t[n-2]]=$0}
END {for(i in a) print a[i]}' file
Essentially, we are asking GNU awk to traverse an array by index in ascending numeric order. Per line read, we extract the date. The date is always located before the <dot>-character and thus always in field 1 if the dot is the field separator (FS="."). We split the first field by the hyphen and use the total number of fields to extract the date. We convert the date simplistically to some number (YYYY*10000+MM*100+DD; DD<100 && MM*100 < 10000) and ask awk to sort it by that number.
It is now possible to combine the full pipe-line in a single awk:
$ gsutil ls gs://organization-dumps/ambient \
| awk 'BEGIN{PROCINFO["sorted_in"]="#ind_num_asc"; FS="."}
{sub("gs://organization-dumps/ambient/",""); sub("/$","")}
(NF==0){next}
{n=split($1,t,"-"); a[t[n]*10000 + t[n-1]*100 + t[n-2]]=$0}
END {for(i in a) print a[i]}'

How to grep a pattern followed by a number, only if the number is above a certain value

I actually need to grep the entire line. I have a file with a bunch of lines that look like this
1 123213 A T . stuff=1.232;otherstuf=34;morestuff=121;AF=0.44;laststuff=AV
4 223152 D L . stuff=1.122;otherstuf=4;morestuff=41;AF=0.02;laststuff=RV
and I want to keep all the lines where AF>0.1. So for the lines above I only want to keep the first line.
Using gnu-awk you can do this:
awk 'gensub(/.*;AF=([^;]+).*/, "\\1", "1", $NF)+0 > 0.1' file
1 123213 A T . stuff=1.232;otherstuf=34;morestuff=121;AF=0.44;laststuff=AV
This gensub function parses out AF=<number> from last field of the input and captures number in captured group #1 which is used for comparison with 0.1.
PS: +0 will convert parsed field to a number.
You could use awk with multiple delimeters to extract the value and compare it:
$ awk -F';|=' '$8 > 0.1' file
Assuming that AF is always of the form 0.NN you can simply match values where the tens place is 1-9, e.g.:
grep ';AF=0.[1-9][0-9];' your_file.csv
You could add a + after the second character group to support additional digits (i.e. 0.NNNNN) but if the values could be outside the range [0, 1) you shouldn't try to match the field with regular expressions.
$ awk -F= '$5>0.1' file
1 123213 A T . stuff=1.232;otherstuf=34;morestuff=121;AF=0.44;laststuff=AV
If that doesn't do what you want when run against your real data then edit your question to provide more truly representative sample input/output.
I would use awk. Since awk supports alphanumerical comparisons you can simply use this:
awk -F';' '$(NF-1) > "AF=0.1"' file.txt
-F';' splits the line into fields by ;. $(NF-1) address the second last field in the line. (NF is the number of fields)

AWK between 2 patterns - first occurence

I am having this example of ini file. I need to extract the names between 2 patterns Name_Z1 and OBJ=Name_Z1 and put them each on a line.
The problem is that there are more than one occurences with Name_Z1 and OBJ=Name_Z1 and i only need first occurence.
[Name_Z5]
random;text
Names;Jesus;Tom;Miguel
random;text
OBJ=Name_Z5
[Name_Z1]
random;text
Names;Jhon;Alex;Smith
random;text
OBJ=Name_Z1
[Name_Z2]
random;text
Names;Chris;Mara;Iordana
random;text
OBJ=Name_Z2
[Name_Z1_Phone]
random;text
Names;Bill;Stan;Mike
random;text
OBJ=Name_Z1_Phone
My desired output would be:
Jhon
Alex
Smith
I am currently writing a more ample script in bash and i am stuck on this. I prefer awk to do the job.
My greatly appreciation for who can help me. Thank you!
For Wintermute solution: The [Name_Z1] part looks like this:
[CAB_Z1]
READ_ONLY=false
FilterAttr=CeaseTime;blank|ObjectOfReference;contains;511047;512044;513008;593026;598326;CL5518;CL5521;CL5538;CL5612;CL5620|PerceivedSeverity;=;Critical;Major;Minor|ProbableCause;!=;HOUSE ALARM;IO DEVICE|ProblemText;contains;AIRE;ALIMENTA;BATER;CONVERTIDOR;DISTRIBUCION;FUEGO;HURTO;MAINS;MALLO;MAYOR;MENOR;PANEL;TEMP
NAME=CAB_Z1
And the [Name_Z1_Phone] part looks like this:
[CAB_Z1_FUEGO]
READ_ONLY=false
FilterAttr=CeaseTime;blank|ObjectOfReference;contains;511047;512044;513008;593026;598326;CL5518;CL5521;CL5538;CL5612;CL5620|PerceivedSeverity;=;Critical;Major;Minor|ProbableCause;!=;HOUSE ALARM;IO DEVICE|ProblemText;contains;FUEGO
NAME=CAB_Z1_FUEGO
The fix should be somewhere around the "|PerceivedSeverity"
Expected Output:
511047
512044
513008
593026
598326
CL5518
CL5521
CL5538
CL5612
CL5620
This should work:
sed -n '/^\[Name_Z1/,/^OBJ=Name_Z1/ { /^Names/ { s/^Names;//; s/;/\n/g; p; q } }' foo.txt
Explanation: Written readably, the code is
/^\[Name_Z1/,/^OBJ=Name_Z1/ {
/^Names/ {
s/^Names;//
s/;/\n/g
p
q
}
}
This means: In the pattern range /^\[Name_Z1/,/^OBJ=Name_Z1/, for all lines that match the pattern /^Names/, remove the Names; in the beginning, then replace all remaining ; with newlines, print the whole thing, and then quit. Since it immediately quits, it will only handle the first such line in the first such pattern range.
EDIT: The update made things a bit more complicated. I suggest
sed -n '/^\[CAB_Z1/,/^NAME=CAB_Z1/ { /^FilterAttr=/ { s/^.*contains;\(.*\)|PerceivedSeverity.*$/\1/; s/;/\n/g; p; q } }' foo.txt
The main difference is that instead of removing ^Names from a line, the substitution
s/^.*contains;\(.*\)|PerceivedSeverity.*$/\1/;
is applied. This isolates the part between contains; and |PerceivedSeverity before continuing as before. It assumes that there is only one such part in the line. If the match is ambiguous, it will pick the one that appears last in the line.
An (g)awk way that doesn't need a set number of fields(although i have assumed that contains; will always be on the line you need the names from.
(g)awk '(x+=/Z1/)&&match($0,/contains;([^|]+)/,a)&&gsub(";","\n",a[1]){print a[1];exit}' f
Explanation
(x+=/Z1/) - Increments x when Z1 is found. Also part of a
condition so x must exist to continue.
match($0,/contains;([^|]+)/,a) - Matches contains; and then captures everything after
up to the |. Stores the capture in a. Again a
condition so must succeed to continue.
gsub(";","\n",a[1]) - Substitutes all the ; for newlines in the capture
group a[1].
{print a[1];exit}' - If all conditions are met then print a[1] and exit.
This way should work in (m)awk
awk '(x+=/Z1/)&&/contains/{split($0,a,"|");y=split(a[2],b,";");for(i=3;i<=y;i++)
print b[i];exit}' file
sed -n '/\[Name_Z1\]/,/OBJ=Name_Z1$/ s/Names;//p' file.txt | tr ';' '\n'
That is sed -n to avoid printing anything not explicitly requested. Start from Name_Z1 and finish at OBJ=Name_Z1. Remove Names; and print the rest of the line where it occurs. Finally, replace semicolons with newlines.
Awk solution would be
$ awk -F";" '/Name_Z1/{f=1} f && /Names/{print $2,$3,$4} /OBJ=Name_Z1/{exit}' OFS="\n" input
Jhon
Alex
Smith
OR
$ awk -F";" '/Name_Z1/{f++} f==1 && /Names/{print $2,$3,$4}' OFS="\n" input
Jhon
Alex
Smith
-F";" sets the field seperator as ;
/Name_Z1/{f++} matches the line with pattern /Name_Z1/ If matched increment {f++}
f==1 && /Names/{print $2,$3,$4} is same as if f == 1 and maches pattern Name with line if true, then print the the columns 2 3 and 4 (delimted by ;)
OFS="\n" sets the output filed seperator as \n new line
EDIT
$ awk -F"[;|]" '/Z1/{f++} f==1 && NF>1{for (i=5; i<15; i++)print $i}' input
511047
512044
513008
593026
598326
CL5518
CL5521
CL5538
CL5612
CL5620
Here is a more generic solution for data in group of blocks.
This awk does not need the end tag, just the start.
awk -vRS= -F"\n" '/^\[Name_Z1\]/ {n=split($3,a,";");for (i=2;i<=n;i++) print a[i];exit}' file
Jhon
Alex
Smith
How it works:
awk -vRS= -F"\n" ' # By setting RS to nothing, one record equals one block. Then FS is set to one line as a field
/^\[Name_Z1\]/ { # Search for block with [Name_Z1]
n=split($3,a,";") # Split field 3, the names and store number of fields in variable n
for (i=2;i<=n;i++) # Loop from second to last field
print a[i] # Print the fields
exit # Exits after first find
' file
With updated data
cat file
data
[CAB_Z1_FUEGO]
READ_ONLY=false
FilterAttr=CeaseTime;blank|ObjectOfReference;contains;511047;512044;513008;593026;598326;CL5518;CL5521;CL5538;CL5612;CL5620|PerceivedSeverity;=;Critical;Major;Minor|ProbableCause;!=;HOUSE ALARM;IO DEVICE|ProblemText;contains;FUEGO
NAME=CAB_Z1_FUEGO
data
awk -vRS= -F"\n" '/^\[CAB_Z1_FUEGO\]/ {split($3,a,"|");n=split(a[2],b,";");for (i=3;i<=n;i++) print b[i]}' file
511047
512044
513008
593026
598326
CL5518
CL5521
CL5538
CL5612
CL5620
The following awk script will do what you want:
awk 's==1&&/^Names/{gsub("Names;","",$0);gsub(";","\n",$0);print}/^\[Name_Z1\]$/||/^OBJ=Name_Z1$/{s++}' inputFileName
In more detail:
s==1 && /^Names;/ {
gsub ("Names;","",$0);
gsub(";","\n",$0);
print
}
/^\[Name_Z1\]$/ || /^OBJ=Name_Z1$/ {
s++
}
The state s starts with a value of zero and is incremented whenever you find one of the two lines:
[Name_Z1]
OBJ=Name_Z1
That means, between the first set of those lines, s will be equal to one. That's where the other condition comes in. When s is one and you find a line starting with Names;, you do two substitutions.
The first is to get rid of the Names; at the front, the second is to replace all ; semi-colon characters with a newline. Then you print it out.
The output for your given test data is, as expected:
Jhon
Alex
Smith

Get lines by a unique portion of the line, and display only the first occurrence of that unique portion

I'm trying to write a script that looks at a part of a line, does a sort -u or something to look for unique occurrences, and then displays the output, sorted by the ORIGINAL ordering of the lines. In other words, only the FIRST occurrence of that part of the line would show up.
I managed to do it using cut, but my output just displays the cut portion of the data. How could I do it so that it gets the entire line?
Here's what I've got so far:
cut -d, -f6 infile.txt | cut -c4-11 | grep -n . | sort -t: -k2,2 -u | sort -t: -k1n,1 | cut -d: -f2-
I know the data doesn't have an extra : or a , in a place that would break this script. But this only outputs the data that was unique. How can I get the entire line? I would prefer to stay away from perl, but awk is okay (though I don't know it very well).
Sample:
If the input file is this (note, the ABCDEFGH is not real, I just put it there to illustrate what I mean):
A....,....,...........,.....,....,...20130718......,.........,...........,......
B....,....,...........,.....,....,...20130714......,.........,...........,......
C....,....,...........,.....,....,...20130718......,.........,...........,......
D....,....,...........,.....,....,...20130719......,.........,...........,......
E....,....,...........,.....,....,...20130713......,.........,...........,......
F....,....,...........,.....,....,...20130714......,.........,...........,......
G....,....,...........,.....,....,...20130630......,.........,...........,......
H....,....,...........,.....,....,...20130718......,.........,...........,......
My program outputs:
20130718
20130714
20130719
20130713
20130630
I want to see:
A....,....,...........,.....,....,...20130718......,.........,...........,......
B....,....,...........,.....,....,...20130714......,.........,...........,......
D....,....,...........,.....,....,...20130719......,.........,...........,......
E....,....,...........,.....,....,...20130713......,.........,...........,......
G....,....,...........,.....,....,...20130630......,.........,...........,......
Yes, awk is your best bet. Here's a mysterious example:
awk -F, '!seen[substr($6,4,8)]++' infile.txt
Explanation:
options:
-F, set the field separator to ,
condition:
substr($6,4,8) up to 8 characters starting at the fourth character
of the sixth field
seen[...]++ seen is an associative array (dictionary). Increment the
value associated with ..., and return the old value
!seen[...]++ if there was no old value, perform the action
action:
There is no action, only a condition, so the default action is
performed if the test succeeds. The default action is to print
the line. So the line will be printed if the relevant characters of
the sixth field haven't yet been seen.
Test:
$ awk -F, '!seen[substr($6,4,8)]++' <<EOF
> A....,....,...........,.....,....,...20130718......,.........,...........,......
> B....,....,...........,.....,....,...20130714......,.........,...........,......
> C....,....,...........,.....,....,...20130718......,.........,...........,......
> D....,....,...........,.....,....,...20130719......,.........,...........,......
> E....,....,...........,.....,....,...20130713......,.........,...........,......
> F....,....,...........,.....,....,...20130714......,.........,...........,......
> G....,....,...........,.....,....,...20130630......,.........,...........,......
> H....,....,...........,.....,....,...20130718......,.........,...........,......
> EOF
A....,....,...........,.....,....,...20130718......,.........,...........,......
B....,....,...........,.....,....,...20130714......,.........,...........,......
D....,....,...........,.....,....,...20130719......,.........,...........,......
E....,....,...........,.....,....,...20130713......,.........,...........,......
G....,....,...........,.....,....,...20130630......,.........,...........,......
$

Resources