Processing a delimited line in bash - bash

Given a single line of input with 'n' arguments which are space delimited. The input arguments themselves are variable. The input is given through an external file.
I want to move specific elements to variables depending on regular expressions. As such, I was thinking of declaring a pointer variable first to keep track of where on the line I am. In addition, the assignment to variable is independent of numerical order, and depending on input some variables may be skipped entirely.
My current method is to use
awk '{print $1}' file.txt
However, not all elements are fixed and I need to account for elements that may be absent, or may have multiple entries.
UPDATE: I found another method.
file=$(cat /file.txt)
for i in ${file[#]}; do
echo $i >> split.txt;
done
With this way, instead of a single line with multiple arguments, we get multiple lines with a single argument. as such, we can now use var#=(grep --regexp="[pattern]" split.txt. Now I just need to figure out how best to use regular expressions to filter this mess.
Let me take an example.
My input strings are:
RON KKND 1534Z AUTO 253985G 034SRT 134OVC 04/32
RON KKND 5256Z 143623G72K 034OVC 074OVC 134SRT 145PRT 13/00
RON KKND 2234Z CON 342523G CLS 01/M12 RMK
So the variable assignment for each of the above would be:
var1=RON var2=KKND var3=1534Z var4=TRUE var5=FALSE var6=253985G varC=2 varC1=034SRT varC2=134OVC var7=04/32
var1=RON var2=KKND var3=5256Z var4=FALSE var5=FALSE var6=143623G72K varC=4 varC1=034OVC varC2=074OVC varC3=134SRT varC4=145PRT var7=13/00
var1=RON var2=KKND var3=2234Z var4=FALSE var5=TRUE var6=342523G varC=0 var7=01/M12
So, the fourth argument might be var4, var5, or var6.
The fifth argument might be var5, var6, or match another criteria.
The sixth argument may or may not be var6. Between var6 and var7 can be determined by matching each argument with */*
Boiling this down even more, The positions on the input of var1, var2 and var3 are fixed but after that I need to compare, order, and assign. In addition, the arguments themselves can vary in character length. The relative position of each section to be divided is fixed in relation to its neighbors. var7 will never be before var6 in the input for example, and if var4 and var5 are true, then the 4th and 5th argument would always be 'AUTO CON' Some segments will always be one argument, and others more than one. The relative position of each is known. As for each pattern, some have a specific character in a specific location, and others may not have any flag on what it is aside from its position in the sequence.
So I need awk to recognize a pointer variable as every argument needs to be checked until a specific match is found
#Check to see if var4 or var5 exists. if so, flag and increment pointer
pointer=4
if (awk '{print $$pointer}' file.txt) == "AUTO" ; then
var4="TRUE"
pointer=$pointer+1
else
var4="FALSE"
fi
if (awk '{print $$pointer}' file.txt) == "CON" ; then
var5="TRUE"
pointer=$pointer+1
else
var5="FALSE"
fi
#position of var6 is fixed once var4 and var5 are determined
var6=$(awk '{print $$pointer}' file.txt)
pointer=$pointer+1
#Count the arguments between var6 and var7 (there may be up to ten)
#and separate each to decode later. varC[0-9] is always three upcase
# letters followed by three numbers. Use this counter later when decoding.
varC=0
until (awk '{print $$pointer}' file.txt) == "*/*" ; do
varC($varC+1)=(awk '{print $$pointer}' file.txt)
varC=$varC+1
pointer=$pointer+1
done
#position of var7 is fixed after all arguments of varC are handled
var7=$(awk '{print $$pointer}' file.txt)
pointer=$pointer+1
I know the above syntax is incorrect. The question is how do I fix it.
var7 is not always at the end of the input line. Arguments after var7 however do not need to be processed.
Actually interpreting the patterns I haven't gotten to yet. I intend to handle that using case statements comparing the variables with regular expressions to compare against. I don't want to use awk to interpret the patterns directly as that would get very messy. I have contemplated using for n in $string, but to do that would mean comparing every argument to every possible combination directly (And there are multiple segments each with multiple patterns) and is such impractical. I'm trying to make this a two step process.

Please try the following:
#!/bin/bash
# template for variable names
declare -a namelist1=( "var1" "var2" "var3" "var4" "var5" "var6" "varC" )
declare -a ary
# read each line and assign ary to the elements
while read -r -a ary; do
if [[ ${ary[3]} = AUTO ]]; then
ary=( "${ary[#]:0:3}" "TRUE" "FALSE" "${ary[4]}" "" "${ary[#]:5:3}" )
elif [[ ${ary[3]} = CON ]]; then
ary=( "${ary[#]:0:3}" "FALSE" "TRUE" "${ary[4]}" "" "${ary[#]:5:3}" )
else
ary=( "${ary[#]:0:3}" "FALSE" "FALSE" "${ary[3]}" "" "${ary[#]:4:5}" )
fi
# initial character of the 7th element
ary[6]=${ary[7]:0:1}
# locate the index of */* entry in the ary and adjust the variable names
for (( i=0; i<${#ary[#]}; i++ )); do
if [[ ${ary[$i]} == */* ]]; then
declare -a namelist=( "${namelist1[#]}" )
for (( j=1; j<=i-7; j++ )); do
namelist+=( "$(printf "varC%d" "$j")" )
done
namelist+=( "var7" )
fi
done
# assign variables to array elements
for (( i=0; i<${#ary[#]}; i++ )); do
# echo -n "${namelist[$i]}=${ary[$i]} " # for debugging
declare -n p="${namelist[$i]}"
p="${ary[$i]}"
done
# echo "var1=$var1 var2=$var2 var3=$var3 ..." # for debugging
done < file.txt
Note that the script above just assigns bash variables and does not print anything
unless you explicitly echo or printf the variables.

Updated: This code shows how to decide variable value based on pattern match , multiple times.
one code block in pure bash and the other in gawk manner
bash code block requires associative Array support, which is not available in very early versions
grep is also required to do pattern matching
tested with GNU bash, version 4.2.46(2)-release (x86_64-redhat-linux-gnu) and grep (GNU grep) 2.20
and stick to printf other than echo after I learn why-is-printf-better-than-echo
when using bash I consider it good practice to be more defensive
#!/bin/bash
declare -ga outVars
declare -ga lineBuf
declare -g NF
#force valid index starts from 1
#consistent with var* name pattern
outVars=(unused var1 var2 var3 var4 var5 var6 varC var7)
((numVars=${#outVars[#]} - 1))
declare -gr numVars
declare -r outVars
function e_unused {
return
}
function e_var1 {
printf "%s" "${lineBuf[1]}"
}
function e_var2 {
printf "%s" "${lineBuf[2]}"
}
function e_var3 {
printf "%s" "${lineBuf[3]}"
}
function e_var4 {
if [ "${lineBuf[4]}" == "AUTO" ] ;
then
printf "TRUE"
else
printf "FALSE"
fi
}
function e_var5 {
if [ "${lineBuf[4]}" == "CON" ] ;
then
printf "TRUE"
else
printf "FALSE"
fi
}
function e_varC {
local var6_idx=4
if [ "${lineBuf[4]}" == "AUTO" -o "${lineBuf[4]}" == "CON" ] ;
then
var6_idx=5
fi
local var7_idx=$NF
local i
local count=0
for ((i=NF;i>=1;i--));
do
if [ $(grep -cE '^.*/.*$' <<<${lineBuf[$i]}) -eq 1 ];
then
var7_idx=$i
break
fi
done
((varC = var7_idx - var6_idx - 1))
if [ $varC -eq 0 ];
then
printf 0
return;
fi
local cFamily=""
local append
for ((i=var6_idx;i<=var7_idx;i++));
do
if [ $(grep -cE '^[0-9]{3}[A-Z]{3}$' <<<${lineBuf[$i]}) -eq 1 ];
then
((count++))
cFamily="$cFamily varC$count=${lineBuf[$i]}"
fi
done
printf "%s %s" $count "$cFamily"
}
function e_var6 {
if [ "${lineBuf[4]}" == "AUTO" -o "${lineBuf[4]}" == "CON" ] ;
then
printf "%s" "${lineBuf[5]}"
else
printf "%s" "${lineBuf[4]}"
fi
}
function e_var7 {
local i
for ((i=NF;i>=1;i--));
do
if [ $(grep -cE '^.*/.*$' <<<${lineBuf[$i]}) -eq 1 ];
then
printf "%s" "${lineBuf[$i]}"
return
fi
done
}
while read -a lineBuf ;
do
NF=${#lineBuf[#]}
lineBuf=(unused ${lineBuf[#]})
for ((i=1; i<=numVars; i++));
do
printf "%s=" "${outVars[$i]}"
(e_${outVars[$i]})
printf " "
done
printf "\n"
done <file.txt
The gawk specific extension Indirect Function Call is used in the awk code below
the code assigns a function name for every desired output variable.
different pattern or other transformation can be applied in its specific function
doing so to avoid tons of if-else-if-else
and is also easier to read and extend.
for the special varC family, the function pick_varC played a trick
after varC is determined ,its value consists of multiple output fields.
if varC=2, the value of varC is returned as 2 varC1=034SRT varC2=134OVC
that is actual value of varC appending all follow members.
gawk '
BEGIN {
keys["var1"] = "pick_var1";
keys["var2"] = "pick_var2";
keys["var3"] = "pick_var3";
keys["var4"] = "pick_var4";
keys["var5"] = "pick_var5";
keys["var6"] = "pick_var6";
keys["varC"] = "pick_varC";
keys["var7"] = "pick_var7";
}
function pick_var1 () {
return $1;
}
function pick_var2 () {
return $2;
}
function pick_var3 () {
return $3;
}
function pick_var4 () {
for (i=1;i<=NF;i++) {
if ($i == "AUTO") {
return "TRUE";
}
}
return "FALSE";
}
function pick_var5 () {
for (i=1;i<=NF;i++) {
if ($i == "CON") {
return "TRUE";
}
}
return "FALSE";
}
function pick_varC () {
for (i=1;i<=NF;i++) {
if (($i=="AUTO" || $i=="CON")) {
break;
}
}
var6_idx = 5;
if ( i!=4 ) {
var6_idx = 4;
}
var7_idx = NF;
for (i=1;i<=NF;i++) {
if ($i~/.*\/.*/) {
var7_idx = i;
}
}
varC = var7_idx - var6_idx - 1;
if ( varC == 0) {
return varC;
}
count = 0;
cFamily = "";
for (i = 1; i<=varC;i++) {
if ($(var6_idx+i)~/[0-9]{3}[A-Z]{3}/) {
cFamily = sprintf("%s varC%d=%s",cFamily,i,$(var6_idx+i));
count++;
}
}
varC = sprintf("%d %s",count,cFamily);
return varC;
}
function pick_var6 () {
for (i=1;i<=NF;i++) {
if (($i=="AUTO" || $i=="CON")) {
break;
}
}
if ( i!=4 ) {
return $4;
} else {
return $5
}
}
function pick_var7 () {
for (i=1;i<=NF;i++) {
if ($i~/.*\/.*/) {
return $i;
}
}
}
{
for (k in keys) {
pickFunc = keys[k];
printf("%s=%s ",k,#pickFunc());
}
printf("\n");
}
' file.txt
test input
RON KKND 1534Z AUTO 253985G 034SRT 134OVC 04/32
RON KKND 5256Z 143623G72K 034OVC 074OVC 134SRT 145PRT 13/00
RON KKND 2234Z CON 342523G CLS 01/M12 RMK
script output
var1=RON var2=KKND var3=1534Z var4=TRUE var5=FALSE varC=2 varC1=034SRT varC2=134OVC var6=253985G var7=04/32
var1=RON var2=KKND var3=5256Z var4=FALSE var5=FALSE varC=4 varC1=034OVC varC2=074OVC varC3=134SRT varC4=145PRT var6=143623G72K var7=13/00
var1=RON var2=KKND var3=2234Z var4=FALSE var5=TRUE varC=0 var6=342523G var7=01/M12

Related

Generic "append to file if not exists" function in Bash

I am trying to write a util function in a bash script that can take a multi-line string and append it to the supplied file if it does not already exist.
This works fine using grep if the pattern does not contain \n.
if grep -qF "$1" $2
then
return 1
else
echo "$1" >> $2
fi
Example usage
append 'sometext\nthat spans\n\tmutliple lines' ~/textfile.txt
I am on MacOS btw which has presented some problems with some of the solutions I've seen posted elsewhere being very linux specific. I'd also like to avoid installing any other tools to achieve this if possible.
Many thanks
If the files are small enough to slurp into a Bash variable (you should be OK up to a megabyte or so on a modern system), and don't contain NUL (ASCII 0) characters, then this should work:
IFS= read -r -d '' contents <"$2"
if [[ "$contents" == *"$1"* ]]; then
return 1
else
printf '%s\n' "$1" >>"$2"
fi
In practice, the speed of Bash's built-in pattern matching might be more of a limitation than ability to slurp the file contents.
See the accepted, and excellent, answer to Why is printf better than echo? for an explanation of why I replaced echo with printf.
Using awk:
awk '
BEGIN {
n = 0 # length of pattern in lines
m = 0 # number of matching lines
}
NR == FNR {
pat[n++] = $0
next
}
{
if ($0 == pat[m])
m++
else if (m > 0 && $0 == pat[0])
m = 1
else
m = 0
}
m == n {
exit
}
END {
if (m < n) {
for (i = 0; i < n; i++)
print pat[i] >>FILENAME
}
}
' - "$2" <<EOF
$1
EOF
if necessary, one would need to properly escape any metacharacters inside FS | OFS :
jot 7 9 |
{m,g,n}awk 'BEGIN { FS = OFS = "11\n12\n13\n"
_^= RS = (ORS = "") "^$" } _<NF || ++NF'
9
10
11
12
13
14
15
jot 7 -2 | (... awk stuff ...)
-2
-1
0
1
2
3
4
11
12
13

Count occurrences in a csv with Bash

I have to create a script that given a country and a sport you get the number of medalists and medals won after reading a csv file.
The csv is called "athletes.csv" and have this header
id|name|nationality|sex|date_of_birth|height|weight|sport|gold|silver|bronze|info
when you call the script you have to add the nationality and sport as parameters.
The script i have created is this one:
#!/bin/bash
participants=0
medals=0
while IFS=, read -ra array
do
if [[ "${array[2]}" == $1 && "${array[7]}" == $2 ]]
then
participants=$participants++
medals=$(($medals+${array[8]}+${array[9]}+${array[10]))
fi
done < athletes.csv
echo $participants
echo $medals
where array[3] is the nationality, array[8] is the sport and array[9] to [11] are the number of medals won.
When i run the script with the correct paramters I get 0 participants and 0 medals.
Could you help me to understand what I'm doing wrong?
Note I cannot use awk nor grep
Thanks in advance
Try this:
#! /bin/bash -p
nation_arg=$1
sport_arg=$2
declare -i participants=0
declare -i medals=0
declare -i line_num=0
while IFS=, read -r _ _ nation _ _ _ _ sport ngold nsilver nbronze _; do
(( ++line_num == 1 )) && continue # Skip the header
[[ $nation == "$nation_arg" && $sport == "$sport_arg" ]] || continue
participants+=1
medals+=ngold+nsilver+nbronze
done <athletes.csv
declare -p participants
declare -p medals
The code uses named variables instead of numbered positional parameters and array indexes to try to improve readability and maintainability.
Using declare -i means that strings assigned to the declared variables are treated as arithmetic expressions. That reduces clutter by avoiding the need for $(( ... )).
The code assumes that the field separator in the CSV file is ,, not | as in the header. If the separator is really |, replace IFS=, with IFS='|'.
I'm assuming that the field delimiter of your CSV file is a comma but you can set it to whatever character you need.
Here's a fixed version of your code:
#!/bin/bash
participants=0
medals=0
{
# skip the header
read
# process the records
while IFS=',' read -ra array
do
if [[ "${array[2]}" == $1 && "${array[7]}" == $2 ]]
then
(( participants++ ))
medals=$(( medals + array[8] + array[9] + array[10] ))
fi
done
} < athletes.csv
echo "$participants" "$medals"
remark: As $1 and $2 are left unquoted they are subject to glob matching (right side of [[ ... == ... ]]). For example you'll be able to show the total number of medals won by the US with:
./script.sh 'US' '*'
But I have to say, doing text processing with pure shell isn't considered a good practice; there exists dedicated tools for that. Here's an example with awk:
awk -v FS=',' -v country="$1" -v sport="$2" '
BEGIN {
participants = medals = 0
}
NR == 1 { next }
$3 == country && $8 == sport {
participants++
medals += $9 + $10 + $11
}
END { print participants, medals }
' athletes.csv
There's also a potential problem remaining: the CSV format might need a real CSV parser for reading it accurately. There exists a few awk libraries for that but IMHO it's simpler to use a CSV‑aware tool that provides the functionalities that you need.
Here's an example with Miller:
mlr --icsv --ifs=',' filter -s country="$1" -s sport="$2" '
begin {
#participants = 0;
#medals = 0;
}
$nationality == #country && $sport == #sport {
#participants += 1;
#medals += $gold + $silver + $bronze;
}
false;
end { print #participants, #medals; }
' athletes.csv

Matching a number against a comma-separated sequence of ranges

I'm writing a bash script which takes a number, and also a comma-separated sequence of values and strings, e.g.: 3,15,4-7,19-20. I want to check whether the number is contained in the set corresponding to the sequence. For simplicity, assume no comma-separated elements intersect, and that the elements are sorted in ascending order.
Is there a simple way to do this in bash other than the brute-force naive way? Some shell utility which does something like that for me, maybe something related to lpr which already knows how to process page range sequences etc.
Is awk cheating?:
$ echo -n 3,15,4-7,19-20 |
awk -v val=6 -v RS=, -F- '(NF==1&&$1==val) || (NF==2&&$1<=val&&$2>=val)' -
Output:
4-7
Another version:
$ echo 19 |
awk -v ranges=3,15,4-7,19-20 '
BEGIN {
split(ranges,a,/,/)
}
{
for(i in a) {
n=split(a[i],b,/-/)
if((n==1 && $1==a[i]) || (n==2 && $1>=b[1] && $1<=b[2]))
print a[i]
}
}' -
Outputs:
19-20
The latter is better as you can feed it more values from a file etc. Then again the former is shorter. :D
Pure bash:
check() {
IFS=, a=($2)
for b in "${a[#]}"; do
IFS=- c=($b); c+=(${c[0]})
(( $1 >= c[0] && $1 <= c[1] )) && break
done
}
$ check 6 '3,15,4-7,19-20' && echo "yes" || echo "no"
yes
$ check 42 '3,15,4-7,19-20' && echo "yes" || echo "no"
no
As bash is tagged, why not just
inrange() { for r in ${2//,/ }; do ((${r%-*}<=$1 && $1<=${r#*-})) && break; done; }
Then test it as usual:
$ inrange 6 3,15,4-7,19-20 && echo yes || echo no
yes
$ inrange 42 3,15,4-7,19-20 && echo yes || echo no
no
A function based on #JamesBrown's method:
function match_in_range_seq {
(( $# == 2 )) && [[ -n "$(echo -n "$2" | awk -v val="$1" -v RS=, -F- '(NF==1&&$1==val) || (NF==2&&$1<=val&&$2>=val)' - )" ]]
}
Will return 0 (in $?) if the second argument (the range sequence) contains the first argument, 1 otherwise.
Another awk idea using two input (-v) variables:
# use of function wrapper is optional but cleaner for the follow-on test run
in_range() {
awk -v value="$1" -v r="$2" '
BEGIN { n=split(r,ranges,",")
for (i=1;i<=n;i++) {
low=high=ranges[i]
if (ranges[i] ~ "-") {
split(ranges[i],x,"-")
low=x[1]
high=x[2]
}
if (value >= low && value <= high) {
print value,"found in the range:",ranges[i]
exit
}
}
}'
}
NOTE: the exit assumes no overlapping ranges, ie, value will not be found in more than one 'range'
Take for a test spin:
ranges='3,15,4-7,19-20'
for value in 1 6 15 32
do
echo "########### value = ${value}"
in_range "${value}" "${ranges}"
done
This generates:
########### value = 1
########### value = 6
6 found in the range: 4-7
########### value = 15
15 found in the range: 15
########### value = 32
NOTES:
OP did not mention what to generate as output if no range match is found; code could be modified to output a 'not found' message as needed
in a comment OP mentioned possibly running the search for a number of values; code could be modified to support such a requirement but would need more input (eg, format of list of values, desired output and how to be used/captured by calling process, etc)

How to iterate a dictionary through indirection

I want to implement in bash the following pseudocode
function gen_items() {
dict=$1 # $1 is a name of a dictionary declared globally
for key in $dict[#]
do
echo $key ${dict[$key]}
# process the key and its value in the dictionary
done
}
The best I have come by is
function gen_items() {
dict=$1
tmp="${dict}[#]"
for key in "${!tmp}"
do
echo $key
done
}
This actually only gets the values from the dictionary, but I need the keys as well.
Use a nameref:
show_dict() {
((BASH_VERSINFO[0] < 4 || ((BASH_VERSINFO[0] == 4 && BASH_VERSINFO[1] < 3)))) &&
{ printf '%s\n' "Need Bash version 4.3 or above" >&2; exit 1; }
declare -n hash=$1
for key in "${!hash[#]}"; do
echo key=$key
done
}
declare -A h
h=([one]=1 [two]=2 [three]=3)
show_dict h
Output:
key=two
key=three
key=one
See:
How can I use variable variables (indirect variables, pointers, references) or associative arrays?
Shell Parameters

Aggregating csv file in bash script

I have csv file with multiple lines. Each line has the same number of columns. What I need to do is to group those lines by a few specified columns and aggregate data from other columns. Example of input file:
proces1,pathA,5-May-2011,10-Sep-2017,5
proces2,pathB,6-Jun-2014,7-Jun-2015,2
proces1,pathB,6-Jun-2017,7-Jun-2017,1
proces1,pathA,11-Sep-2017,15-Oct-2017,2
For above example I need to group lines by first two columns. From 3rd column I need to choose the min value, for 4th column max value, and 5th column should have the sum. So, for such input file I need output:
proces1,pathA,5-May-2011,15-Oct-2017,7
proces1,pathB,6-Jun-2017,7-Jun-2017,1
proces2,pathB,6-Jun-2014,7-Jun-2015,2
I need to process it in bash (I can use awk or sed as well).
With bash and sort:
#!/bin/bash
# create associative arrays
declare -A month2num=([Jan]=1 [Feb]=2 [Mar]=3 [Apr]=4 [May]=5 [Jun]=6 [Jul]=7 [Aug]=8 [Sep]=9 [Oct]=10 [Nov]=11 [Dec]=12])
declare -A p ds de # date start and date end
declare -A -i sum # set integer attribute
# function to convert 5-Jun-2011 to 20110605
date2num() { local d m y; IFS="-" read -r d m y <<< "$1"; printf "%d%.2d%.2d\n" $y ${month2num[$m]} $d; }
# read all columns to variables p1 p2 d1 d2 s
while IFS="," read -r p1 p2 d1 d2 s; do
# if associative array is still empty for this entry
# fill with current strings/value
if [[ -z ${p[$p1,$p2]} ]]; then
p[$p1,$p2]="$p1,$p2"
ds[$p1,$p2]="$d1"
de[$p1,$p2]="$d2"
sum[$p1,$p2]="$s"
continue
fi
# compare strings, set new strings and sum value
if [[ ${p[$p1,$p2]} == "$p1,$p2" ]]; then
[[ $(date2num "$d1") < $(date2num ${ds[$p1,$p2]}) ]] && ds[$p1,$p2]="$d1"
[[ $(date2num "$d2") > $(date2num ${de[$p1,$p2]}) ]] && de[$p1,$p2]="$d2"
sum[$p1,$p2]=sum[$p1,$p2]+s
fi
done < file
# print content of all associative arrays with key vom associative array p
for i in "${!p[#]}"; do echo "${p[$i]},${ds[$i]},${de[$i]},${sum[$i]}"; done
Usage: ./script.sh | sort
Output to stdout:
proces1,pathA,5-May-2011,15-Oct-2017,7
proces1,pathB,6-Jun-2017,7-Jun-2017,1
proces2,pathB,6-Jun-2014,7-Jun-2015,2
See: help declare, help read and of course man bash
With awk + sort
awk -F',|-' '
BEGIN{
A["Jan"]="01"
A["Feb"]="02"
A["Mar"]="03"
A["Apr"]="04"
A["May"]="05"
A["Jun"]="06"
A["July"]="07"
A["Aug"]="08"
A["Sep"]="09"
A["Oct"]="10"
A["Nov"]="11"
A["Dec"]="12"
}
{
B[$1","$2]=B[$1","$2]+$9
z=sprintf( "%.2d",$3)
y=sprintf("%s",$5 A[$4] z)
if(!start[$1$2])
{
end[$1$2]=0
start[$1$2]=99999999
}
if (y < start[$1$2])
{
start[$1$2]=y
C[$1","$2]=$3"-"$4"-"$5
}
x=sprintf( "%.2d",$6)
w=sprintf("%s",$8 A[$7] x)
if(w > end[$1$2] )
{
end[$1$2]=w
D[$1","$2]=$6"-"$7"-"$8
}
}
END{
for (i in B)print i "," C[i] "," D[i] "," B[i]
}
' infile | sort
Extended GNU awk solution:
awk -F, 'function parse_date(d_str){
split(d_str, d, "-");
t = mktime(sprintf("%d %d %d 00 00 00", d[3], m[d[2]], d[1]));
return t
}
BEGIN{ m["Jan"]=1; m["Feb"]=2; m["Mar"]=3; m["Apr"]=4; m["May"]=5; m["Jun"]=6;
m["Jul"]=7; m["Aug"]=8; m["Sep"]=9; m["Oct"]=10; m["Nov"]=11; m["Dec"]=12;
}
{
k=$1 SUBSEP $2;
if (k in a){
if (parse_date(a[k]["min"]) > parse_date($3)) { a[k]["min"]=$3 }
if (parse_date(a[k]["max"]) < parse_date($4)) { a[k]["max"]=$4 }
} else {
a[k]["min"]=$3; a[k]["max"]=$4
}
a[k]["sum"]+= $5
}
END{
for (i in a) {
split(i, j, SUBSEP);
print j[1], j[2], a[i]["min"], a[i]["max"], a[i]["sum"]
}
}' OFS=',' file
The output:
proces1,pathA,5-May-2011,15-Oct-2017,7
proces1,pathB,6-Jun-2017,7-Jun-2017,1
proces2,pathB,6-Jun-2014,7-Jun-2015,2

Resources