increment values in column within file with bash, sed and awk - bash

Please find below an excerpt from one of my file.
1991;1;-7;-3;-9;-4;-7
1991;1;-7;-3;-9;-4;-7
1991;1;-7;-3;-9;-4;-7
1991;2;-14;-11;-14;-4;-14
1991;2;-14;-11;-14;-4;-14
1991;2;-14;-11;-14;-4;-14
1991;3;-7;-3;-15;5;-7
1991;3;-7;-3;-15;5;-7
1991;3;-7;-3;-15;5;-7
1991;4;-15;-9;-21;1;-16
1991;4;-15;-9;-21;1;-16
1991;4;-15;-9;-21;1;-16
1992;1;-12;-6;-19;-2;-12
1992;1;-12;-6;-19;-2;-12
1992;1;-12;-6;-19;-2;-12
1992;2;-16;-7;-22;-12;-15
1992;2;-16;-7;-22;-12;-15
1992;2;-16;-7;-22;-12;-15
1992;3;-22;-15;-25;-16;-24
1992;3;-22;-15;-25;-16;-24
I'm trying through sed or/and awk to add + 1 on the second column for the second row for the second row as long as the year in the first column remains the same.
The results would be the following:
1991;1;-7;-3;-9;-4;-7
1991;2;-7;-3;-9;-4;-7
1991;3;-7;-3;-9;-4;-7
1991;4;-14;-11;-14;-4;-14
1991;5;-14;-11;-14;-4;-14
1991;6;-14;-11;-14;-4;-14
1991;7;-7;-3;-15;5;-7
1991;8;-7;-3;-15;5;-7
1991;9;-7;-3;-15;5;-7
1991;10;-15;-9;-21;1;-16
1991;11;-15;-9;-21;1;-16
1991;12;-15;-9;-21;1;-16
1992;1;-12;-6;-19;-2;-12
1992;2;-12;-6;-19;-2;-12
1992;3;-12;-6;-19;-2;-12
1992;4;-16;-7;-22;-12;-15
1992;5;-16;-7;-22;-12;-15
1992;6;-16;-7;-22;-12;-15
1992;7;-22;-15;-25;-16;-24
1992;8;-22;-15;-25;-16;-24
I've seen countless examples on stackflow but nothing that can lead me close to a solution.
I welcome any suggestions.
Best,

If you always want the 2nd column to be 1 for the line in which the year first appears in column 1, then:
awk -F\; '$1!=l{c=0}{$2=++c}{l=$1}1' OFS=\; input
If you want to maintain whatever was in column 2:
awk -F\; '$1!=l{c=$2}{$2=c++}{l=$1}1' OFS=\; input

This could be done more tersely with awk, but pure bash works fine:
last_year=
counter_val=
while IFS=';' read -r year old_counter rest; do
if [[ $year = "$last_year" ]]; then
(( ++counter_val ))
else
counter_val=1
last_year=$year
fi
printf -v result '%s;' "$year" "$counter_val" "$rest"
printf '%s\n' "${result%;}"
done <input.txt >output.txt

You simply want to increment your second column, and not add one to it? Do you want the second column to go from one onward no matter what the second column is?
awk -F\; '{
if ( NR == 1 ) {
year = $0
}
if ( year == $0 ) {
for (count = 1; count < NF; count++) {
if ( count == 2) {
printf NR ";";
}
else {
printf $count ";";
}
}
print "";
}
else {
print
}
}' test.txt
Awk is a natural program to use because it operates in assuming a loop. Plus, it's math is more natural than plain shell.
The NR means Number of Records and NF means Number of fields. A field is separated by my -F\; parameter, and the record is the line number in my file. The rest of the program is pretty obvious.

Using awk, set the FS (field separator) and OFS (output field separator) to ';' and
for each new year record set the val counter to the start column 2 value. Increment val for each line with that year.
awk -F';' 'BEGIN{OFS=";";y=0}
{ if (y!=$1)
{y=$1;val=$2;print}
else
{val++;print $1,val,$3,$4,$5,$6,$7}}' data_file

Related

How to assign awk result variable to an array and is it possible to use awk inside another awk in loop

I've started to learn bash and totally stuck with the task. I have a comma separated csv file with records like:
id,location_id,organization_id,service_id,name,title,email,department
1,1,,,Name surname,department1 department2 department3,,
2,1,,,name Surname,department1,,
3,2,,,Name Surname,"department1 department2, department3",, e.t.c.
I need to format it this way: name and surname must start with a capital letter
add an email record that consists of the first letter of the name and full surname in lowercase
create a new csv with records from the old csv with corrected fields.
I split csv on records using awk ( cause some fields contain fields with a comma between quotes "department1 department2, department3" ).
#!/bin/bash
input="$HOME/test.csv"
exec 0<$input
while read line; do
awk -v FPAT='"[^"]*"|[^,]*' '{
...
}' $input)
done
inside awk {...} (NF=8 for each record), I tried to use certain field values ($1 $2 $3 $4 $5 $6 $7 $8):
#it doesn't work
IFS=' ' read -a name_surname<<<$5 # Field 5 match to *name* in heading of csv
# Could I use inner awk with field values of outer awk ($5) to separate the field value of outer awk $5 ?
# as an example:
# $5="${awk '{${1^}${2^}}' $5}"
# where ${1^} and ${2^} fields of inner awk
name_surname[0]=${name_surname[0]^}
name_surname[1]=${name_surname[1]^}
$5="${name_surname[0]}' '${name_surname[1]}"
email_name=${name_surname[0]:0:1}
email_surname=${name_surname[1]}
domain='#domain'
$7="${email_name,}${email_surname,,}$domain" # match to field 7 *email* in heading of csv
how to add field values ($1 $2 $3 $4 $5 $6 $7 $8) to array and call function join for each for loop iteration to add record to new csv file?
function join { local IFS="$1"; shift; echo "$*"; }
result=$(join , ${arr[#]})
echo $result >> new.csv
This may be what you're trying to do (using gawk for FPAT as you already were doing) but without more representative sample input and the expected output it's a guess:
$ cat tst.sh
#!/usr/bin/env bash
awk '
BEGIN {
OFS = ","
FPAT = "[^"OFS"]*|\"[^\"]*\""
}
NR > 1 {
n = split($5,name,/\s*/)
$7 = tolower(substr(name[1],1,1) name[n]) "#example.com"
print
}
' "${#:--}"
$ ./tst.sh test.csv
1,1,,,Name surname,department1 department2 department3,nsurname#example.com,
2,1,,,name Surname,department1,nsurname#example.com,
3,2,,,Name Surname,"department1 department2, department3",nsurname#example.com,
I put the awk script inside a shell script since that looks like what you want, obviously you don't need to do that you could just save the awk script in a file and invoke it with awk -f.
Completely working answer by Ed Morton.
If it may be will be helpful for someone, I added one more checking condition: if in CSV file more than one email address with the same name - index number is added to email local part and output is sent to file
#!/usr/bin/env bash
input="$HOME/test.csv"
exec 0<$input
awk '
BEGIN {
OFS = ","
FPAT = "[^"OFS"]*|\"[^\"]*\""
}
(NR == 1) {print} #header of csv
(NR > 1) {
if (length($0) > 1) { #exclude empty lines
count = 0
n = split($5,name,/\s*/)
email_local_part = tolower(substr(name[1],1,1) name[n])
#array stores emails from csv file
a[i++] = email_local_part
#find amount of occurrences of the same email address
for (el in a) {
ret=match(a[el], email_local_part)
if (ret == 1) { count++ }
}
#add number of occurrence to email address
if (count == 1) { $7 = email_local_part "#abc.com" }
else { --count; $7 = email_local_part count "#abc.com" }
print
}
}
' "${#:--}" > new.csv

SUM up all values of each row and write the results in a new column using Bash

I have a big file (many columns) that generally looks like:
Gene,A,B,C
Gnai3,2,3,4
P53,5,6,7
H19,4,4,4
I want to sum every row of the data frame and add it as a new column as below:
Gene,A,B,C,total
Gnai3,2,3,4,9
P53,5,6,7,18
H19,4,4,4,12
I tried awk -F, '{sum=0; for(i=1; i<=NF; i++) sum += $i; print sum}' but then I am not able to make a new column for the total counts.
Any help would be appreciated.
Could you please try following.
awk '
BEGIN{
FS=OFS=","
}
FNR==1{
print $0,"total"
next
}
{
for(j=2;j<=NF;j++)
$(NF+1)+=$j
}
$1=$1
}
1
' Input_file
2nd solution: adding solution as per OP's comment to print first column and sum only.
BEGIN{
FS=OFS=","
}
FNR==1{
print $0,"total"
next
}
{
for(j=2;j<=NF;j++)
sum+=$j
}
print $1,sum
sum=""
}
' Input_file
Can use perl here:
perl -MList::Util=sum0 -F, -lane '
print $_, ",", ($. == 1 ? "total" : sum0( #F[1..$#F] ));
' file
To add a new column, just increment number of columns and assign the new column a value:
NF++; $NF=sum
do:
awk -v OFS=, -F, 'NR>1{sum=0; for(i=1; i<=NF; i++) sum += $i; NF++; $NF=sum } 1'
Using only bash:
#!/bin/bash
while read -r row; do
sum=
if [[ $row =~ (,[0-9]+)+ ]]; then
numlist=${BASH_REMATCH[0]}
sum=,$((${numlist//,/+}))
fi
echo "$row$sum"
done < datafile
There are a few assumptions here about rows in the data file: Numeric fields to be summed up are non-negative integers and the first field is not a numeric field (it will not participate in the sum even if it is a numeric field). Also, the numeric fields are consecutive, that is, there is no a non numeric field between two numeric fields. And, the sum won't overflow.

Turning multi-line string into single comma-separated list in Bash

I have this format:
host1,app1
host1,app2
host1,app3
host2,app4
host2,app5
host2,app6
host3,app1
host4... and so on.
I need it like this format:
host1;app1,app2,app3
host2;app4,app5,app6
I have tired this: awk -vORS=, '{ print $2 }' data | sed 's/,$/\n/'
and it gives me this:
app1,app2,app3 without the host in front.
I do not want to show duplicates.
I do not want this:
host1;app1,app1,app1,app1...
host2;app1,app1,app1,app1...
I want this format:
host1;app1,app2,app3
host2;app2,app3,app4
host3;app2;app3
With input sorted on the first column (as in your example ; otherwise just pipe it to sort), you can use the following awk command :
awk -F, 'NR == 1 { currentHost=$1; currentApps=$2 }
NR > 1 && currentHost == $1 { currentApps=currentApps "," $2 }
NR > 1 && currentHost != $1 { print currentHost ";" currentApps; currentHost=$1; currentApps=$2 }
END { print currentHost ";" currentApps }'
It has the advantage over other solutions posted as of this edit to avoid holding the whole data in memory. This comes at the cost of needing the input to be sorted (which is what would need to put lots of data in memory if the input wasn't sorted already).
Explanation :
the first line initializes the currentHost and currentApps variables to the values of the first line of the input
the second line handles a line with the same host as the previous one : the app mentionned in the line is appended to the currentApps variable
the third line handles a line with a different host than the previous one : the infos for the previous host are printed, then we reinitialize the variables to the value of the current line of input
the last line prints the infos of the current host when we have reached the end of the input
It probably can be refined (so much redundancy !), but I'll leave that to someone more experienced with awk.
See it in action !
$ awk '
BEGIN { FS=","; ORS="" }
$1!=prev { print ors $1; prev=$1; ors=RS; OFS=";" }
{ print OFS $2; OFS=FS }
END { print ors }
' file
host1;app1,app2,app3
host2;app4,app5,app6
host3;app1
Maybe something like this:
#!/bin/bash
declare -A hosts
while IFS=, read host app
do
[ -z "${hosts["$host"]}" ] && hosts["$host"]="$host;"
hosts["$host"]+=$app,
done < testfile
printf "%s\n" "${hosts[#]%,}" | sort
The script reads the sample data from testfile and outputs to stdout.
You could try this awk script:
awk -F, '{a[$1]=($1 in a?a[$1]",":"")$2}END{for(i in a) printf "%s;%s\n",i,a[i]}' file
The script creates entries in the array a for each unique element in the first column. It appends to that array entry all element from the second column.
When the file is parsed, the content of the array is printed.

Find which row having less columns using awk

I have a file where there are 4 fields expected for each row. If there are less number of fields then I want to write that information in a logfile with the row number.
Filed1line1| Filed2line1| Filed3line1| Filed4line1
Filed1line2| Filed2line2|
Filed1line3| Filed2line3| Filed3line3| Filed4line3
Something like - Row number 2 is having 3 fields for file a.txt
Can we achieve this using awk.
Actually I am using the below code snippet. If the number of fields is <> 4 then I am writing it in a bad file. that is working good. But I am unable to write NR value in log.
awk -F'|' -v DGFNM="$IN_DIR$DGFNAME" -v DBFNM="$IN_DIR$DBFNAME" '
$1 == "DTL" {
if (NF == 4) {
print substr($0, 5) > DGFNM
} else {
print > DBFNM
print NR >> $logfile
}
}
' "$IN_DIR$IN_FILE"
Easy: NF is the number of fields in the record and NR is the record number.
Something like: awk '{ if (NF < 4) { print "Row " NR " has " NF " fields"; } }' - there are shorter ways, but I prefer longer code that is easier to read ;-)
See this question for some info on printing to different output files: is it possible to print different lines to different output files using awk
To answer your edited question: $logfile is inside the single quotes, so it is not expanded to your shell variable logfile. And it is not an "awk" variable. try print NR >> "some_file"; in the awk, and then rename some_file to $logfile later.
Another option would be to generate the awk file with the expanded $logfile already in place instead of trying to do it inline.

Awk number comparsion in Bash

I'm trying to print records from line number from 10 to 15 from input file number-src. I tried with below code but it prints all records irrespective of line number.
awk '{
count++
if ( $count >= 10 ) AND ( $count <= 15 )
{
printf "\n" $0
}
}' number_src
awk is not bash just like C is not bash, they are separate tools/languages with their very own syntax and semantics:
awk 'NR>=10 && NR<=15' number_src
Get the book Effective Awk Programming, by Arnold Robbins.
Two issues why your script is not working:
logical AND should be &&.
use count as variable name, when referencing it, not $count.
Here is a working version:
awk '{
count++
if ( count >= 10 && count <= 15 )
{
print $0
}
}' numbers_src
As stated in the quickest answer, for your task NR is the awk-way to do the same task.
For further information, please see the relevant documentation entries about boolean expressions and using variables.

Resources