Add unique value from first column before each group - bash

I have following file contents:
T12 19/11/19 2000
T12 18/12/19 2040
T15 19/11/19 2000
T15 18/12/19 2080
How to get following output with awk,bash and etc, I searched for similar examples but didn't find so far:
T12
19/11/19 2000
18/12/19 2040
T15
19/11/19 2000
18/12/19 2080
Thanks,
S

Could you please try following. This code will print output in same order in which first field is occurring in Input_file.
awk '
!a[$1]++ && NF{
b[++count]=$1
}
NF{
val=$1
$1=""
sub(/^ +/,"")
c[val]=(c[val]?c[val] ORS:"")$0
}
END{
for(i=1;i<=count;i++){
print b[i] ORS c[b[i]]
}
}
' Input_file
Output will be as follows.
T12
19/11/19 2000
18/12/19 2040
T15
19/11/19 2000
18/12/19 2080
Explanation: Adding detailed explanation for above code here.
awk ' ##Starting awk program from here.
!a[$1]++ && NF{ ##Checking condition if $1 is NOT present in array a and line is NOT NULL then do following.
b[++count]=$1 ##Creating an array named b whose index is variable count(every time its value increases cursor comes here) and its value is first field of current line.
} ##Closing BLOCK for this condition now.
NF{ ##Checking condition if a line is NOT NULL then do following.
val=$1 ##Creating variable named val whose value is $1 of current line.
$1="" ##Nullifying $1 here of current line.
sub(/^ +/,"") ##Substituting initial space with NULL now in line.
c[val]=(c[val]?c[val] ORS:"")$0 ##Creating an array c whose index is variable val and its value is keep concatenating to its own value with ORS value.
} ##Closing BLOCK for this condition here.
END{ ##Starting END block for this awk program here.
for(i=1;i<=count;i++){ ##Starting a for loop which runs from i=1 to till value of variable count.
print b[i] ORS c[b[i]] ##Printing array b whose index is i and array c whose index is array b value with index i.
}
} ##Closing this program END block here.
' Input_file ##Mentioning Input_file name here.

Here is a quick awk:
$ awk 'BEGIN{RS="";ORS="\n\n"}{printf "%s\n",$1; gsub($1" +",""); print}' file
How does it work?
Awk knows the concept records and fields.
Files are split in records where consecutive records are split by the record separator RS. Each record is split in fields, where consecutive fields are split by the field separator FS.
By default, the record separator RS is set to be the <newline> character (\n) and thus each record is a line. The record separator has the following definition:
RS:
The first character of the string value of RS shall be the input record separator; a <newline> by default. If RS contains more than one character, the results are unspecified. If RS is null, then records are separated by sequences consisting of a <newline> plus one or more blank lines, leading or trailing blank lines shall not result in empty records at the beginning or end of the input, and a <newline> shall always be a field separator, no matter what the value of FS is.
So with the file format you give, we can define the records based on RS="".
By default, the field separator is set to be any sequence of blanks. So $1 will point to that particular word we want on the separate line. So we print it with printf, and then we remove any reference to it with gsub.

awk is very flexible and provides a number of ways to solve the same problem. The answers you have already are excellent. Another way to approach the problem is to simply keep a single variable that holds the current field 1 as its value. (unset by default) When the first field changes, you simply output the first field as the current heading. Otherwise you output the 2nd and 3rd fields. If a blank-line is encountered, simply output the newline.
awk -v h= '
NF < 3 {print ""; next}
$1 != h {h=$1; print $1}
{printf "%s %s\n", $2, $3}
' file
Above are the 3-rules. If the line is empty (checked with number of fields less than three (NF < 3), then output the newline and skip to the next record. The second checks if the first field is not equal to your current heading variable h -- if not, set h to the new heading and output it. All non-empty records have the 2nd and 3rd fields output.
Result
Just paste the command above at the command line and you will get the desired result, e.g.
awk -v h= '
> NF < 3 {print ""; next}
> $1 != h {h=$1; print $1}
> {printf "%s %s\n", $2, $3}
> ' file
T12
19/11/19 2000
18/12/19 2040
T15
19/11/19 2000
18/12/19 2080

Related

Add Extra Strings Based on count of fields- Sed/Awk

I have data in below format in a text file.
null,"ABC:MNO"
"hjgy","ABC:PQR"
"mn","qwe","ABC:WER"
"mn","qwe","mno","ABC:WER"
All rows should have 3 fields like row 3. I want the data in below format.
"","","","ABC:MNO"
"hjgy","","","ABC:PQR"
"mn","qwe","","ABC:WER"
"mn","qwe","mno","ABC:WER"
If the row starts with null then null should be replace by "","","",
If there are only 2 fields then "","", should be added after 1st string .
If there are 3 fields then "", should be added after 2nd string
If there are 4 fields then do nothing.
I am able to handle 1st scenario by using sed 's/null/\"\",\"\",\"\"/' test.txt
But I dont know how to handle next 2 scenarios.
Regards.
With perl:
$ perl -pe 's/^null,/"","","",/; s/.*,\K/q("",) x (3 - tr|,||)/e' ip.txt
"","","","ABC:MNO"
"hjgy","","","ABC:PQR"
"mn","qwe","","ABC:WER"
"mn","qwe","mno","ABC:WER"
s/^null,/"","","",/ take care of null field first
.*,\K matches till last , in the line
\K is helpful to avoid having to put this matching portion back
3 - tr|,|| will give you how many fields are missing (tr return value is number of occurrences of , here)
q("",) here q() is used to represent single quoted string, so that escaping " isn't needed
x is the string replication operator
e flag allows you to use Perl code in replacement section
If rows starting with null, will always have two fields, then you can also use:
perl -pe 's/.*,\K/q("",) x (3 - tr|,||)/e; s/^null,/"",/'
Similar logic with awk:
awk -v q='"",' 'BEGIN{FS=OFS=","} {sub(/^null,/, q q q);
c=4-NF; while (c--) $NF = q $NF} 1'
With your shown samples only, please try following.
awk '
BEGIN{
FS=OFS=","
}
{
sub(/^null/,"\"\",\"\",\"\"")
}
NF==2{
$1=$1",\"\",\"\""
}
NF==3{
$2=$2",\"\""
}
1' Input_file
OR make " as a variable and one could try following too:
awk -v s1="\"\"" '
BEGIN{
FS=OFS=","
}
{
sub(/^null/,s1 "," s1","s1)
}
NF==2{
$1=$1"," s1 "," s1
}
NF==3{
$2=$2"," s1
}
1' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of this program from here.
FS=OFS="," ##Setting FS and OFS to comma here.
}
{
sub(/^null/,"\"\",\"\",\"\"") ##Substituting starting with space null to "","","", in current line.
}
NF==2{ ##If number of fields are 2 then do following.
$1=$1",\"\",\"\"" ##Adding ,"","" after 1st field value here.
}
NF==3{ ##If number of fields are 3 here then do following.
$2=$2",\"\"" ##Adding ,"" after 2nd field value here.
}
1 ##Printing current line here.
' Input_file ##Mentioning Input_file name here.
A solution using awk:
awk -F "," 'BEGIN{ OFS=FS }
{ gsub(/^ /,"",$1)
if($1=="null") print "\x22\x22","\x22\x22","\x22\x22", $2
else if(NF==2) print $1,"\x22\x22","\x22\x22",$2
else if(NF==3) print $1,$2,"\x22\x22",$3
else print $0 }' input
This might work for you (GNU sed):
sed 's/^\s*null,/"",/;:a;ta;s/,/&/3;t;s/.*,/&"",/;ta' file
If the line begins with null replace that field by an empty one i.e. "",.
Reset the substitute success flag by going back to :a using ta (this will only be the case when the first field is null and has be substituted).
If the 3rd field separator exists then all done.
Otherwise, insert an empty field before the last field separator and repeat.

Match columns between files and generate file with combination of data in terminal/powershell/command line Bash

I have two .txt files of different lengths and would like to do the following:
If a value in column 1 of file 1 is present in column 1 of file 3, print column 2 of file 2 and then the whole line that corresponds from file 1.
Have tried permutations of awk however am so far unsuccessful!
Thank you!
File 1:
MARKERNAME EA NEA BETA SE
10:1000706 T C -0.021786390809225 0.519667838651725
1:715265 G C 0.0310128798578049 0.0403763946716293
10:1002042 CCTT C 0.0337857775471699 0.0403300629299562
File 2:
CHR:BP SNP CHR BP GENPOS ALLELE1 ALLELE0 A1FREQ INFO
1:715265 rs12184267 1 715265 0.0039411 G C 0.964671
1:715367 rs12184277 1 715367 0.00394384 A G 0.964588
Desired File 3:
SNP MARKERNAME EA NEA BETA SE
rs12184267 1:715265 G C 0.0310128798578049 0.0403763946716293
Attempted:
awk -F'|' 'NR==FNR { a[$1]=1; next } ($1 in a) { print $3, $0 }' file1 file2
awk 'NR==FNR{A[$1]=$2;next}$0 in A{$0=A[$0]}1' file1 file2
With your shown samples, could you please try following.
awk '
FNR==1{
if(++count==1){ col=$0 }
else{ print $2,col }
next
}
FNR==NR{
arr[$1]=$0
next
}
($1 in arr){
print $2,arr[$1]
}
' file1 file2
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==1{ ##Checking condition if this is first line of file(s).
if(++count==1){ col=$0 } ##Checking if count is 1 then set col as current line.
else{ print $2,col } ##Checking if above is not true then print 2nd field and col here.
next ##next will skip all further statements from here.
}
FNR==NR{ ##This will be TRUE when file1 is being read.
arr[$1]=$0 ##Creating arr with 1st field index and value is current line.
next ##next will skip all further statements from here.
}
($1 in arr){ ##Checking condition if 1st field present in arr then do following.
print $2,arr[$1] ##Printing 2nd field, arr value here.
}
' file1 file2 ##Mentioning Input_files name here.

Append delimiters for implied blank fields

I am looking for a simple solution to have for each line the same number of commas in file (CSV file)
e.g.
example of file:
1,1
A,B,C,D,E,F
2,2,
3,3,3,
4,4,4,4
expected:
1,1,,,,
A,B,C,D,E,F
2,2,,,,
3,3,3,,,
4,4,4,4,,
the line with the largest number of commas has 5 commas in this case (line #2). so, I want to add other commas in all lines to have the same number for each line (i.e. 5 commas)
Using awk:
$ awk 'BEGIN{FS=OFS=","} {$6=$6} 1' file
1,1,,,,
A,B,C,D,E,F
2,2,,,,
3,3,3,,,
4,4,4,4,,
As you can see above, in this approach the max. number of fields must be hardcoded in the command.
Another take on providing making all lines in a CSV file have the same number of fields. The number of fields need not be known. The max fields will be calculated and a substring of needed commas appended to each record, e.g.
awk -F, -v max=0 '{
lines[n++] = $0 # store lines indexed by line number
fields[lines[n-1]] = NF # store number of field indexed by $0
if (NF > max) # find max NF value
max = NF
}
END {
for(i=0;i<max;i++) # form string with max commas
commastr=commastr","
for(i=0;i<n;i++) # loop appended substring of commas
printf "%s%s\n", lines[i], substr(commastr,1,max-fields[lines[i]])
}' file
Example Use/Output
Pasting at the command-line, you would receive:
$ awk -F, -v max=0 '{
> lines[n++] = $0 # store lines indexed by line number
> fields[lines[n-1]] = NF # store number of field indexed by $0
> if (NF > max) # find max NF value
> max = NF
> }
> END {
> for(i=0;i<max;i++) # form string with max commas
> commastr=commastr","
> for(i=0;i<n;i++) # loop appended substring of commas
> printf "%s%s\n", lines[i], substr(commastr,1,max-fields[lines[i]])
> }' file
1,1,,,,
A,B,C,D,E,F
2,2,,,,
3,3,3,,,
4,4,4,4,,
Could you please try following, a more generic way. This code will work even number of fields are not same in your Input_file and will first read and get maximum number of fields from whole file and then 2nd time reading file it will reset the fields(why because we have set OFS as , so if current line's number of fields are lesser than nf value those many commas will be added to that line). Enhanced version of #oguz ismail's answer.
awk '
BEGIN{
FS=OFS=","
}
FNR==NR{
nf=nf>NF?nf:NF
next
}
{
$nf=$nf
}
1
' Input_file Input_file
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program frmo here.
BEGIN{ ##Starting BEGIN section of awk program from here.
FS=OFS="," ##Setting FS and OFS as comma for all lines here.
}
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first time Input_file is being read.
nf=nf>NF?nf:NF ##Creating variable nf whose value is getting set as per condition, if nf is greater than NF then set it as NF else keep it as it is,
next ##next will skip all further statements from here.
}
{
$nf=$nf ##Mentioning $nf=$nf will reset current lines value and will add comma(s) at last of line if NF is lesser than nf.
}
1 ##1 will print edited/non-edited lines here.
' Input_file Input_file ##Mentioning Input_file names here.

awk to input column data from one file to another based on a match

Objective
I am trying to fill out $9(booking ref), $10 (client) in file1.csv with information pulled from $2 (booking ref) and $3 (client) of file2.csv using "CAMPAIGN ID" ($5 in file1.csv and $1 in file2.csv). So where I have a match between the two files based on "CAMPAIGN ID" I want to print the columns of file2.csv into the matching rows of file1.csv
File1.csv
INVOICE,CLIENT,PLATFORM,CAMPAIGN NAME,CAMPAIGN ID,IMPS,TFS,PRICE,Booking Ref,client
BOB-UK,clientname1,platform_1,campaign1,20572431,5383594,0.05,2692.18,,
BOB-UK,clientname2,platform_1,campaign2,20589101,4932821,0.05,2463.641,,
BOB-UK,clientname1,platform_1,campaign3,23030494,4795549,0.05,2394.777,,
BOB-UK,clientname1,platform_1,campaign4,22973424,5844194,0.05,2925.21,,
BOB-UK,clientname1,platform_1,campaign5,21489000,4251031,0.05,2122.552,,
BOB-UK,clientname1,platform_1,campaign6,23150347,3123945,0.05,1561.197,,
BOB-UK,clientname3,platform_1,campaign7,23194965,2503875,0.05,1254.194,,
BOB-UK,clientname3,platform_1,campaign8,20578983,1522448,0.05,765.1224,,
BOB-UK,clientname3,platform_1,campaign9,22243554,920166,0.05,463.0083,,
BOB-UK,clientname1,platform_1,campaign10,20572149,118865,0.05,52.94325,,
BOB-UK,clientname2,platform_1,campaign11,23077785,28077,0.05,14.40385,,
BOB-UK,clientname2,platform_1,campaign12,21811100,5439,0.05,5.27195,,
File2.csv
CAMPAIGN ID,Booking Ref,client
20572431,ref1,1
21489000,ref2,1
23030494,ref3,1
22973424,ref4,1
23150347,ref5,1
20572149,ref6,1
20578983,ref7,2
22243554,ref8,2
20589101,ref9,3
23077785,ref10,3
21811100,ref11,3
23194965,ref12,3
Desired Output
INVOICE,CLIENT,PLATFORM,CAMPAIGN NAME,CAMPAIGN ID,IMPS,TFS,PRICE,Booking Ref,client
BOB-UK,clientname1,platform_1,campaign1,20572431,5383594,0.05,2692.18,ref1,1
BOB-UK,clientname2,platform_1,campaign2,20589101,4932821,0.05,2463.641,ref9,3
BOB-UK,clientname1,platform_1,campaign3,23030494,4795549,0.05,2394.777,ref3,1
BOB-UK,clientname1,platform_1,campaign4,22973424,5844194,0.05,2925.21,ref4,1
BOB-UK,clientname1,platform_1,campaign5,21489000,4251031,0.05,2122.552,ref2,1
BOB-UK,clientname1,platform_1,campaign6,23150347,3123945,0.05,1561.197,ref5,1
BOB-UK,clientname3,platform_1,campaign7,23194965,2503875,0.05,1254.194,ref12,3
BOB-UK,clientname3,platform_1,campaign8,20578983,1522448,0.05,765.1224,ref7,2
BOB-UK,clientname3,platform_1,campaign9,22243554,920166,0.05,463.0083,ref8,2
BOB-UK,clientname1,platform_1,campaign10,20572149,118865,0.05,52.94325,ref6,1
BOB-UK,clientname2,platform_1,campaign11,23077785,28077,0.05,14.40385,ref10,3
BOB-UK,clientname2,platform_1,campaign12,21811100,5439,0.05,5.27195,ref11,3
What I've tried
From the research I've done on line this appears to be possible using awk and join (How to merge two files using AWK? to get me the closest out of what I found online).
I've tried various awk codes I've found online and I can't seem to get it to achieve my goal. below is the code I've been trying to massage and work that gets me the closes. At current the code is set up to try and populate just the booking ref as I presume I can just rinse repeat it for the client column. With this code I was able to get it to populate the booking ref but it required me to move CAMPAIGN ID to $1 and all it did was replace the values.
NOTE: The order for file1.csv won't sync with file2.csv. All rows may be in a different order as shown in this example.
current code
awk -F"," -v OFS=',' 'BEGIN { while (getline < "fil2.csv") { f[$1] = $2; } } {print $0, f[$1] }' file1.csv
Can someone confirm where I'm going wrong with this code as I've tried altering the columns in this - and the file - without success? Maybe it's just how I'm understanding the code itself.
Like this:
awk 'BEGIN{FS=OFS=","} NR==FNR{r[$1]=$2;c[$1]=$3;next} NR>1{$9=r[$5];$10=c[$5]} 1' \
file2.csv file1.csv
Explanation in multi line form:
# Set input and output field delimiter to ,
BEGIN{
FS=OFS=","
}
# Total row number is the same as the row number in file
# as long as we are reading the first file, file2.csv
NR==FNR{
# Store booking ref and client id indexed by campaign id
r[$1]=$2
c[$1]=$3
# Skip blocks below
next
}
# From here code runs only on file1.csv
NR>1{
# Set booking ref and client id according to the campaign id
# in field 5
$9=r[$5]
$10=c[$5]
}
# Print the modified line of file1.csv (includes the header line)
{
print
}
Could you please try following.
awk '
BEGIN{
FS=OFS=","
print " print "INVOICE,CLIENT,PLATFORM,CAMPAIGN NAME,CAMPAIGN ID,IMPS,TFS,PRICE,Booking Ref,client"
}
FNR==NR && FNR>1{
val=$1
$1=""
sub(/^,/,"")
a[val]=$0
next
}
($5 in a) && FNR>1{
sub(/,*$/,"")
print $0,a[$5]
}
' file2.csv file1.csv
Explanation: Adding explanation for above code.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of code here.
FS=OFS="," ##Setting FS(field separator) and OFS(output field separator)as comma here.
print "INVOICE,CLIENT,PLATFORM,CAMPAIGN NAME,CAMPAIGN ID,IMPS,TFS,PRICE,Booking Ref,client"
} ##Closing BEGIN section of this program now.
FNR==NR && FNR>1{ ##Checking condition FNR==NR which will be true when file2.csv is being read.
val=$1 ##Creating variable val whose value is $1 here.
$1="" ##Nullifying $1 here.
sub(/^,/,"") ##Substitute initial comma with NULL in this line.
a[val]=$0 ##Creating an array a whose index is val and value is $0.
next ##next will skip all further statements from here.
} ##Closing BLOCK for condition FNR==NR here.
($5 in a) && FNR>1{ ##Checking if $5 is present in array a this condition will be checked when file1.csv is being read.
sub(/,*$/,"") ##Substituting all commas at last of line with NULL here.
print $0,a[$5] ##Printing current line and value of array a with index $5 here.
} ##Closing BLOCK for above ($5 in a) condition here.
' file2.csv file1.csv ##Mentioning Input_file names here.
Output will be as follows.
INVOICE,CLIENT,PLATFORM,CAMPAIGN NAME,CAMPAIGN ID,IMPS,TFS,PRICE,Booking Ref,client
BOB-UK,clientname1,platform_1,campaign1,20572431,5383594,0.05,2692.18,ref1,1
BOB-UK,clientname2,platform_1,campaign2,20589101,4932821,0.05,2463.641,ref9,3
BOB-UK,clientname1,platform_1,campaign3,23030494,4795549,0.05,2394.777,ref3,1
BOB-UK,clientname1,platform_1,campaign4,22973424,5844194,0.05,2925.21,ref4,1
BOB-UK,clientname1,platform_1,campaign5,21489000,4251031,0.05,2122.552,ref2,1
BOB-UK,clientname1,platform_1,campaign6,23150347,3123945,0.05,1561.197,ref5,1
BOB-UK,clientname3,platform_1,campaign7,23194965,2503875,0.05,1254.194,ref12,3
BOB-UK,clientname3,platform_1,campaign8,20578983,1522448,0.05,765.1224,ref7,2
BOB-UK,clientname3,platform_1,campaign9,22243554,920166,0.05,463.0083,ref8,2
BOB-UK,clientname1,platform_1,campaign10,20572149,118865,0.05,52.94325,ref6,1
BOB-UK,clientname2,platform_1,campaign11,23077785,28077,0.05,14.40385,ref10,3
BOB-UK,clientname2,platform_1,campaign12,21811100,5439,0.05,5.27195,ref11,3

Divide each row by max value in awk

I am trying to divide the rows by the max value in that row as (with rows having all columns as NA)
r1 r2 r3 r4
a 0 2.3 1.2 0.1
b 0.1 4.5 9.1 3.1
c 9.1 8.4 0 5
I get
r1 r2 r3 r4
a 0 1 0.52173913 0.043478261
b 0.010989011 0.494505495 1 0.340659341
c 1 0.923076923 0 0.549450549
I tried to calculate max of each row by executing
awk '{m=$1;for(i=1;i<=NF;i++)if($i>m)m=$i;print m}' file.txt > max.txt
then pasted it as the last column to the file.txt as
paste file.txt max.txt > file1.txt
I am trying to execute a code where the last column will divide all the columns in that line , but first I needed to format each line hence I am stuck at
awk '{for(i=1;i<NF;i++) printf "%s " $i,$NF}' file1.txt
I am trying to print each combination for that line and then print the next lines combinations on new line. But I want to know if there is a better way to do this.
awk to the rescue!
$ awk 'NR>1 {m=$2; for(i=3;i<=NF;i++) if($3>m) m=$3;
for(i=2;i<=NF;i++) $i/=m}1' file
r1 r2 r3 r4
a 0 1 0.521739 0.0434783
b 0.0222222 1 2.02222 0.688889
c 1 0.923077 0 0.549451
Following awk may help you on same:
awk '
FNR==1{
print;
next
}
{
len=""
for(i=2;i<=NF;i++){
len=len>$i?len:$i};
printf("%s%s", $1, OFS)
}
{
for(i=2;i<=NF;i++){
printf("%s%s",$i>0?$i/len:0,i==NF?RS:FS)}
}
' Input_file
Explanation: Adding explanation too here with solution now:
awk '
FNR==1{ ##FNR==1 is a condition where it will check if it is first line of Input_file then do following:
print; ##printing the current line then.
next ##next is awk out of the box keyword which will skip all further statements now.
}
{
len="" ##variable named len(which contains the greatest value in a line here)
for(i=2;i<=NF;i++){ ##Starting a for loop here starting from 2nd field to till value of NF which means it will cover all the fields on a line.
len=len>$i?len:$i}; ##Creating a variable named len here whose value is $1 if it is NULL and if it is greater than current $1 then it remains same else will be $1
printf("%s%s", $1, OFS) ##Printing the 1st column value here along with space.
}
{
for(i=2;i<=NF;i++){ ##Starting a for loop here whose value starts from 2 to till the value of NF it covers all the field of current line.
printf("%s%s",$i>0?$i/len:0,i==NF?RS:FS)} ##Printing current field divided by value of len varible(which has maximum value of current line), it also checks a conditoin if value of i equals to NF then print new line else print space.
}
' Input_file ##mentioning the Input_file name here.

Resources