How to remove duplicate columns and merge their unique value [closed]

How to remove duplicate columns and merge their unique value [closed] - bash

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I'd like to convert my source file which has the following values:
col1|col2|col3
1|a|desc1
1|a|desc2
1|b|desc3
1|b|desc4
1|b|desc5
2|c|desc6
2|c|desc7
into:
col1|col2|col3
1|a|desc1 desc2
1|b|desc3 desc4 desc5
2|c|desc6 desc7
It's basically a column 1 and column 2 duplicate removal but their column 3 will be merged. Looking into awk or perl or sed or bash code - whichever has the minimal processing power consumption is preferred.

$ cat tst.awk
BEGIN { FS="|" }
{
curr = $1 FS $2
if (curr == prev) {
rec = rec " " $3
}
else {
if (rec) print rec
rec = $0
}
prev = curr
}
END { if (rec) print rec }
$ awk -f tst.awk file
col1|col2|col3
1|a|desc1 desc2
1|b|desc3 desc4 desc5
2|c|desc6 desc7

here is an awk-oneliner, do the merge and keep the order:
awk -F'|' '{k=$1FS$2;if(a[k])a[k]=a[k] OFS $3;else{a[k]=$0;b[++i]=k}}
END{for(x=1;x<=i;x++)print a[b[x]]}' file

Here's a perl solution:
open $fh, "<", "yourfile.txt";
%h = ();
$head = <$fh>;
while (<$fh>) {
if ($_ =~ /(\d\|[a-z]\|)(.*)/) {
$h{$1} .= "$2 ";
}
}
print $head;
foreach (sort keys %h) {
print "$_$h{$_}\n";
}
You can use a regex to get the combined col1 and col2 and store it as a hash key, then append the col3 values to that key whilst looping over the rest of the file.

Perl from command line,
perl -lne'
/(.+\|)(.+)/ or next;
$h{$1} or push #r, $1;
push #{ $h{$1} }, $2;
END { print $_, "#{$h{$_}}" for #r }
' file

Related

bash - add columns to csv rewrite headers with prefix filename

I'd prefer a solution that uses bash rather than converting to a dataframe in python, etc as the files are quite big
I have a folder of CSVs that I'd like to merge into one CSV. The CSVs all have the same header save a few exceptions so I need to rewrite the name of each added column with the filename as a prefix to keep track of which file the column came from.
head globcover_color.csv glds00g.csv
==> file1.csv <==
id,max,mean,90
2870316.0,111.77777777777777
2870317.0,63.888888888888886
2870318.0,73.6
2870319.0,83.88888888888889
==> file2.csv <==
ogc_fid,id,_sum
"1","2870316",9.98795110916615
"2","2870317",12.3311055738527
"3","2870318",9.81535963468479
"4","2870319",7.77729743926775
The id column of each file might be in a different "datatype" but in every file the id matches the line number. For example, line 2 is always id 2870316.
Anticipated output:
file1_id,file1_90,file2_ogc_fid,file2_id,file2__sum
2870316.0,111.77777777777777,"1","2870316",9.98795110916615
2870317.0,63.888888888888886,"2","2870317",12.3311055738527
2870318.0,73.6,"3","2870318",9.81535963468479
2870319.0,83.88888888888889,"4","2870319",7.77729743926775
I'm not quite sure how to do this but I think I'd use the paste command at some point. I'm surprised that I couldn't find a similar question on stackoverflow but I guess it's not that common to have CSV with the same id on the same line number
edit:
I figured out the first part.
paste -d , * > ../rasterjointest.txt achieves what I want but the header needs to be replaced

$ cat tst.awk
BEGIN { FS=OFS="," }
FNR==1 {
fname = FILENAME
sub(/\.[^.]+$/,"",fname)
for (i=1; i<=NF; i++) {
$i = fname "_" $i
}
}
{ row[FNR] = (NR==FNR ? "" : row[FNR] OFS) $0 }
END {
for (rowNr=1; rowNr<=FNR; rowNr++) {
print row[rowNr]
}
}
$ awk -f tst.awk file1.csv file2.csv
file1_id,file1_max,file1_mean,file1_90,file2_ogc_fid,file2_id,file2__sum
2870316.0,111.77777777777777,"1","2870316",9.98795110916615
2870317.0,63.888888888888886,"2","2870317",12.3311055738527
2870318.0,73.6,"3","2870318",9.81535963468479
2870319.0,83.88888888888889,"4","2870319",7.77729743926775
To use minimal memory in awk:
$ cat tst.awk
BEGIN {
FS=OFS=","
for (fileNr=1; fileNr<ARGC; fileNr++) {
filename = ARGV[fileNr]
if ( (getline < filename) > 0 ) {
fname = filename
sub(/\.[^.]+$/,"",fname)
for (i=1; i<=NF; i++) {
$i = fname "_" $i
}
}
row = (fileNr==1 ? "" : row OFS) $0
}
print row
exit
}
$ awk -f tst.awk file1.csv file2.csv; paste -d, file1.csv file2.csv | tail -n +2
file1_id,file1_max,file1_mean,file1_90,file2_ogc_fid,file2_id,file2__sum
2870316.0,111.77777777777777,"1","2870316",9.98795110916615
2870317.0,63.888888888888886,"2","2870317",12.3311055738527
2870318.0,73.6,"3","2870318",9.81535963468479
2870319.0,83.88888888888889,"4","2870319",7.77729743926775

Unix Convert Rows to Columns [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I have an output from a program that I need to convert to columns.
I know you can do this with awk or sed but I can't seem to figure out how.
This is how the output looks:
insert_job: aaa-bbb-ess-qqqqqqq-aaaaaa-aaaaaa job_type: c
box_name: sss-eee-ess-saturday
command: $${qqqq-eee-eat-cmd} $${qqqq-eee-nas-cntrl-dir}\eee\CMS\CMS_C3.xml $${qqqq-eee-nas-log}\eee\AFG\AFG_Build_Qwer.log buildProcess
machine: qqqq-eee-cntl
owner: system_uu_gggg_p#ad
permission: gx,wx
condition: s(qqqq-rtl-etl-40-datamart-load-cms) & s(qqqq-eee-ess)
std_out_file: >E\:\gggg\logs\qqqq-eee-ess-saturday-cms-build.out
std_err_file: >E\:\gggg\logs\qqqq-eee-ess-saturday-cms-build.err
max_run_alarm: 420
alarm_if_fail: 1
application: qqqq-M9887
I need it to look like this
Or like this:
insert_job: job_type: box_name: command: machine:
aaa-bbb-ess-qqqqqqq-aaaaaa-aaaaaa c sss-eee-ess-saturday $${qqqq-eee-eat-cmd} qqqq-eee-cntl
insert_job:;job_type:;box_name:;command:;machine:;
aaa-bbb-ess-qqqqqqq-aaaaaa-aaaaaa;c;sss-eee-ess-saturday;$${qqqq-eee-eat-cmd};qqqq-eee-cntl;
Basically either already with TAB separated or in CSV format.
Thanks for the help

You haven't shown us what the actual expected output looks like so I've assumed you want it tab-separated and unquoted and I've made some other assumptions about how your input records are separated, etc.:
$ cat tst.awk
BEGIN { OFS="\t" }
{
if ( numTags == 0 ) {
tag = $1
val = $2
sub(/:$/,"",tag)
tags[++numTags] = tag
tag2val[tag] = val
sub(/[^:]+: +[^ ]+ +/,"")
}
tag = val = $0
sub(/: .*/,"",tag)
sub(/[^:]+: /,"",val)
tags[++numTags] = tag
tag2val[tag] = val
}
tag == "application" {
if ( !cnt++ ) {
for (tagNr=1; tagNr<=numTags; tagNr++ ) {
tag = tags[tagNr]
printf "%s%s", tag, (tagNr<numTags ? OFS : ORS)
}
}
for (tagNr=1; tagNr<=numTags; tagNr++ ) {
tag = tags[tagNr]
val = tag2val[tag]
printf "%s%s", val, (tagNr<numTags ? OFS : ORS)
}
numTags = 0
}
.
$ awk -f tst.awk file
insert_job job_type box_name command machine owner permission condition std_out_file std_err_file max_run_alarm alarm_if_fail application
aaa-bbb-ess-qqqqqqq-aaaaaa-aaaaaa c sss-eee-ess-saturday $${qqqq-eee-eat-cmd} $${qqqq-eee-nas-cntrl-dir}\eee\CMS\CMS_C3.xml $${qqqq-eee-nas-log}\eee\AFG\AFG_Build_Qwer.log buildProcess qqqq-eee-cntl system_uu_gggg_p#ad gx,wx s(qqqq-rtl-etl-40-datamart-load-cms) & s(qqqq-eee-ess) >E\:\gggg\logs\qqqq-eee-ess-saturday-cms-build.out >E\:\gggg\logs\qqqq-eee-ess-saturday-cms-build.err 420 1 qqqq-M9887
If you want something easier for Excel to handle, this will produce a CSV that Excel will be able to open just by double clicking on the output file name:
$ cat tst.awk
BEGIN { OFS="," }
{
if ( numTags == 0 ) {
tag = $1
val = $2
sub(/:$/,"",tag)
tags[++numTags] = tag
tag2val[tag] = val
sub(/[^:]+: +[^ ]+ +/,"")
}
tag = val = $0
sub(/: .*/,"",tag)
sub(/[^:]+: /,"",val)
tags[++numTags] = tag
tag2val[tag] = val
}
tag == "application" {
if ( !cnt++ ) {
for (tagNr=1; tagNr<=numTags; tagNr++ ) {
tag = tags[tagNr]
printf "\"%s\"%s", tag, (tagNr<numTags ? OFS : ORS)
}
}
for (tagNr=1; tagNr<=numTags; tagNr++ ) {
tag = tags[tagNr]
val = tag2val[tag]
printf "\"%s\"%s", val, (tagNr<numTags ? OFS : ORS)
}
numTags = 0
}
.
$ awk -f tst.awk file
"insert_job","job_type","box_name","command","machine","owner","permission","condition","std_out_file","std_err_file","max_run_alarm","alarm_if_fail","application"
"aaa-bbb-ess-qqqqqqq-aaaaaa-aaaaaa","c","sss-eee-ess-saturday","$${qqqq-eee-eat-cmd} $${qqqq-eee-nas-cntrl-dir}\eee\CMS\CMS_C3.xml $${qqqq-eee-nas-log}\eee\AFG\AFG_Build_Qwer.log buildProcess","qqqq-eee-cntl","system_uu_gggg_p#ad","gx,wx","s(qqqq-rtl-etl-40-datamart-load-cms) & s(qqqq-eee-ess)",">E\:\gggg\logs\qqqq-eee-ess-saturday-cms-build.out",">E\:\gggg\logs\qqqq-eee-ess-saturday-cms-build.err","420","1","qqqq-M9887"

If this is a one-off job, and you don't have to worry about double quotes in the source data, try something like this. I have assumed you want comma-separated values to put in a spreadsheet and the data is in a file called foo.txt.
echo $(sed 's/^\([^:]*\): \(.*\)$/"\1",/g' foo.txt)
echo $(sed 's/^\([^:]*\): \(.*\)$/"\2",/g' foo.txt)

Comparison Between 2 FIles and Printing the Result in a Desired For at using AWK [duplicate]

This question already has answers here:
Keep rows from file 1 matching values from file 2 using awk
(2 answers)
Closed 5 years ago.
My files are as below:
File 1:
COL1|COL2|COL3|COL4|COL5
'SR'|'2017-09-01 00:19:13'|'+05:30'|'1A3LA7015L5S'|'5042449536906016501541'
'SR'|'2017-09-01 00:19:20'|'+05:30'|'1A3LA7015L6I'|'5042449603146028701550'
'SR'|'2017-09-01 00:19:23'|'+05:30'|'1A3LA7015L6I'|'5042449603146028701555'
File 2:
COL1|COL2|COL3|COL4|COL5
'SR'|'2017-09-01 00:19:13'|'+05:30'|'1A3LA7015L5Q'|'5042449536906016501541'
'SR'|'2017-09-01 00:19:20'|'+05:30'|'1A3LA7015L6I'|'5042449603146028701550'
'SR'|'2017-09-01 00:19:20'|'+05:30'|'1A3LA7015L6I'|'5042449603146028701555'
Here Primary Key is my 5th Column say it is stored always in a Variable $var
The output I want in as follows:
PrimaryKey|Column|File1Value|File2Value
'5042449536906016501541'|COL4|'1A3LA7015L5S'|'1A3LA7015L5Q'
'5042449603146028701555'|COL2|'2017-09-01 00:19:23'|'2017-09-01 00:19:20'
I have tried using the following code:
paste -d '|' File1 File2 | awk -F '|' -v col="pk1" \
'{c=NF/2;for(i=1;i<=c;i++)if($i!=$(i+c))printf "%s|%s|%s|%s \n",$(i+$col+1),$i,$(i+c),$i-$(i+c)}'
However, this is not working as expected.

A script like this using GNU AWK might work:
script.awk
BEGIN { # setup file separator and sorting:
FS=OFS="|"
PROCINFO["sorted_in"]="#ind_str_asc"
}
# skip header lines
FNR == 1 { next }
# store first file
(FNR==NR) { f1[$5]=$0
# skip processing of other rules and
# read the next line from input
next
}
# store second file
{ f2[$5]=$0
if( ! ($5 in f1)) {
f1[$5] = ""
}
}
# compare and print stored information
END { print"PrimaryKey", "Column", "File1Value", "File2Value"
for( k in f1) {
split( f1[k], arr1, "|")
split( f2[k], arr2, "|")
for( c = 1; c <= length( f1[ k ] ); c++ ) {
if( arr1[c] != arr2[c] ) {
print k, "COL" c, arr1[c], arr2[c]
}
}
}
}
You would run a command like this: awk -f script.awk yourfile1 yourfile2

awk to process the first two lines then the next two and so on

Suppose i have a very file which i created from two files one is old & another is the updated file by using cat & sort on the primary key.
File1
102310863||7097881||6845193||271640||06007709532577||||
102310863||7097881||6845123||271640||06007709532577||||
102310875||7092992||6840808||023740||10034500635650||||
102310875||7092992||6840818||023740||10034500635650||||
So pattern of this file is line 1 = old value & line 2 = updated value & so on..
now I want to process the file in such a way that awk first process the first two lines of the file & find out the difference & then move on two the next two lines.
now the process is
if($[old record]!=$[new record])
i= [new record]#[old record];
Desired output
102310863||7097881||6845123#6845193||271640||06007709532577||||
102310875||7092992||6840818#6840808||023740||10034500635650||||

$ cat tst.awk
BEGIN { FS="[|][|]"; OFS="||" }
NR%2 { split($0,old); next }
{
for (i=1;i<=NF;i++) {
if (old[i] != $i) {
$i = $i "#" old[i]
}
}
print
}
$
$ awk -f tst.awk file
102310863||7097881||6845123#6845193||271640||06007709532577||||
102310875||7092992||6840818#6840808||023740||10034500635650||||

This awk could help:
$ awk -F '\\|\\|' '{
getline new;
split(new, new_array, "\\|\\|");
for(i=1;i<=NF;i++) {
if($i != new_array[i]) {
$i = new_array[i]"#"$i;
}
}
} 1' OFS="||" < input_file
102310863||7097881||6845123#6845193||271640||06007709532577||||
102310875||7092992||6840818#6840808||023740||10034500635650||||
I think, you are good enough in awk to understand above code. Skipping the explanation.

Updated version, and thanks #martin for the double | trick:
$ cat join.awk
BEGIN {new=0; FS="[|]{2}"; OFS="||"}
new==0 {
split($0, old_data, "[|]{2}")
new=1
next
}
new==1 {
split($0, new_data, "[|]{2}")
for (i = 1; i <= 7; i++) {
if (new_data[i] != old_data[i]) new_data[i] = new_data[i] "#" old_data[i]
}
print new_data[1], new_data[2], new_data[3], new_data[4], new_data[5], new_data[6], new_data[7]
new = 0
}
$ awk -f join.awk data.txt
102310863||7097881||6845123#6845193||271640||06007709532577||||
102310875||7092992||6840818#6840808||023740||10034500635650||||

awk parse data based on value in middle of text

I have the following input:
adm.cd.rrn.vme.abcd.name = foo
adm.cd.rrn.vme.abcd.test = no
adm.cd.rrn.vme.abcd.id = 123456
adm.cd.rrn.vme.abcd.option = no
adm.cd.rrn.vme.asfa.name = bar
adm.cd.rrn.vme.asfa.test = no
adm.cd.rrn.vme.asfa.id = 324523
adm.cd.rrn.vme.asfa.option = yes
adm.cd.rrn.vme.xxxx.name = blah
adm.cd.rrn.vme.xxxx.test = no
adm.cd.rrn.vme.xxxx.id = 666666
adm.cd.rrn.vme.xxxx.option = no
How can extract all the values associated with a specific id?
For example, if I have id == 324523, I'd like it to print the values of name, test, and option:
bar no yes
Is it possible to achieve in a single awk command (or anything similar in bash)?
EDIT: Based on input, here's my solution until now:
MYID=$(awk -F. '/'"${ID}"$'/{print $5}' ${TMP_LIST})
awk -F'[ .]' '{
if ($5 == "'${MYID}'") {
if ($6 == "name") {name=$NF}
if ($6 == "test") {test=$NF}
if ($6 == "option") {option=$NF}
}
} END {print name,test,option}' ${TMP_LIST})
Thanks

$ cat tst.awk
{ rec = rec $0 RS }
/option/ {
if (rec ~ "id = "tgt"\n") {
printf "%s", rec
}
rec = ""
next
}
$ awk -v tgt=324523 -f tst.awk file
adm.cd.rrn.vme.asfa.name = bar
adm.cd.rrn.vme.asfa.test = no
adm.cd.rrn.vme.asfa.id = 324523
adm.cd.rrn.vme.asfa.option = yes
or if you prefer:
$ cat tst.awk
BEGIN { FS="[. ]" }
$(NF-2) == "id" { found = ($NF == tgt ? 1 : 0); next }
{ rec = (rec ? rec OFS : "") $NF }
$(NF-2) == "option" { if (found) print rec; rec = ""; next }
$ awk -v tgt=324523 -f tst.awk file
bar no yes

first, I convert each record in a line with xargs, then I look for lines that contain the regular expression and print the columns searched
cat input | xargs -n 12 | awk '{if($0~/id\s=\s324523\s/){ print $3, $6, $12}}'
a solution more general:
awk 'BEGIN{FS="\\.|\\s"; } #field separator is point \\. or space \\s
{
a[$5"."$6]=$8; #store records in associative array a
if($8=="324523" && $6=="id"){
reg[$5]=1; #if is record found, add to associative array reg
}
}END{
for(k2 in reg){
s=""
for(k in a){
if(k~"^"k2"\\."){ #if record is an element of "reg" then add to output "s"
s=k":"a[k]" "s
}
}
print s;
}
}' input

if your input format is fixed, you can do in this way:
grep -A1 -B2 'id\s*=\s*324523$' file|awk 'NR!=3{printf "%s ",$NF}END{print ""}'
you can add -F'=' to awk part too.
it could be done by awk alone, but grep could save some typing...

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to remove duplicate columns and merge their unique value [closed] - bash

$ cat tst.awk BEGIN { FS="|" } { curr = $1 FS $2 if (curr == prev) { rec = rec " " $3 } else { if (rec) print rec rec = $0 } prev = curr } END { if (rec) print rec } $ awk -f tst.awk file col1|col2|col3 1|a|desc1 desc2 1|b|desc3 desc4 desc5 2|c|desc6 desc7

here is an awk-oneliner, do the merge and keep the order: awk -F'|' '{k=$1FS$2;if(a[k])a[k]=a[k] OFS $3;else{a[k]=$0;b[++i]=k}} END{for(x=1;x<=i;x++)print a[b[x]]}' file

Perl from command line, perl -lne' /(.+\|)(.+)/ or next; $h{$1} or push #r, $1; push #{ $h{$1} }, $2; END { print $_, "#{$h{$_}}" for #r } ' file

Related

bash - add columns to csv rewrite headers with prefix filename

Unix Convert Rows to Columns [closed]

Comparison Between 2 FIles and Printing the Result in a Desired For at using AWK [duplicate]

awk to process the first two lines then the next two and so on

awk parse data based on value in middle of text

Categories

Resources