Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I have an output from a program that I need to convert to columns.
I know you can do this with awk or sed but I can't seem to figure out how.
This is how the output looks:
insert_job: aaa-bbb-ess-qqqqqqq-aaaaaa-aaaaaa job_type: c
box_name: sss-eee-ess-saturday
command: $${qqqq-eee-eat-cmd} $${qqqq-eee-nas-cntrl-dir}\eee\CMS\CMS_C3.xml $${qqqq-eee-nas-log}\eee\AFG\AFG_Build_Qwer.log buildProcess
machine: qqqq-eee-cntl
owner: system_uu_gggg_p#ad
permission: gx,wx
condition: s(qqqq-rtl-etl-40-datamart-load-cms) & s(qqqq-eee-ess)
std_out_file: >E\:\gggg\logs\qqqq-eee-ess-saturday-cms-build.out
std_err_file: >E\:\gggg\logs\qqqq-eee-ess-saturday-cms-build.err
max_run_alarm: 420
alarm_if_fail: 1
application: qqqq-M9887
I need it to look like this
Or like this:
insert_job: job_type: box_name: command: machine:
aaa-bbb-ess-qqqqqqq-aaaaaa-aaaaaa c sss-eee-ess-saturday $${qqqq-eee-eat-cmd} qqqq-eee-cntl
insert_job:;job_type:;box_name:;command:;machine:;
aaa-bbb-ess-qqqqqqq-aaaaaa-aaaaaa;c;sss-eee-ess-saturday;$${qqqq-eee-eat-cmd};qqqq-eee-cntl;
Basically either already with TAB separated or in CSV format.
Thanks for the help
You haven't shown us what the actual expected output looks like so I've assumed you want it tab-separated and unquoted and I've made some other assumptions about how your input records are separated, etc.:
$ cat tst.awk
BEGIN { OFS="\t" }
{
if ( numTags == 0 ) {
tag = $1
val = $2
sub(/:$/,"",tag)
tags[++numTags] = tag
tag2val[tag] = val
sub(/[^:]+: +[^ ]+ +/,"")
}
tag = val = $0
sub(/: .*/,"",tag)
sub(/[^:]+: /,"",val)
tags[++numTags] = tag
tag2val[tag] = val
}
tag == "application" {
if ( !cnt++ ) {
for (tagNr=1; tagNr<=numTags; tagNr++ ) {
tag = tags[tagNr]
printf "%s%s", tag, (tagNr<numTags ? OFS : ORS)
}
}
for (tagNr=1; tagNr<=numTags; tagNr++ ) {
tag = tags[tagNr]
val = tag2val[tag]
printf "%s%s", val, (tagNr<numTags ? OFS : ORS)
}
numTags = 0
}
.
$ awk -f tst.awk file
insert_job job_type box_name command machine owner permission condition std_out_file std_err_file max_run_alarm alarm_if_fail application
aaa-bbb-ess-qqqqqqq-aaaaaa-aaaaaa c sss-eee-ess-saturday $${qqqq-eee-eat-cmd} $${qqqq-eee-nas-cntrl-dir}\eee\CMS\CMS_C3.xml $${qqqq-eee-nas-log}\eee\AFG\AFG_Build_Qwer.log buildProcess qqqq-eee-cntl system_uu_gggg_p#ad gx,wx s(qqqq-rtl-etl-40-datamart-load-cms) & s(qqqq-eee-ess) >E\:\gggg\logs\qqqq-eee-ess-saturday-cms-build.out >E\:\gggg\logs\qqqq-eee-ess-saturday-cms-build.err 420 1 qqqq-M9887
If you want something easier for Excel to handle, this will produce a CSV that Excel will be able to open just by double clicking on the output file name:
$ cat tst.awk
BEGIN { OFS="," }
{
if ( numTags == 0 ) {
tag = $1
val = $2
sub(/:$/,"",tag)
tags[++numTags] = tag
tag2val[tag] = val
sub(/[^:]+: +[^ ]+ +/,"")
}
tag = val = $0
sub(/: .*/,"",tag)
sub(/[^:]+: /,"",val)
tags[++numTags] = tag
tag2val[tag] = val
}
tag == "application" {
if ( !cnt++ ) {
for (tagNr=1; tagNr<=numTags; tagNr++ ) {
tag = tags[tagNr]
printf "\"%s\"%s", tag, (tagNr<numTags ? OFS : ORS)
}
}
for (tagNr=1; tagNr<=numTags; tagNr++ ) {
tag = tags[tagNr]
val = tag2val[tag]
printf "\"%s\"%s", val, (tagNr<numTags ? OFS : ORS)
}
numTags = 0
}
.
$ awk -f tst.awk file
"insert_job","job_type","box_name","command","machine","owner","permission","condition","std_out_file","std_err_file","max_run_alarm","alarm_if_fail","application"
"aaa-bbb-ess-qqqqqqq-aaaaaa-aaaaaa","c","sss-eee-ess-saturday","$${qqqq-eee-eat-cmd} $${qqqq-eee-nas-cntrl-dir}\eee\CMS\CMS_C3.xml $${qqqq-eee-nas-log}\eee\AFG\AFG_Build_Qwer.log buildProcess","qqqq-eee-cntl","system_uu_gggg_p#ad","gx,wx","s(qqqq-rtl-etl-40-datamart-load-cms) & s(qqqq-eee-ess)",">E\:\gggg\logs\qqqq-eee-ess-saturday-cms-build.out",">E\:\gggg\logs\qqqq-eee-ess-saturday-cms-build.err","420","1","qqqq-M9887"
If this is a one-off job, and you don't have to worry about double quotes in the source data, try something like this. I have assumed you want comma-separated values to put in a spreadsheet and the data is in a file called foo.txt.
echo $(sed 's/^\([^:]*\): \(.*\)$/"\1",/g' foo.txt)
echo $(sed 's/^\([^:]*\): \(.*\)$/"\2",/g' foo.txt)
Related
I have an awk script (tst.awk):
NR==FNR {
ids[++numIds] = $1","
next
}
FNR==1 { numFiles++ }
{
id = $1
sub(/^[^[:space:]]+[[:space:]]+/,"")
vals[id,numFiles] = $0
gsub(/[^[:space:],]+/,"NA")
naVal[numFiles] = $0
}
END {
for ( idNr=1; idNr<=numIds; idNr++) {
id = ids[idNr]
printf "%s%s", id, OFS
for (fileNr=1; fileNr<=numFiles; fileNr++) {
val = ((id,fileNr) in vals ? vals[id,fileNr] : naVal[fileNr])
printf "%s%s", val, (fileNr<numFiles ? OFS : ORS)
}
}
}
That is called on the command line with:
awk -f tst.awk master file1 file2 file3 > output.file
(note: there can be a variable number of arguments)
How can I change this script, and command line code, to run it as a bash script?
I have tried (tst_awk.sh):
#!/bin/bash
awk -f "$1" "$2" "$3" "$4"
'NR==FNR {
ids[++numIds] = $1","
next
}
FNR==1 { numFiles++ }
{
id = $1
sub(/^[^[:space:]]+[[:space:]]+/,"")
vals[id,numFiles] = $0
gsub(/[^[:space:],]+/,"NA")
naVal[numFiles] = $0
}
END {
for ( idNr=1; idNr<=numIds; idNr++) {
id = ids[idNr]
printf "%s%s", id, OFS
for (fileNr=1; fileNr<=numFiles; fileNr++) {
val = ((id,fileNr) in vals ? vals[id,fileNr] : naVal[fileNr])
printf "%s%s", val, (fileNr<numFiles ? OFS : ORS)
}
}
}' > output_file
called on command line with:
./tst_awk.sh master file1 file2 file3
I have also tried (tst_awk2.sh):
#!/bin/bash
awk -f master file1 file2 file3
'NR==FNR {
ids[++numIds] = $1","
next
}
FNR==1 { numFiles++ }
...
}
}
}' > output_file
called on command line with:
./tst_awk2.sh
-f needs to be followed by the name of the awk script. You're putting the first argument of the shell script after it.
You can use "$#" to get all the script arguments, so you're not limited to just 4 arguments.
#!/bin/bash
awk -f /path/to/tst.awk "$#" > output_file
Use an absolute path to the awk script so you can run the shell script from any directory.
If you don't want to use the separate tst.awk, you just put the script as the literal first argument to awk.
#!/bin/bash
awk 'NR==FNR {
ids[++numIds] = $1","
next
}
FNR==1 { numFiles++ }
{
id = $1
sub(/^[^[:space:]]+[[:space:]]+/,"")
vals[id,numFiles] = $0
gsub(/[^[:space:],]+/,"NA")
naVal[numFiles] = $0
}
END {
for ( idNr=1; idNr<=numIds; idNr++) {
id = ids[idNr]
printf "%s%s", id, OFS
for (fileNr=1; fileNr<=numFiles; fileNr++) {
val = ((id,fileNr) in vals ? vals[id,fileNr] : naVal[fileNr])
printf "%s%s", val, (fileNr<numFiles ? OFS : ORS)
}
}
}' "$#" > output_file
you can make your awk script executable by adding the shebang
#! /bin/awk -f
NR==FNR {
ids[++numIds] = $1","
next
}...
don't forget to chmod +x tst.awk
and run
$ ./tst.awk master file1 file2 file3 > outfile
I'd prefer a solution that uses bash rather than converting to a dataframe in python, etc as the files are quite big
I have a folder of CSVs that I'd like to merge into one CSV. The CSVs all have the same header save a few exceptions so I need to rewrite the name of each added column with the filename as a prefix to keep track of which file the column came from.
head globcover_color.csv glds00g.csv
==> file1.csv <==
id,max,mean,90
2870316.0,111.77777777777777
2870317.0,63.888888888888886
2870318.0,73.6
2870319.0,83.88888888888889
==> file2.csv <==
ogc_fid,id,_sum
"1","2870316",9.98795110916615
"2","2870317",12.3311055738527
"3","2870318",9.81535963468479
"4","2870319",7.77729743926775
The id column of each file might be in a different "datatype" but in every file the id matches the line number. For example, line 2 is always id 2870316.
Anticipated output:
file1_id,file1_90,file2_ogc_fid,file2_id,file2__sum
2870316.0,111.77777777777777,"1","2870316",9.98795110916615
2870317.0,63.888888888888886,"2","2870317",12.3311055738527
2870318.0,73.6,"3","2870318",9.81535963468479
2870319.0,83.88888888888889,"4","2870319",7.77729743926775
I'm not quite sure how to do this but I think I'd use the paste command at some point. I'm surprised that I couldn't find a similar question on stackoverflow but I guess it's not that common to have CSV with the same id on the same line number
edit:
I figured out the first part.
paste -d , * > ../rasterjointest.txt achieves what I want but the header needs to be replaced
$ cat tst.awk
BEGIN { FS=OFS="," }
FNR==1 {
fname = FILENAME
sub(/\.[^.]+$/,"",fname)
for (i=1; i<=NF; i++) {
$i = fname "_" $i
}
}
{ row[FNR] = (NR==FNR ? "" : row[FNR] OFS) $0 }
END {
for (rowNr=1; rowNr<=FNR; rowNr++) {
print row[rowNr]
}
}
$ awk -f tst.awk file1.csv file2.csv
file1_id,file1_max,file1_mean,file1_90,file2_ogc_fid,file2_id,file2__sum
2870316.0,111.77777777777777,"1","2870316",9.98795110916615
2870317.0,63.888888888888886,"2","2870317",12.3311055738527
2870318.0,73.6,"3","2870318",9.81535963468479
2870319.0,83.88888888888889,"4","2870319",7.77729743926775
To use minimal memory in awk:
$ cat tst.awk
BEGIN {
FS=OFS=","
for (fileNr=1; fileNr<ARGC; fileNr++) {
filename = ARGV[fileNr]
if ( (getline < filename) > 0 ) {
fname = filename
sub(/\.[^.]+$/,"",fname)
for (i=1; i<=NF; i++) {
$i = fname "_" $i
}
}
row = (fileNr==1 ? "" : row OFS) $0
}
print row
exit
}
$ awk -f tst.awk file1.csv file2.csv; paste -d, file1.csv file2.csv | tail -n +2
file1_id,file1_max,file1_mean,file1_90,file2_ogc_fid,file2_id,file2__sum
2870316.0,111.77777777777777,"1","2870316",9.98795110916615
2870317.0,63.888888888888886,"2","2870317",12.3311055738527
2870318.0,73.6,"3","2870318",9.81535963468479
2870319.0,83.88888888888889,"4","2870319",7.77729743926775
I want to get the value (usually a string) of var1 in a file my_file.dat and save this value to x.
I managed to do this using the following command:
x = `awk '$1 == "var1" {print $2}' my_file.dat`
It now turns out that there can be several occurrences of var1 in my_file.dat, e.g.:
Series1
var1 = temp/data/
Series2
var1 = lost/oldfiles/
My question is then how can I get only the value of the 'var1' which is located right after the line 'Series1', such that 'x' returns 'temp/data/'?
Given the sample you posted all you need is:
x=$(awk 'prev=="Series1" {print $NF} {prev=$0}' file)
but more robustly:
x=$(awk '
{ name=value=$0; sub(/[[:space:]]*=.*/,"",name); sub(/[^=]+=[[:space]]*/,"",value) }
(prev=="Series1") && (name=="var1") { print value }
{ prev=$0 }
' file)
What about a two state machine to solve the problem:
#!/bin/awk
BEGIN {
state = 0;
}
{
if( state == 0 )
{
if( index( $1, "Series" nseries ) )
{
state = 1
}
}
else
{
if( index( $1, "Series" ) > 0 )
{
exit
}
if( index( $1, "var1" ) > 0 )
{
idx = index( $0, "=" )
str = substr( $0, idx + 1 )
gsub(/^[ \t]+/, "", str )
print str
exit
}
}
}
# eof #
Test file:
Series1
var1 = temp/data/
Series2
var1 = lost/oldfiles/
Series3
var1 = foo/bar/
Series4
var1 = alpha/betta/
Series5
var1 = /foo/this=bad
Series6
var1 = /foo/Series/
Reading var1 from Series1:
x=$(awk -v nseries=1 -f solution.awk -- ./my_file.dat)
echo $x
temp/data/
Reading var1 from Series5:
x=$(awk -v nseries=5 -f solution.awk -- ./my_file.dat)
echo $x
/foo/this=bad
Reading var1 from Series6:
x=$(awk -v nseries=6 -f solution.awk -- ./my_file.dat)
echo $x
/foo/Series/
Hope it Helps!
I have the following input:
adm.cd.rrn.vme.abcd.name = foo
adm.cd.rrn.vme.abcd.test = no
adm.cd.rrn.vme.abcd.id = 123456
adm.cd.rrn.vme.abcd.option = no
adm.cd.rrn.vme.asfa.name = bar
adm.cd.rrn.vme.asfa.test = no
adm.cd.rrn.vme.asfa.id = 324523
adm.cd.rrn.vme.asfa.option = yes
adm.cd.rrn.vme.xxxx.name = blah
adm.cd.rrn.vme.xxxx.test = no
adm.cd.rrn.vme.xxxx.id = 666666
adm.cd.rrn.vme.xxxx.option = no
How can extract all the values associated with a specific id?
For example, if I have id == 324523, I'd like it to print the values of name, test, and option:
bar no yes
Is it possible to achieve in a single awk command (or anything similar in bash)?
EDIT: Based on input, here's my solution until now:
MYID=$(awk -F. '/'"${ID}"$'/{print $5}' ${TMP_LIST})
awk -F'[ .]' '{
if ($5 == "'${MYID}'") {
if ($6 == "name") {name=$NF}
if ($6 == "test") {test=$NF}
if ($6 == "option") {option=$NF}
}
} END {print name,test,option}' ${TMP_LIST})
Thanks
$ cat tst.awk
{ rec = rec $0 RS }
/option/ {
if (rec ~ "id = "tgt"\n") {
printf "%s", rec
}
rec = ""
next
}
$ awk -v tgt=324523 -f tst.awk file
adm.cd.rrn.vme.asfa.name = bar
adm.cd.rrn.vme.asfa.test = no
adm.cd.rrn.vme.asfa.id = 324523
adm.cd.rrn.vme.asfa.option = yes
or if you prefer:
$ cat tst.awk
BEGIN { FS="[. ]" }
$(NF-2) == "id" { found = ($NF == tgt ? 1 : 0); next }
{ rec = (rec ? rec OFS : "") $NF }
$(NF-2) == "option" { if (found) print rec; rec = ""; next }
$ awk -v tgt=324523 -f tst.awk file
bar no yes
first, I convert each record in a line with xargs, then I look for lines that contain the regular expression and print the columns searched
cat input | xargs -n 12 | awk '{if($0~/id\s=\s324523\s/){ print $3, $6, $12}}'
a solution more general:
awk 'BEGIN{FS="\\.|\\s"; } #field separator is point \\. or space \\s
{
a[$5"."$6]=$8; #store records in associative array a
if($8=="324523" && $6=="id"){
reg[$5]=1; #if is record found, add to associative array reg
}
}END{
for(k2 in reg){
s=""
for(k in a){
if(k~"^"k2"\\."){ #if record is an element of "reg" then add to output "s"
s=k":"a[k]" "s
}
}
print s;
}
}' input
if your input format is fixed, you can do in this way:
grep -A1 -B2 'id\s*=\s*324523$' file|awk 'NR!=3{printf "%s ",$NF}END{print ""}'
you can add -F'=' to awk part too.
it could be done by awk alone, but grep could save some typing...
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I'd like to convert my source file which has the following values:
col1|col2|col3
1|a|desc1
1|a|desc2
1|b|desc3
1|b|desc4
1|b|desc5
2|c|desc6
2|c|desc7
into:
col1|col2|col3
1|a|desc1 desc2
1|b|desc3 desc4 desc5
2|c|desc6 desc7
It's basically a column 1 and column 2 duplicate removal but their column 3 will be merged. Looking into awk or perl or sed or bash code - whichever has the minimal processing power consumption is preferred.
$ cat tst.awk
BEGIN { FS="|" }
{
curr = $1 FS $2
if (curr == prev) {
rec = rec " " $3
}
else {
if (rec) print rec
rec = $0
}
prev = curr
}
END { if (rec) print rec }
$ awk -f tst.awk file
col1|col2|col3
1|a|desc1 desc2
1|b|desc3 desc4 desc5
2|c|desc6 desc7
here is an awk-oneliner, do the merge and keep the order:
awk -F'|' '{k=$1FS$2;if(a[k])a[k]=a[k] OFS $3;else{a[k]=$0;b[++i]=k}}
END{for(x=1;x<=i;x++)print a[b[x]]}' file
Here's a perl solution:
open $fh, "<", "yourfile.txt";
%h = ();
$head = <$fh>;
while (<$fh>) {
if ($_ =~ /(\d\|[a-z]\|)(.*)/) {
$h{$1} .= "$2 ";
}
}
print $head;
foreach (sort keys %h) {
print "$_$h{$_}\n";
}
You can use a regex to get the combined col1 and col2 and store it as a hash key, then append the col3 values to that key whilst looping over the rest of the file.
Perl from command line,
perl -lne'
/(.+\|)(.+)/ or next;
$h{$1} or push #r, $1;
push #{ $h{$1} }, $2;
END { print $_, "#{$h{$_}}" for #r }
' file