awk parse data based on value in middle of text - bash

I have the following input:
adm.cd.rrn.vme.abcd.name = foo
adm.cd.rrn.vme.abcd.test = no
adm.cd.rrn.vme.abcd.id = 123456
adm.cd.rrn.vme.abcd.option = no
adm.cd.rrn.vme.asfa.name = bar
adm.cd.rrn.vme.asfa.test = no
adm.cd.rrn.vme.asfa.id = 324523
adm.cd.rrn.vme.asfa.option = yes
adm.cd.rrn.vme.xxxx.name = blah
adm.cd.rrn.vme.xxxx.test = no
adm.cd.rrn.vme.xxxx.id = 666666
adm.cd.rrn.vme.xxxx.option = no
How can extract all the values associated with a specific id?
For example, if I have id == 324523, I'd like it to print the values of name, test, and option:
bar no yes
Is it possible to achieve in a single awk command (or anything similar in bash)?
EDIT: Based on input, here's my solution until now:
MYID=$(awk -F. '/'"${ID}"$'/{print $5}' ${TMP_LIST})
awk -F'[ .]' '{
if ($5 == "'${MYID}'") {
if ($6 == "name") {name=$NF}
if ($6 == "test") {test=$NF}
if ($6 == "option") {option=$NF}
}
} END {print name,test,option}' ${TMP_LIST})
Thanks

$ cat tst.awk
{ rec = rec $0 RS }
/option/ {
if (rec ~ "id = "tgt"\n") {
printf "%s", rec
}
rec = ""
next
}
$ awk -v tgt=324523 -f tst.awk file
adm.cd.rrn.vme.asfa.name = bar
adm.cd.rrn.vme.asfa.test = no
adm.cd.rrn.vme.asfa.id = 324523
adm.cd.rrn.vme.asfa.option = yes
or if you prefer:
$ cat tst.awk
BEGIN { FS="[. ]" }
$(NF-2) == "id" { found = ($NF == tgt ? 1 : 0); next }
{ rec = (rec ? rec OFS : "") $NF }
$(NF-2) == "option" { if (found) print rec; rec = ""; next }
$ awk -v tgt=324523 -f tst.awk file
bar no yes

first, I convert each record in a line with xargs, then I look for lines that contain the regular expression and print the columns searched
cat input | xargs -n 12 | awk '{if($0~/id\s=\s324523\s/){ print $3, $6, $12}}'
a solution more general:
awk 'BEGIN{FS="\\.|\\s"; } #field separator is point \\. or space \\s
{
a[$5"."$6]=$8; #store records in associative array a
if($8=="324523" && $6=="id"){
reg[$5]=1; #if is record found, add to associative array reg
}
}END{
for(k2 in reg){
s=""
for(k in a){
if(k~"^"k2"\\."){ #if record is an element of "reg" then add to output "s"
s=k":"a[k]" "s
}
}
print s;
}
}' input

if your input format is fixed, you can do in this way:
grep -A1 -B2 'id\s*=\s*324523$' file|awk 'NR!=3{printf "%s ",$NF}END{print ""}'
you can add -F'=' to awk part too.
it could be done by awk alone, but grep could save some typing...

Related

bash - add columns to csv rewrite headers with prefix filename

I'd prefer a solution that uses bash rather than converting to a dataframe in python, etc as the files are quite big
I have a folder of CSVs that I'd like to merge into one CSV. The CSVs all have the same header save a few exceptions so I need to rewrite the name of each added column with the filename as a prefix to keep track of which file the column came from.
head globcover_color.csv glds00g.csv
==> file1.csv <==
id,max,mean,90
2870316.0,111.77777777777777
2870317.0,63.888888888888886
2870318.0,73.6
2870319.0,83.88888888888889
==> file2.csv <==
ogc_fid,id,_sum
"1","2870316",9.98795110916615
"2","2870317",12.3311055738527
"3","2870318",9.81535963468479
"4","2870319",7.77729743926775
The id column of each file might be in a different "datatype" but in every file the id matches the line number. For example, line 2 is always id 2870316.
Anticipated output:
file1_id,file1_90,file2_ogc_fid,file2_id,file2__sum
2870316.0,111.77777777777777,"1","2870316",9.98795110916615
2870317.0,63.888888888888886,"2","2870317",12.3311055738527
2870318.0,73.6,"3","2870318",9.81535963468479
2870319.0,83.88888888888889,"4","2870319",7.77729743926775
I'm not quite sure how to do this but I think I'd use the paste command at some point. I'm surprised that I couldn't find a similar question on stackoverflow but I guess it's not that common to have CSV with the same id on the same line number
edit:
I figured out the first part.
paste -d , * > ../rasterjointest.txt achieves what I want but the header needs to be replaced
$ cat tst.awk
BEGIN { FS=OFS="," }
FNR==1 {
fname = FILENAME
sub(/\.[^.]+$/,"",fname)
for (i=1; i<=NF; i++) {
$i = fname "_" $i
}
}
{ row[FNR] = (NR==FNR ? "" : row[FNR] OFS) $0 }
END {
for (rowNr=1; rowNr<=FNR; rowNr++) {
print row[rowNr]
}
}
$ awk -f tst.awk file1.csv file2.csv
file1_id,file1_max,file1_mean,file1_90,file2_ogc_fid,file2_id,file2__sum
2870316.0,111.77777777777777,"1","2870316",9.98795110916615
2870317.0,63.888888888888886,"2","2870317",12.3311055738527
2870318.0,73.6,"3","2870318",9.81535963468479
2870319.0,83.88888888888889,"4","2870319",7.77729743926775
To use minimal memory in awk:
$ cat tst.awk
BEGIN {
FS=OFS=","
for (fileNr=1; fileNr<ARGC; fileNr++) {
filename = ARGV[fileNr]
if ( (getline < filename) > 0 ) {
fname = filename
sub(/\.[^.]+$/,"",fname)
for (i=1; i<=NF; i++) {
$i = fname "_" $i
}
}
row = (fileNr==1 ? "" : row OFS) $0
}
print row
exit
}
$ awk -f tst.awk file1.csv file2.csv; paste -d, file1.csv file2.csv | tail -n +2
file1_id,file1_max,file1_mean,file1_90,file2_ogc_fid,file2_id,file2__sum
2870316.0,111.77777777777777,"1","2870316",9.98795110916615
2870317.0,63.888888888888886,"2","2870317",12.3311055738527
2870318.0,73.6,"3","2870318",9.81535963468479
2870319.0,83.88888888888889,"4","2870319",7.77729743926775

how to find out common columns and its records from two files using awk

I have two files:
File 1:
id|name|address|country
1|abc|efg|xyz
2|asd|dfg|uio
File 2(only headers):
id|name|country
Now, I want an output like:
OUTPUT:
id|name|country
1|abc|xyz
2|asd|uio
Basically, I have a user record file(file1) and a header file(file2).Now, I want to extract only those records from (file1) whose columns match with that in the header file.
I want to do this using awk or bash.
I tried using:
awk 'BEGIN { OFS="..."} FNR==NR { a[(FNR"")] = $0; next } { print a[(FNR"")], $0 > "test.txt"}' header.txt file.txt
and have no idea what to do next.
Thank You
Following awk may help you on same.
awk -F"|" 'FNR==NR{for(i=1;i<=NF;i++){a[$i]};next} FNR==1 && FNR!=NR{for(j=1;j<=NF;j++){if($j in a){b[++p]=j}}} {for(o=1;o<=p;o++){printf("%s%s",$b[o],o==p?ORS:OFS)}}' OFS="|" File2 File1
Adding a non-one liner form of solution too now.
awk -F"|" '
FNR==NR{
for(i=1;i<=NF;i++){
a[$i]};
next}
FNR==1 && FNR!=NR{
for(j=1;j<=NF;j++){
if($j in a){ b[++p]=j }}
}
{
for(o=1;o<=p;o++){
printf("%s%s",$b[o],o==p?ORS:OFS)}
}
' OFS="|" File2 File1
Edit by Ed Morton: FWIW here's the same script written with normal indenting/spacing and a couple of more meaningful variable names:
BEGIN { FS=OFS="|" }
NR==FNR {
for (i=1; i<=NF; i++) {
names[$i]
}
next
}
FNR==1 {
for (i=1; i<=NF; i++) {
if ($i in names) {
f[++numFlds] = i
}
}
}
{
for (i=1; i<=numFlds; i++) {
printf "%s%s", $(f[i]), (i<numFlds ? OFS : ORS)
}
}
with (lot's of) unix pipes as Doug McIlroy intended...
$ function p() { sed 1q "$1" | tr '|' '\n' | cat -n | sort -k2; }
$ cut -d'|' -f"$(join -j2 <(p header) <(p file) | sort -k2n | cut -d' ' -f3 | paste -sd,)" file
id|name|country
1|abc|xyz
2|asd|uio
Solution using bash>4:
IFS='|' headers1=($(head -n1 $file1))
IFS='|' headers2=($(head -n1 $file2))
IFS=$'\n'
# find idxes we want to output, ie. mapping of headers1 to headers2
idx=()
for i in $(seq 0 $((${#headers2[#]}-1))); do
for j in $(seq 0 $((${#headers1[#]}-1))); do
if [ "${headers2[$i]}" == "${headers1[$j]}" ]; then
idx+=($j)
break
fi
done
done
# idx=(0 1 3) for example
# simple join output function from https://stackoverflow.com/questions/1527049/join-elements-of-an-array
join_by() { local IFS="$1"; shift; echo "$*"; }
# first line - output headers
join_by '|' "${headers2[#]}"
isfirst=true
while IFS='|' read -a vals; do
# ignore first (header line)
if $isfirst; then
isfirst=false
continue;
fi;
# filter from line only columns with idx indices
tmp=()
for i in ${idx[#]}; do
tmp+=("${vals[$i]}")
done
# join ouptut with '|'
join_by '|' "${tmp[#]}"
done < $file1
This one respects the order of columns in file1, changed the order:
$ cat file1
id|country|name
The awk:
$ awk '
BEGIN { FS=OFS="|" }
NR==1 { # file1
n=split($0,a)
next
}
NR==2 { # file2 header
for(i=1;i<=NF;i++)
b[$i]=i
}
{ # output part
for(i=1;i<=n;i++)
printf "%s%s", $b[a[i]], (i==n?ORS:OFS)
}' file1 file2
id|country|name
1|xyz|abc
2|uio|asd
(Another version using cut for outputing in revisions)
This is similar to RavinderSingh13's solution, in that it first reads the headers from the shorter file, and then decides which columns to keep from the longer file based on the headers on the first line of it.
It however does the output differently. Instead of constructing a string, it shifts the columns to the left if it does not want to include a particular field.
BEGIN { FS = OFS = "|" }
# read headers from first file
NR == FNR { for (i = 1; i <= NF; ++i) header[$i]; next }
# mark fields in second file as "selected" if the header corresponds
# to a header in the first file
FNR == 1 {
for (i = 1; i <= NF; ++i)
select[i] = ($i in header)
}
{
skip = 0
pos = 1
for (i = 1; i <= NF; ++i)
if (!select[i]) { # we don't want this field
++skip
$pos = $(pos + skip) # shift fields left
} else
++pos
NF -= skip # adjust number of fields
print
}
Running this:
$ mawk -f script.awk file2 file1
id|name|country
1|abc|xyz
2|asd|uio

AWK: increment a field based on values from previous line

Given the following input for AWK:
10;20;20
8;41;41
15;52;52
How could I increase/decrease the values so that:
$1 = remains unchanged
$2 = $2 of previous line + $1 of previous line + 1
$3 = $3 of previous line + $1 of previous line + 1
So the desired output would be:
10;20;20
8;31;31
15;40;40
I need to auto-increment and loop over the lines,
using associative arrays, but it's confusing for me.
Surely, this doesn't work as desired:
#!/bin/awk -f
BEGIN { FS = ";" }
{
print ln, st, of
ln=$1
st=$2 + ln + 1
of=$3 + ln + 1
}
with awk
awk -F";" -v OFS=";"
'NR!=1{ $2=a[2]+a[1]+1; $3=a[3]+a[1]+1 } { split($0,a,FS) } 1' file
split the line to an array and when processing the next line we can use the values stored.
test
10;20;20
8;31;31
15;40;40
Following awk may help you in same.
awk -F";" '
FNR==1{
val=$1;
val1=$2;
val2=$3;
print;
next
}
{
$2=val+val1+1;
$3=val+val2+1;
print;
val=$1;
val1=$2;
val2=$3;
}' OFS=";" Input_file
For your given Input_file, output will be as follows.
10;20;20
8;31;31
15;40;40
awk 'BEGIN{
FS = OFS = ";"
}
FNR>1{
$2 = p2 + p1 + 1
$3 = p3 + p1 + 1
}
{
p1=$1; p2=$2; p3=$3
}1
' infile
Input:
$ cat infile
10;20;20
8;41;41
15;52;52
Output:
awk 'BEGIN{FS=OFS=";"}FNR>1{$2=p2+p1+1; $3=p3+p1+1 }{p1=$1; p2=$2; p3=$3}1' infile
10;20;20
8;31;31
15;40;40
Or store only fields of your interest
awk -v myfields="2,3" '
BEGIN{
FS=OFS=";";
split(myfields,t,/,/)
}
{
for(i in t)
{
if(FNR>1)
{
$(t[i]) = a[t[i]] + a[1] + 1
}
a[t[i]] = $(t[i])
}
a[1] = $1
}1' infile

Find string in a file to another file

I want to find keywords in FILE2 to each column of FILE1 and print <BLANK> if keywords in FILE2 is not in FILE1 regardless of the separators.
FILE1
XYZ=TRS-000|XYZ=TWR-000|UJU=909|GFT=879|JKP=908
XYZ=TRS-000|XYZ=TWR-000|GFT=879|JKP=908
FILE2
TRS-0
TWR
UJU
GFT-8
OUTPUT
XYZ=TRS-000|XYZ=TWR-000|UJU=909|GFT=879
XYZ=TRS-000|XYZ=TWR-000||GFT=879
SCRIPT so far
(This script finds exact match in FILE2 to FILE1 columns (with = as separator). I can't figure out how to do: if a string from FILE2 contained in a column in FILE1. )
BEGIN{FS=OFS="|"}
NR==FNR{a[++i]=$1;next}
{
d=""
delete b
for(j=1;j<=NF;j++){
split($j,c,"-")
b[c[1]]=$j;
}
for(j=1;j<=i;j++){
d=d (d==""?"":OFS) (a[j] in b?b[a[j]]:"")
}
print d
}
`
$ cat tst.awk
BEGIN { FS=OFS="|" }
NR==FNR { sub(/[^[:alpha:]].*/,""); keys[++numKeys]=$0; next }
{
delete key2val
for (i=1;i<=NF;i++) {
for (keyNr=1; keyNr<=numKeys; keyNr++) {
key = keys[keyNr]
if ( $i ~ ("="key"-|^"key"=") ) {
key2val[key] = $i
}
}
}
for (keyNr=1; keyNr<=numKeys; keyNr++) {
key = keys[keyNr]
printf "%s%s", key2val[key], (keyNr<numKeys?OFS:ORS)
}
}
$ awk -f tst.awk file2 file1
XYZ=TRS-000|XYZ=TWR-000|UJU=909|GFT=879
XYZ=TRS-000|XYZ=TWR-000||GFT=879
Based on your sample input, there are two patterns for determining the keyword after the split. This solution sets index i accordingly.
BEGIN {FS=OFS="|"}
NR==FNR {a[$1]=NR; cols=NR; next}
{
delete out
for (f=1;f<=NF;++f) {
i = ($f ~ /^.*=.*-.*$/) ? 2 : 1
split($f, b, /[=-]/)
if (b[i] in a) {
out[a[b[i]]] = $f
}
}
printf "%s", out[1]
for (j=2; j<=cols; ++j) {
printf "%s%s", OFS, out[j]
}
print ""
}
$ cat file1
XYZ=TRS-000|XYZ=TWR-000|UJU=909|GFT=879|JKP=908
XYZ=TRS-000|XYZ=TWR-000|GFT=879|JKP=908
$ cat file2
TRS
TWR
UJU
GFT
$ awk -f utl.awk file2 file1
XYZ=TRS-000|XYZ=TWR-000|UJU=909|GFT=879
XYZ=TRS-000|XYZ=TWR-000||GFT=879

awk to process the first two lines then the next two and so on

Suppose i have a very file which i created from two files one is old & another is the updated file by using cat & sort on the primary key.
File1
102310863||7097881||6845193||271640||06007709532577||||
102310863||7097881||6845123||271640||06007709532577||||
102310875||7092992||6840808||023740||10034500635650||||
102310875||7092992||6840818||023740||10034500635650||||
So pattern of this file is line 1 = old value & line 2 = updated value & so on..
now I want to process the file in such a way that awk first process the first two lines of the file & find out the difference & then move on two the next two lines.
now the process is
if($[old record]!=$[new record])
i= [new record]#[old record];
Desired output
102310863||7097881||6845123#6845193||271640||06007709532577||||
102310875||7092992||6840818#6840808||023740||10034500635650||||
$ cat tst.awk
BEGIN { FS="[|][|]"; OFS="||" }
NR%2 { split($0,old); next }
{
for (i=1;i<=NF;i++) {
if (old[i] != $i) {
$i = $i "#" old[i]
}
}
print
}
$
$ awk -f tst.awk file
102310863||7097881||6845123#6845193||271640||06007709532577||||
102310875||7092992||6840818#6840808||023740||10034500635650||||
This awk could help:
$ awk -F '\\|\\|' '{
getline new;
split(new, new_array, "\\|\\|");
for(i=1;i<=NF;i++) {
if($i != new_array[i]) {
$i = new_array[i]"#"$i;
}
}
} 1' OFS="||" < input_file
102310863||7097881||6845123#6845193||271640||06007709532577||||
102310875||7092992||6840818#6840808||023740||10034500635650||||
I think, you are good enough in awk to understand above code. Skipping the explanation.
Updated version, and thanks #martin for the double | trick:
$ cat join.awk
BEGIN {new=0; FS="[|]{2}"; OFS="||"}
new==0 {
split($0, old_data, "[|]{2}")
new=1
next
}
new==1 {
split($0, new_data, "[|]{2}")
for (i = 1; i <= 7; i++) {
if (new_data[i] != old_data[i]) new_data[i] = new_data[i] "#" old_data[i]
}
print new_data[1], new_data[2], new_data[3], new_data[4], new_data[5], new_data[6], new_data[7]
new = 0
}
$ awk -f join.awk data.txt
102310863||7097881||6845123#6845193||271640||06007709532577||||
102310875||7092992||6840818#6840808||023740||10034500635650||||

Resources