I have those two files (both have headers), each line of both files are starting with a date on the first column with the same format. the separator is a semicolon.
On the 9th column of the first file, I can only have those id: UK or JPN or EUR.
I need to aggregate file1 with the intel from file2 with the corresponding date intel.
I can try to do it with a bash script and a "for" loop of course, but I'm sure that resource wise it will be better with an awk or else bash command... if possible!
Thanks in advance for any hint.
ps: I tried unsuccessfully to adapt this method: https://unix.stackexchange.com/questions/428861/vlookup-equivalent-in-awk-scripting
The first file :
Date;$2;$3;$4;$5;$6;$7;$8;Id
2018-01-01; ;UK
2018-01-02; ;JPN
2018-01-03; ;EUR
2018-01-04; ;JPN
the second file :
Date;UKDIR;JPNDIR;EURDIR
2018-01-01;1;2;3
2018-01-02;4;5;6
2018-01-03;7;8;9
2018-01-04;11;10;12
Expected return
Date;$2;$3;$4;$5;$6;$7;$8;Id ;Intel
2018-01-01; ;UK ;1
2018-01-02; ;JPN ;5
2018-01-03; ;EUR ;9
2018-01-04; ;JPN ;10
You may use this awk:
awk -F';' -v OFS='; ' 'NR==1 { for (i=2; i<=NF; i++) h[i]=$i; next }
FNR==NR { for (i=2; i<=NF; i++) a[$1,h[i]]=$i; next }
FNR==1 { print $0, "Intel"; next }
{ print $0, a[$1,$NF "DIR"] }' file2 file1
Date;$2;$3;$4;$5;$6;$7;$8;Id; Intel
2018-01-01; ;UK; 1
2018-01-02; ;JPN; 5
2018-01-03; ;EUR; 9
2018-01-04; ;JPN; 10
Could you please try following.
awk '
BEGIN{
count=count1=1
FS=OFS=";"
}
FNR!=NR && FNR==1{
print $0 OFS "Intel"
}
FNR==NR && /^[0-9]/{
a[$1]=$(++count)
count=count==4?1:count
next
}
NF && /^[0-9]/{
print $0 OFS a[$1]
count1=count1==4?1:count1
}
' second_file first_file
Output will be as follows.
Date;$2;$3;$4;$5;$6;$7;$8;Id;Intel
2018-01-01; ;UK;1
2018-01-02; ;JPN;5
2018-01-03; ;EUR;9
2018-01-04; ;JPN;11
$ cat tst.awk
BEGIN { FS=OFS=";" }
NR==FNR {
if (NR == 1) {
for (fldNr=2; fldNr<=NF; fldNr++) {
fldName = $fldNr
sub(/DIR/,"",fldName)
fldNr2name[fldNr] = fldName
}
}
else {
for (fldNr=2; fldNr<=NF; fldNr++) {
fldName = fldNr2name[fldNr]
dateFldName2val[$1,fldName] = $fldNr
}
}
next
}
{
print $0, (FNR>1 ? dateFldName2val[$1,$NF] : "Intel")
}
$ awk -f tst.awk file2 file1
Date;$2;$3;$4;$5;$6;$7;$8;Id;Intel
2018-01-01; ;UK;1
2018-01-02; ;JPN;5
2018-01-03; ;EUR;9
2018-01-04; ;JPN;10
I have two files:
File 1:
id|name|address|country
1|abc|efg|xyz
2|asd|dfg|uio
File 2(only headers):
id|name|country
Now, I want an output like:
OUTPUT:
id|name|country
1|abc|xyz
2|asd|uio
Basically, I have a user record file(file1) and a header file(file2).Now, I want to extract only those records from (file1) whose columns match with that in the header file.
I want to do this using awk or bash.
I tried using:
awk 'BEGIN { OFS="..."} FNR==NR { a[(FNR"")] = $0; next } { print a[(FNR"")], $0 > "test.txt"}' header.txt file.txt
and have no idea what to do next.
Thank You
Following awk may help you on same.
awk -F"|" 'FNR==NR{for(i=1;i<=NF;i++){a[$i]};next} FNR==1 && FNR!=NR{for(j=1;j<=NF;j++){if($j in a){b[++p]=j}}} {for(o=1;o<=p;o++){printf("%s%s",$b[o],o==p?ORS:OFS)}}' OFS="|" File2 File1
Adding a non-one liner form of solution too now.
awk -F"|" '
FNR==NR{
for(i=1;i<=NF;i++){
a[$i]};
next}
FNR==1 && FNR!=NR{
for(j=1;j<=NF;j++){
if($j in a){ b[++p]=j }}
}
{
for(o=1;o<=p;o++){
printf("%s%s",$b[o],o==p?ORS:OFS)}
}
' OFS="|" File2 File1
Edit by Ed Morton: FWIW here's the same script written with normal indenting/spacing and a couple of more meaningful variable names:
BEGIN { FS=OFS="|" }
NR==FNR {
for (i=1; i<=NF; i++) {
names[$i]
}
next
}
FNR==1 {
for (i=1; i<=NF; i++) {
if ($i in names) {
f[++numFlds] = i
}
}
}
{
for (i=1; i<=numFlds; i++) {
printf "%s%s", $(f[i]), (i<numFlds ? OFS : ORS)
}
}
with (lot's of) unix pipes as Doug McIlroy intended...
$ function p() { sed 1q "$1" | tr '|' '\n' | cat -n | sort -k2; }
$ cut -d'|' -f"$(join -j2 <(p header) <(p file) | sort -k2n | cut -d' ' -f3 | paste -sd,)" file
id|name|country
1|abc|xyz
2|asd|uio
Solution using bash>4:
IFS='|' headers1=($(head -n1 $file1))
IFS='|' headers2=($(head -n1 $file2))
IFS=$'\n'
# find idxes we want to output, ie. mapping of headers1 to headers2
idx=()
for i in $(seq 0 $((${#headers2[#]}-1))); do
for j in $(seq 0 $((${#headers1[#]}-1))); do
if [ "${headers2[$i]}" == "${headers1[$j]}" ]; then
idx+=($j)
break
fi
done
done
# idx=(0 1 3) for example
# simple join output function from https://stackoverflow.com/questions/1527049/join-elements-of-an-array
join_by() { local IFS="$1"; shift; echo "$*"; }
# first line - output headers
join_by '|' "${headers2[#]}"
isfirst=true
while IFS='|' read -a vals; do
# ignore first (header line)
if $isfirst; then
isfirst=false
continue;
fi;
# filter from line only columns with idx indices
tmp=()
for i in ${idx[#]}; do
tmp+=("${vals[$i]}")
done
# join ouptut with '|'
join_by '|' "${tmp[#]}"
done < $file1
This one respects the order of columns in file1, changed the order:
$ cat file1
id|country|name
The awk:
$ awk '
BEGIN { FS=OFS="|" }
NR==1 { # file1
n=split($0,a)
next
}
NR==2 { # file2 header
for(i=1;i<=NF;i++)
b[$i]=i
}
{ # output part
for(i=1;i<=n;i++)
printf "%s%s", $b[a[i]], (i==n?ORS:OFS)
}' file1 file2
id|country|name
1|xyz|abc
2|uio|asd
(Another version using cut for outputing in revisions)
This is similar to RavinderSingh13's solution, in that it first reads the headers from the shorter file, and then decides which columns to keep from the longer file based on the headers on the first line of it.
It however does the output differently. Instead of constructing a string, it shifts the columns to the left if it does not want to include a particular field.
BEGIN { FS = OFS = "|" }
# read headers from first file
NR == FNR { for (i = 1; i <= NF; ++i) header[$i]; next }
# mark fields in second file as "selected" if the header corresponds
# to a header in the first file
FNR == 1 {
for (i = 1; i <= NF; ++i)
select[i] = ($i in header)
}
{
skip = 0
pos = 1
for (i = 1; i <= NF; ++i)
if (!select[i]) { # we don't want this field
++skip
$pos = $(pos + skip) # shift fields left
} else
++pos
NF -= skip # adjust number of fields
print
}
Running this:
$ mawk -f script.awk file2 file1
id|name|country
1|abc|xyz
2|asd|uio
I want to find keywords in FILE2 to each column of FILE1 and print <BLANK> if keywords in FILE2 is not in FILE1 regardless of the separators.
FILE1
XYZ=TRS-000|XYZ=TWR-000|UJU=909|GFT=879|JKP=908
XYZ=TRS-000|XYZ=TWR-000|GFT=879|JKP=908
FILE2
TRS-0
TWR
UJU
GFT-8
OUTPUT
XYZ=TRS-000|XYZ=TWR-000|UJU=909|GFT=879
XYZ=TRS-000|XYZ=TWR-000||GFT=879
SCRIPT so far
(This script finds exact match in FILE2 to FILE1 columns (with = as separator). I can't figure out how to do: if a string from FILE2 contained in a column in FILE1. )
BEGIN{FS=OFS="|"}
NR==FNR{a[++i]=$1;next}
{
d=""
delete b
for(j=1;j<=NF;j++){
split($j,c,"-")
b[c[1]]=$j;
}
for(j=1;j<=i;j++){
d=d (d==""?"":OFS) (a[j] in b?b[a[j]]:"")
}
print d
}
`
$ cat tst.awk
BEGIN { FS=OFS="|" }
NR==FNR { sub(/[^[:alpha:]].*/,""); keys[++numKeys]=$0; next }
{
delete key2val
for (i=1;i<=NF;i++) {
for (keyNr=1; keyNr<=numKeys; keyNr++) {
key = keys[keyNr]
if ( $i ~ ("="key"-|^"key"=") ) {
key2val[key] = $i
}
}
}
for (keyNr=1; keyNr<=numKeys; keyNr++) {
key = keys[keyNr]
printf "%s%s", key2val[key], (keyNr<numKeys?OFS:ORS)
}
}
$ awk -f tst.awk file2 file1
XYZ=TRS-000|XYZ=TWR-000|UJU=909|GFT=879
XYZ=TRS-000|XYZ=TWR-000||GFT=879
Based on your sample input, there are two patterns for determining the keyword after the split. This solution sets index i accordingly.
BEGIN {FS=OFS="|"}
NR==FNR {a[$1]=NR; cols=NR; next}
{
delete out
for (f=1;f<=NF;++f) {
i = ($f ~ /^.*=.*-.*$/) ? 2 : 1
split($f, b, /[=-]/)
if (b[i] in a) {
out[a[b[i]]] = $f
}
}
printf "%s", out[1]
for (j=2; j<=cols; ++j) {
printf "%s%s", OFS, out[j]
}
print ""
}
$ cat file1
XYZ=TRS-000|XYZ=TWR-000|UJU=909|GFT=879|JKP=908
XYZ=TRS-000|XYZ=TWR-000|GFT=879|JKP=908
$ cat file2
TRS
TWR
UJU
GFT
$ awk -f utl.awk file2 file1
XYZ=TRS-000|XYZ=TWR-000|UJU=909|GFT=879
XYZ=TRS-000|XYZ=TWR-000||GFT=879
My input file:
SMOKE_TEST_FIMS,"['a', 'b', 'c']",2015-08-01 14:00:00+0000,100
4.AIQM-B,,2015-04-16 12:04:21+0000,102
000TEST2,['1.034820'],2015-11-19 05:00:00+0000,130
I want to parse the string in such a way that output file will look like below:
Expected output:
'SMOKE_TEST_FIMS', 'a', '2015-08-01 14:00:00+0000','100'
'SMOKE_TEST_FIMS','b', '2015-08-01 14:00:00+0000','100'
'SMOKE_TEST_FIMS','c', '2015-08-01 14:00:00+0000','100'
'4.AIQM-B','','2015-04-16 12:04:21+0000','102'
'000TEST2','1.034820','2015-11-19 05:00:00+0000','130'
I was able to parse the single column data ['a','b','c'] to
'a'
'b'
'c'
sed -i "s/ *\"/'/g;s/ *[^0-9]*\('[^']*'\)\]*'*/\1/g;s/\(.\)''/\1'\n'/g;" updatebomStatement2.cql
If you are are ok with an gnu awk solution here is such a script:
script.awk
BEGIN { FPAT = "(\"[^\"]+\")|(\\[[^\\]]+\\])|([^,]*)"
OFS = ","
}
{ if ( $2~/\[[^\]]+/ ) {
# sanitize input: strip ", [, ]:
gsub(/[\[\]\"]/, "", $2)
# split at "," into parts: and print them
split($2, parts, ",")
for( ind in parts ) {
# further normalize input
gsub(/^ ?'/, "", parts[ind])
gsub(/'$/, "", parts[ind])
tmp=sprintf("'%s','%s','%s','%s'", $1, parts[ind], $3, $4)
print tmp
}
}
else {
tmp=sprintf("'%s','%s','%s','%s'", $1, $2, $3, $4)
print tmp
}
}
Run it like this: awk -f script.awk yourfile.
Imho gnu awk with its FPAT feature and its control statements is much better suited for your requirements than sed.
The first line with the FPAT describes what mades up a field in your input. It is either
something inside double quotes "
something inside brackets [ ... ]
or something separated by comma
The if statement matches that bracket case which has to be split into several lines.
sed is for simple substitutions on individual lines, that is all. For anything more interesting you should be using awk:
$ cat tst.awk
BEGIN { FS=",?\"?[][]\"?,?"; OFS="," }
{
if (split($2,a,/\047/)) {
for (j=2; j in a; j+=2) {
$2 = a[j]
prt()
}
}
else {
prt()
}
}
function prt( out) {
out = "\047" $0 "\047"
gsub(OFS,"\047,\047",out)
print out
}
$ awk -f tst.awk file
'SMOKE_TEST_FIMS','a','2015-08-01 14:00:00+0000','100'
'SMOKE_TEST_FIMS','b','2015-08-01 14:00:00+0000','100'
'SMOKE_TEST_FIMS','c','2015-08-01 14:00:00+0000','100'
'4.AIQM-B','','2015-04-16 12:04:21+0000','102'
'000TEST2','1.034820','2015-11-19 05:00:00+0000','130'
or building on #karakfa's idea:
$ cat tst.awk
BEGIN { FS="([][ \"\047])*,([][ \"\047])*"; OFS="\047,\047" }
{
for(i=2; i<=(NF-2); i++) {
print "\047" $1, $i, $(NF-1), $NF "\047"
}
}
$ awk -f tst.awk file
'SMOKE_TEST_FIMS','a','2015-08-01 14:00:00+0000','100'
'SMOKE_TEST_FIMS','b','2015-08-01 14:00:00+0000','100'
'SMOKE_TEST_FIMS','c','2015-08-01 14:00:00+0000','100'
'4.AIQM-B','','2015-04-16 12:04:21+0000','102'
'000TEST2','1.034820','2015-11-19 05:00:00+0000','130'
alternative hacky awk
$ awk -F, -v OFS=, -v q="'" '{gsub(/[ "\]\[]/, "");
for(i=2;i <=NF-2; i++)
{$i=$i?$i:q q;
print q $1 q, $i, q $(NF-1) q,q $NF q}}' file
'SMOKE_TEST_FIMS','a','2015-08-0114:00:00+0000','100'
'SMOKE_TEST_FIMS','b','2015-08-0114:00:00+0000','100'
'SMOKE_TEST_FIMS','c','2015-08-0114:00:00+0000','100'
'4.AIQM-B','','2015-04-1612:04:21+0000','102'
'000TEST2','1.034820','2015-11-1905:00:00+0000','130'
I have the following input:
adm.cd.rrn.vme.abcd.name = foo
adm.cd.rrn.vme.abcd.test = no
adm.cd.rrn.vme.abcd.id = 123456
adm.cd.rrn.vme.abcd.option = no
adm.cd.rrn.vme.asfa.name = bar
adm.cd.rrn.vme.asfa.test = no
adm.cd.rrn.vme.asfa.id = 324523
adm.cd.rrn.vme.asfa.option = yes
adm.cd.rrn.vme.xxxx.name = blah
adm.cd.rrn.vme.xxxx.test = no
adm.cd.rrn.vme.xxxx.id = 666666
adm.cd.rrn.vme.xxxx.option = no
How can extract all the values associated with a specific id?
For example, if I have id == 324523, I'd like it to print the values of name, test, and option:
bar no yes
Is it possible to achieve in a single awk command (or anything similar in bash)?
EDIT: Based on input, here's my solution until now:
MYID=$(awk -F. '/'"${ID}"$'/{print $5}' ${TMP_LIST})
awk -F'[ .]' '{
if ($5 == "'${MYID}'") {
if ($6 == "name") {name=$NF}
if ($6 == "test") {test=$NF}
if ($6 == "option") {option=$NF}
}
} END {print name,test,option}' ${TMP_LIST})
Thanks
$ cat tst.awk
{ rec = rec $0 RS }
/option/ {
if (rec ~ "id = "tgt"\n") {
printf "%s", rec
}
rec = ""
next
}
$ awk -v tgt=324523 -f tst.awk file
adm.cd.rrn.vme.asfa.name = bar
adm.cd.rrn.vme.asfa.test = no
adm.cd.rrn.vme.asfa.id = 324523
adm.cd.rrn.vme.asfa.option = yes
or if you prefer:
$ cat tst.awk
BEGIN { FS="[. ]" }
$(NF-2) == "id" { found = ($NF == tgt ? 1 : 0); next }
{ rec = (rec ? rec OFS : "") $NF }
$(NF-2) == "option" { if (found) print rec; rec = ""; next }
$ awk -v tgt=324523 -f tst.awk file
bar no yes
first, I convert each record in a line with xargs, then I look for lines that contain the regular expression and print the columns searched
cat input | xargs -n 12 | awk '{if($0~/id\s=\s324523\s/){ print $3, $6, $12}}'
a solution more general:
awk 'BEGIN{FS="\\.|\\s"; } #field separator is point \\. or space \\s
{
a[$5"."$6]=$8; #store records in associative array a
if($8=="324523" && $6=="id"){
reg[$5]=1; #if is record found, add to associative array reg
}
}END{
for(k2 in reg){
s=""
for(k in a){
if(k~"^"k2"\\."){ #if record is an element of "reg" then add to output "s"
s=k":"a[k]" "s
}
}
print s;
}
}' input
if your input format is fixed, you can do in this way:
grep -A1 -B2 'id\s*=\s*324523$' file|awk 'NR!=3{printf "%s ",$NF}END{print ""}'
you can add -F'=' to awk part too.
it could be done by awk alone, but grep could save some typing...