Compare Joining different fields - bash

I have 2 files
file1.txt [syntax field1:filed2:field3:...]
123456:07102015174037:100 --> this should be matched
123457:03102015174037:354
123456:03102015174037:1
1234556:03102015174037:0
file2.txt [syntax field3:filed4:field1:...]
100:03102015174037:123456 --> this should be matched
101:03145415174037:1234556
I wanted to check if
From file1.txt combination of field1 & field2 exists in file2.txt at field3 & field1
At End I want to print only file1.txt content which are matched
so I ended up doing [To get only matched columns]
awk -F ':' '{print $1,$2}' file1.txt >> tmpfile1.txt
awk -F ':' '{print $1,$3}' file2.txt >> tmpfile2.txt
grep -f tmpfile1.txt tmpfile2.txt> match.txt
grep -f match.txt file1.txt >> file1updated.txt
cat fil1updated
100:03102015174037:123456
Is there are one step & efficient way of doing this [Its like joins on columns in SQL]

You can run this inline of course. But I prefer to create an awk script as such (join.awk)
NR==FNR {
a[$1 FS $3] = $0; next
}
$3 FS $1 in a {print a[$3 FS $1]}
Then you can get your result by running
awk -F: -f join.awk file2.txt file1.txt

Related

Write specific columns of files into another files, Who can give me a more concise solution?

I have a troublesome problem about writing specific columns of the file into another file, more details are I have the file1 like below, I need to write the first columns exclude the first row to file2 with one line and separated with '|' sign. And now I have a solution by sed and awk, this missing last step inserts into the top of file2, even though I still believe there should be some more concise solution on account of powerful of awk、sed, etc. So, Who can offer me another more concise script?
sed '1d;s/ .//' ./file1 | awk '{printf "%s|", $1; }' | awk '{if (NR != 0) {print substr($1, 1, length($1) - 1)}}'
file1:
col_name data_type comment
aaa string null
bbb int null
ccc int null
file2:
xxx ccc(whatever is this)
The result of file2 should be this :
aaa|bbb|ccc
xxx ccc(whatever is this)
Assuming there's no whitespace in the column 1 data, in increasing length:
sed -i "1i$(awk 'NR > 1 {print $1}' file1 | paste -sd '|')" file2
or
ed file2 <<END
1i
$(awk 'NR > 1 {print $1}' file1 | paste -sd '|')
.
wq
END
or
{ awk 'NR > 1 {print $1}' file1 | paste -sd '|'; cat file2; } | sponge file2
or
mapfile -t lines < <(tail -n +2 file1)
col1=( "${lines[#]%%[[:blank:]]*}" )
new=$(IFS='|'; echo "${col1[*]}"; cat file2)
echo "$new" > file2
This might work for you (GNU sed):
sed -z 's/[^\n]*\n//;s/\(\S*\).*/\1/mg;y/\n/|/;s/|$/\n/;r file2' file1
Process file1 "wholemeal" by using the -z command line option.
Remove the first line.
Remove all columns other than the first.
Replace newlines by |'s
Replace the last | by a newline.
Append file2.
Alternative using just command line utils:
tail +2 file1 | cut -d' ' -f1 | paste -s -d'|' | cat - file2
Tail file1 from line 2 onwards.
Using the results from the tail command, isolate the first column using a space as the column delimiter.
Using the results from the cut command, serialize each line into one, delimited by |',s.
Using the results from the paste, append file2 using the cat command.
I'm learning awk at the moment.
awk 'BEGIN{a=""} {if(NR>1) a = a $1 "|"} END{a=substr(a, 1, length(a)-1); print a}' file1
Edit: Here's another version that uses an array:
awk 'NR > 1 {a[++n]=$1} END{for(i=1; i<=n; ++i){if(i>1) printf("|"); printf("%s", a[i])} printf("\n")}' file1
Here is a simple Awk script to merge the files as per your spec.
awk '# From the first file, merge all lines except the first
NR == FNR { if (FNR > 1) { printf "%s%s", sep, $1; sep = "|"; } next }
# We are in the second file; add a newline after data from first file
FNR == 1 { printf "\n" }
# Simply print all lines from file2
1' file1 file2
The NR==FNR condition is true when we are reading the first input file: The overall line number NR is equal to the line number within the current file FNR. The final 1 is a common idiom for printing all input lines which make it this far into the script (the next in the first block prevent lines from the first file to reaching this far).
For conciseness, you can remove the comments.
awk 'NR == FNR { if (FNR > 1) { printf "%s%s", sep, $1; sep = "|"; } next }
FNR == 1 { printf "\n" } 1' file1 file2
Generally speaking, Awk can do everything sed can do, so piping sed into Awk (or vice versa) is nearly always a useless use of sed.

Why awk does not work in script file while select somthing between two files

In my project, I have two files.
The content of file1 is like:
bme-zhangyl
chem-abbott
chem-hef
chem-lijun
chem-liuch
chem-lix
chem-nisf
chem-quanm
chem-sunli
chem-taohq
chem-wanggc
chem-wangyg
The content of file2 is like:
bme-zhangyl bme-zhangmm
phy-dongert phy-zhangwq
chem-lijun phy-zhangwq
ls-liulj bio-chenw
phy-zhangyb phy-zhangwq
mee-xingw mee-rongym
cs-likm cs-hisao
cs-nany cs-hisao
cs-pengym cs-hisao
chem-quanm cs-hisao
cs-likq cs-hisao
cs-wujx cs-liuyp
mse-mar mse-liangyy
ccse-xiezy ccse-xiezy
maad-chensm maad-wanmp
Now i have a script file, the content of it is like:
#!/bash/sh
for i in $(cat file1)
do
groupname=`awk '($1=='"$i"'){print $2}' file2`
echo $groupname
done
But it is unlucky, it displays nothing;
i have tried another way:
#!/bash/sh
for i in $(cat file1)
do
groupname=`awk '{if($1=='"$i"')print $2}' file2`
echo $groupname
done
and
#!/bash/sh
for i in $(cat file1)
do
groupname=`awk '{if($1==$i)print $2}' file2`
echo $groupname
done
They are all fail. It seems nothing wrong, who can help me?
The correct output should be:
bme-zhangmm
phy-zhangwq
cs-hisao
Using bare awk:
$ awk 'NR==FNR{a[$1];next}$1 in a{print $2}' file1 file2
Output:
bme-zhangmm
phy-zhangwq
cs-hisao
Explained:
$ awk '
NR==FNR { # has file1 strings to a hash
a[$1]
next
}
$1 in a { # if file2 field 1 keyword was hashed from file1
print $2 # output word from field 2
}' file1 file2
UpdateD: As a script:
#!/bin/sh
awk 'NR==FNR{a[$1];next}$1 in a{print $2}' file1 file2
i have tested:
groupname=`awk '{if($1==" '$i' ") print $2}' UGfrompwdguprst`
it works Ok

Splitting csv file into multiple files with 2 columns in each file

I am trying to split a file (testfile.csv) that contains the following:
1,2,4,5,6,7,8,9
a,b,c,d,e,f,g,h
q,w,e,r,t,y,u,i
a,s,d,f,g,h,j,k
z,x,c,v,b,n,m,z
into a file
1,2
a,b
q,w
a,s
z,x
and another file
4,5
c,d
e,r
d,f
c,v
but I cannot seem to do that in awk using an iterative solution.
awk -F, '{print $1, $2}'
awk -F, '{print $3, $4}'
does it for me but I would like a looping solution.
I tried
awk -F, '{ for (i=1;i< NF;i+=2) print $i, $(i+1) }' testfile.csv
but it gives me a single column. It appears that I am iterating over the first row and then moving onto the second row skipping every other element of that specific row.
You can use cut:
$ cut -d, -f1,2 file > file_1
$ cut -d, -f3,4 file > file_2
If you are going to use awk be sure to set the OFS so that the columns remain a CSV file:
$ awk 'BEGIN{FS=OFS=","}
{print $1,$2 >"f1"; print $3,$4 > "f2"}' file
$ cat f1
1,2
a,b
q,w
a,s
z,x
$cat f2
4,5
c,d
e,r
d,f
c,v
Is there a quick and dirty way of renaming the resulting files with the first row and first column (like first file would be 1.csv, second file would be 4.csv:
awk 'BEGIN{FS=OFS=","}
FNR==1 {n1=$1 ".csv"; n2=$3 ".csv"}
{print $1,$2 >n1; print $3,$4 > n2}' file
awk -F, '{ for (i=1; i < NF; i+=2) print $i, $(i+1) > i ".csv"}' tes.csv
works for me. I was trying to get the output in bash which was all jumbled up.
It's do-able in bash, but it will be much slower than awk:
f=testfile.csv
IFS=, read -ra first < <(head -1 "$f")
for ((i = 0; i < (${#first[#]} + 1) / 2; i++)); do
slice_file="${f%.csv}$((i+1)).csv"
cut -d, -f"$((2 * i + 1))-$((2 * (i + 1)))" "$f" > "$slice_file"
done
with sed:
sed -r '
h
s/(.,.),./\1/w file1.txt
g
s/.,.,(.,.),./\1/w file2.txt' file.txt

Compare columns in two text files and match lines

I want to compare the second column (delimited by a whitespace) in file1:
n01443537/n01443537_481.JPEG n01443537
n01629819/n01629819_420.JPEG n01629819
n02883205/n02883205_461.JPEG n02883205
With the second column (delimited by a whitespace) in file2:
val_8447.JPEG n09256479
val_68.JPEG n01443537
val_1054.JPEG n01629819
val_1542.JPEG n02883205
val_8480.JPEG n03089624
If there is a match, I would like to print out the corresponding line of file2.
Desired output in this example:
val_68.JPEG n01443537
val_1054.JPEG n01629819
val_1542.JPEG n02883205
I tried the following, but the output file is empty:
awk -F' ' 'NR==FNR{c[$2]++;next};c[$2] > 0' file1.txt file2.txt > file3.txt
Also tried this, but the result was the same (empty output file):
awk 'NR==FNR{a[$2];next}$2 in a' file1 file2 > file3.txt
GNU join exists for this purpose.
join -o "2.1 2.2" -j 2 <(sort -k 2 file1) <(sort -k 2 file2)
Using awk:
awk 'FNR==NR{a[$NF]; next} $NF in a' file1 file2
val_68.JPEG n01443537
val_1054.JPEG n01629819
val_1542.JPEG n02883205
Here is a grep alternative with process substitution:
grep -f <(awk '{print " " $NF "$"}' file1) file2
Using print " " $NF "$" to create a regex like " n01443537$" so that we match only last column in grep.

sed: to merge contents of one CSV to another based on variable match

Attempting to correlate GEO IP data from one CSV to an access log of another.
Sample lines of data:
CSV1
Bob,App1,8-Jan-15,8.8.8.8
April,App3,2-Jan-15,5.5.5.5
George,App2,1-Feb-15,8.8.8.8
CSV2
8.8.8.8,US,United States,CA,California,Mountain View,94040,America/Los_Angeles
5.5.5.5,US,United States,FL,Florida,Miami
I want to search CSV1 for any IP listed in in CSV2 and append fields 1,2,4 to CSV1 when the IP matches.
So far I have, but I'm getting errors I believe at the SED portion.
#!/bin/bash
for LINE in $( cat CSV2 | awk -F',' '{print $1 "," $2 "," $4}' )
do
$IP = $( echo $LINE | cut -d, -f1 )
sed -i.bak "s/"$IP/\""$LINE\"" CSV1
done
Desired Output:
Bob,App1,8-Jan-15,8.8.8.8,United States,CA
Dawn,App3,2-Jan-15,5.5.5.5,United States,FL
George,App2,1-Feb-15,8.8.8.8,United States,CA
Using the join command:
$ join -t , -1 4 -2 1 -o 1.1,1.2,1.3,1.4,2.3,2.4 <(sort -t, -k4,4 CSV1) <(sort -t, CSV2)
Bob,App1,8-Jan-15,8.8.8.8,United States,CA
Using sort is overkill here, but for >1 line files, join requires the files to be sorted on the join key
With awk
$ awk -F, -v OFS=, 'NR == FNR {a[$1] = $3 OFS $4; next} $4 in a {print $0, a[$4]}' CSV2 CSV1
Bob,App1,8-Jan-15,8.8.8.8,United States,CA

Resources