Compare two files(file1 & file2) and add one column from from file2 to file1 if first column of two files matches - shell

I have two files (file1 and file2)
file1
ABC=14.2.0.7.SAMPLE=git.xyz/plugins/gitiles/+/refs/heads/clientpatch/abc/patch142007
DEF=14.3.0.5.SAMPLE=git.xyz/plugins/gitiles/+/refs/heads/clientpatch/def/patch143005
DEF=14.3.0.5.SAMPLE2=git.calypso/plugins/gitiles/+/refs/heads/clientpatch/def/patch14300-calib
HIJ=12.0.0.0.Sp3.SAMPLE3=git.xyz/plugins/gitiles/+/refs/heads/clientpatch/hij/patch120000sp3
MNO=16.1.0.28.SAMPLE=git.xyz/plugins/gitiles/+/refs/heads/clientpatch/mno/patch161028
.......(150 lines)
file2
IJK = open
ABC = closed
PQR = closed
DEF = open
HIJ = open
LMN = closed
MNO = closed
PQR = open
......(> 150 lines)
output file
ABC=14.2.0.7.SAMPLE=git.xyz/plugins/gitiles/+/refs/heads/client/abc/patch142007=closed
DEF=14.3.0.5.SAMPLE=git.xyz/plugins/gitiles/+/refs/heads/client/def/patch143005=open
DEF=14.3.0.5.SAMPLE2=git.xyz/plugins/gitiles/+/refs/heads/client/def/patch14300-calib=open
HIJ=12.0.0.0.Sp3.SAMPLE3=git.xyz/plugins/gitiles/+/refs/heads/client/hij/patch120000sp3=open
MNO=16.1.0.28.SAMPLE=git.xyz/plugins/gitiles/+/refs/heads/client/mno/patch161028=closed
I have tried the following script. But it is not giving me any output. Not even printing anything. No errors
while IFS= read -r line
do
key1=`echo $line | awk -F "=" '{print $1}'` < file1
key2=`echo $line | awk -F "=" '{print $2}'` < file1
key3=`echo $line | awk -F "=" '{print $3}'` < file1
key4=`echo $line | awk -F "=" '{print $1}'` < file2
value3=`echo $line | awk -F "=" '{print $2}'` < file2
if [ "$key1" == "$key4" ]; then
echo "$key1=$key2=$key3=$value3"
fi
done
Giving a brief description for how the code should work.
The code should compare first columns of two files(file1 and file2). If each name matches it should give me output file as listed above. Else go to the next line. I should get output if my two files are either in sorted or unsorted format.
Helps will be appreciated. Thank you

Or another approach with awk that stores the file2 values in an array and then appends the correct state to the appropriate line in file1:
awk -F' = ' 'NR==FNR {a[$1]=$2; next} {print $0"="a[$1]}' file2 FS="=" file1
Example Use/Output
$ awk -F' = ' 'NR==FNR {a[$1]=$2; next} {print $0"="a[$1]}' file2 FS="=" file1
ABC=14.2.0.7.SAMPLE=git.xyz/plugins/gitiles/+/refs/heads/clientpatch/abc/patch142007=closed
DEF=14.3.0.5.SAMPLE=git.xyz/plugins/gitiles/+/refs/heads/clientpatch/def/patch143005=open
DEF=14.3.0.5.SAMPLE2=git.calypso/plugins/gitiles/+/refs/heads/clientpatch/def/patch14300-calib=open
HIJ=12.0.0.0.Sp3.SAMPLE3=git.xyz/plugins/gitiles/+/refs/heads/clientpatch/hij/patch120000sp3=open
MNO=16.1.0.28.SAMPLE=git.xyz/plugins/gitiles/+/refs/heads/clientpatch/mno/patch161028=closed

Could you please try following.
awk '
BEGIN{
OFS="="
}
FNR==NR{
a[$1]=$NF
next
}
($1 in a){
print $0,a[$1]
}
' Input_file2 FS="=" Input_file1
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section from here.
OFS="=" ##Setting OFS as = here for all lines.
}
FNR==NR{ ##Checking condition if FNR==NR which will be TRUE when file2 is being read.
a[$1]=$NF ##Creating an array a with index $1 and value is last field.
next ##next will skip all further statements from here.
}
($1 in a){ ##Checking condition if $1 of current line is present in array a then do following.
print $0,a[$1] ##Printing current line and value of array a with index $1.
}
' file2 FS="=" file1 ##Mentioning Input_file file2 and file1 and setting FS="=" for file1 here.

Related

Write specific columns of files into another files, Who can give me a more concise solution?

I have a troublesome problem about writing specific columns of the file into another file, more details are I have the file1 like below, I need to write the first columns exclude the first row to file2 with one line and separated with '|' sign. And now I have a solution by sed and awk, this missing last step inserts into the top of file2, even though I still believe there should be some more concise solution on account of powerful of awk、sed, etc. So, Who can offer me another more concise script?
sed '1d;s/ .//' ./file1 | awk '{printf "%s|", $1; }' | awk '{if (NR != 0) {print substr($1, 1, length($1) - 1)}}'
file1:
col_name data_type comment
aaa string null
bbb int null
ccc int null
file2:
xxx ccc(whatever is this)
The result of file2 should be this :
aaa|bbb|ccc
xxx ccc(whatever is this)
Assuming there's no whitespace in the column 1 data, in increasing length:
sed -i "1i$(awk 'NR > 1 {print $1}' file1 | paste -sd '|')" file2
or
ed file2 <<END
1i
$(awk 'NR > 1 {print $1}' file1 | paste -sd '|')
.
wq
END
or
{ awk 'NR > 1 {print $1}' file1 | paste -sd '|'; cat file2; } | sponge file2
or
mapfile -t lines < <(tail -n +2 file1)
col1=( "${lines[#]%%[[:blank:]]*}" )
new=$(IFS='|'; echo "${col1[*]}"; cat file2)
echo "$new" > file2
This might work for you (GNU sed):
sed -z 's/[^\n]*\n//;s/\(\S*\).*/\1/mg;y/\n/|/;s/|$/\n/;r file2' file1
Process file1 "wholemeal" by using the -z command line option.
Remove the first line.
Remove all columns other than the first.
Replace newlines by |'s
Replace the last | by a newline.
Append file2.
Alternative using just command line utils:
tail +2 file1 | cut -d' ' -f1 | paste -s -d'|' | cat - file2
Tail file1 from line 2 onwards.
Using the results from the tail command, isolate the first column using a space as the column delimiter.
Using the results from the cut command, serialize each line into one, delimited by |',s.
Using the results from the paste, append file2 using the cat command.
I'm learning awk at the moment.
awk 'BEGIN{a=""} {if(NR>1) a = a $1 "|"} END{a=substr(a, 1, length(a)-1); print a}' file1
Edit: Here's another version that uses an array:
awk 'NR > 1 {a[++n]=$1} END{for(i=1; i<=n; ++i){if(i>1) printf("|"); printf("%s", a[i])} printf("\n")}' file1
Here is a simple Awk script to merge the files as per your spec.
awk '# From the first file, merge all lines except the first
NR == FNR { if (FNR > 1) { printf "%s%s", sep, $1; sep = "|"; } next }
# We are in the second file; add a newline after data from first file
FNR == 1 { printf "\n" }
# Simply print all lines from file2
1' file1 file2
The NR==FNR condition is true when we are reading the first input file: The overall line number NR is equal to the line number within the current file FNR. The final 1 is a common idiom for printing all input lines which make it this far into the script (the next in the first block prevent lines from the first file to reaching this far).
For conciseness, you can remove the comments.
awk 'NR == FNR { if (FNR > 1) { printf "%s%s", sep, $1; sep = "|"; } next }
FNR == 1 { printf "\n" } 1' file1 file2
Generally speaking, Awk can do everything sed can do, so piping sed into Awk (or vice versa) is nearly always a useless use of sed.

How to compare two files and print the values of both the files which are different

There are 2 files. I need to sort them first and then compare the 2 files and then the difference I need to print the value from File 1 and File 2.
file1:
pair,bid,ask
AED/MYR,3.918000,3.918000
AED/SGD,3.918000,3.918000
AUD/CAD,3.918000,3.918000
file2:
pair,bid,ask
AUD/CAD,3.918000,3.918000
AUD/CNY,3.918000,3.918000
AED/MYR,4.918000,4.918000
Output should be:
pair,inputbid,inputask,outputbid,outtputask
AED/MYR,3.918000,3.918000,4.918000,4.918000
The only difference in 2 files is AED/MYR with different bid/ask rates. How can I print difference value from file 1 and file 2.
I tried using below commands:
nawk -F, 'NR==FNR{a[$1]=$4;a[$2]=$5;next} !($4 in a) || !($5 in a) {print $1 FS a[$1] FS a[$2] FS $4 FS $5}' file1 file2
Result output as below:
pair,bid,ask,bid,ask
AUD/CAD,3.918000,3.918000,3.918000,3.918000
AUD/CHF,3.918000,3.918000,3.918000,3.918000
AUD/CNH,3.918000,3.918000,3.918000,3.918000
AUD/CNY,3.918000,3.918000,3.918000,3.918000
AED/MYR,3.918000,3.918000,4.918000,4.918000
We are still not able to get only the difference.
Could you please try following, written and tested in GNU awk with shown samples.
awk -v header="pair,inputbid,inputask,outputbid,outtputask" '
BEGIN{
FS=OFS=","
}
FNR==NR{
arr[$1]=$0
next
}
($1 in arr) && arr[$1]!=$0{
val=$1
$1=""
sub(/^,/,"")
if(!found){
print header
found=1
}
print arr[val],$0
}' Input_file1 Input_file2
Explanation: Adding detailed explanation for above.
awk -v header="pair,inputbid,inputask,outputbid,outtputask" ' ##Starting awk program from here and setting this to header value here.
BEGIN{ ##Starting BEGIN section of this program from here.
FS=OFS="," ##Setting field separator and output field separator as comma here.
}
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when Input_file1 is being read.
arr[$1]=$0 ##Creating arr with index $1 and keep value as current line.
next ##next will skip all further statements from here.
}
($1 in arr) && arr[$1]!=$0{ ##Checking condition if first field is present in arr and its value NOT equal to $0
val=$1 ##Creating val which has current line value in it.
$1="" ##Nullifying irst field here.
sub(/^,/,"") ##Substitute starting , with NULL here.
if(!found){ ##Checking if found is NULL then do following.
print header ##Printing header here only once.
found=1 ##Setting found here.
}
print arr[val],$0 ##Printing arr with index of val and current line here.
}' Input_file1 Input_file2 ##Mentioning Input_files here.
With bash process substitution, then join and then choosing with awk:
# print header
printf "%s\n" "pair,inputbid,inputask,outputbid,outtputask"
# remove first line from both files, then sort them on first field
# then join them on first field and output first 5 fields
join -t, -11 -21 -o1.1,1.2,1.3,2.2,2.3 <(tail -n +2 file1 | sort -t, -k1) <(tail -n +2 file2 | sort -t, -k1) |
# output only those lines, that columns differ
awk -F, '$2 != $4 || $3 != $5'

Compare two files and combine different columns of two files together into a single file using shell

I have two files file1.txt and file2.txt.
file1.txt
Amal=123=amal#gmail.com
Anil=342=anil#gmail.com
Ajith=548=ajith#gmail.com
Aravind=998=arav#gmail.com
file2.txt
Anil=Active
Amal=Active
Ajith=Inactive
Aravind=Active
Midhun=Active
I need to add an extra column in file1.txt from file2.txt mentioning whether each of them is active or inactive and also remove lines from file2.txt which are not present in file1.txt.(for example, Midhun is not present in file1.txt. So i need to remove midhun from file2.txt)
My output file should be
output.txt
Amal=123=Active
Anil=342=Active
Ajith=548=Inactive
Aravind=998=Active
I tried the following. But it is not working.
while IFS= read -r line
do
key=`echo $line | awk -F "=" '{print $1}'` < file1.txt
key2=`echo $line | awk -F "=" '{print $2}'` < file1.txt
value=`echo $line | awk -F "=" '{print $2}'` < file2.txt
echo "$key=$key2=$value"
done
EDIT: Since OP changed his requirement so adding this solution now.
awk '
BEGIN{
FS=OFS="="
}
FNR==NR{
a[$1]=$2
next
}
($1 in a){
$3=""
sub(/=$/,"")
print $0,a[$1]
}
' Input_file2 Input_file1
This should be a simple task for awk, please try following.
awk 'BEGIN{FS=OFS="="} FNR==NR{a[$1]=$2;next} ($1 in a){print $0,a[$1]}' file2 file1
Explanation: Adding detailed explanation for above code here.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section for this program from here.
FS=OFS="=" ##Setting FS and OFS value as = here for all lines.
} ##Closing BLOCK for BEGIN here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first Input_file is being read.
a[$1]=$2 ##Creating array a with index $1 and value $2.
next ##next will skip all further statements from here.
}
($1 in a){ ##Checking condition ig $1 of current line(from file1) is present in array a then do following.
print $0,a[$1] ##Printing current line and value of array a with index $1 of current line here.
}
' file2 file1 ##Mentioning Input_file names here.
No need for scripting. Sort the files and then it's a simple join.
join -t= <(sort file1.txt) <(sort file2.txt)
To comply with the OP's update, let's cut only the first two fields of file1:
join -t= <(sort file1.txt | cut -d= -f-2) <(sort file2.txt)

Replace the first column in a file with another column in different file using shell

I have two files file1 and file2
file1
Shyam=123=12.3.4.5=user#gmail.com
Shyam=123=12.2.5.4=user#gmail.com
Joshwa=234=14.3.4.67=user#gmail.com
Anil=879=15.3.4.98=user#gmail.com
Anil=765=15.4.5.65=user#gmail.com
.......
file2
Shyam=ShyamLal
Joshwa=JoshwaSam
Anil=AnilAcharya
....
"=" is mentioned as a seperator in file1 and file2.
I want to update names as given in file2. ie.,Shyam will be replaced with ShyamLal, Joshwa will be replaced with JoshwaSam and Anil will be replaced with AnilAcharya. I don't want to use if-else condition, because I have large number of datas.
My output should be like:
ShyamLal=123=12.3.4.5=user#gmail.com
ShyamLal=123=12.2.5.4=user#gmail.com
JoshwaSam=234=14.3.4.67=user#gmail.com
AnilAcharya=879=15.3.4.98=user#gmail.com
AnilAcharya=765=15.4.5.65=user#gmail.com.
I tried this. But don't know whether I am doing right
while IFS= read -r line
do
key=`echo $line | awk -F "=" '{print $1}'` < file1.txt
value=`echo $line | awk -F "=" '{print $2}' < file2.txt`
cat file1.txt | sed 's/$key/$value/g'
done
How can I proceed?
Could you please try following.
awk '
BEGIN{
FS=OFS="="
}
FNR==NR{
a[$1]=$2
next
}
($1 in a){
$1=a[$1]
}
1
' Input_file2 Input_file1
Explanation: Adding detailed explanation for above code here.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section here.
FS=OFS="=" ##Setting FS and OFS as = for all lines here.
} ##Closing BLOCK for BEGIN section of this program here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when Input_file file2 is being read.
a[$1]=$2 ##Creating an array named a with index $1 with value of $2 of current line.
next ##next will skip all further statements from here.
}
($1 in a){ ##Checking condition if $1 is present in array a this will be done when Input_file1 is being read.
$1=a[$1] ##Setting $1 to array a value with index $1 of current line.
}
1 ##1 will print edited/non-edited line here.
' file2 file1 ##Mentioning Input_file names here.

using awk to print header name and a substring

i try using this code for printing a header of a gene name and then pulling a substring based on its location but it doesn't work
>output_file
cat input_file | while read row; do
echo $row > temp
geneName=`awk '{print $1}' tmp`
startPos=`awk '{print $2}' tmp`
endPOs=`awk '{print $3}' tmp`
for i in temp; do
echo ">${geneName}" >> genes_fasta ;
echo "awk '{val=substr($0,${startPos},${endPOs});print val}' fasta" >> genes_fasta
done
done
input_file
nad5_exon1 250405 250551
nad5_exon2 251490 251884
nad5_exon3 195620 195641
nad5_exon4 154254 155469
nad5_exon5 156319 156548
fasta
atgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgc............
and this is my wrong output file
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
output should look like that:
>name1
atgcatgcatgcatgcatgcat
>name2
tgcatgcatgcatgcat
>name3
gcatgcatgcatgcatgcat
>namen....
You can do this with a single call to awk which will be orders of magnitude more efficient than looping in a shell script and calling awk 4-times per-iteration. Since you have bash, you can simply use command substitution and redirect the contents of fasta to an awk variable and then simply output the heading and the substring containing the beginning through ending characters from your fasta file.
For example:
awk -v fasta=$(<fasta) '{print ">" $1; print substr(fasta,$2,$3-$2+1)}' input
or using getline within the BEGIN rule:
awk 'BEGIN{getline fasta<"fasta"}
{print ">" $1; print substr(fasta,$2,$3-$2+1)}' input
Example Input Files
Note: the beginning and ending values have been reduced to fit within the 129 characters of your example:
$ cat input
rad5_exon1 1 17
rad5_exon2 23 51
rad5_exon3 110 127
rad5_exon4 38 62
rad5_exon5 59 79
and the first 129-characters of your example fasta
$ cat fasta
atgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgc
Example Use/Output
$ awk -v fasta=$(<fasta) '{print ">" $1; print substr(fasta,$2,$3-$2+1)}' input
>rad5_exon1
atgcatgcatgcatgca
>rad5_exon2
gcatgcatgcatgcatgcatgcatgcatg
>rad5_exon3
tgcatgcatgcatgcatg
>rad5_exon4
tgcatgcatgcatgcatgcatgcat
>rad5_exon5
gcatgcatgcatgcatgcatg
Look thing over and let me know if I understood your question requirements. Also let me know if you have further questions on the solution.
If I'm understanding correctly, how about:
awk 'NR==FNR {fasta = fasta $0; next}
{
printf(">%s %s\n", $1, substr(fasta, $2, $3 - $2 + 1))
}' fasta input_file > genes_fasta
It first reads fasta file and stores the sequence in a variable fasta.
Then it reads input_file line by line, extracts the substring of fasta starting at $2 and of length $3 - $2 + 1. (Note that the 3rd argument to substr function is length, not endpos.)
Hope this helps.
made it work!
this is the script for pulling substrings from a fasta file
cat genes_and_bounderies1 | while read row; do
echo $row > temp
geneName=`awk '{print $1}' temp`
startPos=`awk '{print $2}' temp`
endPos=`awk '{print $3}' temp`
length=$(expr $endPos - $startPos)
for i in temp; do
echo ">${geneName}" >> genes_fasta
awk -v S=$startPos -v L=$length '{print substr($0,S,L)}' unwraped_${fasta} >> genes_fasta
done
done

Resources