Comparing file content in two different directories - shell

I have four files in two directories: 1.txt and 2.txt in one directory and 3.txt and 4.txt in another one. I want to compare the first pattern starting with word "query" in these text files and match the files existing in two different directories.
How can I do it?
Example:
1.txt
ABC
Query : JKLTER
2.txt
ABC
Query : PCA
3.txt
Query :JKLTER
XYSH
Query : ABC
4.txt
GFHHH
Using the command I could derive these two files from the directories just based on first pattern (starting with Query) matched.
Output :
Matched files : 1.txt 3.txt

I have something that is hopefully close enough - else you can diddle around with it a bit to get it closer.
So, if you use GNU awk to find the first line containing the word Query in all the files in a directory and then print the last word on that line and the name of the current file, you will get this for your first directory d1:
awk -F'[ :]*' '/Query/{print $NF,FILENAME; nextfile}' d1/*txt
JKLTER d1/1.txt
PCA d1/2.txt
And this for the second directory d2:
awk -F'[ :]*' '/Query/{print $NF,FILENAME; nextfile}' d2/*txt
JKLTER d2/3.txt
You can then pass the output of each of those commands to join to have it join lines wherein the first field matches:
join <(awk -F'[ :]*' '/Query/{print $NF,FILENAME; nextfile}' d1/*txt) <(awk -F'[ :]*' '/Query/{print $NF,FILENAME; nextfile}' d2/*txt)
Output
JKLTER d1/1.txt d2/3.txt
You can get rid of the leading directory by changing into each directory before running awk:
join <(cd d1; awk -F'[ :]*' '/Query/{print $NF,FILENAME; nextfile}' *txt) <(cd d2; awk -F'[ :]*' '/Query/{print $NF,FILENAME;nextfile}' *txt)
Output
JKLTER 1.txt 3.txt
You can get rid of the common field used by join like this:
join <(...) <(...) | awk '{$1="";print}'
Output
1.txt 3.txt
If you only have text files and nothing else in each subdirectory, and there are actually spaces after the colon following the word Query, my solution can be simplified to:
join <(cd d1; awk '/Query/{print $NF,FILENAME; nextfile}' *) <(cd d2; awk '/Query/{print $NF,FILENAME;nextfile}' *) | awk '{print $2,"matches",$3}'
Output
1.txt matches 3.txt

Related

Extracting unique values between 2 files with awk

I need to get uniq lines when comparing 2 files. These files containing field separator ":" which should be treated as the end of line while comparing strings.
The file1 contains these lines
apple:tasty
apple:red
orange:nice
kiwi:awesome
kiwi:expensive
banana:big
grape:green
orange:oval
banana:long
The file2 contains these lines
orange:nice
banana:long
The output file should be (2 occurrences of orange and 2 occurrences of banana deleted)
apple:tasty
apple:red
kiwi:awesome
kiwi:expensive
grape:green
So the only strings before : should be compared
Is it possible to complete this task in 1 command ?
I tried to complete the task in such way but field separator does not work in that situation.
awk -F: 'FNR==NR {a[$0]++; next} !a[$0]' file1 file2 > outputfile
You basically had it, but $0 refers to the whole line when you want to deal with only the first field, which is $1.
Also you need to take care with the order of the input files. To use the values from file2 for deciding which lines to include from file1, process file2 first:
$ awk -F: 'FNR==NR {a[$1]++; next} !a[$1]' file2 file1
apple:tasty
apple:red
kiwi:awesome
kiwi:expensive
grape:green
One comment: awk is very ineffective with arrays. In real life with big files, better use something like:
comm -3 <(cut -d : -f 1 f1 | sort -u) <(cut -d : -f 1 f2 | sort -u) | grep -h -f /dev/stdin f1 f2

Compare the first string ending with ; in 2 txt files and get only those lines that are unique

I'm trying to compare 2 txt files and get only those lines that are unique, the problem is that lines want to compare only on the basis of 1 word that ends with a sign; because it's just what interests me
This is the example line:
000000423B;Name;26.46;32.55;0;06;pc.
I need to find out if the text file is also 000000423B and if it does not display it or save it to a file
awk 'NR == FNR {exclude [$ 0]; next}! ($ 0 in exclude)' 1.txt 2.txt
and
grep -xvFf 1.txt 2.txt> 3.txt
They give nice results but they compare the whole line and I need to compare only to the first character;
Any idea?
My input
1.txt:
000000423B;Name;27.47;33.79;0;06;szt.
000010001;Name2;4.42;5.44;0;08;szt.
000010001D;Name3;1.68;2.06;0;06;szt.
2.txt
000000423B;Name;97.47;33.79;0;06;szt.
000010001;Name2;4.99;5.44;0;08;szt.
000010001D;Name3:8778;1.68;2.06;0;06;szt.
009999999;Name4:99999;1.68;2.06;0;96;szt.
I want get result:
009999999;Name4:99999;1.68;2.06;0;96;szt.
In 1.txt and 2.txt first three lines have that same "product id" but other price and I do not care. I need to find only new "product id", these are the first digits to the character " ; "
To fix your command and make it work only for the first column, you can do this :
awk -F';' 'NR == FNR {exclude [$1]; next} !($1 in exclude)' 1.txt 2.txt
join -t';' -11 -21 -v1 -v2 <(sort 1.txt) <(sort 2.txt)
sort both files. Note I didn't need to specify -t';' -k1.1, because we are joining on the first field from file.
join the sorted files
-t';' using ; as a separator
-11 -21 on first field from both files
-v1 -v2 print unmatched lines from first and second file. Actually -v2 would be enough, dunno if your interested in unmatching lines from first file too. If not, remove -v1.

Merging 2 sorted files(with similar content) based on Timestamp in shellscript

I have 2 identical files with below content:
File1:
1,Abhi,Ban,20180921T09:09:01,EmpId1,SalaryX
4,Bbhi,Dan,20180922T09:09:03,EmpId2,SalaryY
7,Cbhi,Ean,20180923T09:09:05,EmpId3,SalaryZ
9,Dbhi,Fan,20180924T09:09:09,EmpId4,SalaryQ
File2:
11,Ebhi,Gan,20180922T09:09:02,EmpId5,SalaryA
12,Fbhi,Han,20180923T09:09:04,EmpId6,SalaryB
3,Gbhi,Ian,20180924T09:09:06,EmpId7,SalaryC
5,Hbhi,Jan,20180925T09:09:08,EmpId8,SalaryD
I want to append all File1's content in Files (based on the date in ascending order)
Outcome:
1,Abhi,Ban,20180921T09:09:01,EmpId1,SalaryX
11,Ebhi,Gan,20180922T09:09:02,EmpId5,SalaryA
4,Bbhi,Dan,20180922T09:09:03,EmpId2,SalaryY
12,Fbhi,Han,20180923T09:09:04,EmpId6,SalaryB
7,Cbhi,Ean,20180923T09:09:05,EmpId3,SalaryZ
3,Gbhi,Ian,20180924T09:09:06,EmpId7,SalaryC
9,Dbhi,Fan,20180924T09:09:09,EmpId4,SalaryQ
5,Hbhi,Jan,20180925T09:09:08,EmpId8,SalaryD
You can use below AWK construct to do this :-
awk -F "," 'NR==FNR{print $4, $0;next} NR>FNR{print $4, $0;}' f1.txt f2.txt | sort | awk '{print $2}'
Explanation :-
Prefix date column ($4) before every line ($0) for both the files.
sort it. And Then print $2 which is whole line.
These printed lines will be in sorted order by date.
f1.txt and f2.txt are two file names.
You can try the following command
awk 'FNR==NR{a[FNR]=$0;next}{print a[FNR]"\n"$0}' file1 file2
with an array a store file1's datas, FNR is a's key.

bash: using 2 variables from same file and sed

I have a 2 files:
file1.txt
rs142159069:45000079:TACTTCTTGGACATTTCC:T 45000079
rs111285978:45000103:A:AT 45000103
rs190363568:45000168:C:T 45000168
file2.txt
rs142159069:45000079:TACTTCTTGGACATTTCC:T rs142159069
rs111285978:45000103:A:AT rs111285978
rs190363568:45000168:C:T rs190363568
Using file2.txt, I want to replace the names (column2 of file1.txt which is column1 of file2.txt) by the entry in column 2. The output file would then be:
rs142159069 45000079
rs111285978 45000103
rs190363568 45000168
I have tried inputing the columns of file2.txt but without success:
while read -r a b
do
cat file1.txt | sed s'/$a/$b/'
done < file2.txt
I am quite new to bash. Also, not sure how to write an output file with my command. Any help would be deeply appreciated.
In your case, using awk or perl would be easier, if you are willing to accept an answer without sed:
awk '(NR==FNR){out[$1]=$2;next}{out[$1]=out[$1]" "$2}END{for (i in out){print out[i]} }' file2.txt file1.txt > output.txt
output.txt :
rs142159069 45000079
rs111285978 45000103
rs190363568 45000168
Note: this assume all symbols in column1 are unique, and that they are all present in both files
explanation:
(NR==FNR){out[$1]=$2;next} : while you are parsing the first file, create a map with the name from the first column as key
{out[$1]=out[$1]" "$2} : append the value from the second column
END{for (i in out){print out[i]} } : print all the values in the map
Apparently $2 of file2 is part of $1 of file1, so you could use awk and redefine FS:
$ awk -F"[: ]" '{print $1,$NF}' file1
rs142159069 45000079
rs111285978 45000103
rs190363568 45000168

ksh shell script to print and delete matched line based on a string

I have 2 files like below. I need a script to find string from file2 in file1 and delete the line which contains the string from file1 and put it in another file (output1.txt). Also it shld print the lines deleted and the string if the string doesn't exist in File1 (Ouput2.txt).
File1:
Apple
Boy: Goes to school
Cat
File2:
Boy
Dog
I need output like below.
Output1.txt:
Apple
Cat
Output2.txt:
Dog
Can anyone help please
If you have awk available on your system:
awk -v FS='[ :]' 'NR==FNR{a[$1]}NR>FNR&&!($1 in a){print $1}' File2 File1 > Output1.txt
awk -v FS='[ :]' 'NR==FNR{a[$1]}NR>FNR&&!($1 in a){print $1}' File1 File2 > Output2.txt
The script is storing in an array a the first element $1 of the first file given in argument.
If the first parameter of the second file is not part of the array, print it.
Note that the delimiter is either a space or a :

Resources