Compare one to one lines in 2 different files using shell scripting - shell

I have 2 files:
File1 --------------------------------------->File2
abc -----------------------------------------> abc
cde -----------------------------------------> cde,xyz,efg,hij,...,n
efg -----------------------------------------> lmn,opq,weq,...n
Now I want to File1 line1 -> File2 line1, line 2 -> line2 and so on...
However, in file2 a single line can have multiple entries separated with 'comma'.
now if the entry in file1 matches with the any of the corresponding line entry in file 2 -> result ok
Else show the diff...
For example:
FILE1 ---------------------- FILE2
cde ---------------------- cde,xyz,efg,hij,opt
the result should be ok because cde exist in both files.
Can you please help me out to write a shell script for the same
sdiff gave me the entries difference also

Consider these two test files:
$ cat file1
abc
cde
efg
$ cat file2
abc
cde,xyz,efg,hij,n
lmn,opq,weq,n
Consider the command:
$ awk -F, 'FNR==NR{a[NR]=$1;next} {f=0;for (i=1;i<=NF;i++)if($i==a[FNR])f=1;if(f)print "OK";else print a[FNR]" -----> " $0}' file1 file2
OK
OK
efg -----> lmn,opq,weq,n
This prints OK on every line for which the key in file1 is found anywhere on the corresponding line in file2. If it is not, it prints both lines as shown.
Another example
From the comments, consider these two files in which all lines have a match:
$ cat f1
abc
cde
mno
$ cat f2
abc
efg,cde,hkl
mno
$ awk -F, 'FNR==NR{a[NR]=$1;next} {f=0;for (i=1;i<=NF;i++)if($i==a[FNR])f=1;if(f)print "OK";else print a[FNR]" -----> " $0}' f1 f2
OK
OK
OK

Related

bash - Concatenate files in different subfolders into a single file and have each file name in the first column

I am trying to concatenate a few thousand files that are in different subfolders into a single file and also have the name of each concatenated file inserted as the first column so that I know which file each data row came from. Essentially starting with something like this:
EDIT: I neglected to mention that each file has the same header so I updated the request accordingly.
Folder1
file1.txt
A B C
123 010 ...
456 020 ...
789 030 ...
Folder2
file2.txt
A B C
abc 100 ...
efg 200 ...
hij 300 ...
and outputting this:
CombinedFile.txt
A B C
file1 123 010 ...
file1 456 020 ...
file1 789 030 ...
file2 abc 100 ...
file2 efg 200 ...
file2 hij 300 ...
After reading this post, I have tried the following code, but end up with a syntax error (apologies, I'm super new to awk!)
shopt -s globstar
for filename in path/**/*.txt; do
awk '{print FILENAME "\t" $0}' *.txt > CombinedFile.txt
done
Thanks for your help!
This single awk should be able to do it without any looping:
shopt -s globstar
awk 'FNR == 1 {
f = FILENAME
gsub(/^.*\/|\.[^.]+$/, "", f)
if (NR > 1) # show header for first file only
next
}
{
print f, $0
}' path/**/*.txt > CombinedFile.txt
cat CombinedFile.txt
file1 123 010
file1 456 020
file1 789 030
file2 abc 100
file2 efg 200
file2 hij 300

Trying to merge 2 files but ignore new lines

I'm trying to merge 2 lists together: Only copy over common differences, but ignore new lines. Might be easier to explain by this:
a.txt b.txt
abc 123
def abc.^$234,~12
ghi abcdd
jkl asdf
mnn ghi.^$321,~11
opq jkl
mnn^$qws
zxy
Becomes:
output.txt:
abc.^$234,~12
def
ghi.^$321,~11
jkl
mnn^$qws
opq
Trying to combine to lists, copy common lines while dropping new lines.
This might work for you (GNU sed):
sed -nE '1{x;s/.*/cat file2/e;x};G;s/^([^\n]+)(\n.*)*\n(\1\>[^\n]*).*/\3/;P' file1
Slurp file2 into the hold space and then append it to each line in file1.
If the word in file1 matches a word in file2, print the contents of that line in file2. Otherwise, print the current line in file1.
you could try the diff and patch commands, they might help you.
diff -u old_file new_file > change.diff
patch new_file < change.diff
You're requirements aren't at all clear but this will produce the expected output you posted given the sample input you posted so it may be what you're looking for:
$ awk -F'[^[:alnum:]]' 'NR==FNR{a[$1]=$0; next} {print ($1 in a ? a[$1] : $1)}' b.txt a.txt
abc.^$234,~12
def
ghi.^$321,~11
jkl
mnn^$qws
opq
Using awk:
$ awk '
NR==FNR {
a[$0]
next
}
{
for(i in a)
if(index(i,$0)) {
print i
next
}
print
}' b a
Output:
abc.^$234,~12
def
ghi.^$321,~11
jkl
mnn^$qws
opq

Add sequential number at the beginning of files

I have 5 files I want to add sequential numbers and tabulation at the beginning of each file but the second file should start with the last number from the first file and so on here's an example:
file1
line1
line2
....
line13
file2
line1
line2
file5
line1
line2
Output file1
1 line1
........
13 line13
output file2
14 line1
15 line2
And so on
if you want to concatenate files and number lines, use cat:
cat -n file1 file2 file3 file4 file5
if you want to create a separate output file for each input file, use awk:
awk '{
printf "%d\t%s\n",NR,$0 > ("output_"FILENAME)
}' file1 file2 file3 file4 file5
reads file1..5, numbers lines and outputs them to output_file1..5. note that if you have too many files then above awk command will fail with an error like too many open file descriptors., in that case use following, it closes the previous file when input file changes.
awk '
FILENAME!=f{close("output_"f);f=FILENAME}
{printf "%d\t%s\n",NR,$0 > ("output_"f)}
' file1 file2 file3 file4 file5

Non matching word from file1 to file2

I have two files - file1 & file2.
file1 contains (only words) says-
ABC
YUI
GHJ
I8O
..................
file2 contains many para.
dfghjo ABC kll njjgg bla bla
GHJ njhjckhv chasjvackvh ..
ihbjhi hbhibb jh jbiibi
...................
I am using below command to get the matching lines which contains word from file1 in file2
grep -Ff file1 file2
(Gives output of lines where words of file1 found in file2)
I also need the words which doesn't match/found in file 2 and unable to find Un-matching word.
Can anyone help in getting below output
YUI
I8O
i am looking one liner command (via grep,awk,sed), as i am using pssh command and can't use while,for loop
You can print only the matched parts with -o.
$ grep -oFf file1 file2
ABC
GHJ
Use that output as a list of patterns for a search in file1. Process substitution <(cmd) simulates a file containing the output of cmd. With -v you can print lines that did not match. If file1 contains two lines such that one line is a substring of another line you may want to add -x (only match whole lines) to prevent false positives.
$ grep -vxFf <(grep -oFf file1 file2) file1
YUI
I8O
Using Perl - both matched/non-matched in same one-liner
$ cat sinw.txt
ABC
YUI
GHJ
I8O
$ cat sin_in.txt
dfghjo ABC kll njjgg bla bla
GHJ njhjckhv chasjvackvh ..
ihbjhi hbhibb jh jbiibi
$ perl -lne '
BEGIN { %x=map{chomp;$_=>1} qx(cat sinw.txt); $w="\\b".join("\|",keys %x)."\\b"}
print "$&" and delete($x{$&}) if /$w/ ;
END { print "\nnon-matched\n".join("\n", keys %x) }
' sin_in.txt
ABC
GHJ
non-matched
I8O
YUI
$
Getting only the non-matched
$ perl -lne '
BEGIN {
%x = map { chomp; $_=>1 } qx(cat sinw.txt);
$w = "\\b" . join("\|",keys %x) . "\\b"
}
delete($x{$&}) if /$w/;
END { print "\nnon-matched\n".join("\n", keys %x) }
' sin_in.txt
non-matched
I8O
YUI
$
Note that even a single use of $& variable used to be very expensive for the whole program, in Perl versions prior to 5.20.
Assuming your "words" in file1 are in more than 1 line :
while read line
do
for word in $line
do
if ! grep -q $word file2
then echo $word not found
fi
done
done < file1
For Un-matching words, here's one GNU awk solution:
awk 'NR==FNR{a[$0];next} !($1 in a)' RS='[ \n]' file2 file1
YUI
I8O
Or !($0 in a), it's the same. Since I set RS='[ \n]', every space as line separator too.
And note that I read file2 first, and then file1.
If file2 could be empty, you should change NR==FNR to different file checking methods, like ARGIND==1 for GNU awk, or FILENAME=="file2", or FILENAME==ARGV[1] etc.
Same mechanism for only the matched one too:
awk 'NR==FNR{a[$0];next} $0 in a' RS='[ \n]' file2 file1
ABC
GHJ

BASH - Split file into several files based on conditions

I have a file (input.txt) with the following structure:
>day_1
ABC
DEF
GHI
>day_2
JKL
MNO
PQR
>day_3
STU
VWX
YZA
>month_1
BCD
EFG
HIJ
>month_2
KLM
NOP
QRS
...
I would like to split this file into multiple files (day.txt; month.txt; ...). Each new text file would contain all "header" lines (the one starting with >) and their content (lines between two header lines).
day.txt would therefore be:
>day_1
ABC
DEF
GHI
>day_2
JKL
MNO
PQR
>day_3
STU
VWX
YZA
and month.txt:
>month_1
BCD
EFG
HIJ
>month_2
KLM
NOP
QRS
I cannot use split -l in this case because the amount of lines is not the same for each category (day, month, etc.). However, each sub-category has the same number of lines (=3).
EDIT: As per OP adding 1 more solution now.
awk -F'[>_]' '/^>/{file=$2".txt"} {print > file}' Input_file
Explanation:
awk -F'[>_]' ' ##Creating field separator as > or _ in current lines.
/^>/{ file=$2".txt" } ##Searching a line which starts with > if yes then creating a variable named file whose value is 2nd field".txt"
{ print > file } ##Printing current line to variable file(which will create file name of variable file's value).
' Input_file ##Mentioning Input_file name here.
Following awk may help you on same.
awk '/^>day/{file="day.txt"} /^>month/{file="month.txt"} {print > file}' Input_file
You can set the record separator to > and then just set the file name based on the category given by $1.
$ awk -v RS=">" 'NF {f=$1; sub(/_.*$/, ".txt", f); printf ">%s", $0 > f}' input.txt
$ cat day.txt
>day_1
ABC
DEF
GHI
>day_2
JKL
MNO
PQR
>day_3
STU
VWX
YZA
$ cat month.txt
>month_1
BCD
EFG
HIJ
>month_2
KLM
NOP
QRS
Here's a generic solution for >name_number format
$ awk 'match($0, /^>[^_]+_/){k = substr($0, RSTART+1, RLENGTH-2);
if(!(k in a)){close(op); a[k]; op=k".txt"}}
{print > op}' ip.txt
match($0, /^>[^_]+_/) if line matches >name_ at start of line
k = substr($0, RSTART+1, RLENGTH-2) save the name portion
if(!(k in a)) if the key is not found in array
a[k] add key to array
op=k".txt" output file name
close(op) in case there are too many files to write
print > op print input record to filename saved in op
Since each subcategory is composed of the same amount of lines, you can use grep's -A / --after flag to specify that number of lines to match after a header.
So if you know in advance the list of categories, you just have to grep the headers of their subcategories to redirect them with their content to the correct file :
lines_by_subcategory=3 # number of lines *after* a subcategory's header
for category in "month" "day"; do
grep ">$category" -A $lines_by_subcategory input.txt >> "$category.txt"
done
You can try it here.
Note that this isn't the most efficient solution as it must browse the input once for each category. Other solutions could instead browse the content and redirect each subcategory to their respective file in a single pass.

Resources