Trying to merge 2 files but ignore new lines - bash

I'm trying to merge 2 lists together: Only copy over common differences, but ignore new lines. Might be easier to explain by this:
a.txt b.txt
abc 123
def abc.^$234,~12
ghi abcdd
jkl asdf
mnn ghi.^$321,~11
opq jkl
mnn^$qws
zxy
Becomes:
output.txt:
abc.^$234,~12
def
ghi.^$321,~11
jkl
mnn^$qws
opq
Trying to combine to lists, copy common lines while dropping new lines.

This might work for you (GNU sed):
sed -nE '1{x;s/.*/cat file2/e;x};G;s/^([^\n]+)(\n.*)*\n(\1\>[^\n]*).*/\3/;P' file1
Slurp file2 into the hold space and then append it to each line in file1.
If the word in file1 matches a word in file2, print the contents of that line in file2. Otherwise, print the current line in file1.

you could try the diff and patch commands, they might help you.
diff -u old_file new_file > change.diff
patch new_file < change.diff

You're requirements aren't at all clear but this will produce the expected output you posted given the sample input you posted so it may be what you're looking for:
$ awk -F'[^[:alnum:]]' 'NR==FNR{a[$1]=$0; next} {print ($1 in a ? a[$1] : $1)}' b.txt a.txt
abc.^$234,~12
def
ghi.^$321,~11
jkl
mnn^$qws
opq

Using awk:
$ awk '
NR==FNR {
a[$0]
next
}
{
for(i in a)
if(index(i,$0)) {
print i
next
}
print
}' b a
Output:
abc.^$234,~12
def
ghi.^$321,~11
jkl
mnn^$qws
opq

Related

Partial string search between two files using AWK

I have been trying to re-write an egrep command using awk to improve performance but haven't been successful. The egrep command performs a simple case insensitive search of the records in file1 against (partial matches in) file2. Below is the command and sample output.
file1 contains:
Abc
xyz
123
blah
hh
a,b
file2 contains:
abc de
xyz
123
456
blah
test1
abdc
abc,def,123
kite
a,b,c
Original command :
egrep -i -f file1 file2
Original (egrep) command output :
$ egrep -i -f file1 file2
abc de
xyz
123
blah
abc,def,123
a,b,c
I would like to use AWK to rewrite the command to do the same operation. I have tried the below but it is performing a full record match and not partial like grep does.
Modified command in awk :
awk 'NR==FNR{a[tolower($0)];next} tolower($0) in a' file1 file2
Modified command (awk) output:
$ awk 'NR==FNR{a[tolower($0)];next} tolower($0) in a' file1 file2
xyz
123
blah
This excludes the records which had partial matches for the string "abc". Any help to fix the awk command please? Thanks in advance.
Use index like this for a partial literal match:
awk '
NR == FNR {
needles[tolower($0)]
next
}
{
haystack = tolower($0)
for (needle in needles) {
if (index(haystack, needle)) {
print
break
}
}
}' file1 file2
I would be a bit surprised that it's significantly faster than egrep but you can try this:
$ awk 'NR==FNR {r=r ((r=="")?"":"|") tolower($0);next} tolower($0)~r' file1 file2
abc de
xyz
123
blah
abc,def,123
Explanation: first build the r1|r2|...|rn regular expression from the content of file1 and store it in awk variable r. Then print all lines of file2 that match it, thanks to the ~ match operator.
If you have GNU awk you can use its IGNORECASE variable instead of tolower:
$ awk -v IGNORECASE=1 'NR==FNR{r=r ((r=="")?"":"|") $0;next} $0~r' file1 file2
abc de
xyz
123
blah
abc,def,123
And with GNU awk it could be that forcing the type of r to regexp instead of string leads to better performance. The manual says:
Given that you can use both regexp and string constants to describe
regular expressions, which should you use? The answer is "regexp
constants," for several reasons:
...
It is more efficient to use regexp constants. 'awk' can note that
you have supplied a regexp and store it internally in a form that
makes pattern matching more efficient. When using a string
constant, 'awk' must first convert the string into this internal
form and then perform the pattern matching.
In order to do this you can try:
$ awk -v IGNORECASE=1 'NR==FNR {s=s ((s=="")?"":"|") $0;next}
FNR==1 && NR!=FNR {r=#//;sub(//,s,r);print typeof(r),r} $0~r' file1 file2
regexp Abc|xyz|123|blah|hh
abc de
xyz
123
blah
abc,def,123
(r=#// forces variable r to be of type regexp and sub(//,s,r) does not change this)
Note: just like with your egrep attempts, the lines of file1 are considered as regular expressions, not simple text strings to search for. So, if one line in file1 is .*, all lines in file2 will match, not just the lines containing substring .*.

Searching a string and replacing another string above the searched string

I have a file with the lines below
123
456
123
789
abc
efg
xyz
I need to search with abc and replace immediate above 123 with 111. This is the requirement, abc is only one occurrence in the file but 123 can be multiple occurrences and 123 can be at any position above abc.
Please help me.
I have tried with below sed command
sed -i.bak "/abc/!{x;1!p;d;};x;s/123/1111" filename
With the above command, it is only replacing 123, if 123 is just above abc, if 123 is 2 lines above abc then replace is failing.
There's more than on way to do it. Here's one:
sed -i.bak '1{h;d;};/123/{x;p;d;};/abc/{x;s/123/111/;p;d;};H;${x;p;};d' filename
ed comes in handy for complex editing of files in scripts:
ed -s file <<EOF
/^abc$/;?^123$?;.c
111
.
w
EOF
This: Sets the current line to the first one matching abc (/^abc$/;). Then changes the first line before that point that matches 123 to 111 (?XXX? searches backwards for a matching regular expression, and ?^123$?;. selects that single line for c to change) and finally saves the modified file.
This is a classic case where you keep track of your previous line and change stuff depeinding on conditions satisfying the current line. Genearlly, an awk program looks like this:
awk '(FNR==1){prev=$0; next}
(condition_on_$0) { action_on_prev }
{ print prev; prev = $0 }
END { print $0 }'
So in the case of the OP, this would read:
awk '(FNR==1){prev=$0; next}
$0 == "abc" { if (prev == "123") prev = "111" }
{ print prev; prev = $0 }
END { print $0 }'
This might work for you (GNU sed):
sed -Ez 's/(.*)(\n123.*\nabc)/\1\n111\2/' file
This slurps the file into memory and inserts 111 in front of the last occurrence of 123 before abc.
A less memory intensive solution:
sed -E '/^123$/{:a;N;/\n123$/{h;s///p;g;s/.*\n//;ba};/\nabc$/!ba;s/^/111\n/}' file
This gathers up lines following a line containing 123. If another line containing 123 is encountered it offloads all lines before it and begins gathering lines again. If it finds a line containing abc it inserts 111 at the front of the lines gathered so far.
Another alternative:
sed '/abc/{x;/./{s/^/111\n/p;z};x;b};/123/{x;/./p;x;h;$!d;b};x;/./{x;H;$!d};x' file
$ tac file | awk 'f && sub(/123/,"111"){f=0} /abc/{f=1} 1' | tac
123
456
111
789
abc
efg
xyz

Non matching word from file1 to file2

I have two files - file1 & file2.
file1 contains (only words) says-
ABC
YUI
GHJ
I8O
..................
file2 contains many para.
dfghjo ABC kll njjgg bla bla
GHJ njhjckhv chasjvackvh ..
ihbjhi hbhibb jh jbiibi
...................
I am using below command to get the matching lines which contains word from file1 in file2
grep -Ff file1 file2
(Gives output of lines where words of file1 found in file2)
I also need the words which doesn't match/found in file 2 and unable to find Un-matching word.
Can anyone help in getting below output
YUI
I8O
i am looking one liner command (via grep,awk,sed), as i am using pssh command and can't use while,for loop
You can print only the matched parts with -o.
$ grep -oFf file1 file2
ABC
GHJ
Use that output as a list of patterns for a search in file1. Process substitution <(cmd) simulates a file containing the output of cmd. With -v you can print lines that did not match. If file1 contains two lines such that one line is a substring of another line you may want to add -x (only match whole lines) to prevent false positives.
$ grep -vxFf <(grep -oFf file1 file2) file1
YUI
I8O
Using Perl - both matched/non-matched in same one-liner
$ cat sinw.txt
ABC
YUI
GHJ
I8O
$ cat sin_in.txt
dfghjo ABC kll njjgg bla bla
GHJ njhjckhv chasjvackvh ..
ihbjhi hbhibb jh jbiibi
$ perl -lne '
BEGIN { %x=map{chomp;$_=>1} qx(cat sinw.txt); $w="\\b".join("\|",keys %x)."\\b"}
print "$&" and delete($x{$&}) if /$w/ ;
END { print "\nnon-matched\n".join("\n", keys %x) }
' sin_in.txt
ABC
GHJ
non-matched
I8O
YUI
$
Getting only the non-matched
$ perl -lne '
BEGIN {
%x = map { chomp; $_=>1 } qx(cat sinw.txt);
$w = "\\b" . join("\|",keys %x) . "\\b"
}
delete($x{$&}) if /$w/;
END { print "\nnon-matched\n".join("\n", keys %x) }
' sin_in.txt
non-matched
I8O
YUI
$
Note that even a single use of $& variable used to be very expensive for the whole program, in Perl versions prior to 5.20.
Assuming your "words" in file1 are in more than 1 line :
while read line
do
for word in $line
do
if ! grep -q $word file2
then echo $word not found
fi
done
done < file1
For Un-matching words, here's one GNU awk solution:
awk 'NR==FNR{a[$0];next} !($1 in a)' RS='[ \n]' file2 file1
YUI
I8O
Or !($0 in a), it's the same. Since I set RS='[ \n]', every space as line separator too.
And note that I read file2 first, and then file1.
If file2 could be empty, you should change NR==FNR to different file checking methods, like ARGIND==1 for GNU awk, or FILENAME=="file2", or FILENAME==ARGV[1] etc.
Same mechanism for only the matched one too:
awk 'NR==FNR{a[$0];next} $0 in a' RS='[ \n]' file2 file1
ABC
GHJ

BASH - Split file into several files based on conditions

I have a file (input.txt) with the following structure:
>day_1
ABC
DEF
GHI
>day_2
JKL
MNO
PQR
>day_3
STU
VWX
YZA
>month_1
BCD
EFG
HIJ
>month_2
KLM
NOP
QRS
...
I would like to split this file into multiple files (day.txt; month.txt; ...). Each new text file would contain all "header" lines (the one starting with >) and their content (lines between two header lines).
day.txt would therefore be:
>day_1
ABC
DEF
GHI
>day_2
JKL
MNO
PQR
>day_3
STU
VWX
YZA
and month.txt:
>month_1
BCD
EFG
HIJ
>month_2
KLM
NOP
QRS
I cannot use split -l in this case because the amount of lines is not the same for each category (day, month, etc.). However, each sub-category has the same number of lines (=3).
EDIT: As per OP adding 1 more solution now.
awk -F'[>_]' '/^>/{file=$2".txt"} {print > file}' Input_file
Explanation:
awk -F'[>_]' ' ##Creating field separator as > or _ in current lines.
/^>/{ file=$2".txt" } ##Searching a line which starts with > if yes then creating a variable named file whose value is 2nd field".txt"
{ print > file } ##Printing current line to variable file(which will create file name of variable file's value).
' Input_file ##Mentioning Input_file name here.
Following awk may help you on same.
awk '/^>day/{file="day.txt"} /^>month/{file="month.txt"} {print > file}' Input_file
You can set the record separator to > and then just set the file name based on the category given by $1.
$ awk -v RS=">" 'NF {f=$1; sub(/_.*$/, ".txt", f); printf ">%s", $0 > f}' input.txt
$ cat day.txt
>day_1
ABC
DEF
GHI
>day_2
JKL
MNO
PQR
>day_3
STU
VWX
YZA
$ cat month.txt
>month_1
BCD
EFG
HIJ
>month_2
KLM
NOP
QRS
Here's a generic solution for >name_number format
$ awk 'match($0, /^>[^_]+_/){k = substr($0, RSTART+1, RLENGTH-2);
if(!(k in a)){close(op); a[k]; op=k".txt"}}
{print > op}' ip.txt
match($0, /^>[^_]+_/) if line matches >name_ at start of line
k = substr($0, RSTART+1, RLENGTH-2) save the name portion
if(!(k in a)) if the key is not found in array
a[k] add key to array
op=k".txt" output file name
close(op) in case there are too many files to write
print > op print input record to filename saved in op
Since each subcategory is composed of the same amount of lines, you can use grep's -A / --after flag to specify that number of lines to match after a header.
So if you know in advance the list of categories, you just have to grep the headers of their subcategories to redirect them with their content to the correct file :
lines_by_subcategory=3 # number of lines *after* a subcategory's header
for category in "month" "day"; do
grep ">$category" -A $lines_by_subcategory input.txt >> "$category.txt"
done
You can try it here.
Note that this isn't the most efficient solution as it must browse the input once for each category. Other solutions could instead browse the content and redirect each subcategory to their respective file in a single pass.

Simpler way to count the number of duplicated rows in a text file

I have a text file that looks like this:
abc
bcd
abc
efg
bcd
abc
And the expected output is this:
3 abc
2 bcd
1 efg
I know there is an existed solution for this:
sort -k2 < inFile |
awk '!z[$1]++{a[$1]=$0;} END {for (i in a) print z[i], a[i]}' |
sort -rn -k1 > outFile
The code sorts, removes duplicates, and sorts again, and prints the expected output.
However, is there a simpler way to express the z[$1]++{a[$1]=$0} part? More "basic", I mean.
More basic:
$ sort inFile | uniq -c
3 abc
2 bcd
1 efg
More basic awk
When one is used to awk's idioms, the expression !z[$1]++{a[$1]=$0;} is clear and concise. For those used to programming in other languages, other forms might be more familiar, such as:
awk '{if (z[$1]++ == 0) a[$1]=$0;} END {for (i in a) print z[i], a[i]}'
Or,
awk '{if (z[$1] == 0) a[$1]=$0; z[$1]+=1} END {for (i in a) print z[i], a[i]}'
If your input file contains billions of lines and you want to avoid sort, then you can just do:
awk '{a[$0]++} END{for(x in a) print a[x],x}' file.txt

Resources