Grep for Keyword1Keyword2 but not Keyword1TEXTKeyword2 - Very large grep - bash

I want to be able to grep for exact match results without outputting those with text in between my searched words. the middle being part of the output. For example:
egrep -i "^cat|^dog" list.txt >> startswith.txt
egrep -i "home$|house$" startswith.txt >> final.txt
I want this to return any matches for cathome, cathouse, doghome, doghouse; but not return cathasahome, catneedsahouse, etc. Take note that the files would be wayyy to big for me to go through and say ^word1word2$ in every combination.
Is there a way to do this within grep or egrep.

Use some grouping to specify both parts of your pattern, The anchors (^ and $) will apply to the groups.
$ cat list.txt
cathome
cathouse
catindahouse
dogindahome
doghouse
doghome
$ egrep -i "^(dog|cat)(home|house)$" list.txt
cathome
cathouse
doghouse
doghome
You could try the same thing in Perl regex mode, with non-capturing groups (since you don't care about capturing them):
$ grep -Pi "^(?:dog|cat)(?:home|house)$" list.txt
No idea if that'll make a difference either way, but doesn't hurt to try.

You didn't provide any sample input or expected output so this is an untested guess but this is probably what you're looking for:
awk '
BEGIN {
split("cat dog",beg)
split("home house",end)
for (i in beg)
for (j in end)
matches[beg[i] end[j]]
}
tolower($0) in matches
' file
e.g.:
$ cat file
acathome
CatHome
catinhouse
CATHOUSE
doghomes
dogHOME
dogathouse
DOGhouse
$ awk '
BEGIN {
split("cat dog",beg)
split("home house",end)
for (i in beg)
for (j in end)
matches[beg[i] end[j]]
}
tolower($0) in matches
' file
CatHome
CATHOUSE
dogHOME
DOGhouse

Related

replace string with exact match in bash script

I have a many repeated content as give below in a file . These are only uniq content.
CHECKSUM="Y"
CHECKSUM="N"
CHECKSUM="U"
CHECKSUM="
I want to replace empty field with "Null" and need output as :
CHECKSUM="Y"
CHECKSUM="N"
CHECKSUM="U"
CHECKSUM="Null"
What I can think of as :
#First find the matching content
cat file.txt | egrep 'CHECKSUM="Y"|CHECKSUM="N"|CHECKSUM="U"' > file_contain.txt
# Find the content where given string are not there
cat file.txt | egrep -v 'CHECKSUM="Y"|CHECKSUM="N"|CHECKSUM="U"' > file_donot_contain.txt
# Replace the string in content not found file
sed -i 's/CHECKSUM="/CHECKSUM="Null"/g' file_donot_contain.txt
# Merge the files
cat file_contain.txt file_donot_contain.txt > output.txt
But I find this is not efficient way of doing. Any other suggestion ?
To achieve this you need to mark that this is the end of the line, not just part of it, using $ (And optionally ^ to mark the start of the line too):
sed -i s'/^CHECKSUM="$/CHECKSUM="Null"/' file.txt

How to process tr across all files in a directory and output to a different name in another directory?

mpu3$ echo * | xargs -n 1 -I {} | tr "|" "/n"
which outputs:
#.txt
ag.txt
bg.txt
bh.txt
bi.txt
bid.txt
dh.txt
dw.txt
er.txt
ha.txt
jo.txt
kc.txt
lfr.txt
lg.txt
ng.txt
pb.txt
r-c.txt
rj.txt
rw.txt
se.txt
sh.txt
vr.txt
wa.txt
is what I have so far. What is missing is the output; I get none. What I really want is to get a list of txt files, use their name up to the extension, process out the "|" and replace it with a LF/CR and put the new file in another directory as [old-name].ics. HALP. THX in advance. - Idiot me.
You can loop over the files and use sed to process the file:
for i in *.txt; do
sed -e 's/|/\n/g' "$i" > other_directory/"${i%.txt}".ics
done
No need to use xargs, especially with echo which would risk the filenames getting word split and having globbing apply to them, so could well do the wrong thing.
Then we use sed and use s to substitute | with \n g makes it a global replace. We redirect that to the other director you want and use bash's parameter expansion to strip off the .txt from the end
Here's an awk solution:
$ awk '
FNR==1 { # for first record of every file
close(f) # close previous file f
f="path_to_dir/" FILENAME # new filename with path
sub(/txt$/,"ics",f) } # replace txt with ics
{
gsub(/\|/,"\n") # replace | with \n
print > f }' *.txt # print to new file

sed multiple replacements with line range

I have a file with below records
user1,fuser1,luser1,user1#test.com,data,user1
user2,fuser2,luser2,user2#test.com,data,user2
user3,fuser3,luser3,user3#test.com,data,user3
I wanted to perform some text replacements from
user1,fuser1,luser1,user1#test.com,data,user1
to
New_user1,New_fuser1,New_luser1,New_user1#test.com,data,New_user1
so I wrote below sed script.
sed -i -e 's/user/New_user/g; s/fuser/New_fuser/g; s/luser/New_luser/g' file
This works perfect. Now I have a requirement that I want to replace in specific line range.
start=2
end=3
sed -i -e ''${start},${end}'s/user/New_user/g; s/fuser/New_fuser/g; s/luser/New_luser/g' file
but this command is replacing pattern in all lines. example output is,
user1,New_fuser1,New_luser1,user1#test.com,data,New_user1
user2,New_fuser2,New_luser2,user2#test.com,data,New_user2
user3,New_fuser3,New_luser3,user3#test.com,data,New_user3
Looks like range is getting applied only to first expression and remaining expressions are getting applied on whole file. How to apply this range to all expressions?
You can use awk variables to use for this functionality, controlling the row and column numbers used for replacing
awk -vFS="," -vOFS="," -v columnStart=2 -v columnEnd=3 -v rowStart=1 -v rowEnd=2 \
'NR>=rowStart&&NR<=rowEnd{for(i=columnStart; i<=columnEnd; i++) \
$i="New_"$i; print }' file
where the awk variables columnStart, columnEnd, rowStart and rowStart determine which columns and rows to replace with , as the de-limiter adopted.
For your input file:-
$ cat input-file
user1,fuser1,luser1,user1#test.com,data,user1
user2,fuser2,luser2,user2#test.com,data,user2
user3,fuser3,luser3,user3#test.com,data,user3
Assuming I want to do replacement in lines 2 and 3 from columns 3-4, I can set-up my awk as
awk -vFS="," -vOFS="," -v columnStart=3 -v columnEnd=4 -v rowStart=2 -v rowEnd=3 \
'NR>=rowStart&&NR<=rowEnd{for(i=columnStart; i<=columnEnd; i++) \
$i="New_"$i; print }' file
user2,fuser2,New_luser2,New_user2#test.com,data,user2
user3,fuser3,New_luser3,New_user3#test.com,data,user3
To apply on the say the last column, set the columnStart and columnEnd to the same value e.g. say on column 6 and on last line only.
awk -vFS="," -vOFS="," -v columnStart=6 -v columnEnd=6 -v rowStart=3 -v rowEnd=3 \
'NR>=rowStart&&NR<=rowEnd{for(i=columnStart; i<=columnEnd; i++) \
$i="New_"$i; print }' file
user3,fuser3,luser3,user3#test.com,data,New_user3
When using GNU Sed (present on Ubuntu, probably Debian, and probably others).
There is a feature which makes this easy:
https://www.gnu.org/software/sed/manual/sed.html#Common-Commands
A group of commands may be enclosed between { and } characters. This
is particularly useful when you want a group of commands to be
triggered by a single address (or address-range) match.
Example: perform substitution then print the second input line:
$ seq 3 | sed -n '2{s/2/X/ ; p}'
X
Given the original question, this should do the trick:
sed -i -e '2,3 {s/user/New_user/g; s/fuser/New_fuser/g; s/luser/New_luser/g}' file
The following works for me:
START=2
NUM=1
sed -i -e "$START,+${NUM} s/user/New_user/g; $START,+${NUM} s/fuser/New_fuser/g; $START,+${NUM} s/luser/New_luser/g" file
As you can see, there are several changes:
The line range has to be present at each expression
The range should be represented (in this case) as the start line number and number of lines (the number of affected lines is NUM+1)
You put extra apostrophe symbols.
Using a single s command:
start=1
end=2
sed -e "$start,$end s/\([fl]*\)user/New_\1user/g" file
[fl]*user will match user with optional f or l first letter
output:
New_user1,New_fuser1,New_luser1,New_user1#test.com,data,New_user1
New_user2,New_fuser2,New_luser2,New_user2#test.com,data,New_user2
user3,fuser3,luser3,user3#test.com,data,user3

Multi-line grep with positive and negative filtering

I need to grep for a multi-line string that doesn't include one string, but does include others. This is what I'm searching for in some HTML files:
<not-this>
<this> . . . </this>
</not-this>
In other words, I want to find files that contain <this> and </this> on the same line, but should not be surrounded by html tags <not-this> on the lines before and/or after. Here is some shorthand logic for what I want to do:
grep 'this' && '/this' && !('not-this')
I've seen answers with the following...
grep -Er -C 2 '.*this.*this.*' . | grep -Ev 'not-this'
...but this just erases the line(s) containing the "not" portion, and displays the other lines. What I'd like is for it to not pull those results at all if "not-this" is found within a line or two of "this".
Is there a way to accomplish this?
P.S. I'm using Ubuntu and gnome-terminal.
It sounds like an awk script might work better here:
$ cat input.txt
<not-this>
<this>BAD! DO NOT PRINT!</this>
</not-this>
<yes-this>
<this>YES! PRINT ME!</this>
</yes-this>
$ cat not-this.awk
BEGIN {
notThis=0
}
/<not-this>/ {notThis=1}
/<\/not-this>/ {notThis=0}
/<this>.*<\/this>/ {if (notThis==0) print}
$ awk -f not-this.awk input.txt
<this>YES! PRINT ME!</this>
Or, if you'd prefer, you can squeeze this awk script onto one long line:
$ awk 'BEGIN {notThis=0} /<not-this>/ {notThis=1} /<\/not-this>/ {notThis=0} /<this>.*<\/this>/ {if (notThis==0) print}' input.txt

Grep (Bash) error

I have a file like this called new.samples.dat
-4.5000000000E-01 8.0000000000E+00 -1.3000000000E-01
5.0000000000E-02 8.0000000000E+00 3.4000000000E-01
...
I have to search all this numbers of this file in another file called Remaining.Simulations.dat and copy them in another file. I did like this
for sample_index in $(seq 1 100)
do
sample=$(awk 'NR=='$sample_index'' new.samples.dat)
grep "$sample" Remaining.Simulations.dat >> Previous.Training.dat
done
It works almost fine but it does not copy all the $sample into Previous.Training.dat even if I am sure that these are in Remaining.Simulations.dat
This errors appear
grep: invalid option -- '.'
Usage: grep [OPTION]... PATTERN [FILE]...
Try `grep --help' for more information.
Do you have any idea how to solve it?Thank you
It's because you're trying to grep for something like -4.5 and grep is treating that as an option rather than a search string. If you use -- to indicate there are no more options, this should work okay:
pax> echo -4.5000000000E-01 | grep -4.5000000000E-01
grep: invalid option -- '.'
Usage: grep [OPTION]... PATTERN [FILE]...
Try 'grep --help' for more information.
pax> echo -4.5000000000E-01 | grep -- -4.5000000000E-01
-4.5000000000E-01
In addition, if you pass the string 7.2 to grep, it will match any line containing 7 followed by any character followed by 2 since:
Regular expressions treat . as a special character; and
Without start and end markers, 7.2 will also match 47.2, 7.25 and so on.
With awk you can try something like:
awk '
NR==FNR {
for (i=1;i<=NF;i++) {
numbers[$i]++
}
next
}
{
for (number in numbers)
if (index ($0,number) > 0) {
print $0
}
}' new.samples.dat Remaining.Simulations.dat > anotherfile

Resources