Removes values in a file that match patterns from another file [duplicate] - bash

This question already has answers here:
Bash, Linux, Need to remove lines from one file based on matching content from another file
(3 answers)
Closed 7 years ago.
I have a list of values in one file:
item2
item3
item4
and I want to remove the entire line from another file when the rows looks like this:
item1|XXXX|ABCD
item2|XXXX|ABCD
item3|XXXX|ABCD
item4|XXXX|ABCD
item5|XXXX|ABCD
So that I'm left with:
item1|XXXX|ABCD
item5|XXXX|ABCD
Is there a bash sequence to do this?

grep -vf can do the job:
grep -vFf file1 file2
item1|XXXX|ABCD
item5|XXXX|ABCD

awk to the rescue!
$ awk -F"|" 'NR==FNR{a[$1];next} !($1 in a)' remove items
item1|XXXX|ABCD
item5|XXXX|ABCD
where the item list to be removed is in file "remove" and data in file "items"

If your distinctive marker is that |XXXX|ABCD| string, you can just grep it out:
$ grep -vF '|XXXX|ABCD|' input > output
It's safer to use option -F (fixed strings) because your pattern is dangerously close to containing regex metacharacters (namely in your case: |—it's not active in the default grep regex syntax, but you don't want to worry about that when you're working with simple patterns).
If your distinctive pattern is the rest of the line, you can use a whole file as a pattern list with grep's -f option:
$ grep -vFf item_list < input > output

Related

Find words from file a in file b and output the missing word matches from file a

I have two files that I am trying to run a find/grep/fgrep on. I have been trying several different commands to try to get the following results:
File A
hostnamea
hostnameb
hostnamec
hostnamed
hostnamee
hostnamef
File B
hostnamea-20170802
hostnameb-20170802
hostnamec-20170802.xml # some files have extensions
020214-_hostnamed-20170208.tar # some files have different extensions and have different date structure
HOSTNAMEF-20170802
*about files- date=20170802 - most all have this date format - some have different date format *
FileA is my control file - I want to search fileb with the whole word hostnamea-f and match the hostnamea-f in fileb and output the non-matches from filea into the output on terminal to be used in a shell script.
For this example I made it so hostnamee is not within fileb. I want to run an fgrep/grep/awk - whatever can work for this - and output only the missing hostnamee from filea.
I can get this to work but it does not particularly do what I need and if I swap it around I get nothing.
user#host:/netops/backups/scripts$ fgrep -f filea fileb -i -w -o
hostnamea
hostnameb
hostnamec
hostnamed
HOSTNAMEF
Cool - I get the matches in File-B but what if I try to reverse it.
host#host:/netops/backups/scripts$ fgrep -f fileb filea -i -w -o
host#host:/netops/backups/scripts$
I have tried several different commands but cannot seem to get it right. I am using -i to ignore case, -w to match whole word and -o
I have found some sort of workaround but was hoping there was a more elegant way of doing this with a single command either awk,egrep,fgrep or other.
user#host:/netops/backups/scripts$ fgrep -f filea fileb -i -w -o > test
user#host:/netops/backups/scripts$ diff filea test -i
5d4
< hostnamee
You can
look for "only-matches", i.e. -o, of a in b
use the result as patterns to look for in a, i.e. -f-
only list what does not match, i.e. -v
Code:
grep -of a.txt b.txt | grep -f- -v a.txt
Output:
hostnamee
hostnamef
Case-insensitive code:
grep -oif a.txt b.txt | grep -f- -vi a.txt
Output:
hostnamee
Edit:
Responding to the interesting input by Ed Morton, I have made the sample input somewhat "nastier" to test robustness against substring matches and regex-active characters (e.g. "."):
a.txt:
hostnamea
hostnameb
hostnamec
hostnamed
hostnamee
hostnamef
ostname
lilihostnamec
hos.namea
b.txt:
hostnamea-20170802
hostnameb-20170802
hostnamec-20170802.xml # some files have extensions
020214-_hostnamed-20170208.tar # some files have different extensions and have different date structure
HOSTNAMEF-20170802
lalahostnamef
hostnameab
stnam
This makes things more interesting.
I provide this case insensitive solution:
grep -Fwoif a.txt b.txt | grep -f- -Fviw a.txt
additional -F, meaning "no regex tricks"
additional -w, meaning "whole word matching"
I find the output quite satisfying, assuming that the following change of the "requirements" is accepted:
Hostnames in "a" only match parts of "b", if all adjoining _ (and other "word characers" are always considered part of the hostname.
(Note the additional output line of hostnamed, which is now not found in "b" anymore, because in "b", it is preceded by an _.)
To match possible occurrences of valid hostnames which are preceded/followed by other word characters, the list in "a" would have to explicitly name those variations. E.g. "_hostnamed" would have to be listed in order to not have "hostnamed" in the output.
(With a little luck, this might even be acceptable for OP, then this extended solution is recommended; for robustness against "EdMortonish traps". Ed, please consider this a compliment on your interesting input, it is not meant in any way negatively.)
Output for "nasty" a and b:
hostnamed
hostnamee
ostname
lilihostnamec
hos.namea
I am not sure whether the changed handling of _ still matches OPs goal (if not, within OPs scope the first case insensitive solution is satisfying).
_ is part of "letter characters" which can be used for "whole word only matching" -w. More detailed regex control at some point gets beyond grep, as Ed Morton has mentioned, using awk, perl (sed for masochistic brain exercise, the kind I enjoy) is then appropriate.
With GNU grep 2.5.4 on windows.
The files a.txt and b.txt have your content, I made however sure that they have UNIX line-endings, that is important (at least for a, possibly not for b).
$ cat tst.awk
NR==FNR {
gsub(/^[^_]+_|-[^-]+$/,"")
hostnames[tolower($0)]
next
}
!(tolower($0) in hostnames)
$ awk -f tst.awk fileB fileA
hostnamee
$ awk -f tst.awk b.txt a.txt
hostnamee
ostname
lilihostnamec
hos.namea
The only assumption in the above is that your host names don't contain underscores and anything after the last - on the line is a date. If that's not the case and there's a better definition of what the optional hostname prefix and suffix strings in fileB can be then just tweak the gsub() to use an appropriate regexp.

How to get a string out of a plain text file [duplicate]

This question already has answers here:
How do I split a string on a delimiter in Bash?
(37 answers)
Closed 6 years ago.
I have a .txt file that has a list containing a hash and a password so it looks likes this:
00608cbd5037e18593f96995a080a15c:9e:hoboken
00b108d5d2b5baceeb9853b1ad9aa9e5:1c:wVPZ
Out of this txt file I need to extract only the passwords and add them in a new text file so that I have a list that would look like this:
hoboken
wVPZ
etc
etc
etc
etc
How to do this in bash, a scripting language or simply with a text editor?
Given your examples, the following use of cut would achieve what you want:
cut -f3 -d':' /folder/file >> /folder/result
The code above would delete anything before (and including) the second colon : on each line, which would work on your case, given your examples. The result is stored on /folder/result.
Edit: I edited this answer to make it simpler.
I suggest to use awk to get always last column from your file:
awk -F ':' '{print $NF}' file
Output:
hoboken
wVPZ
With sed, to remove the string up to ::
sed 's/.*://' file
You could also use grep:
$ grep -o [^:]*$ file
hoboken
wVPZ
-o print only matching part
[^:] anything but :
* all matching characters
$ end of record

How to use shell to solve the scripts and about file?

I have a question:
file:
154891
145690
165211
190189
135901
290134
I want to output like this: (Every three uid separated by comma)
154891,145690,165211
190189,135901,290134
How can I do it?
You can use pr:
pr -3 -s, -l 1
Print in 3 columns, with commas as separators, with a 'page length' of 1.
154891,145690,165211
190189,135901,290134
sed ':1;N;s/\n/,/;0~3b;t1' file
or
awk 'ORS=NR%3?",":"\n"' file
There could be many ways to do that, pick one you like, with/out comma ",":
$ awk '{printf "%s%s",$0,(NR%3?",":RS)}' file
154891,145690,165211
190189,135901,290134
$ xargs -n3 -a file
154891 145690 165211
190189 135901 290134

Bash grep in file which is in another file

I have 2 files, one contains this :
file1.txt
632121S0 126.78.202.250 1
131145S0 126.178.20.250 1
the other contain this : file2.txt
632121S0 126.78.202.250 OBS
131145S0 126.178.20.250 OBS
313359S2 126.137.37.250 OBS
I want to end up with a third file which contains :
632121S0 126.78.202.250 OBS
131145S0 126.178.20.250 OBS
Only the lines which start by the same string in both files. I can't remember how to do it. I tried several grep, egrep and find, i still cannot use it properly...
Can you help please ?
You can use this awk:
$ awk 'FNR==NR {a[$1]; next} $1 in a' f1 f2
632121S0 126.78.202.250 OBS
131145S0 126.178.20.250 OBS
It is based on the idea of two file processing, by looping through files as this:
first loop through first file, storing the first field in the array a.
then loop through second file, checking if its first field is in the array a. If that is true, the line is printed.
To do this with grep, you need to use a process substitution:
grep -f <(cut -d' ' -f1 file1.txt) file2.txt
grep -f uses a file as a list of patterns to search for within file2. In this case, instead of passing file1 unaltered, process substitution is used to output only the first column of the file.
If you have a lot of these lines, then the utility join would likely be useful.
join - join lines of two files on a common field
Here's a set of examples.

Bash script compare values from 2 files and print output values from one file

I have two files like this;
File1
114.4.21.198,cl_id=1J3W7P7H0S3L6g85900g736h6_101ps
114.4.21.205,cl_id=1O3M7A7Q0S3C6h85902g7b3h7_101pf
114.4.21.205,cl_id=1W3C7Z7W0U3J6795197g177j9_117p1
114.4.21.213,cl_id=1I3A7J7N0M3W6e950i7g2g2i0_1020h
File2
cl_id=1B3O7M6C8T4O1b559i2g930m0_1165d
cl_id=1X3J7M6J0W5S9535180h90302_101p5
cl_id=1G3D7X6V6A7R81356e3g527m9_101nl
cl_id=1L3J7R7O0F0L74954h2g495h8_117qk
cl_id=1L3J7R7O0F0L74954h2g495h8_117qk
cl_id=1J3W7P7H0S3L6g85900g736h6_101ps
cl_id=1W3C7Z7W0U3J6795197g177j9_117p1
cl_id=1I3A7J7N0M3W6e950i7g2g2i0_1020h
cl_id=1Q3Y7Q7J0M3E62953e5g3g5k0_117p6
I want to compare cl_id values that exist on file1 but not exist on file2 and print out the first values from file1 (IP Address).
it should be like this
114.4.21.198
114.4.21.205
114.4.21.205
114.4.21.213
114.4.23.70
114.4.21.201
114.4.21.211
120.172.168.36
I have tried awk,grep diff, comm. but nothing come close. Please tell the correct command to do this.
thanks
One proper way to that is this:
grep -vFf file2 file1 | sed 's|,cl_id.*$||'
I do not see how you get your output. Where does 120.172.168.36 come from.
Here is one solution to compare
awk -F, 'NR==FNR {a[$0]++;next} !a[$1] {print $1}' file2 file1
114.4.21.198
114.4.21.205
114.4.21.205
114.4.21.213
Feed both files into AWK or perl with field separator=",". If there are two fields, add the fields to a dictionary/map/two arrays/whatever ("file1Lines"). If there is just one field (this is file 2), add it to a set/list/array/whatever ("file2Lines"). After reading all input:
Loop over the file1Lines. For each element, check whether the key part is present in file2Lines. If not, print the value part.
This seems like what you want to do and might work, efficiently:
grep -Ff file2.txt file1.txt | cut -f1 -d,
First the grep takes the lines from file2.txt to use as patterns, and finds the matching lines in file1.txt. The -F is to use the patterns as literal strings rather then regular expressions, though it doesn't really matter with your sample.
Finally the cut takes the first column from the output, using , as the column delimiter, resulting in a list of IP addresses.
The output is not exactly the same as your sample, but the sample didn't make sense anyway, as it contains text that was not in any of the input files. Not sure if this is what you wanted or something more.

Resources