shell: Get line from FILE1 by content in FILE2 - bash

I have a file (maillog) like this:
Feb 22 23:53:39 info postfix[102]: connect from APVLDPDF01[...
Feb 22 23:53:39 info postfix[101]: BA1D7805A1: client=APVLDPDF01[...
Feb 22 23:53:39 info postfix[103]: BA1D7805A1: message-id
Feb 22 23:53:39 info opendkim[139]: BA1D7805A1: DKIM-Signature field added
Feb 22 23:53:39 info postfix[763]: ED6F3805B9: to=<CORREO1#GM.COM>, relay...
Feb 22 23:53:39 info postfix[348]: ED6F3805B9: removed
Feb 22 23:53:39 info postfix[348]: BA1D7805A1: from=<correo#prueba.com>,...
Feb 22 23:53:39 info postfix[102]: disconnect from APVLDPDF01...
Feb 22 23:53:39 info postfix[842]: 59AE0805B4: to=<CO2#GM.COM>,status=sent
Feb 22 23:53:39 info postfix[348]: 59AE0805B4: removed
Feb 22 23:53:41 info postfix[918]: BA1D7805A1: to=<CO3#GM.COM>, status=sent
Feb 22 23:53:41 info postfix[348]: BA1D7805A1: removed
and a second file (mailids) like this:
6DBDD8039F:
3B15BC803B:
BA1D7805A1:
2BD19803B4:
I want to get an output file that contains something like this:
Feb 22 23:53:41 info postfix[918]: BA1D7805A1: to=<CO3#GM.COM>, status=sent
Just the lines that the ID exists in the second file, in this example just the ID = BA1D7805A1: is in the file one. But there's another condition, this line must be "ID to=<"
it means that just the lines that contain "to=<" and the ID in file two can be output.
I've found differents solutions, but I have a huge problem about the performance.
The maillog file size is 2GB, and its about 10millions lines. And the mailid file have around 32000 lines.
The process takes too much time, and I've never seen finished it.
I've tried with awk and grep commands, but I dont find the best way.

grep -F -f mailids maillog | grep 'to=<'
From the grep man page:
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings, separated by
newlines, any of which is to be matched. (-F is specified by
POSIX.)
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. The empty file
contains zero patterns, and therefore matches nothing. (-f is
specified by POSIX.)

better to add -w option
-w, --word-regexp
Select only those lines containing matches that form whole
words. The test is that the matching substring must either be
at the beginning of the line, or preceded by a non-word
constituent character. Similarly, it must be either at the end
of the line or followed by a non-word constituent character.
Word-constituent characters are letters, digits, and the
underscore.
Here is the common command I use.
grep -Fwf mailids maillog |grep 'to=<'
and if the ID is fixed at column 6, try this one-liner awk command
awk 'NR==FNR{a[$1];next} /to=</&&$6 in a ' mailids maillog

Related

What is the Exact Use and Meaning of "IFS=!"

I was trying to understand the usage of IFS but there is something I couldn't find any information about.
My example code:
#!/bin/sh
# (C) 2016 Ergin Bilgin
IFS=!
for LINE in $(last -a | sed '$ d')
do
echo $LINE | awk '{print $1}'
done
unset IFS
I use this code to print last users line by line. I totally understand the usage of IFS and in this example when I use default IFS, it reads word by word inside of my loop. And when I use IFS=! it reads line by line as I wish. The problem here is I couldn't find anything about that "!" on anywhere. I don't remember where I learned that. When I google about achieving same kind of behaviour, I see other values which are usually strings.
So, what is the meaning of that "!" and how it gives me the result I wish?
Thanks
IFS=! is merely setting a non-existent value for IFS so that you can iterate input line by line. Having said that using for loop here is not recommended, better to use read in a while loop like this to print first column i.e. username:
last | sed '$ d' | while read -r u _; do
echo "$u"
done
As you are aware, if the output of last had a !, the script would split the input lines on that character.
The output format of last is not standardized (not in POSIX for instance), but you are unlikely to find a system where the first column contains anything but the name of whatever initiated an action. For instance, I see this:
tom pts/8 Wed Apr 27 04:25 still logged in michener.jexium-island.net
tom pts/0 Wed Apr 27 04:15 still logged in michener.jexium-island.net
reboot system boot Wed Apr 27 04:02 - 04:35 (00:33) 3.2.0-4-amd64
tom pts/0 Tue Apr 26 16:23 - down (04:56) michener.jexium-island.net
continuing to
reboot system boot Fri Apr 1 15:54 - 19:03 (03:09) 3.2.0-4-amd64
tom pts/0 Fri Apr 1 04:34 - down (00:54) michener.jexium-island.net
wtmp begins Fri Apr 1 04:34:26 2016
with Linux, and different date-formats, origination, etc., on other machines.
By setting IFS=!, the script sets the field-separator to a value which is unlikely to occur in the output of last, so each line is read into LINE without splitting it. Normally, lines are split on spaces.
However, as you see, the output of last normally uses spaces for separating columns, and it is fed into awk which splits the line anyway — with spaces. The script could be simplified in various ways, e.g.,:
#!/bin/sh
for LINE in $(last -a | sed -e '$ d' -e 's/ .*//')
do
echo $LINE
done
which is (starting from the example in the question) adequate if the number of logins is not large enough to exceed your command-line. While checking for variations in last output, I noticed one machine with about 9800 lines from several years. (The other usual motivations given for not using for-loops are implausible in this instance). As a pipe:
#!/bin/sh
last -a | sed -e 's/ .*//' -e '/^$/d' | while IFS= read LINE
do
echo $LINE
done
I changed the sed expression (which OP likely copied from some place such as Bash - remove the last line from a file) because it does not work.
Finally, using the -a option of last is unnecessary, since all of the additional information it provides is discarded.

How sed exact match from url strings in input file and remove it?

I have a input / input file containing lines with urls like this
https://www.youtube.com/watch?v=VXw6OdVmMpw Mon Nov 2 10:25:32 2015
https://www.youtube.com/watch?v=VXw6OdVmMpw Mon Nov 2 10:27:34 2015
https://www.youtube.com/watch?v=VXw6OdVmMpw Mon Nov 2 10:28:23 2015
I want grep only they exact matched string like if i sed for the last line
which is https://www.youtube.com/watch?v=VXw6OdVmMpw Mon Nov 2 10:28:23 2015
so it will delete the last line only
Note : Lines may contain same text at front but different date and time so i need exact match ,
thanks for understanding .
Do you just want to use the url at the beginning of line replace the whole line?
Two parts of regex in the command below:
1. Match the URL:
^([a-zA-Z0-9:/.?=]*)
Match the last part:
.*
Then use the URL part replace the whole line.
echo "https://www.youtube.com/watch?v=VXw6OdVmMpw Mon Nov 2 10:28:23 2015" | sed 's/^\([a-zA-Z0-9:/.?=]*\).*/\1/'
the result is
https://www.youtube.com/watch?v=VXw6OdVmMpw
Is this what you want?
required output can be achieved by using:
grep -v 'text to grep'

Check if a line does not start with a specific string with grep [duplicate]

This question already has answers here:
Linux, Print all lines in a file, NOT starting with
(5 answers)
Closed 8 years ago.
I have a file app.log
Oct 06 03:51:43 test test
Nov 06 15:04:53 text text text
more text more text
Nov 06 15:06:43 text text text
Nov 06 15:07:33
more text more text
Nov 06 15:14:23 test test
more text more text
some more text
Nothing but text
some extra text
Nov 06 15:34:31 test test test
How do I grep all the lines that does not begin with Nov 06 ?
I have tried
grep -En "^[^Nov 06]" app.log
I am not able to get lines which have 06 in them.
Simply use the below grep command,
grep -v '^Nov 06' file
From grep --help,
-v, --invert-match select non-matching lines
Another hack through regex,
grep -P '^(?!Nov 06)' file
Regex Explanation:
^ Asserts that we are at the start.
(?!Nov 06) This negative lookahead asserts that there isn't a string Nov 06 following the line start. If yes, then match the boundary exists before first character in each line.
Another regex based solution through PCRE verb (*SKIP)(*F)
grep -P '^Nov 06(*SKIP)(*F)|^' file

Remove lines where next line matches certain pattern

I have a following simple script for parsing out dates from irc logs (created by irssi)
#!/bin/bash
query=$1
grep -n $query logfile > matches.log
grep -n "Day changed" logfile >> matches.log
cat matches.log | sort -n
It produces output like:
--- Day changed Tue Jul 03 2012
--- Day changed Wed Jul 04 2012
--- Day changed Thu Jul 05 2012
16:54 <#Hamatti> who let the dogs out
--- Day changed Fri Jul 06 2012
--- Day changed Sat Jul 07 2012
--- Day changed Sun Jul 08 2012
12:11 <#Hamatti> dogs are fun
But since I'm only interested in finding out dates for actual matches, I'd like to filter out all those
--- Day changed XXX XXX dd dddd
lines where they don't follow by timestamp on the next line. So the example should output
--- Day changed Thu Jul 05 2012
16:54 <#Hamatti> who let the dogs out
--- Day changed Sun Jul 08 2012
12:11 <#Hamatti> dogs are fun
to get rid of all the disinformation that's not useful.
edit.
After the answer by T. Zelieke I realised that I could make this more of a one-liner so I use the following now to save logfile from being iterated twice.
query=$1
egrep "$query|Day changed" logfile |grep -B1 "^[^-]" |sed '/^--$/d'
grep -B1 "^[^-]" data |sed '/^--$/d'
This uses grep to filter lines that do NOT start with a dash ("^[^-]"). -B1 asks to print the immediate line before a match.
Unfortunately grep separates then each match (pair of two lines) by an -- line. Therefore I pipe the output through sed to get rid of those superflouos lines.
Here's one using awk.
awk -v query="$1" '/^--- Day changed/{day=$0;next} $0 ~ query {if (day!=p) {print day;p=day}; print}'
Every time it finds a "Day changed" line, it stores it in the variable day. Then when it finds a match to the query, it outputs the currently stored day line first. In case there are multiple matches in the same day, the variable p is used to determine if the day-line has been printed already.

Remove duplicate entries in a Bash script [duplicate]

This question already has answers here:
How to delete duplicate lines in a file without sorting it in Unix
(9 answers)
Closed 7 years ago.
I want to remove duplicate entries from a text file, e.g:
kavitha= Tue Feb 20 14:00 19 IST 2012 (duplicate entry)
sree=Tue Jan 20 14:05 19 IST 2012
divya = Tue Jan 20 14:20 19 IST 2012
anusha=Tue Jan 20 14:45 19 IST 2012
kavitha= Tue Feb 20 14:00 19 IST 2012 (duplicate entry)
Is there any possible way to remove the duplicate entries using a Bash script?
Desired output
kavitha= Tue Feb 20 14:00 19 IST 2012
sree=Tue Jan 20 14:05 19 IST 2012
divya = Tue Jan 20 14:20 19 IST 2012
anusha=Tue Jan 20 14:45 19 IST 2012
You can sort then uniq:
$ sort -u input.txt
Or use awk:
$ awk '!a[$0]++' input.txt
It deletes duplicate, consecutive lines from a file (emulates "uniq").
First line in a set of duplicate lines is kept, rest are deleted.
sed '$!N; /^\(.*\)\n\1$/!P; D'
Perl one-liner similar to #kev's awk solution:
perl -ne 'print if ! $a{$_}++' input
This variation removes trailing whitespace before comparing:
perl -lne 's/\s*$//; print if ! $a{$_}++' input
This variation edits the file in-place:
perl -i -ne 'print if ! $a{$_}++' input
This variation edits the file in-place, and makes a backup input.bak
perl -i.bak -ne 'print if ! $a{$_}++' input
This might work for you:
cat -n file.txt |
sort -u -k2,7 |
sort -n |
sed 's/.*\t/ /;s/\([0-9]\{4\}\).*/\1/'
or this:
awk '{line=substr($0,1,match($0,/[0-9][0-9][0-9][0-9]/)+3);sub(/^/," ",line);if(!dup[line]++)print line}' file.txt

Resources