Remove lines where next line matches certain pattern - bash

I have a following simple script for parsing out dates from irc logs (created by irssi)
#!/bin/bash
query=$1
grep -n $query logfile > matches.log
grep -n "Day changed" logfile >> matches.log
cat matches.log | sort -n
It produces output like:
--- Day changed Tue Jul 03 2012
--- Day changed Wed Jul 04 2012
--- Day changed Thu Jul 05 2012
16:54 <#Hamatti> who let the dogs out
--- Day changed Fri Jul 06 2012
--- Day changed Sat Jul 07 2012
--- Day changed Sun Jul 08 2012
12:11 <#Hamatti> dogs are fun
But since I'm only interested in finding out dates for actual matches, I'd like to filter out all those
--- Day changed XXX XXX dd dddd
lines where they don't follow by timestamp on the next line. So the example should output
--- Day changed Thu Jul 05 2012
16:54 <#Hamatti> who let the dogs out
--- Day changed Sun Jul 08 2012
12:11 <#Hamatti> dogs are fun
to get rid of all the disinformation that's not useful.
edit.
After the answer by T. Zelieke I realised that I could make this more of a one-liner so I use the following now to save logfile from being iterated twice.
query=$1
egrep "$query|Day changed" logfile |grep -B1 "^[^-]" |sed '/^--$/d'

grep -B1 "^[^-]" data |sed '/^--$/d'
This uses grep to filter lines that do NOT start with a dash ("^[^-]"). -B1 asks to print the immediate line before a match.
Unfortunately grep separates then each match (pair of two lines) by an -- line. Therefore I pipe the output through sed to get rid of those superflouos lines.

Here's one using awk.
awk -v query="$1" '/^--- Day changed/{day=$0;next} $0 ~ query {if (day!=p) {print day;p=day}; print}'
Every time it finds a "Day changed" line, it stores it in the variable day. Then when it finds a match to the query, it outputs the currently stored day line first. In case there are multiple matches in the same day, the variable p is used to determine if the day-line has been printed already.

Related

Prepend to lines of a program as they come in

I'm running xinput test and trying to timestamp the data.
From another question, I'm using :
xinput test $KEYBOARD_ID | (echo -n $(date +"$date_format") && cat) > $LOGFILE_NAME
However, that dates the first line, not every line.
If I do a while loop:
while IFS= read -r line
do
echo -n $(date +"date_format") &&cat)
done < $(xinput test $KEYBOARD_ID)
The loop exits right away, since xinput test is yet to generate any text.
Process substitution fails as well, only dating the first line of the file.
while IFS= read -r line
do
(echo -n $(date +"$date_format") && cat) > $LOGFILE_NAME
done < <(xinput test $KEYBOARD_ID)
Writing to file and post-processing won't work, because I need the timestamp when each line was processed.
I feel like I'm making a small error, but I can't find it, any input?
The following GNU awk command is equivalent to #karakfa's answer, but launches fewer processes, so it could be faster if the device is generating a lot of events:
xinput test "$KEYBOARD_ID" | gawk '{print strftime(), $0}' > "$LOGFILE_NAME"
perhaps this will help...
$ seq 10 | xargs -n1 -I {} echo $(date) {}
Wed May 10 14:43:09 EDT 2017 1
Wed May 10 14:43:09 EDT 2017 2
Wed May 10 14:43:09 EDT 2017 3
Wed May 10 14:43:09 EDT 2017 4
Wed May 10 14:43:09 EDT 2017 5
Wed May 10 14:43:09 EDT 2017 6
Wed May 10 14:43:09 EDT 2017 7
Wed May 10 14:43:09 EDT 2017 8
Wed May 10 14:43:09 EDT 2017 9
Wed May 10 14:43:09 EDT 2017 10
Note that, as commented below, this time stamp won't be updated for each line, if you want to time stamp each new line the gawk solution by user000001
I feel like I'm making a small error, but I can't find it
Yep. It's the cat. It reads the rest of the input and puts it there. Instead, you should just write the current line, and append it to the file:
while IFS= read -r line
do
(echo "$(date +"$date_format") $line") >> $LOGFILE_NAME
done < <(xinput test $KEYBOARD_ID)
Which can more canonically be written as
while IFS= read -r line
do
echo "$(date +"$date_format") $line"
done < <(xinput test $KEYBOARD_ID) > "$LOGFILE_NAME"
I would go for #user000001's shorter and more efficient solution though.

How sed exact match from url strings in input file and remove it?

I have a input / input file containing lines with urls like this
https://www.youtube.com/watch?v=VXw6OdVmMpw Mon Nov 2 10:25:32 2015
https://www.youtube.com/watch?v=VXw6OdVmMpw Mon Nov 2 10:27:34 2015
https://www.youtube.com/watch?v=VXw6OdVmMpw Mon Nov 2 10:28:23 2015
I want grep only they exact matched string like if i sed for the last line
which is https://www.youtube.com/watch?v=VXw6OdVmMpw Mon Nov 2 10:28:23 2015
so it will delete the last line only
Note : Lines may contain same text at front but different date and time so i need exact match ,
thanks for understanding .
Do you just want to use the url at the beginning of line replace the whole line?
Two parts of regex in the command below:
1. Match the URL:
^([a-zA-Z0-9:/.?=]*)
Match the last part:
.*
Then use the URL part replace the whole line.
echo "https://www.youtube.com/watch?v=VXw6OdVmMpw Mon Nov 2 10:28:23 2015" | sed 's/^\([a-zA-Z0-9:/.?=]*\).*/\1/'
the result is
https://www.youtube.com/watch?v=VXw6OdVmMpw
Is this what you want?
required output can be achieved by using:
grep -v 'text to grep'

shell: Get line from FILE1 by content in FILE2

I have a file (maillog) like this:
Feb 22 23:53:39 info postfix[102]: connect from APVLDPDF01[...
Feb 22 23:53:39 info postfix[101]: BA1D7805A1: client=APVLDPDF01[...
Feb 22 23:53:39 info postfix[103]: BA1D7805A1: message-id
Feb 22 23:53:39 info opendkim[139]: BA1D7805A1: DKIM-Signature field added
Feb 22 23:53:39 info postfix[763]: ED6F3805B9: to=<CORREO1#GM.COM>, relay...
Feb 22 23:53:39 info postfix[348]: ED6F3805B9: removed
Feb 22 23:53:39 info postfix[348]: BA1D7805A1: from=<correo#prueba.com>,...
Feb 22 23:53:39 info postfix[102]: disconnect from APVLDPDF01...
Feb 22 23:53:39 info postfix[842]: 59AE0805B4: to=<CO2#GM.COM>,status=sent
Feb 22 23:53:39 info postfix[348]: 59AE0805B4: removed
Feb 22 23:53:41 info postfix[918]: BA1D7805A1: to=<CO3#GM.COM>, status=sent
Feb 22 23:53:41 info postfix[348]: BA1D7805A1: removed
and a second file (mailids) like this:
6DBDD8039F:
3B15BC803B:
BA1D7805A1:
2BD19803B4:
I want to get an output file that contains something like this:
Feb 22 23:53:41 info postfix[918]: BA1D7805A1: to=<CO3#GM.COM>, status=sent
Just the lines that the ID exists in the second file, in this example just the ID = BA1D7805A1: is in the file one. But there's another condition, this line must be "ID to=<"
it means that just the lines that contain "to=<" and the ID in file two can be output.
I've found differents solutions, but I have a huge problem about the performance.
The maillog file size is 2GB, and its about 10millions lines. And the mailid file have around 32000 lines.
The process takes too much time, and I've never seen finished it.
I've tried with awk and grep commands, but I dont find the best way.
grep -F -f mailids maillog | grep 'to=<'
From the grep man page:
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings, separated by
newlines, any of which is to be matched. (-F is specified by
POSIX.)
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. The empty file
contains zero patterns, and therefore matches nothing. (-f is
specified by POSIX.)
better to add -w option
-w, --word-regexp
Select only those lines containing matches that form whole
words. The test is that the matching substring must either be
at the beginning of the line, or preceded by a non-word
constituent character. Similarly, it must be either at the end
of the line or followed by a non-word constituent character.
Word-constituent characters are letters, digits, and the
underscore.
Here is the common command I use.
grep -Fwf mailids maillog |grep 'to=<'
and if the ID is fixed at column 6, try this one-liner awk command
awk 'NR==FNR{a[$1];next} /to=</&&$6 in a ' mailids maillog

sed: convert time(3) seconds in a table into printable date (spamdb)

I get the following from spamdb, where the third field represents the time in seconds since the Epoch.
Cns# spamdb | fgrep TRAPPED
TRAPPED|113.163.117.129|1360836903
TRAPPED|113.171.216.201|1360837481
TRAPPED|122.177.159.61|1360844596
TRAPPED|36.231.9.231|1360865649
TRAPPED|37.146.207.209|1360832096
TRAPPED|212.156.98.210|1360837015
TRAPPED|59.99.160.62|1360839785
TRAPPED|86.127.116.162|1360840492
TRAPPED|92.83.139.194|1360843056
TRAPPED|219.71.12.150|1360844704
I want to sort this table by the time, and print the time field with date -r, such that it's presentable and clear when the event has occurred.
How do I do this in tcsh on OpenBSD?
Sorting with sort is easy, and so is editing with sed; but how do I make sed execute date -r or equivalent?
There are indeed a few obstacles here: first, you basically have to separate the data, and then one part of it is presented as-is, whereas another part has to be passed down to date -r for date formatting, prior to being presented to the user.
Another obstacle is making sure the output is aligned: apparently, it's quite difficult to handle the tab character in the shell, possibly only on the BSDs:
sed replace literal TAB
Replacing / with TAB using sed
Also, as we end up piping this to sh for execution, we have to use a different separator for the fields other than the pipe character, |.
So far, this is the best snippet I could come up with, it seems to work great in my tcsh:
Cns# spamdb | fgrep TRAPPED | sort -n -t '|' -k 3 | sed -E -e 's#\|###g' \
-e 's#^([A-Z]+)#([0-9.]+)#([0-9]+)$#"echo -n \2_"; "date -r \3"#g' | \
xargs -n1 sh -c | awk '{gsub("_","\t",$0); print;}'
37.146.207.209 Thu Feb 14 00:54:56 PST 2013
113.163.117.129 Thu Feb 14 02:15:03 PST 2013
212.156.98.210 Thu Feb 14 02:16:55 PST 2013
113.171.216.201 Thu Feb 14 02:24:41 PST 2013
59.99.160.62 Thu Feb 14 03:03:05 PST 2013
86.127.116.162 Thu Feb 14 03:14:52 PST 2013
92.83.139.194 Thu Feb 14 03:57:36 PST 2013
122.177.159.61 Thu Feb 14 04:23:16 PST 2013
219.71.12.150 Thu Feb 14 04:25:04 PST 2013
36.231.9.231 Thu Feb 14 10:14:09 PST 2013

Remove duplicate entries in a Bash script [duplicate]

This question already has answers here:
How to delete duplicate lines in a file without sorting it in Unix
(9 answers)
Closed 7 years ago.
I want to remove duplicate entries from a text file, e.g:
kavitha= Tue Feb 20 14:00 19 IST 2012 (duplicate entry)
sree=Tue Jan 20 14:05 19 IST 2012
divya = Tue Jan 20 14:20 19 IST 2012
anusha=Tue Jan 20 14:45 19 IST 2012
kavitha= Tue Feb 20 14:00 19 IST 2012 (duplicate entry)
Is there any possible way to remove the duplicate entries using a Bash script?
Desired output
kavitha= Tue Feb 20 14:00 19 IST 2012
sree=Tue Jan 20 14:05 19 IST 2012
divya = Tue Jan 20 14:20 19 IST 2012
anusha=Tue Jan 20 14:45 19 IST 2012
You can sort then uniq:
$ sort -u input.txt
Or use awk:
$ awk '!a[$0]++' input.txt
It deletes duplicate, consecutive lines from a file (emulates "uniq").
First line in a set of duplicate lines is kept, rest are deleted.
sed '$!N; /^\(.*\)\n\1$/!P; D'
Perl one-liner similar to #kev's awk solution:
perl -ne 'print if ! $a{$_}++' input
This variation removes trailing whitespace before comparing:
perl -lne 's/\s*$//; print if ! $a{$_}++' input
This variation edits the file in-place:
perl -i -ne 'print if ! $a{$_}++' input
This variation edits the file in-place, and makes a backup input.bak
perl -i.bak -ne 'print if ! $a{$_}++' input
This might work for you:
cat -n file.txt |
sort -u -k2,7 |
sort -n |
sed 's/.*\t/ /;s/\([0-9]\{4\}\).*/\1/'
or this:
awk '{line=substr($0,1,match($0,/[0-9][0-9][0-9][0-9]/)+3);sub(/^/," ",line);if(!dup[line]++)print line}' file.txt

Resources