grep with negative condition, and more than a certain line number - shell

i'm trying to write a grep/sed/awk for a condition from a file, and at the same time with a negative condition (does not contain xxx) and also i wanna grep all the lines that are > than a certain line number.

Awk should deal with that nicely:
/condition/ && ! /negative condition/ { print $0; outputdone = 1 }
{ if(NR > certain_line_number && !outputdone) print $0
outputdone = 0
}
I wasn't quite certain if all the conditions were applied together. I guessed that you always want to print lines beyond some point, but up to that point the positive and negative conditions applied.

tail -<number of lines that you want from end of file> filename | grep -v xxx

Related

Parsing unstructured text file with grep

I am trying to analyze this IDS log file from MIT, found here.
Summarized attack: 41.084031
IDnum Date StartTime Duration Destination Attackname insider? manual? console?success? aDump? oDump iDumpBSM? SysLogs FSListing StealthyNew? Category OS
41.08403103/29/1999 08:18:35 00:04:07 172.016.112.050ps out rem succ aDmp oDmp iDmp BSM SysLg FSLst Stlth Old llU2R
41.08403103/29/1999 08:19:37 00:01:56 209.154.098.104ps out rem succ aDmp oDmp iDmp BSM SysLg FSLst Stlth Old llU2R
41.08403103/29/1999 08:29:27 00:00:43 172.016.112.050ps out rem succ aDmp oDmp iDmp BSM SysLg FSLst Stlth Old llU2R
41.08403103/29/1999 08:40:14 00:24:26 172.016.112.050ps out rem succ aDmp oDmp iDmp BSM SysLg FSLst Stlth Old llU2R
I am trying to write commands that do two things:
First, parse through the entire file and determine the amount of distinct "summarized attacks" that begin with 4x.xxxxx. I have accomplished this with:
grep -o -E "Summarized attack: 4". It returns 80.
Second, for each of the "Summarized Attacks" found by the above command, parse the table and determine the amount of IDnum rows, and return the total amount of rows (i.e., attacks) across all "Summarized attack" finds. I would imagine that number is somewhere around 200.
However, I am struggling to get the individual number of IDs, i.e., that are in the IDnum column of this text file.
Since it is a text file with technically no structure, how can I parse this .txt file as if it had a tabular structure to retrieve the total entries in the IDnum column, for each Summarized attack that follows the above grep command's search text?
Desired output would be a count of all IDnum's for the Summarized attacks found by the above command. I don't know the count, but I would imagine an integer output, similar to the return of 80 for grep -o -E "Summarized attack: 4". The output would be <int> where <int> is the # of "attacks" as defined by rows in the IDnum column across all 80 of the found "Summarized attacks" by the above grep command.
If another command other than grep is better suited, that is OK.
to count matches you can use grep -c
grep -cE '(^Summarized.attack:.4[0-9]\.[0-9]+$)'
you can use colon as delimiter for cut -d
(if you loop over results the leading whitespace does not care)
grep -oE '(^Summarized.attack:.4[0-9]\.[0-9]+$)' | cut -d: -f2
example loop
file="path/to/master-listfile-condensed.txt"
for var in $(grep -oE '(^Summarized.attack:.4[0-9]\.[0-9]+$)' "$file" | cut -d: -f2)
do
printf "Summarized attacks: %s: %s\n" $var \
$(grep -cE "(^.${var}[0-9]+/[0-9]{2}/[0-9]{4})" "$file")
done
^ start of line
$ end of line
. any byte (in this case single whitespace)
\. single dot (escaped)
[0-9] single digit
+ one (or more) occurrence
{4} four occurrence
Assuming you have more than one "Summarized attack:" in your input file this may be what you're looking for:
$ cat tst.awk
/^Summarized attack:/ {
prt()
atk = ($3 ~ /^4/ ? $3 : 0)
cnt = 0
}
atk { cnt++ }
END {
prt()
print "TOTAL", tot
}
function prt() {
if ( atk ) {
cnt -= 2
print atk, cnt
}
tot += cnt
}
.
$ awk -f tst.awk file
For your first part, fgrep -c "Summarized attacks: 4" or fgrep -F "Summarized attacks: 4" is sufficient.
If I understand your second part, for each of those blocks, you want to add up the attack rows and print a grand total. You can do that with
gawk '/^Summarized attack: 4/ { on=1; next} /^ 4[0-9.]*/ { if (on) ++ids; next} /^ IDnum/ {next} /^ */ {next} { on=0} END {print ids;}'< master-listfile-condensed.txt
The first statement says, search (/.../) for every line that begins with (^) "Summarized attack: 4", and upon finding it, turn on the "on" flag, and go to the next line. The second statement says, if this is an attack record (i.e. begins with 4 followed by a string [*] of digits), then check the flag; if it is on, count it. Basically, we want the flag to be on when we are in a stanza of target attack records. The next two statements say for every line that starts with " IDnum" or are all whitespace (sometimes blank lines are inserted), go to the next line; this is needed to counteract the next statement, which says that if this is not a line that matches any of the previous statements, turn off the "on" flag. This prevents us from counting attacks outside the target. Finally, END means at the end, print the grand total. I get 757 which is pretty far out of your range. But I think it is correct.
But a far easier way, assuming the Summarized timestamp is always repeated in the IDnum at least to the first significant digit, would be to use
grep -Ec '^ 4' master-listfile-condensed.txt
That means count all the lines that begin with space-4. In this case it gives us the correct result.

bash, using awk command for printing lines characters if condition

Before starting to explain my issue I have to say that it's the first time I'm using bash and the awk command.
I have a file containing a lot of lines and I am interested in printing some of these lines if certain characters of the line satisfy a condition. I already have a simple method which is working but I intend to try with awk to see if it can be faster. The command I'm trying was inspired by a colleague at work but I don't fully understand it.
My file looks like :
# 15247.479
1 23775U 96005A 18088.90328565 -.00000293 +00000-0 +00000-0 0 9992
2 23775 014.2616 019.1859 0018427 174.9850 255.8427 00.99889926081074
# 15250.479
1 23775U 96005A 18088.35358271 -.00000295 +00000-0 +00000-0 0 9990
2 23775 014.2614 019.1913 0018425 174.9634 058.1812 00.99890136081067
The 4th field number refers to a date and I want to print the lines starting with 1 and 2 if the bold number if superior to startDate and inferior to endDate.
I am trying with :
< $file awk ' BEGIN {ok=0}
{date=substring($0,19,10) if ($date>='$firstTime' && $date<= '$lastTime' ) {print; ok=1} else ok=0;next}
{if (ok) print}'
This returns a syntax error but I fear it is not the only problem. I don't really understand what the $0 in substring refers to.
Thanks everyone for the help !
Per the question about $0:
Awk is a language built for processing tables and has language features specific to both filtering and manipulating tabular data. One language feature is automatic field splitting.
If you see a $ in front of a variable or constant, it is referring to a "field." When awk sees $field_number being used in a variable context, awk splits the current record buffer based upon what is in the FS variable and allows you to work on that just as you would any other variable -- just that the backing store for that variable is the record buffer.
$0 is a special field referring to the whole of the record buffer. There are some interesting notes in the awk documentation about the side effects on $0 of assigning $field_number variables, FS and OFS that are worth an in depth read.
Here is my answer to your application:
(1) First, LC_ALL may help us for speed. I'm using ll/ul for lower and upper limits -- the reason for which will be apparent later. Specifying them as variables outside the script helps our readability. It is good practice to properly quote shell variables.
(2) It is good practice to use BEGIN { ... }, as you did in your attempt, to formally initialize variables. If using gawk, we can use LINT = 1 to test things like this.
(3) /^#/ is probably the simplest (and fastest) pattern for our reset. We use next because we never want to apply the limits to this line and we never want to see this line in our output (even if ll = ul = "").
(4) It is surprisingly easy to make a mistake on limits. Implement limits consistently one way, and our readers will thank us. We remember to check corner cases where ll and/or ul are blank. One corner case is where we have already triggered our limits and we are waiting for /^#/ -- we don't want to rescan the limits again while ok.
(5) The default action of a pattern is to print.
(6) Remembering to quote our filename variable will save us someday when we inevitably encounter the stray "$file" with spaces in the name.
LC_ALL=C awk -v ll="$firstTime" -v ul="$lastTime" ' # (1)
BEGIN { ok = 0 } # (2)
/^#/ { ok = 0; next } # (3)
!ok { ok = (ll == "" || ll <= $4) && (ul == "" || $4 <= ul) } # (4)
ok # <- print if ok # (5)
' "$file" # (6)
You're missing a ; between the variable assignment and if. And instead of concatenating shell variables, assign them to awk variables. There's no need to initialize ok=0, uninitialized variables are automatically treated as falsey. And if you want to access a field of the input, use $n where n is the field number, rather than substr().
You need to set ok=0 when you get to the next line beginning with #, otherwise you'll just keep printing the rest of the file.
awk -v firstTime="$firstTime" -v lastTime="$lastTime" '
NF > 3 && $4 > firstTime && $4 <= lastTime { print; ok=1 }
$1 == "#" { ok = 0 }
ok { print }' "$file"
This answer is based upon my original but taking into account some new information that #clem sent us in comment -- to the effect that we now know that the line we need to test is always immediately subsequent to the line matching /^#/. Therefore, when we match in this new solution, we immediately do a getline to grab the next line, and set ok based upon that next line's data. We now only check against limits on the line subsequent to our match, and we do not check against limits on lines where we shouldn't.
LC_ALL=C awk -v ll="$firstTime" -v ul="$lastTime" '
BEGIN { ok = 0 }
/^#/ {
getline
ok = (ll == "" || ll <= $4) && (ul == "" || $4 <= ul)
}
ok # <- print if ok
' "$file"

extract each line followed by a line with a different value in column two

Given the following file structure,
9.975 1.49000000 0.295 0 0.4880 0.4929 0.5113 0.5245 2.016726 1.0472 -30.7449 1
9.975 1.49000000 0.295 1 0.4870 0.5056 0.5188 0.5045 2.015859 1.0442 -30.7653 1
9.975 1.50000000 0.295 0 0.5145 0.4984 0.4873 0.5019 2.002143 1.0854 -30.3044 2
is there a way to extract each line in which the value in column two is not equal to the value in column two in the following line?
I.e. from these three lines I would like to extract the second one, since 1.49 is not equal to 1.50.
Maybe with sed or awk?
This is how I do this in MATLAB:
myline = 1;
mynewline = 1;
while myline < length(myfile)
if myfile(myline,2) ~= myfile(myline+1,2)
mynewfile(mynewline,:) = myfile(myline,:);
mynewline = mynewline+1;
myline = myline+1;
else
myline = myline+1;
end
end
However, my files are so large now that I would prefer to carry out this extraction in terminal before transferring them to my laptop.
Awk should do.
<data awk '($2 != prev) {print line} {line = $0; prev = $2}'
A brief intro to awk: awk program consists of a set of condition {code} blocks. It operates line by line. When no condition is given, the block is executed for each line. BEGIN condition is executed before the first line. Each line is split to fields, which are accessible with $_number_. The full line is in $0.
Here I compare the second field to the previous value, if it does not match I print the whole previous line. In all cases I store the current line into line and the second field into prev.
And if you really want it right, careful with the float comparisons - something like abs($2 - prev) < eps (there is no abs in awk, you need to define it yourself, and eps is some small enough number). I'm actually not sure if awk converts to number for equality testing, if not you're safe with the string comparisons.
This might work for you (GNU sed):
sed -r 'N;/^((\S+)\s+){2}.*\n\S+\s+\2/!P;D' file
Read two lines at a time. Pattern match on the first two columns and only print the first line when the second column does not match.
Try following command:
awk '$2 != field && field { print line } { field = $2; line = $0 }' infile
It saves previous line and second field, comparing in next loop with current line values. The && field check is useful to avoid a blank line at the beginning of file, when $2 != field would match because variable is empty.
It yields:
9.975 1.49000000 0.295 1 0.4870 0.5056 0.5188 0.5045 2.015859 1.0442 -30.7653 1

Detecting the range of the matches - shell script

I have a file A and B .
A contains log entries
B contains some line numbers which in turn are some specific line numbers
referring the log file
I have an awk program like this:
echo "($(awk 'NR==1{r1=$1;next}
NR>2{printf"||"}
{printf"(NR>%d&&NR<%d)",r1,$1;r1=$1}' B))&&/mypattern/"|
awk -f- A
---> (courtesy jeff y from stack flow )
in this link you can find the requirement for which the above script is born
the part before the "|" generates the ranges to be search in file A.
eg : (NR>385&&NR<537)||(NR>537&&NR<539)||(NR>539&&NR<547)|
these are some of the ranges which can be generated by the part before "|"
I need to store the ranges for which "mypattern" has a match in a sep file ora variable ate last
say eg NR>385&&NR<537 --> this range on file A may not have a match according to my pattern where as (NR>539&&NR<547 this may have
so any idea to check whether a match had happened , if so to store the NR corresponding NR values in side a file .
Given your clarification in comments, I would propose a solution that only takes one pass through FILE_A with awk, rather than trying to extract and use line numbers and ranges numerically (FILE_B) in multiple passes:
awk '/Exception/ && !/ExceptionUnparseable date/ {
haveEx="yes"; ex=$0; exDate=last; haveMatch=""}
haveEx && /tms/ {
haveMatch="yes"; print exDate; print ex; haveEx=""}
haveMatch && /tms/ {print}
{last = $0} ' FILE_A
Or, if you don't need to see the lines that do match the pattern:
awk '/Exception/ && !/ExceptionUnparseable date/ {
haveEx="yes"; ex=$0; exDate=last}
haveEx && /tms/ {
print exDate; print ex; haveEx=""}
{last = $0} ' FILE_A

Show different context on different grep keyword?

I know -A -B -C could be used to show context around the grep keyword.
My question is, how to show different context on different keyword?
For example, how do I show -A 5 for cat, -B 4 for dog, and -C 1 for monkey:
egrep -A3 "cat|dog|monkey" <file>
// this just show 3 after lines for each keyword.
i don't think there's any way to do it with a single grep call, but you could run it through grep once for each variable and concatenate the output:
var=$(grep -n -A 5 cat file)$'\n'$(grep -n -B 4 dog file)$'\n'$(grep -n -C 1 monkey file)
var=$(sort -un <(echo "$var"))
now echo "$var" will produce the same output as you would have gotten from your single command, plus line numbers and context indicators (the : prefix indicates a line that matched the pattern exactly, and the - prefix indicates a line being included because of the -A -B and/or -C options).
the reason i included the line numbers thus far is to preserve the order of the results you would have seen had you managed to do this in one statement. if you like them, great, but if not, you can use the following line to cut them out:
var=$(cut -d: -f2- <(echo "$var") | cut -d- -f2-)
this passes it through once to cut the exact matching lines' prefixes, then again to cut the context matches' prefixes.
pretty? no. but it works.
I'm afraid grep won't do that. You'll have to use a different tool. Perhaps write your own program.
Something like this would do it:
awk '
BEGIN{ ARGV[ARGC++] = ARGV[1] }
function prtB(nr) { for (i=FNR-nr; i<FNR; i++) print a[i] }
function prtA(nr) { for (i=FNR+1; i<=FNR+nr; i++) print a[i] }
NR==FNR{ a[NR]; next }
/cat/ { print; prtA(5) }
/dog/ { prtB(4); print }
/monkey/ { prtB(1); print; prtA(1) }
' file
check the math on the loops in the functions. You didn't say how you'd want to handle lines that contain monkey AND dog, for example.
EDIT: here's an untested solution that would print the maximum context around any match and let you specify the contexts on the command line and won't use as much memory as the above cheap and cheerful solution:
awk -v cxts="cat:0:5\ndog:4:0\nmonkey:1:1" '
BEGIN{
ARGV[ARGC++] = ARGV[1]
numCxts = split(cxts,cxtsA,RS)
for (i=1;i<=numCxts;i++) {
regex = cxtsA[i]
n = split(regex,rangeA,/:/)
sub(/:[^:]+:[^:]+$/,"",regex)
endA[regex] = rangeA[n]
startA[regex] = rangeA[n-1]
regexA[regex]
}
}
NR==FNR{
for (regex in regexA) {
if ($0 ~ regex) {
start = NR - startA[regex]
end = NR + endA[regex]
for (i=start; i<=end; i++) {
prt[i]
}
}
}
next
}
FNR in prt
' file
Separate the searched for patterns in the cxts variable with whatever your RS value is, newline by default.

Resources