Counting and sorting multiple values in textfile columns

Counting and sorting multiple values in textfile columns - sorting

I want to sift through text files looking for suspicious activity. I'm a little familiar with bash scripting, including grep, sed and awk. My research on stackoverflow.com, tldp.org, etc, and talk with colleagues shows that perl and python are best suited for my task, but I have zero experience with those scripting languages.
Inputs with various scripting, compiled or interpreted languages are welcome. Due to my limitations, please add comments to the code, enabling me to quickly understand and learn the language.
Ok, the task is sorting and counting items sorted in columns. I /can/ accomplish part of this this using grep, awk and sed. Unfortunately, the recursive aspects (as I perceive the problem) has me stumped.
The input text is sorted, two columns of ip addresses (simplified in my example, below) and a column for a destination port (of all possible values). The size of this file may be several megabytes, but probably never more than 250MB at at time, so absolute efficiency isn't necessary. Simplicity is.
SIP DIP DPt
111.100 200.150 80
111.100 200.150 443
111.100 200.155 22
111.100 200.155 80
111.100 200.155 443
111.100 200.160 80
111.100 200.165 139
111.100 200.165 443
111.100 200.165 512
115.102 225.150 80
115.102 225.150 137
115.102 225.150 443
120.125 250.175 23
120.135 250.145 23
125.155 250.165 80
125.155 250.165 139
125.155 250.175 1023
The code I have working (drafting this from memory ... not currently at my linux box) is similar to this ...
#!/bin/bash
declare -i counter=0
SIP=null # current source ip.
SIP_last=null # for last ip address processed.
SIP_next=null # not found a use for this, yet.
# sorting usually reqs three vars, so here it is.
for SIP in `zcat textfile.gz | awk '{ if ($3 <1024) print $1,$2,$3}'` do
# Ensure I count the first item. This was problematic at first.
if [[ "$SIP_last" == null ]] then
SIP_last=$SIP
counter=counter+1 # counter=+ didn't work reliably.
# Do something useful. As shown, it works.
if [[ "$SIP" == "$SIP_last" ]] then
counter=counter+1
if [[ "$SIP != "$SIP_last" ]] then
echo SIP: $SIP_last Counter: $counter # DIP code has not yet been added.
SIP_last=$SIP
# Ensure I always catch the last item. Still working on this issue.
# XXX
done
Using the input provided above, the output should look something like this ...
SIP DIP Ct Ports
> 2 < 1024
111.100 200.150 80, 443
111.100 200.155 20, 80, 443
111.100 200.165 139, 443, 512
115.102 225.150 80, 137, 443
Looking at the output you can see the crux of the matter is only reporting DIP counts > 2 and Ports < 1024. Limiting the ports to < 1024 is simple enough using the provided awk statement. It's matching up the DIPs to the SIPs and keeping a running tally of the DPts that's the kicker.
Again, this is from memory, so forgive the coding errors. Thanks for your assistance.
Allen.

With your posted sample input file:
$ awk '
NR==1 { print; next }
$3 < 1024 {
key = $1 "\t" $2
if (!seen[key,$3]++) {
cnt[key]++
vals[key] = vals[key] sep[key] $3
sep[key] = ", "
}
}
END { for (key in cnt) if (cnt[key] > 1) print key "\t" vals[key] }
' file
SIP DIP DPt
111.100 200.155 22, 80, 443
111.100 200.165 139, 443, 512
125.155 250.165 80, 139
115.102 225.150 80, 137, 443
111.100 200.150 80, 443
If that's not what you're looking for, please clarify.

Related

Find lines with no matching numbers as their preceding line

I am trying to find the lines in a file which non of the numbers in those lines are in their preceding line. This file has around 400000 lines. This is an example of the input file:
320 5120
240 326 5120
240 326 5120
241 333 514
240 326 5120
240 326 5120
320 5120
240
100 112
240 326 5120
240 326 5120
320 5120
The expected output results is:
241 333 514
240 326 5120
240
100 112
240 326 5120
So far I could find this command:
$ awk '!seen[$1]++' file
320 5120
240 326 5120
241 333 514
100 112
which I can get the unique number of column 1 and I can do the same separately for other columns. Can I somehow get the information I want from this command? Any help would be appreciated.

A Perl command-line program ("one"-liner), assuming things other than numbers in the file
perl -wnE'
#n = /([0-9]+)/g;
say "#n" if not grep { exists $seen_nums{$_} } #n;
%seen_nums = map { $_ => 1 } #n
' data.txt
This prints the desired output. It also prints the very first line (correctly). Since the program parses lines for numbers it can be used for files with headers, text-only (comment?) lines, etc.
But if the data is sure to have only numbers then we can use Perl's -a switch with which words on each line are available in the #F array. Also shrunk a little to actually fit on a line
perl -wlanE'grep exists $n{$_}, #F or say; %n = map { $_=>1 } #F' data.txt
A brief explanation of switches (see docs linked above)
-w turns on warnings
-l strips the newline, and can tack it back on, with few more subtleties
-a turns on "autosplit" (when used with -n or -p), so that #F is available in the program which contains words on the line. On newer Perls this sets -n as well
-n Critical for processing files or STDIN -- opens the resource and sets up a loop over lines. Run with -MO=Deparse to see what it does
-E The -e is what makes it evaluate everything between the following quotes as Perl code. With capital (E) it also turns on features, what I use mostly for say. (Doing this has drawbacks, since it enables all features and makes things not backwards compatible anymore.)
Note: The first line can be omitted by adding condition $.!=2 to the print

Here is an awk solution:
$ awk 'NR>1{p=1; for (i=1;i<=NF;i++){if($i in a)p=0}} {delete a; for (i=1;i<=NF;i++)a[$i]} p' file
241 333 514
240 326 5120
240
100 112
240 326 5120
How it works
NR>1{...}
Perform the commands in braces for all except the first line. Those commands are:
p=1
Initialize p to true (nonzero)
for (i=1;i<=NF;i++){if($i in a)p=0}
If any field is a key in array a, then set p to false (zero).
delete a
Delete array a.
for (i=1;i<=NF;i++)a[$i]
Create a key in array a for every field on the current line.
p
If p is true, print the line.
Multiple line version
Or, for those who prefer their code spread over multiple lines:
awk '
NR>1{
p=1
for (i=1;i<=NF;i++){
if($i in a)p=0}
}
{
delete a
for (i=1;i<=NF;i++)
a[$i]
}
p' file

Here's a perl one-liner:
$ perl -M-warnings -lane 'print unless #F ~~ %prev; %prev = map { $_ => 1 } #F;' input.txt
320 512
241 333 514
240 326 512
240
100 112
240 326 512
It uses the frowned-upon smart match operator in the name of conciseness. With smartmatch, ARRAY ~~ HASH returns true if any elements of the array are keys in the hash, which is perfect for this use case. If this was a standalone script and not a one-liner I'd probably use a different approach, though.
(Is there a reason the first line of your sample input isn't in your expected output even though it meets the critera?)

Here is a perl solution that does that. It tests for any of the numbers were seen on the previous line.
This includes printing the first line as noted by Shawn which might be needed. If not, just exclude the print join(... line in the code.
#!/usr/bin/perl
use strict;
use warnings;
use List::Util 'any';
open my $fh, '<', 'f0.txt' or die $!;
my #nums = split ' ', <$fh>;
my %seen = map{ $_ => 1} #nums;
print join(' ', #nums), "\n"; # print the first line
while (<$fh>) {
#nums = split;
print unless any {$seen{$_}} #nums;
%seen = map{ $_ => 1} #nums;
}
close $fh or die $!;
Output is:
320 512
241 333 514
240 326 512
240
100 112
240 326 512

A simple awk that checks, by means of a regex-match if the number is in the previous line. The idea is:
the previous line is stored in variable t
if any of the fields is matched to the previous line, we can skip to the next line.
This is done in the following way:
$ awk '{for(i=1;i<=NF;++i) if (FS t FS ~ FS $i FS) {t=$0; next}; t=$0}1'
320 512
241 333 514
240 326 512
240
100 112
240 326 512
The trick to make it work is to ensure that the line starts and stops with a field separator. If we would do the test t ~ $i we could match the number 25 against the number 255. But by ensuring that all numbers are sandwhiched between field separators, we can just do the test FS t FS ~ FS $i FS.
note: if you don't want the first line to be printed, replace the last 1 by (FNR>1)

Given your updated input:
$ awk '$0 !~ p; {gsub(/ /,"|"); p="(^| )("$0")( |$)"}' file
241 333 514
240 326 5120
240
100 112
240 326 5120
The above just converts the previous line read into a regexp like (^| )(320|5120)( |$) and then does a regexp comparison to see if the current line matches it and prints the current line if it doesn't match the modified previous line. This approach would only lead to false matches if your fields contained RE metacharacters which obviously yours don't since they're all-digits

Adding numbers in text file using awk and Bash

I need to take all numbers that appear within a book index and add 22 to them. The index data looks like this (for example):
Ubuntu, 120, 143, 154
Yggdrasil, 144, 170-171
Yood, Charles, 6
Young, Bob, 178-179
Zawinski, Jamie, 204
I am trying to do this with awk using this script:
#!/bin/bash
filename="index"
while read -r line
do
echo $line | awk -v n=22 '{printf($1)}{printf(" " )}{for(i=2;i<=NF;i++)printf(i%2?$i+n:$i+n)", "};{print FS}'
done < "$filename"
It comes close to working but has the following problems:
It doesn't work for page numbers that are part of a range (e.g., "170-171") rather than individual numbers.
For entries where the index term is more than one word (e.g., "X Windows" and "Young, Bob") the output displays only the first word in the term. The second word ends up being output as the number 22. (I know why this is happening -- my awk commands treats $2 as a number, and if it's a string it assumes it has a value of 0) but I can't figure out how to solve it.
Disclosure: I'm by no means an awk expert. I'm just looking for a quick way to modify the page numbers in my index (which is due in a few days) because my publisher decided to change the pagination in the manuscript after I had already prepared the index. awk seems like the best tool for the job to me, but I'm open to other suggestions if someone has a better idea. Basically, I just need a way to say "take all numbers in this file and add 22 to them; don't change anything else."

With GNU awk for multi-char RS and RT:
$ awk -v RS='[0-9]+' '{ORS=(RT=="" ? "" : RT+22)}1' file
Ubuntu, 142, 165, 176
Yggdrasil, 166, 192-193
Yood, Charles, 28
Young, Bob, 200-201
Zawinski, Jamie, 226

For example:
perl -plE 's/\b(\d+)\b/$1+22/ge' index
output
Ubuntu, 142, 165, 176
Yggdrasil, 166, 192-193
Yood, Charles, 28
Young, Bob, 200-201
Zawinski, Jamie, 226
but it isn't awk

You can use this gnu awk command:
awk 'BEGIN {FS="\f";RS="(, |-|\n)";} /^[0-9]+$/ {$1 = $1 +22} { printf("%s%s", $1, RT);}' yourfile
there is a bit of abuse with FS and RS to get awk to handle each token in each line as a record of its own, so you dont have to loop over the
fields and test each field whether or not it is numerical
RS="(, |-|\n)" configures dash, newline and ", " as record separators
on "records" consisting only of digits: 22 is added
the printf prints the token together with its RT to reconstruct the line from the file

Consider using the following awk script(add_number.awk):
BEGIN{ FS=OFS=", "; if (!n) n=22; } # if `n` variable hasn't been passed the default is 22
{
for (i=1;i<=NF;i++) { # traversing fields
if ($i~/^[0-9]+$/) { # if a field contains a single number
$i+=n;
}
else if (match($i, /^([0-9]+)-([0-9]+)$/, arr)) { # if a field contains `range of numbers`
$i=arr[1]+n"-"arr[2]+n;
}
}
print;
}
Usage:
awk -v n=22 -f add_number.awk testfile
The output:
Ubuntu, 142, 165, 176
Yggdrasil, 166, 192-193
Yood, Charles, 28
Young, Bob, 200-201
Zawinski, Jamie, 226

Priting all IP addresses from an IP range ssh/bash [duplicate]

This question already has an answer here:
cidr converter using sed or awk
(1 answer)
Closed 6 years ago.
I am trying to print all IPs from an IP address range such as 72.21.206.0/23 in the command line more preferably with a single command.
I have tried several commands with awk & cut in combination but was not able to achieve the desired result.
For example if I have the following in file3:
72.21.110.0/16
72.21.206.0/23
and I would like to extract all IPs from 72.21.206.0/23 and print them in separate lines on the screen. I have only reached this point due to my basic knowledge:
awk -F'/' 'NR==2{print $1+1}' file3
which is supposed to print from my assumptions but it is not:
72.21.206.1
Could you please help please.

If you have nmap available you can just run something like:
nmap -n -sL 72.21.110.0/16
This will produce output along the lines of:
Nmap scan report for 72.21.0.0
Nmap scan report for 72.21.0.1
Nmap scan report for 72.21.0.2
[...]
Nmap scan report for 72.21.255.253
Nmap scan report for 72.21.255.254
Nmap scan report for 72.21.255.255
Nmap done: 65536 IP addresses (0 hosts up) scanned in 33.42 seconds
Answers to this question suggest a solution using ipcalc. And having found that, I guess I'm marking this as a duplicate...
Update
A solution in awk, just for you:
BEGIN {
FS="/"
}
{
split($1, octets, ".");
base=lshift(octets[1], 24) + lshift(octets[2], 16)
+ lshift(octets[3], 8) + octets[4];
max=lshift(1, 32-$2);
for (i=0; i<max; i++) {
addr = base + i;
addr = sprintf("%s.%s.%s.%d", rshift(addr, 24),
rshift(and(addr, 0x00FF0000), 16),
rshift(and(addr, 0x0000FF00), 8),
and(addr, 0xFF))
print addr
}
}
Given input like this:
$ echo 192.168.1.0/28 | awk -f ipranger.awk
You get output like this:
192.168.0.0
192.168.0.1
192.168.0.2
192.168.0.3
192.168.0.4
192.168.0.5
192.168.0.6
192.168.0.7
192.168.0.8
192.168.0.9
192.168.0.10
192.168.0.11
192.168.0.12
192.168.0.13
192.168.0.14
192.168.0.15

bash: find pattern in one file and apply some code for each pattern found

I created a script that will auto-login to router and checks for current CPU load, if load exceeds a certain threshold I need it print the current CPU value to the standard output.
i would like to search in script o/p for a certain pattern (the value 80 in this case which is the threshold for high CPU load) and then for each instance of the pattern it will check if current value is greater than 80 or not, if true then it will print 5 lines before the pattern followed by then the current line with the pattern.
Question1: how to loop over each instance of the pattern and apply some code on each of them separately?
Question2: How to print n lines before the pattern followed by x lines after the pattern?
ex. i used awk to search for the pattern "health" and print 6 lines after it as below:
awk '/health/{x=NR+6}(NR<=x){print}' ./logs/CpuCheck.log
I would like to do the same for the pattern "80" and this time print 5 lines before it and one line after....only if $3 (representing current CPU load) is exceeding the value 80
below is the output of auto-login script (file name: CpuCheck.log)
ABCD-> show health xxxxxxxxxx
* - current value exceeds threshold
1 Min 1 Hr 1 Hr
Cpu Limit Curr Avg Avg Max
-----------------+-------+------+------+-----+----
01 80 39 36 36 47
WXYZ-> show health xxxxxxxxxx
* - current value exceeds threshold
1 Min 1 Hr 1 Hr
Cpu Limit Curr Avg Avg Max
-----------------+-------+------+------+-----+----
01 80 29 31 31 43
Thanks in advance for the help

Rather than use awk, you could use the -B and -A and switches to grep, which print a number of lines before and after a pattern is matched:
grep -E -B 5 -A 1 '^[0-9]+[[:space:]]+80[[:space:]]+(100|9[0-9]|8[1-9])' CpuCheck.log
The pattern matches lines which start with some numbers, followed by spaces, followed by 80, followed by a number greater between 81 and 100. The -E switch enables extended regular expressions (EREs), which are needed if you want to use the + character to mean "one or more". If your version of grep doesn't support EREs, you can instead use the slightly more verbose \{1,\} syntax:
grep -B 5 -A 1 '^[0-9]\{1,\}[[:space:]]\{1,\}80[[:space:]]\{1,\}\(100\|9[0-9]\|8[1-9]\)' CpuCheck.log
If grep isn't an option, one alternative would be to use awk. The easiest way would be to store all of the lines in a buffer:
awk 'f-->0;{a[NR]=$0}/^[0-9]+[[:space:]]+80[[:space:]]+(100|9[0-9]|8[1-9])/{for(i=NR-5;i<=NR;++i)print i, a[i];f=1}'
This stores every line in an array a. When the third column is greater than 80, it prints the previous 5 lines from the array. It also sets the flag f to 1, so that f-->0 is true for the next line, causing it to be printed.
Originally I had opted for a comparison $3>80 instead of the regular expression but this isn't a good idea due to the varying format of the lines.
If the log file is really big, meaning that reading the whole thing into memory is unfeasible, you could implement a circular buffer so that only the previous 5 lines were stored, or alternatively, read the file twice.

Unfortunately, awk is stream-oriented and doesn't have a simple way to get the lines before the current line. But that doesn't mean it isn't possible:
awk '
BEGIN {
bufferSize = 6;
}
{
buffer[NR % bufferSize] = $0;
}
$2 == 80 && $3 > 80 {
# print the five lines before the match and the line with the match
for (i = 1; i <= bufferSize; i++) {
print buffer[(NR + i) % bufferSize];
}
}
' ./logs/CpuCheck.log

I think the easiest way with awk, by reading the file.
This should use essentially 0 memory except whatever is used to store the line numbers.
If there is only one occurence
awk 'NR==FNR&&$2=="80"{to=NR+1;from=NR-5}NR!=FNR&&FNR<=to&&FNR>=from' file{,}
If there are more than one occurences
awk 'NR==FNR&&$2=="80"{to[++x]=NR+1;from[x]=NR-5}
NR!=FNR{for(i in to)if(FNR<=to[i]&&FNR>=from[i]){print;next}}' file{,}
Input/output
Input
1
2
3
4
5
6
7
8
9
10
11
12
01 80 39 36 36 47
13
14
15
16
17
01 80 39 36 36 47
18
19
20
Output
8
9
10
11
12
01 80 39 36 36 47
13
14
15
16
17
01 80 39 36 36 47
18
How it works
NR==FNR&&$2=="80"{to[++x]=NR+5;from[x]=NR-5}
In the first file if the second field is 80 set to and from to the record number + or - whatever you want.
Increment the occurrence variable x.
NR!=FNR
In the second file
for(i in to)
For each occurrence
if(FNR<=to[i]&&FNR>=from[i]){print;next}
If the current record number(in this file) is between this occurrences to and from then print the line.Next prevents the line from being printed multiple times if occurrences of the pattern are close together.
file{,}
Use the file twice as two args. the {,} expands to file file

awk : if >=4 lines in a row begin with + or - don't print record

I'm trying to use awk to read a file and only display lines that do no begin with a + or - 4 or more times in a row. gawk would be fine too. Each grouping is separated by a blank line.
Here's a sample from the file, these are the lines I do not want printed:
+Host is up.
+Not shown: 95 closed ports, 3 filtered ports
+PORT STATE SERVICE VERSION
+23/tcp open telnet
+9100/tcp open jetdirect
-Host is up.
-Not shown: 99 closed ports
-PORT STATE SERVICE VERSION
-5900/tcp open vnc
A sample from the file which I do want printed ( not 4 or more in a row ):
-Not shown: 76 closed ports, 18 filtered ports
+Not shown: 93 closed ports
PORT STATE SERVICE VERSION
+514/tcp open shell
I'm learning how to use awk at the moment as I've been reading O'Reilly's awk & sed but I'm a little stumped on this problem. Also, if anyone cares to, I wouldn't mind seeing non-awk ways of solving this problem with a shell script.
Thanks!

If I understood your question, the input file have records as paragraphs, so you will need to separate them with blank lines. I assumed it for next script:
Content of script.awk:
BEGIN {
## Separate records by one or more blank lines.
RS = ""
## Each line will be one field. Both for input and output.
FS = OFS = "\n"
}
## For every paragraph...
{
## Flag to check if I will print the paragraph to output.
## If 1, print.
## If 0, don't print.
output = 1
## Count how many consecutive rows have '+' or '-' as first
## character.
j = 0
## Traverse all rows.
for ( i = 1; i <= NF; i++ ) {
if ( substr( $i, 1, 1 ) ~ /+|-/ ) {
++j;
}
else {
j = 0
}
if ( j >= 4 ) {
output = 0
break
}
}
if ( output == 1 ) {
print $0 "\n"
}
}
Assuming following test input file as infile:
+Host is up.
+Not shown: 95 closed ports, 3 filtered ports
+PORT STATE SERVICE VERSION
+Host is up.
+Not shown: 95 closed ports, 3 filtered ports
+PORT STATE SERVICE VERSION
+23/tcp open telnet
+9100/tcp open jetdirect
-Host is up.
-Not shown: 99 closed ports
-PORT STATE SERVICE VERSION
-5900/tcp open vnc
-Not shown: 76 closed ports, 18 filtered ports
+Not shown: 93 closed ports
PORT STATE SERVICE VERSION
+514/tcp open shell
Run the script like:
awk -f script.awk infile
With following output (first record because it doesn't reach to four consecutive rows, and second record because it has a different line between them):
+Host is up.
+Not shown: 95 closed ports, 3 filtered ports
+PORT STATE SERVICE VERSION
-Not shown: 76 closed ports, 18 filtered ports
+Not shown: 93 closed ports
PORT STATE SERVICE VERSION
+514/tcp open shell

awk '{if(NF>3 &&( $0 ~ /\+/ || $0 ~ /-/) ) print $0}' test.txt

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Counting and sorting multiple values in textfile columns - sorting

Related

Find lines with no matching numbers as their preceding line

Adding numbers in text file using awk and Bash

Priting all IP addresses from an IP range ssh/bash [duplicate]

bash: find pattern in one file and apply some code for each pattern found

awk : if >=4 lines in a row begin with + or - don't print record

Categories

Resources