Format grep Output in Columns - bash

I have a large log file, from which I need to extract some specific data, to be more precise, the values of distinct fields that appear repeatedly, i.e. I need to get some information from many CDRs, such as call type, origination number, etc.
The original text formatting is as per below:
Reason Code:"XXX", Result Code:XXX, Desc: "XXX"
..
A_NUMBER.ADDRESS = XXX
..
Using egrep I have managed to get the required lines, which appear to be like:
Reason Code:"XXX", Result Code:XXX, Desc: "XXX"
RECORD_IDENTIFICATION.FILE_ID: XXX
A_NUMBER.ADDRESS = XXX
Call is from XXXX, VDATE=XXXX.
but I am not being able to format them in a tabular style, grouped by Reason, File_ID, A_Num and Call Date, acting as column heads,
like
Reason Code | File_ID | A_Number | Date
xxxx | xxxx | xxxx | xxxx |
I am not really interested in the appearance, I just want the elements to be consecutive, in order to belong to the same call.
I have messed with different variants of awk, sed and printf, but nothing seems to work.
I have tried to put the total characters value as a parameter in printf
printf "%-205s\n" $(grep -E 'Reason Code|RECORD_IDENTIFICATION.FILE_ID|A_NUMBER.ADDRESS|Call is from' file.err)
or
printf "%-65s | %-65s | %-65s | %-65s" $(grep -E 'Reason Code|RECORD_IDENTIFICATION.FILE_ID|A_NUMBER.ADDRESS' file.err | awk 'FS = "\n" {print $1}')
but the values in output are scrambled and unusable.
In my opinion the solution may lay in some sort of loop, which awk seems to support, but I am not being able to sort it out.
Any help would be very appreciated.
Thank You

You can transform the output of your grep command with sed :
sed 'N;N;N;s/Reason Code:"\([^"]*\).*FILE_ID: \([^\n]*\).*A_NUMBER.ADDRESS = \([^\n]*\).*VDATE=\([^.]*\).*/\1 \2 \3 \4/'
 
$ echo ''' Reason Code:"XXX", Result Code:XXX, Desc: "XXX"
RECORD_IDENTIFICATION.FILE_ID: XXX
A_NUMBER.ADDRESS = XXX
Call is from XXXX, VDATE=XXXX.''' | sed 'N;N;N;s/Reason Code:"\([^"]*\).*FILE_ID: \([^\n]*\).*A_NUMBER.ADDRESS = \([^\n]*\).*VDATE=\([^.]*\).*/\1 \2 \3 \4/'
XXX XXX XXX XXXX
However, it would be best to avoid using grep and let sed also do the filtering. I can't propose such a solution since you didn't post the format of your unfiltered data.

Related

Command line: retrieving specific column from CSV file

I have a CSV file called articles.csv with headers as follows:
article_id, article_title, article_shares, article_date.
The first row of data in the article is found as $ articles.csv | sed "1 d" and this returns: "895", "Trump, Clinton, America. Who will win, who will lose?", "100", "01/05/2016".
I want to return the fourth column of data (the date of the article) so I use the following code:
$ articles.csv | sed "1 d" | cut -d , -f 4.
However I don't get the date, I get America. Who will win. How do I get the output of the fourth column, regardless of the fact that some columns have commas in them?
A quick and dirty solution:
... | awk -F'",' '{print $4}'
A slow but clean solution:
... | ruby -ne $'require "csv"; print CSV.parse($_)[0][3]'
Note: CSV format should not have spaces between fields, so change your record to:
"895","Trump, Clinton, America. Who will win, who will lose?","100","01/05/2016"

Elegant way to replace tr '\n' '\0' (Null byte generating warnings at runtime)

I strongly doubt about the grep best use in my code and would like to find a better and cleaner coding style for extracting the session ID and security level from my cookie file :
cat mycookie
# Netscape HTTP Cookie File
# https://curl.haxx.se/docs/http-cookies.html
# This file was generated by libcurl! Edit at your own risk.
#HttpOnly_127.0.0.1 FALSE / FALSE 0 PHPSESSID 1hjs18icittvqvpa4tm2lv9b12
#HttpOnly_127.0.0.1 FALSE /mydir/ FALSE 0 security medium
The expected output is the SSID hash :
1hjs18icittvqvpa4tm2lv9b12
Piping grep with tr '\n' '\0' works like a charm in the command line, but generates warnings (warning: command substitution: ignored null byte in input”) at the bash code execution. Here is the related code (with warnings):
ssid=$(grep -Po 'PHPSESSID.*' path/sessionFile | grep -Po '[a-z]|[0-9]' | tr '\n' '\0')
I am using bash 4.4.12 (x86_64-pc-linux-gnu) and could read here this crystal clear explanation :
Bash variables are stored as C strings. C strings are NUL-terminated.
They thus cannot store NULs by definition.
I could see here and there in both cases a coding solution using read:
# read content from stdin into array variable and a scalar variable "suffix"
array=( )
while IFS= read -r -d '' line; do
array+=( "$line" )
done < <(process that generates NUL stream here)
suffix=$line # content after last NUL, if any
# emit recorded content
printf '%s\0' "${array[#]}"; printf '%s' "$suffix"
I don't want to use arrays nor a while loop for this specific case, or others. I found this workaround using sed:
ssid=$(grep -Po 'PHPSESSID.*' path/sessionFile | grep -Po '[a-z]|[0-9]' | tr '\n' '_' | sed -e 's/_//g')
My two questions are :
1) Would it be a better way to substitute tr '\n' '\0', without using read into a while loop ?
2) Would it be a better way to extract properly the SSID and security level ?
Thx
It looks like you're trying to get rid of the newlines in the output from grep, but turning them into nulls doesn't do this. Nulls aren't visible in your terminal, but are still there and (like many other nonprinting characters) will wreak havoc if they get treated as part of your actual data. If you want to get rid of the newlines, just tell tr to delete them for you with ... | tr -d '\n'. But if you're trying to get the PHPSESSID value from a Netscape-format cookie file, there's a much much better way:
ssid=$(awk '($6 == "PHPSESSID") {print $7}' path/sessionFile)
This looks for "PHPSESSID" only in the sixth field (not in e.g. the path or cookie values -- both places it could legally appear), and specifically prints the seventh field of matching lines (not just anything after "PHPSESSID" that happens to be a digit or lowercase letter).
You could also try this, if you don't want to use awk:
ssid=$(grep -P '\bPHPSESSID\b' you_cookies_file)
echo $ssid # for debug only
which outputs something like
#HttpOnly_127.0.0.1 FALSE / FALSE 0 PHPSESSID 1hjs18icittvqvpa4tm2lv9b12
Then with cut(1) extract the relevant field:
echo $ssid |cut -d" " -f7
which outputs
1hjs18icittvqvpa4tm2lv9b12
Of course you should capture the last echo.
UPDATE
If you don't want to use cut, it is possible to emulate it with:
echo $ssid | (read a1 b2 c3 d4 e5 f6 g7; echo $g7)
Demonstration to capture in a variable:
$ field=$(echo $ssid | (read a1 b2 c3 d4 e5 f6 g7; echo $g7))
$ echo $field
1hjs18icittvqvpa4tm2lv9b12
$
Another way is to use positional parameters passing the string to a function which then refers to $7. Perhaps cleaner. Otherwise, you can use an array:
array=($(echo $ssid))
echo ${array[6]} # outputs the 7th field
It should also be possible to use regular expressions and/or string manipulation is bash, but they seem a little more difficult to me.

How to append lots of variables to one variable with a simple command

I want to stick all the variables into one variable
A=('blah')
AA=('blah2')
AAA=('blah3')
AAB=('blah4')
AAC=('blah5')
#^^lets pretend theres 100 more of these ^^
#Variable composition
#after AAA, is AAB then AAC then AAD etc etc, does that 100 times
I want them all placed into this MASTER variable
#MASTER=${A}${AA}${AAA} (<-- insert AAB, AAC and 100 more variables here)
I obviously don't want to type 100 variables in this expression because there's probably an easier way to do this. Plus I'm gonna be doing more of these therefore I need it automated.
I'm relatively new to sed, awk, is there a way to append those 100 variables into the master variable?
For this specific purpose I DO NOT want an array.
You can use a simple one-liner, quite straightforward, though more expensive:
master=$(set | grep -E '^(A|AA|A[A-D][A-D])=' | sort | cut -f2- -d= | tr -d '\n')
set lists all the variables in var=name format
grep filters out the variables we need
sort puts them in the right order (probably optional since set gives a sorted output)
cut extracts the values, removing the variable names
tr removes the newlines
Let's test it.
A=1
AA=2
AAA=3
AAB=4
AAC=5
AAD=6
AAAA=99 # just to make sure we don't pick this one up
master=$(set | grep -E '^(A|AA|A[A-D][A-D])=' | sort | cut -f2- -d= | tr -d '\n')
echo "$master"
Output:
123456
With my best guess, how about:
#!/bin/bash
A=('blah')
AA=('blah2')
AAA=('blah3')
AAB=('blah4')
AAC=('blah5')
# to be continued ..
for varname in A AA A{A..D}{A..Z}; do
value=${!varname}
if [ -n "$value" ]; then
MASTER+=$value
fi
done
echo $MASTER
which yields:
blahblah2blah3blah4blah5...
Although I'm not sure whether this is what the OP wants.
echo {a..z}{a..z}{a..z} | tr ' ' '\n' | head -n 100 | tail -n 3
adt
adu
adv
tells us, that it would go from AAA to ADV to reach 100, or for ADY for 103.
echo A{A..D}{A..Z} | sed 's/ /}${/g'
AAA}${AAB}${AAC}${AAD}${AAE}${AAF}${AAG}${AAH}${AAI}${AAJ}${AAK}${AAL}${AAM}${AAN}${AAO}${AAP}${AAQ}${AAR}${AAS}${AAT}${AAU}${AAV}${AAW}${AAX}${AAY}${AAZ}${ABA}${ABB}${ABC}${ABD}${ABE}${ABF}${ABG}${ABH}${ABI}${ABJ}${ABK}${ABL}${ABM}${ABN}${ABO}${ABP}${ABQ}${ABR}${ABS}${ABT}${ABU}${ABV}${ABW}${ABX}${ABY}${ABZ}${ACA}${ACB}${ACC}${ACD}${ACE}${ACF}${ACG}${ACH}${ACI}${ACJ}${ACK}${ACL}${ACM}${ACN}${ACO}${ACP}${ACQ}${ACR}${ACS}${ACT}${ACU}${ACV}${ACW}${ACX}${ACY}${ACZ}${ADA}${ADB}${ADC}${ADD}${ADE}${ADF}${ADG}${ADH}${ADI}${ADJ}${ADK}${ADL}${ADM}${ADN}${ADO}${ADP}${ADQ}${ADR}${ADS}${ADT}${ADU}${ADV}${ADW}${ADX}${ADY}${ADZ
The final cosmetics is easily made by hand.
One-liner using a for loop:
for n in A AA A{A..D}{A..Z}; do str+="${!n}"; done; echo ${str}
Output:
blahblah2blah3blah4blah5
Say you have the input file inputfile.txt with arbitrary variable names and values:
name="Joe"
last="Doe"
A="blah"
AA="blah2
then do:
master=$(eval echo $(grep -o "^[^=]\+" inputfile.txt | sed 's/^/\$/;:a;N;$!ba;s/\n/$/g'))
This will concatenate the values of all variables in inputfile.txt into master variable. So you will have:
>echo $master
JoeDoeblahblah2

Get lines by a unique portion of the line, and display only the first occurrence of that unique portion

I'm trying to write a script that looks at a part of a line, does a sort -u or something to look for unique occurrences, and then displays the output, sorted by the ORIGINAL ordering of the lines. In other words, only the FIRST occurrence of that part of the line would show up.
I managed to do it using cut, but my output just displays the cut portion of the data. How could I do it so that it gets the entire line?
Here's what I've got so far:
cut -d, -f6 infile.txt | cut -c4-11 | grep -n . | sort -t: -k2,2 -u | sort -t: -k1n,1 | cut -d: -f2-
I know the data doesn't have an extra : or a , in a place that would break this script. But this only outputs the data that was unique. How can I get the entire line? I would prefer to stay away from perl, but awk is okay (though I don't know it very well).
Sample:
If the input file is this (note, the ABCDEFGH is not real, I just put it there to illustrate what I mean):
A....,....,...........,.....,....,...20130718......,.........,...........,......
B....,....,...........,.....,....,...20130714......,.........,...........,......
C....,....,...........,.....,....,...20130718......,.........,...........,......
D....,....,...........,.....,....,...20130719......,.........,...........,......
E....,....,...........,.....,....,...20130713......,.........,...........,......
F....,....,...........,.....,....,...20130714......,.........,...........,......
G....,....,...........,.....,....,...20130630......,.........,...........,......
H....,....,...........,.....,....,...20130718......,.........,...........,......
My program outputs:
20130718
20130714
20130719
20130713
20130630
I want to see:
A....,....,...........,.....,....,...20130718......,.........,...........,......
B....,....,...........,.....,....,...20130714......,.........,...........,......
D....,....,...........,.....,....,...20130719......,.........,...........,......
E....,....,...........,.....,....,...20130713......,.........,...........,......
G....,....,...........,.....,....,...20130630......,.........,...........,......
Yes, awk is your best bet. Here's a mysterious example:
awk -F, '!seen[substr($6,4,8)]++' infile.txt
Explanation:
options:
-F, set the field separator to ,
condition:
substr($6,4,8) up to 8 characters starting at the fourth character
of the sixth field
seen[...]++ seen is an associative array (dictionary). Increment the
value associated with ..., and return the old value
!seen[...]++ if there was no old value, perform the action
action:
There is no action, only a condition, so the default action is
performed if the test succeeds. The default action is to print
the line. So the line will be printed if the relevant characters of
the sixth field haven't yet been seen.
Test:
$ awk -F, '!seen[substr($6,4,8)]++' <<EOF
> A....,....,...........,.....,....,...20130718......,.........,...........,......
> B....,....,...........,.....,....,...20130714......,.........,...........,......
> C....,....,...........,.....,....,...20130718......,.........,...........,......
> D....,....,...........,.....,....,...20130719......,.........,...........,......
> E....,....,...........,.....,....,...20130713......,.........,...........,......
> F....,....,...........,.....,....,...20130714......,.........,...........,......
> G....,....,...........,.....,....,...20130630......,.........,...........,......
> H....,....,...........,.....,....,...20130718......,.........,...........,......
> EOF
A....,....,...........,.....,....,...20130718......,.........,...........,......
B....,....,...........,.....,....,...20130714......,.........,...........,......
D....,....,...........,.....,....,...20130719......,.........,...........,......
E....,....,...........,.....,....,...20130713......,.........,...........,......
G....,....,...........,.....,....,...20130630......,.........,...........,......
$

Best way to simulate "group by" from bash?

Suppose you have a file that contains IP addresses, one address in each line:
10.0.10.1
10.0.10.1
10.0.10.3
10.0.10.2
10.0.10.1
You need a shell script that counts for each IP address how many times it appears in the file. For the previous input you need the following output:
10.0.10.1 3
10.0.10.2 1
10.0.10.3 1
One way to do this is:
cat ip_addresses |uniq |while read ip
do
echo -n $ip" "
grep -c $ip ip_addresses
done
However it is really far from being efficient.
How would you solve this problem more efficiently using bash?
(One thing to add: I know it can be solved from perl or awk, I'm interested in a better solution in bash, not in those languages.)
ADDITIONAL INFO:
Suppose that the source file is 5GB and the machine running the algorithm has 4GB. So sort is not an efficient solution, neither is reading the file more than once.
I liked the hashtable-like solution - anybody can provide improvements to that solution?
ADDITIONAL INFO #2:
Some people asked why would I bother doing it in bash when it is way easier in e.g. perl. The reason is that on the machine I had to do this perl wasn't available for me. It was a custom built linux machine without most of the tools I'm used to. And I think it was an interesting problem.
So please, don't blame the question, just ignore it if you don't like it. :-)
sort ip_addresses | uniq -c
This will print the count first, but other than that it should be exactly what you want.
The quick and dirty method is as follows:
cat ip_addresses | sort -n | uniq -c
If you need to use the values in bash you can assign the whole command to a bash variable and then loop through the results.
PS
If the sort command is omitted, you will not get the correct results as uniq only looks at successive identical lines.
for summing up multiple fields, based on a group of existing fields, use the example below : ( replace the $1, $2, $3, $4 according to your requirements )
cat file
US|A|1000|2000
US|B|1000|2000
US|C|1000|2000
UK|1|1000|2000
UK|1|1000|2000
UK|1|1000|2000
awk 'BEGIN { FS=OFS=SUBSEP="|"}{arr[$1,$2]+=$3+$4 }END {for (i in arr) print i,arr[i]}' file
US|A|3000
US|B|3000
US|C|3000
UK|1|9000
The canonical solution is the one mentioned by another respondent:
sort | uniq -c
It is shorter and more concise than what can be written in Perl or awk.
You write that you don't want to use sort, because the data's size is larger than the machine's main memory size. Don't underestimate the implementation quality of the Unix sort command. Sort was used to handle very large volumes of data (think the original AT&T's billing data) on machines with 128k (that's 131,072 bytes) of memory (PDP-11). When sort encounters more data than a preset limit (often tuned close to the size of the machine's main memory) it sorts the data it has read in main memory and writes it into a temporary file. It then repeats the action with the next chunks of data. Finally, it performs a merge sort on those intermediate files. This allows sort to work on data many times larger than the machine's main memory.
cat ip_addresses | sort | uniq -c | sort -nr | awk '{print $2 " " $1}'
this command would give you desired output
Solution ( group by like mysql)
grep -ioh "facebook\|xing\|linkedin\|googleplus" access-log.txt | sort | uniq -c | sort -n
Result
3249 googleplus
4211 linkedin
5212 xing
7928 facebook
It seems that you have to either use a big amount of code to simulate hashes in bash to get linear behavior or stick to the quadratic superlinear versions.
Among those versions, saua's solution is the best (and simplest):
sort -n ip_addresses.txt | uniq -c
I found http://unix.derkeiler.com/Newsgroups/comp.unix.shell/2005-11/0118.html. But it's ugly as hell...
I feel awk associative array is also handy in this case
$ awk '{count[$1]++}END{for(j in count) print j,count[j]}' ips.txt
A group by post here
You probably can use the file system itself as a hash table. Pseudo-code as follows:
for every entry in the ip address file; do
let addr denote the ip address;
if file "addr" does not exist; then
create file "addr";
write a number "0" in the file;
else
read the number from "addr";
increase the number by 1 and write it back;
fi
done
In the end, all you need to do is to traverse all the files and print the file names and numbers in them. Alternatively, instead of keeping a count, you could append a space or a newline each time to the file, and in the end just look at the file size in bytes.
Most of the other solutions count duplicates. If you really need to group key value pairs, try this:
Here is my example data:
find . | xargs md5sum
fe4ab8e15432161f452e345ff30c68b0 a.txt
30c68b02161e15435ff52e34f4fe4ab8 b.txt
30c68b02161e15435ff52e34f4fe4ab8 c.txt
fe4ab8e15432161f452e345ff30c68b0 d.txt
fe4ab8e15432161f452e345ff30c68b0 e.txt
This will print the key value pairs grouped by the md5 checksum.
cat table.txt | awk '{print $1}' | sort | uniq | xargs -i grep {} table.txt
30c68b02161e15435ff52e34f4fe4ab8 b.txt
30c68b02161e15435ff52e34f4fe4ab8 c.txt
fe4ab8e15432161f452e345ff30c68b0 a.txt
fe4ab8e15432161f452e345ff30c68b0 d.txt
fe4ab8e15432161f452e345ff30c68b0 e.txt
GROUP BY under bash
Regarding this SO thread, there are some different answer regarding different needs.
1. Counting IP as SO request (GROUP BY IP address).
As IP are easy to convert to single integer, for small bunch of address, if you need to repeat this kind of operation many time, using a pure bash function could be a lot more efficient!
Pure bash (no fork!)
There is a way, using a bash function. This way is very quick as there is no fork!...
countIp () {
local -a _ips=(); local _a
while IFS=. read -a _a ;do
((_ips[_a<<24|${_a[1]}<<16|${_a[2]}<<8|${_a[3]}]++))
done
for _a in ${!_ips[#]} ;do
printf "%.16s %4d\n" \
$(($_a>>24)).$(($_a>>16&255)).$(($_a>>8&255)).$(($_a&255)) ${_ips[_a]}
done
}
Note: IP addresses are converted to 32bits unsigned integer value, used as index for array. This use simple bash arrays!
time countIp < ip_addresses
10.0.10.1 3
10.0.10.2 1
10.0.10.3 1
real 0m0.001s
user 0m0.004s
sys 0m0.000s
time sort ip_addresses | uniq -c
3 10.0.10.1
1 10.0.10.2
1 10.0.10.3
real 0m0.010s
user 0m0.000s
sys 0m0.000s
On my host, doing so is a lot quicker than using forks, upto approx 1'000 addresses, but take approx 1 entire second when I'll try to sort'n count 10'000 addresses.
2. GROUP BY duplicates (files content)
By using checksum you could indentfy duplicate files somewhere:
find . -type f -exec sha1sum {} + |
sort |
sed '
:a;
$s/^[^ ]\+ \+//;
N;
s/^\([^ ]\+\) \+\([^ ].*\)\n\1 \+\([^ ].*\)$/\1 \2\o11\3/;
ta;
s/^[^ ]\+ \+//;
P;
D;
ba
'
This will print all duplicates, by line, separated by Tabulation ($'\t' or octal 011 ou could change /\1 \2\o11\3/; by /\1 \2|\3/; for using | as separator).
./b.txt ./e.txt
./a.txt ./c.txt ./d.txt
Could be written as (with | as separator):
find . -type f -exec sha1sum {} + | sort | sed ':a;$s/^[^ ]\+ \+//;N;
s/^\([^ ]\+\) \+\([^ ].*\)\n\1 \+\([^ ].*\)$/\1 \2|\3/;ta;s/^[^ ]\+ \+//;P;D;ba'
Pure bash way
By using nameref, you could build bash arrays holding all duplicates:
declare -iA sums='()'
while IFS=' ' read -r sum file ;do
declare -n list=_LST_$sum
list+=("$file")
sums[$sum]+=1
done < <(
find . -type f -exec sha1sum {} +
)
From there, you have a bunch of arrays holding all duplicates file name as separated element:
for i in ${!sums[#]};do
declare -n list=_LST_$i
printf "%d %d %s\n" ${sums[$i]} ${#list[#]} "${list[*]}"
done
This may output something like:
2 2 ./e.txt ./b.txt
3 3 ./c.txt ./a.txt ./d.txt
Where count of files by md5sum (${sums[$shasum]}) match count of element in arrays ${_LST_ShAsUm[#]}.
for i in ${!sums[#]};do
declare -n list=_LST_$i
echo ${list[#]#A}
done
declare -a _LST_22596363b3de40b06f981fb85d82312e8c0ed511=([0]="./e.txt" [1]="./b.txt")
declare -a _LST_f572d396fae9206628714fb2ce00f72e94f2258f=([0]="./c.txt" [1]="./a.txt" [2]="./d.txt")
Note that this method could handle spaces and special characters in filenames!
3. GROUP BY columns in a table
As efficient sample using awk was provided by Anonymous, here is a pure bash solution.
So you want to sumarize columns 3 to last column and group by columns 1 and 2, having table.txt looking like
US|A|1000|2000
US|B|1000|2000
US|C|1000|2000
UK|1|1000|2000
UK|1|1000|2000|3000
UK|1|1000|2000|3000|4000
For not too big tables, you could:
myfunc() {
local -iA restabl='()';
local IFS=+
while IFS=\| read -ra ar; do
restabl["${ar[0]}|${ar[1]}"]+="${ar[*]:2}"
done
for i in ${!restabl[#]} ;do
printf '%s|%s\n' "$i" "${restabl[$i]}"
done
}
Could ouput something like:
myfunc <table.txt
UK|1|19000
US|A|3000
US|C|3000
US|B|3000
And to have table sorted:
myfunc() {
local -iA restabl='()';
local IFS=+ sorted=()
while IFS=\| read -ra ar; do
sorted[64#${ar[0]}${ar[1]}]="${ar[0]}|${ar[1]}"
restabl["${ar[0]}|${ar[1]}"]+="${ar[*]:2}"
done
for i in ${sorted[#]} ;do
printf '%s|%s\n' "$i" "${restabl[$i]}"
done
}
Must return:
myfunc <table
UK|1|19000
US|A|3000
US|B|3000
US|C|3000
I'd have done it like this:
perl -e 'while (<>) {chop; $h{$_}++;} for $k (keys %h) {print "$k $h{$k}\n";}' ip_addresses
but uniq might work for you.
Importing data to sqlite db and using sql syntax (just an other idea).
I know it's too much for this example but would be useful for complex queries with multiple files (tables)
#!/bin/bash
trap clear_db EXIT
clear_db(){ rm -f "mydb$$"; }
# add header to input_file (IP)
INPUT_FILE=ips.txt
# import file into db
sqlite3 -csv mydb$$ ".import ${INPUT_FILE} mytable"
# using sql statements on table 'mytable'
sqlite3 mydb$$ -separator " " "SELECT IP, COUNT(*) FROM mytable GROUP BY IP;"
10.0.10.1 3
10.0.10.2 1
10.0.10.3 1
I understand you are looking for something in Bash, but in case someone else might be looking for something in Python, you might want to consider this:
mySet = set()
for line in open("ip_address_file.txt"):
line = line.rstrip()
mySet.add(line)
As values in the set are unique by default and Python is pretty good at this stuff, you might win something here. I haven't tested the code, so it might be bugged, but this might get you there. And if you want to count occurrences, using a dict instead of a set is easy to implement.
Edit:
I'm a lousy reader, so I answered wrong. Here's a snippet with a dict that would count occurences.
mydict = {}
for line in open("ip_address_file.txt"):
line = line.rstrip()
if line in mydict:
mydict[line] += 1
else:
mydict[line] = 1
The dictionary mydict now holds a list of unique IP's as keys and the amount of times they occurred as their values.
This does not answer the count element of the original question, but this question is the first search engine result when searching for what I wanted to achieve, so I thought this may help someone as it relates to 'group by' functionality.
I wanted to order files based on groupings of them, where the presence of some string in the filename determined the group.
It uses a temporary grouping/ordering prefix which is removed after ordering; sed substitute expressions (s#pattern#replacement#g) match the target string and prepend an integer to the line corresponding to the desired sort order of that target string. Then, grouping prefix is removed with cut.
Note that the sed expressions could be joined (e.g. sed -e '<expr>; <expr>; <expr>;') but here they're split for readability.
It's not pretty and probably not fast (I'm dealing with <50 items) but it at-least conceptually simple and doesn't require learning awk.
#!/usr/bin/env bash
for line in $(find /etc \
| sed -E -e "s#^(.*${target_string_A}.*)#${target_string_A_sort_index}:\1#;" \
| sed -E -e "s#^(.*${target_string_B}.*)#${target_string_B_sort_index}:\1#;" \
| sed -E -e "s#^/(.*)#00:/\1#;" \
| sort \
| cut -c4-
)
do
echo "${line}"
done
e.g. Input
/this/is/a/test/a
/this/is/a/test/b
/this/is/a/test/c
/this/is/a/special/test/d
/this/is/a/another/test/e
#!/usr/bin/env bash
for line in $(find /etc \
| sed -E -e "s#^(.*special.*)#10:\1#;" \
| sed -E -e "s#^(.*another.*)#05:\1#;" \
| sed -E -e "s#^/(.*)#00:/\1#;" \
| sort \
| cut -c4-
)
do
echo "${line}"
done
/this/is/a/test/a
/this/is/a/test/b
/this/is/a/test/c
/this/is/a/another/test/e
/this/is/a/special/test/d
A combination of awk + sort (with version sort flag) is probably fastest (if ur environment has awk at all):
echo "${input...}" |
{m,g}awk '{ __[$+_]++ } END { for(_ in __) { print "",+__[_],_ } }' FS='^$' OFS='\t' |
gsort -t$'\t' -k 3,3 -V
Only the post GROUP-BY summary rows are being sent to the sorting utility - which is far less system intensive sort compared to pre-sorting the input rows for no reason.
For small inputs, e.g. fewer than 1000 rows or so, just directly sort|uniq -c it.
3 10.0.10.1
1 10.0.10.2
1 10.0.10.3
Sort may be omitted if order is not significant
uniq -c <source_file>
or
echo "$list" | uniq -c
if the source list is a variable

Resources