Bash to find count of multiple strings in a large file - bash

I'm trying to get the count of various strings in a large txt file using bash commands.
I.e. find the count of the strings 'pig', 'horse', and 'cat' using bash, and get an output say 'pig: 7, horse: 3, cat: 5'. I would like a way to search through the txt file only once, because it is very large (so I do not want to search for 'pig' through the whole txt file, then go back and search for 'horse', etc.)
Any help with commands would be appreciated. Thanks!

grep -Eo 'pig|horse|cat' txt.file | sort | uniq -c | awk '{print $2": "$1}'
Breaking that into pieces:
grep -Eo 'pig|horse|cat' Print all the occurrences (-o) of the
extended (-e) regex
sort Sort the resulting words
uniq -c Output unique values (of sorted input)
with the count (-c) of each value
awk '{print $2": "$1}' For each line, print the second field (the word)
then a colon and a space, and then the first
field (the count).

Related

How many times has the letter "N" or its repeat(eg: "NNNNN") been found in a text file?

I am given a file.txt (text file) with a string of data. Example contents:
abcabccabbabNababbababaaaNNcacbba
abacabababaaNNNbacabaaccabbacacab
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
aaababababababacacacacccbababNbNa
abababbacababaaacccc
To find the number of distinct repeated patterns of "N" (repeated one or more times) that are present in the file using unix commands.
I am unsure on what commands to use even after trying a range of different commands.
$ grep -E -c "(N)+" file.txt
the output must be 6
One way:
$ sed 's/[^N]\{1,\}/\n/g' file.txt | grep -c N
6
How it works:
Replace all sequences of one or more non-N characters in the input with a newline.
This turns strings like abcabccabbabNababbababaaaNNcacbba into
N
NN
Count the number of lines with at least one N (Ignoring the empty lines).
Regular-expression free alternative:
$ tr -sc N ' ' < file.txt | wc -w
6
Uses tr to replace all runs of non-N characters with a single space, and counts the remaining words (Which are the N sequences). Might not even need the -s option.
Using GNU awk (well, just tested with gawk, mawk, busybox awk and awk version 20121220 and it seemed to work with all of them):
$ gawk -v RS="^$" -F"N+" '{print NF-1}' file
6
It reads in the whole file as a single record, uses regex N+ as field separator and outputs the field count minus one. For other awks:
$ awk -v RS="" -F"N+" '{c+=NF-1}END{print c}' file
It reads in empty line separated blocks of records, counts and sums fields.
Here is an awk that should work on most system.
awk -F'N+' '{a+=NF-1} END {print a}' file
6
It splits the line by one or more N and then count number of fields-1 pr line.
If you have a text file, and you want to count the number times a sequence of letters of N appear, you can do:
awk '{a+=gsub(/N+/,"")}END{print a}' file
This, however, will distinguish sequences that are split over multiple lines. Example:
abcNNN
NNefg
If you want this to be counted as a single sequence, you should do:
awk 'BEGIN{RS=OFS=""}{$1=$1}{a+=gsub(/N+/,"")}END{print a}' file

Bash: sort rows within a file by timestamp

I am new to bash scripting and I have written a script to match regex and output lines to print to a file.
However, each line contains multiple columns, one of which is the timestamp column, which appears in the form YYYYMMDDHHMMSSTTT (to millisecond) as shown below.
20180301050630663,ABC,,,,,,,,,,
20180301050630664,ABC,,,,,,,,,,
20180301050630665,ABC,,,,,,,,,,
20180301050630666,ABC,,,,,,,,,,
20180301050630667,ABC,,,,,,,,,,
20180301050630668,ABC,,,,,,,,,,
20180301050630663,ABC,,,,,,,,,,
20180301050630665,ABC,,,,,,,,,,
20180301050630661,ABC,,,,,,,,,,
20180301050630662,ABC,,,,,,,,,,
My code is written as follow:
awk -F "," -v OFS=","'{if($2=="ABC"){print}}' < $i>> "$filename"
How can I modify my code such that it can sort the rows by timestamp (YYYYMMDDHHMMSSTTT) in ascending order before printing to file?
You can use a very simple sort command, e.g.
sort yourfile
If you want to insure sort only looks at the datestamp, you can tell sort to only use the first command separated field as your sorting criteria, e.g.
sort -t, -k1 yourfile
Example Use/Output
With your data save in a file named log, you could do:
$ sort -t, -k1 log
20180301050630661,ABC,,,,,,,,,,
20180301050630662,ABC,,,,,,,,,,
20180301050630663,ABC,,,,,,,,,,
20180301050630663,ABC,,,,,,,,,,
20180301050630664,ABC,,,,,,,,,,
20180301050630665,ABC,,,,,,,,,,
20180301050630665,ABC,,,,,,,,,,
20180301050630666,ABC,,,,,,,,,,
20180301050630667,ABC,,,,,,,,,,
20180301050630668,ABC,,,,,,,,,,
Let me know if you have any problems.
Just add a pipeline.
awk -F "," '$2=="ABC"' < "$i" |
sort -n >> "$filename"
In the general case, to sort on column 234. try sort -t, -k234,234n
Notice alse the quoting around "$i", like you already have around "$filename", and the simplifications of the Awk script.
If you are using gawk you can do:
$ awk -F "," -v OFS="," '$2=="ABC"{a[$1]=$0} # Filter lines that have "ABC"
END{ # set the sort method
PROCINFO["sorted_in"] = "#ind_num_asc"
for (e in a) print a[e] # traverse the array of lines
}' file
An alternative is to use sed and sort:
sed -n '/^[0-9]*,ABC,/p' file | sort -t, -k1 -n
Keep in mind that both of these methods are unrelated to the shell used. Bash is just executing the tools (sed, awk, sort, etc) that are otherwise part of the OS.
Bash itself could do the sort in pure Bash but it would be long and slow.

bash sort a list starting at the end of each line

I have a file containing file paths and filenames that I want to sort starting at the end of the string.
My file contains a list such as below:
/Volumes/Location/Workers/Andrew/2015-08-12_Andrew_PC/DOCS/3177109.doc
/Volumes/Location/Workers/Andrew/2015-09-17_Andrew_PC/DOCS/2130419.doc
/Volumes/Location/Workers/Bill/2016-03-17_Bill_PC/DOCS/1998816.doc
/Volumes/Location/Workers/Charlie/2016-07-06_Charlie_PC/DOCS/4744123.doc
I want to sort this list such that the filenames will be sequential, this will help find duplicates based on filename regardless of path.
The list should appear like this:
/Volumes/Location/Workers/Bill/2016-03-17_Bill_PC/DOCS/1998816.doc
/Volumes/Location/Workers/Andrew/2015-09-17_Andrew_PC/DOCS/2130419.doc
/Volumes/Location/Workers/Andrew/2015-08-12_Andrew_PC/DOCS/3177109.doc
/Volumes/Location/Workers/Charlie/2015-07-06_Charlie_PC/DOCS/4744128.doc
Here's a way to do this:
sed -e 's|^.*/\(.*\)$|\1\t\0|' list.txt | sort | cut -f 2-
This uses sed to insert a copy of the filename to the beginning of each line so that we can sort the list with sort. Then we remove the stuff that we added in the first step.
This should work:
sort -t/ -k7 input_file
This will sort based on dynamic last field which is separated by /.
First it will append last field to the start of the line and then sort. First field which is appended earlier is removed by second awk.
awk -F'/' '{ $0= $NF " " $0;print $0 |"sort -k1"}' fil |awk '{print $2}'
/Volumes/Location/Workers/Bill/2016-03-17_Bill_PC/DOCS/1998816.doc
/Volumes/Location/Workers/Andrew/2015-09-17_Andrew_PC/DOCS/2130419.doc
/Volumes/Location/Workers/Andrew/2015-08-12_Andrew_PC/DOCS/3177109.doc
/Volumes/Location/Workers/Charlie/2016-07-06_Charlie_PC/DOCS/4744123.doc

How to sort,uniq and display line that appear more than X times

I have a file like this:
80.13.178.2
80.13.178.2
80.13.178.2
80.13.178.2
80.13.178.1
80.13.178.3
80.13.178.3
80.13.178.3
80.13.178.4
80.13.178.4
80.13.178.7
I need to display unique entries for repeated line (similar to uniq -d) but only entries that occur more than just twice (twice being an example so flexibility to define the lower limit.)
Output for this example should be like this when looking for entries with three or more occurrences:
80.13.178.2
80.13.178.3
Feed the output from uniq -cd to awk
sort test.file | uniq -cd | awk -v limit=2 '$1 > limit{print $2}'
With pure awk:
awk '{a[$0]++}END{for(i in a){if(a[i] > 2){print i}}}' a.txt
It iterates over the file and counts the occurances of every IP. At the end of the file it outputs every IP which occurs more than 2 times.

Bash Script: count unique lines in file

Situation:
I have a large file (millions of lines) containing IP addresses and ports from a several hour network capture, one ip/port per line. Lines are of this format:
ip.ad.dre.ss[:port]
Desired result:
There is an entry for each packet I received while logging, so there are a lot of duplicate addresses. I'd like to be able to run this through a shell script of some sort which will be able to reduce it to lines of the format
ip.ad.dre.ss[:port] count
where count is the number of occurrences of that specific address (and port). No special work has to be done, treat different ports as different addresses.
So far, I'm using this command to scrape all of the ip addresses from the log file:
grep -o -E [0-9]+\.[0-9]+\.[0-9]+\.[0-9]+(:[0-9]+)? ip_traffic-1.log > ips.txt
From that, I can use a fairly simple regex to scrape out all of the ip addresses that were sent by my address (which I don't care about)
I can then use the following to extract the unique entries:
sort -u ips.txt > intermediate.txt
I don't know how I can aggregate the line counts somehow with sort.
You can use the uniq command to get counts of sorted repeated lines:
sort ips.txt | uniq -c
To get the most frequent results at top (thanks to Peter Jaric):
sort ips.txt | uniq -c | sort -bgr
To count the total number of unique lines (i.e. not considering duplicate lines) we can use uniq or Awk with wc:
sort ips.txt | uniq | wc -l
awk '!seen[$0]++' ips.txt | wc -l
Awk's arrays are associative so it may run a little faster than sorting.
Generating text file:
$ for i in {1..100000}; do echo $RANDOM; done > random.txt
$ time sort random.txt | uniq | wc -l
31175
real 0m1.193s
user 0m0.701s
sys 0m0.388s
$ time awk '!seen[$0]++' random.txt | wc -l
31175
real 0m0.675s
user 0m0.108s
sys 0m0.171s
This is the fastest way to get the count of the repeated lines and have them nicely printed sored by the least frequent to the most frequent:
awk '{!seen[$0]++}END{for (i in seen) print seen[i], i}' ips.txt | sort -n
If you don't care about performance and you want something easier to remember, then simply run:
sort ips.txt | uniq -c | sort -n
PS:
sort -n parse the field as a number, that is correct since we're sorting using the counts.

Resources