combining multiple grep searches and making my script more efficient - bash

I have a file called Type1.txt, that looks like this:
$ cat Type1.txt
ID.580.G3C0
TTTTTTTTTTT
ID.580.G3C8
ATTATATC-AAA
ID.580.GXC16
ATTATTTC-ACG-TTTTTCCTA
ID.694.G9C3
ATTATATC-ACG-AAATCCTA
ID.694.G9C3
etc...
I want to write a bash script to count the instances of each ID and export it into another file that provides a summary, something like this:
ID.580 = 3
ID.694 = 1
etc...
So far the script is messy and unusable.
For the above I have the following:
#!/bin/bash
for Count in `grep -c "ID.580" Type1.txt; do
echo $Count=ID.580
done > Result.txt #Allows to count only for that single ID.
I have over a thousand ID.XXX, making this code unusable since it's not plausible to add individual ID.XXX for each search. Thank you for the help!

Shell
The code below uses the standard UNIX utilities, and does not assume that the second part of the ID is exactly 3 characters, but will find ID.1.123123123 and ID.1234.123123 and properly only take the first dot-delimited part. As it is
grep '^ID\.[0-9]' Type1.txt | cut -d . -f 1-2 | sort \
| uniq -c | awk '{ print $2" = "$1 }'
grep filters only lines beginning with ID. followed by 1 digit (at least)
cut uses . as the field delimiter, and only outputting fields 1 and 2, thus removing
everything after and including the second . on the line.
sort sorts the lines for uniq to work
uniq prints each line from its input prefixed with a count
awk part reverses these fields and prints them separated with =.
If the first part of the ID can contain letters too, change the end of regular expression to [0-9] to [0-9A-Z]. for example
The pipeline outputs
ID.580 = 3
ID.694 = 2
Python
As the Python is popular among biologists, you might want to hone your python skills instead:
from collections import Counter
counter = Counter()
with open('Type1.txt') as f:
for line in f:
if line.startswith('ID.'):
top_id = '.'.join(line.split('.', 2)[:2])
counter[top_id] += 1
for top_id, count in sorted(counter.items()):
print("%s = %d" % (top_id, count))
The results are exactly identical.

grep '^ID.[0-9][0-9][0-9]' input_file | cut -c1-6 | sort | uniq -c
works?

TL;DR
Given your particular corpus and grouping strategy, there's more than one way to get the results you need. Here are two alternative solutions, one in awk, and one in Ruby.
GNU awk
One way is to use GNU awk to perform the following steps:
match just the ID lines
split matching input lines into fields
select and print the fields you need
sort the lines in the filtered result
count the adjacent duplicates
perform any specialized formatting on the result
For example:
$ awk '/^ID/ {split($0, a, "."); print a[1] "." a[2]}' /tmp/foo |
sort | uniq --count | awk '{print $2 " = " $1}'
ID.580 = 3
ID.694 = 2
With the corpus you provided in your question, this takes an average of 8 ms on my system. A larger corpus will take longer, of course, but unless you have a really huge data set this should be fast enough for most purposes.
Ruby
Ruby offers what I consider a more elegant solution, but is in fact slower. The idea here is to store the relevant portion of your IDs as hash keys, and increment a counter each time you encounter a given ID. For example, consider this Ruby one-liner:
$ ruby -ne 'BEGIN { id = Hash.new(0) }
id[$&] += 1 if /\AID\.\d+/
END { id.each_pair do |k,v| puts "#{k} = #{v}" end }' /tmp/foo
ID.580 = 3
ID.694 = 2
This solution takes around 45 ms to process the same corpus, so I wouldn't recommend it over the awk pipeline just for transforming output. The main advantage to doing it this way is that you have an actual data structure (e.g. a Hash object) that you could manipulate in a more full-featured program.

Here is awk one liner:
$ awk -F. '$1=="ID"{a[$2,$3]++}END{for (i in a) {split(i,ind,SUBSEP); r[ind[1]]++}for (i in r) print "ID."i" = "r[i]}' file
ID.694 = 1
ID.580 = 3
And here is a pure bash solution:
#!/bin/bash
while IFS=. read -r pre id code rest
do
[[ $pre == ID ]] || continue
[[ ${a[$id]} =~ \."$code"\. ]] || {
a[$id]="${a[$id]}.$code."
((count[$id]++));
}
done < file
for i in "${!count[#]}"
do
echo "ID.$i = ${count[$i]}"
done
$ ./script.sh
ID.580 = 3
ID.694 = 1

awk might work too...
awk '/ID.580/{x++}END{print x}' test.txt
You can put this in a for loop
for i in ID.580 ID.694
do
awk '/'$i'/{x++}END{print x}' test.txt
done

Related

How to sort array of strings by function in shell script

I have the following list of strings in shell script:
something-7-5-2020.dump
another-7-5-2020.dump
anoter2-6-5-2020.dump
another-4-5-2020.dump
another2-4-5-2020.dump
something-2-5-2020.dump
another-2-5-2020.dump
8-1-2021
26-1-2021
20-1-2021
19-1-2021
3-9-2020
29-9-2020
28-9-2020
24-9-2020
1-9-2020
6-8-2020
20-8-2020
18-8-2020
12-8-2020
10-8-2020
7-7-2020
5-7-2020
27-7-2020
7-6-2020
5-6-2020
23-6-2020
18-6-2020
28-5-2020
26-5-2020
9-12-2020
28-12-2020
15-12-2020
1-12-2020
27-11-2020
20-11-2020
19-11-2020
18-11-2020
1-11-2020
11-11-2020
31-10-2020
29-10-2020
27-10-2020
23-10-2020
21-10-2020
15-10-2020
23-09-2020
So my goal is to sort them by date, but it's in dd-mm-yyyy and d-m-yyyy format and sometimes there's a word before like word-dd-mm-yyyy. I would like to create a function to sort the values like any other language so it ignores the first word, casts the date to a common format and compares that format. In javascript it would be something like:
arrayOfStrings.sort((a, b) => functionToOrderStrings())
My code to obtain the array is the following:
dumps=$(gsutil ls gs://organization-dumps/ambient | sed "s:gs\://organization-dumps/ambient/::" | sed '/^$/d' | sed 's:/$::' | sort --reverse --key=3 --key=2 --key=1 --field-separator=-)
echo "$dumps"
I would like to say that I've already searched this in Stackoverflow and none of the answers did help me, because all of them are oriented to sort dates in correct format and that's not my case.
If you have the results in a pipeline, involving an array seems completely superfluous here.
You can apply a technique called a Schwartzian transform: add a prefix to each line with a normalized version the data so it can be easily sorted, then sort, then discard the prefix.
I'm guessing something like the following;
gsutil ls gs://organization-dumps/ambient |
awk '{ sub("gs:\/\/organization-dumps/ambient/", "");
if (! $0) next;
sub("/$", "");
d = $0;
sub(/^[^0-9][^-]*-/, "", d);
sub(/[^0-9]*$/, "", d);
split(d, w, "-");
printf "%04i-%02i-%02i\t%s\n", w[3], w[2], w[1], $0 }' |
sort -n | cut -f2-
In so many words, we are adding a tab-delimited field in front of every line, then sorting on that, then discarding the first field with cut -f2-. The field extraction contains some assumptions which seem to be valid for your test data, but may need additional tweaking if you have real data with corner cases like if the label before the date could sometimes contain a number with dashes around it, too.
If you want to capture the result in a variable, like in your original code, that's easy to do; but usually, you should just run everything in a pipeline.
Notice that I factored your multiple sed scripts into the Awk script, too, some of that with a fair amount of guessing as to what the input looks like and what the sed scripts were supposed to accomplish. (Perhaps also note that sed, like Awk, is a scripting language; to run several sed commands on the same input, just put them after each other in the same sed script.)
Preprocess input to be in the format you want it to be for sorting.
Sort
Remove artifacts from step 1
The following:
sed -E '
# extract the date and put it in first column separated by tab
# this could be better, its just an example
s/(.*-)?([0-9]?[0-9]-[0-9]?[0-9]-[0-9]{4})/\2\t&/;
# If day is a single digit, add a zero in front
s/^([0-9]-)/0\1/;
# If month is a single digit, add a zero in front
s/^([0-9][0-9]-)([0-9]-)/\10\2/
# year in front? no idea - shuffle the way you want
s/([0-9]{2})-([0-9]{2})-([0-9]{4})/\3-\2-\1/
' input.txt | sort | cut -f2-
outputs:
another-2-5-2020.dump
something-2-5-2020.dump
another-4-5-2020.dump
another2-4-5-2020.dump
anoter2-6-5-2020.dump
another-7-5-2020.dump
something-7-5-2020.dump
26-5-2020
28-5-2020
5-6-2020
7-6-2020
18-6-2020
23-6-2020
5-7-2020
7-7-2020
27-7-2020
6-8-2020
10-8-2020
12-8-2020
18-8-2020
20-8-2020
1-9-2020
3-9-2020
23-09-2020
24-9-2020
28-9-2020
29-9-2020
15-10-2020
21-10-2020
23-10-2020
27-10-2020
29-10-2020
31-10-2020
1-11-2020
11-11-2020
18-11-2020
19-11-2020
20-11-2020
27-11-2020
1-12-2020
9-12-2020
15-12-2020
28-12-2020
8-1-2021
19-1-2021
20-1-2021
26-1-2021
Using GNU awk:
gsutil ls gs://organization-dumps/ambient | awk '{ match($0,/[[:digit:]]{1,2}-[[:digit:]]{1,2}-[[:digit:]]{4}/);dayt=substr($0,RSTART,RLENGTH);split(dayt,map,"-");length(map[1])==1?map[1]="0"map[1]:map[1]=map[1];length(map[2])==1?map[2]="0"map[2]:map[2]=map[2];map1[mktime(map[3]" "map[2]" "map[1]" 00 00 00")]=$0 } END { PROCINFO["sorted_in"]="#ind_num_asc";for (i in map1) { print map1[i] } }'
Explanation:
gsutil ls gs://organization-dumps/ambient | awk '{
match($0,/[[:digit:]]{1,2}-[[:digit:]]{1,2}-[[:digit:]]{4}/); # Check that lines contain a date
dayt=substr($0,RSTART,RLENGTH); # Extract the date
split(dayt,map,"-"); # Split the date in the array map based on "-" as the delimiter
length(map[1])==1? map[1]="0"map[1]:map[1]=map[1];length(map[2])==1?map[2]="0"map[2]:map[2]=map[2]; # Pad the month and day with "0" if required
map1[mktime(map[3]" "map[2]" "map[1]" 00 00 00")]=$0 # Get the epoch format date based on the values in the map array and use this for the index of the array map1 with the line as the value
}
END {
PROCINFO["sorted_in"]="#ind_num_asc"; # Set the ordering of the array
for (i in map1) {
print map1[i] # Loop through map1 and print the values (lines)
}
}'
Using GNU awk, you can do this fairly easy:
awk 'BEGIN{PROCINFO["sorted_in"]="#ind_num_asc"; FS="."}
{n=split($1,t,"-"); a[t[n]*10000 + t[n-1]*100 + t[n-2]]=$0}
END {for(i in a) print a[i]}' file
Essentially, we are asking GNU awk to traverse an array by index in ascending numeric order. Per line read, we extract the date. The date is always located before the <dot>-character and thus always in field 1 if the dot is the field separator (FS="."). We split the first field by the hyphen and use the total number of fields to extract the date. We convert the date simplistically to some number (YYYY*10000+MM*100+DD; DD<100 && MM*100 < 10000) and ask awk to sort it by that number.
It is now possible to combine the full pipe-line in a single awk:
$ gsutil ls gs://organization-dumps/ambient \
| awk 'BEGIN{PROCINFO["sorted_in"]="#ind_num_asc"; FS="."}
{sub("gs://organization-dumps/ambient/",""); sub("/$","")}
(NF==0){next}
{n=split($1,t,"-"); a[t[n]*10000 + t[n-1]*100 + t[n-2]]=$0}
END {for(i in a) print a[i]}'

Matching pairs using Linux terminal

I have a file named list.txt containing a (supplier,product) pair and I must show the number of products from every supplier and their names using Linux terminal
Sample input:
stationery:paper
grocery:apples
grocery:pears
dairy:milk
stationery:pen
dairy:cheese
stationery:rubber
And the result should be something like:
stationery: 3
stationery: paper pen rubber
grocery: 2
grocery: apples pears
dairy: 2
dairy: milk cheese
Save the input to file, and remove the empty lines. Then use GNU datamash:
datamash -s -t ':' groupby 1 count 2 unique 2 < file
Output:
dairy:2:cheese,milk
grocery:2:apples,pears
stationery:3:paper,pen,rubber
The following pipeline shoud do the job
< your_input_file sort -t: -k1,1r | sort -t: -k1,1r | sed -E -n ':a;$p;N;s/([^:]*): *(.*)\n\1:/\1: \2 /;ta;P;D' | awk -F' ' '{ print $1, NF-1; print $0 }'
where
sort sorts the lines according to what's before the colon, in order to ease the successive processing
the cryptic sed joins the lines with common supplier
awk counts the items for supplier and prints everything appropriately.
Doing it with awk only, as suggested by KamilCuk in a comment, would be a much easier job; doing it with sed only would be (for me) a nightmare. Using both is maybe silly, but I enjoyed doing it.
If you need a detailed explanation, please comment, and I'll find time to provide one.
Here's the sed script written one command per line:
:a
$p
N
s/([^:]*): *(.*)\n\1:/\1: \2 /
ta
P
D
and here's how it works:
:a is just a label where we can jump back through a test or branch command;
$p is the print command applied only to the address $ (the last line); note that all other commands are applied to every line, since no address is specified;
N read one more line and appends it to the current pattern space, putting a \newline in between; this creates a multiline in the pattern space
s/([^:]*): *(.*)\n\1:/\1: \2 / captures what's before the first colon on the line, ([^:]*), as well as what follows it, (.*), getting rid of eccessive spaces, *;
ta tests if the previous s command was successful, and, if this is the case, transfers the control to the line labelled by a (i.e. go to step 1);
P prints the leading part of the multiline up to and including the embedded \newline;
D deletes the leading part of the multiline up to and including the embedded \newline.
This should be close to the only awk code I was referring to:
< os awk -F: '{ count[$1] += 1; items[$1] = items[$1] " " $2 } END { for (supp in items) print supp": " count[supp], "\n"supp":" items[supp]}'
The awk script is more readable if written on several lines:
awk -F: '{ # for each line
# we use the word before the : as the key of an associative array
count[$1] += 1 # increment the count for the given supplier
items[$1] = items[$1] " " $2 # concatenate the current item to the previous ones
}
END { # after processing the whole file
for (supp in items) # iterate on the suppliers and print the result
print supp": " count[supp], "\n"supp":" items[supp]
}

How do I do a for loop with 2 arrays in shell script?

I have to first declare two arrays which I also need help with.
Originally, it's two single variables.
day=$(hadoop fs -ls -R /user/hive/* |
awk '/filename.txt.gz/' |
tail -1 |
date -d $(echo `awk '{print $6}'`) '+%b %-d' |
tr -d ' ')
time_stamp=$(hadoop fs -ls -R /user/hive/* |
awk '/filename.txt.gz/' |
tail -1 |
awk '{ print $7 }')
Now instead of tail -1, I need tail -5. So first, how do I make these two arrays?
Second question, how do I make a for loop with each value from the paired values of $day and $time_stamp? I can't use array_combine because I need to perform actions on each array separately. Thanks
You are collecting the data into strings, not arrays. But additionally, your code should probably be refactored significantly -- as a general rule of thumb, if something happens in Awk, most of the rest should also happen in Awk.
You assign to an array with variable=(values of array) and to get the values from a subprocess, it's variable=($(command to produce values)).
Here's a first attempt at refactoring your code.
# Avoid repeated code -- break this out into a function
extract_field () {
hadoop fs -ls -R /user/hive/* |
# Get rid of the tail and the repeated Awk
# Notice backslashes in regex
# Pass in the field to extract as a parameter
awk -v field="$1" '/filename\.txt\.gz/ { d[++i]=$field }
END { for(j=i-5; j<=i; ++j) print d[j] }'
)
day=($(extract_field 6 |
# Refactor accordingly
# And if you don't want a space in the format string, don't put a space in the format string in the first place
xargs -i {} date -d {} '+%b%-d'))
time_stamp=($(extract_field 7))
I'm highly skeptical of the arrangement to call the Hadoop command twice, though. Perhaps just extract fields 6 and 7 in a single go and then post-process the results to get them into two separate arrays. Something like this instead then?
combined=($(hadoop fs -ls -R /user/hive/* |
awk '/filename\.txt\.gz/ { d[++i]=$6 " " $7 }
END { for(j=i-5; j<=i; ++j) print d[j] }'))
for ((i=0; i<"${#combined[#]}"; ++i)); do
day[$i]="$(date -d "${combined[i]% *}" +'%b%-d')"
time_stamp[$i]="${combined[i]#* }"
done
unset combined
The statement that you need to handle the dates and times independently from each other sounds suspicious; if you can find a way to avoid doing that, perhaps after all don't split combined into two separate arrays. The code above reveals how to extract the date and the time from a value in combined (the mechanism is called parameter substitution). It also obviously demonstrates how to loop over the indices in an array.

Bash script processing too slow

I have the following script where I'm parsing 2 csv files to find a MATCH the files have 10000 lines each one. But the processing is taking a long time!!! Is this normal?
My script:
#!/bin/bash
IFS=$'\n'
CSV_FILE1=$1;
CSV_FILE2=$2;
sort -t';' $CSV_FILE1 >> Sorted_CSV1
sort -t';' $CSV_FILE2 >> Sorted_CSV2
echo "PATH1 ; NAME1 ; SIZE1 ; CKSUM1 ; PATH2 ; NAME2 ; SIZE2 ; CKSUM2" >> 'mapping.csv'
while read lineCSV1 #Parse 1st CSV file
do
PATH1=`echo $lineCSV1 | awk '{print $1}'`
NAME1=`echo $lineCSV1 | awk '{print $3}'`
SIZE1=`echo $lineCSV1 | awk '{print $7}'`
CKSUM1=`echo $lineCSV1 | awk '{print $9}'`
while read lineCSV2 #Parse 2nd CSV file
do
PATH2=`echo $lineCSV2 | awk '{print $1}'`
NAME2=`echo $lineCSV2 | awk '{print $3}'`
SIZE2=`echo $lineCSV2 | awk '{print $7}'`
CKSUM2=`echo $lineCSV2 | awk '{print $9}'`
# Test if NAM1 MATCHS NAME2
if [[ $NAME1 == $NAME2 ]]; then
#Test checksum OF THE MATCHING NAME
if [[ $CKSUM1 != $CKSUM2 ]]; then
#MAPPING OF THE MATCHING LINES
echo $PATH1 ';' $NAME1 ';' $SIZE1 ';' $CKSUM1 ';' $PATH2 ';' $NAME2 ';' $SIZE2 ';' $CKSUM2 >> 'mapping.csv'
fi
break #When its a match break the while loop and go the the next Row of the 1st CSV File
fi
done < Sorted_CSV2 #Done CSV2
done < Sorted_CSV1 #Done CSV1
This is a quadratic order. Also, see Tom Fenech comment: You are calling awk several times inside a loop inside another loop. Instead of using awk for the fields in every line try setting the IFS shell variable to ";" and read the fields directly in read commands:
IFS=";"
while read FIELD11 FIELD12 FIELD13; do
while read FIELD21 FIELD22 FIELD23; do
...
done <Sorted_CSV2
done <Sorted_CSV1
Though, this would be still O(N^2) and very inefficient. It seems you are matching 2 fields by a coincident field. This task is easier and faster to accomplish by using join command line utility, and would reduce order from O(N^2) to O(N).
Whenever you say "Does this file/data list/table have something that matches this file/data list/table?", you should think of associative arrays (sometimes called hashes).
An associative array is keyed by a particular value and each key is associated with a value. The nice thing is that finding a key is extremely fast.
In your loop of a loop, you have 10,000 lines in each file. You're outer loop executed 10,000 times. Your inner loop may execute 10,000 times for each and every line in your first file. That's 10,000 x 10,000 times you go through that inner loop. That's potentially looping 100 million times through that inner loop. Think you can see why your program might be a little slow?
In this day and age, having a 10,000 member associative array isn't that bad. (Imagine doing this back in 1980 on a MS-DOS system with 256K. It just wouldn't work). So, let's go through the first file, create a 10,000 member associative array, and then go through the second file looking for matching lines.
Bash 4.x has associative arrays, but I only have Bash 3.2 on my system, so I can't really give you an answer in Bash.
Besides, sometimes Bash isn't the answer to a particular issue. Bash can be a bit slow and the syntax can be error prone. Awk might be faster, but many versions don't have associative arrays. This is really a job for a higher level scripting language like Python or Perl.
Since I can't do a Bash answer, here's a Perl answer. Maybe this will help. Or, maybe this will inspire someone who has Bash 4.x can give an answer in Bash.
I Basically open the first file and create an associative array keyed by the checksum. If this is a sha1 checksum, it should be unique for all files (unless they're an exact match). If you don't have a sha1 checksum, you'll need to massage the structure a wee bit, but it's pretty much the same idea.
Once I have the associative array figured out, I then open file #2 and simply see if the checksum already exists in the file. If it does, I know I have a matching line, and print out the two matches.
I have to loop 10,000 times in the first file, and 10,000 times in the second. That's only 20,000 loops instead of 10 million that's 20,000 times less looping which means the program will run 20,000 times faster. So, if it takes 2 full days for your program to run with a double loop, an associative array solution will work in less than one second.
#! /usr/bin/env perl
#
use strict;
use warnings;
use autodie;
use feature qw(say);
use constant {
FILE1 => "file1.txt",
FILE2 => "file2.txt",
MATCHING => "csv_matches.txt",
};
#
# Open the first file and create the associative array
#
my %file_data;
open my $fh1, "<", FILE1;
while ( my $line = <$fh1> ) {
chomp $line;
my ( $path, $blah, $name, $bather, $yadda, $tl_dr, $size, $etc, $check_sum ) = split /\s+/, $line;
#
# The main key is "check_sum" which **should** be unique, especially if it's a sha1
#
$file_data{$check_sum}->{PATH} = $path;
$file_data{$check_sum}->{NAME} = $name;
$file_data{$check_sum}->{SIZE} = $size;
}
close $fh1;
#
# Now, we have the associative array keyed by the data we want to match, read file 2
#
open my $fh2, "<", FILE2;
open my $csv_fh, ">", MATCHING;
while ( my $line = <$fh2> ) {
chomp $line;
my ( $path, $blah, $name, $bather, $yadda, $tl_dr, $size, $etc, $check_sum ) = split /\s+/, $line;
#
# If there is a matching checksum in file1, we know we have a matching entry
#
if ( exists $file_data{$check_sum} ) {
printf {$csv_fh} "%s;%s:%s:%s:%s:%s\n",
$file_data{$check_sum}->{PATH}, $file_data{$check_sum}->{NAME}, $file_data{$check_sum}->{SIZE},
$path, $name, $size;
}
}
close $fh2;
close $csv_fh;
BUGS
(A good manpage always list issues!)
This assumes one match per file. If you have multiple duplicates in file1 or file2, you will only pick up the last one.
This assumes a sha256 or equivalent checksum. In such a checksum, it is extremely unlikely that two files will have the same checksum unless they match. A 16bit checksum from the historic sum command may have collisions.
Although a proper database engine would make a much better tool for this, it is still very well possible to do it with awk.
The trick is to sort your data, so that records with the same name are grouped together. Then a single pass from top to bottom is enough to find the matches. This can be done in linear time.
In detail:
Insert two columns in both CSV files
Make sure every line starts with the name. Also add a number (either 1 or 2) which denotes from which file the line originates. We will need this when we merge the two files together.
awk -F';' '{ print $2 ";1;" $0 }' csvfile1 > tmpfile1
awk -F';' '{ print $2 ";2;" $0 }' csvfile2 > tmpfile2
Concatenate the files, then sort the lines
sort tmpfile1 tmpfile2 > tmpfile3
Scan the result, report the mismatches
awk -F';' -f scan.awk tmpfile3
Where scan.awk contains:
BEGIN {
origin = 3;
}
$1 == name && $2 > origin && $6 != checksum {
print record;
}
{
name = $1;
origin = $2;
checksum = $6;
sub(/^[^;]*;.;/, "");
record = $0;
}
Putting it all together
Crammed together into a Bash oneliner, without explicit temporary files:
(awk -F';' '{print $2";1;"$0}' csvfile1 ; awk -F';' '{print $2";2;"$0}' csvfile2) | sort | awk -F';' 'BEGIN{origin=3}$1==name&&$2>origin&&$6!=checksum{print record}{name=$1;origin=$2;checksum=$6;sub(/^[^;]*;.;/,"");record=$0;}'
Notes:
If the same name appears more than once in csvfile1, then all but the last one are ignored.
If the same name appears more than once in csvfile2, then all but the first one are ignored.

Best way to simulate "group by" from bash?

Suppose you have a file that contains IP addresses, one address in each line:
10.0.10.1
10.0.10.1
10.0.10.3
10.0.10.2
10.0.10.1
You need a shell script that counts for each IP address how many times it appears in the file. For the previous input you need the following output:
10.0.10.1 3
10.0.10.2 1
10.0.10.3 1
One way to do this is:
cat ip_addresses |uniq |while read ip
do
echo -n $ip" "
grep -c $ip ip_addresses
done
However it is really far from being efficient.
How would you solve this problem more efficiently using bash?
(One thing to add: I know it can be solved from perl or awk, I'm interested in a better solution in bash, not in those languages.)
ADDITIONAL INFO:
Suppose that the source file is 5GB and the machine running the algorithm has 4GB. So sort is not an efficient solution, neither is reading the file more than once.
I liked the hashtable-like solution - anybody can provide improvements to that solution?
ADDITIONAL INFO #2:
Some people asked why would I bother doing it in bash when it is way easier in e.g. perl. The reason is that on the machine I had to do this perl wasn't available for me. It was a custom built linux machine without most of the tools I'm used to. And I think it was an interesting problem.
So please, don't blame the question, just ignore it if you don't like it. :-)
sort ip_addresses | uniq -c
This will print the count first, but other than that it should be exactly what you want.
The quick and dirty method is as follows:
cat ip_addresses | sort -n | uniq -c
If you need to use the values in bash you can assign the whole command to a bash variable and then loop through the results.
PS
If the sort command is omitted, you will not get the correct results as uniq only looks at successive identical lines.
for summing up multiple fields, based on a group of existing fields, use the example below : ( replace the $1, $2, $3, $4 according to your requirements )
cat file
US|A|1000|2000
US|B|1000|2000
US|C|1000|2000
UK|1|1000|2000
UK|1|1000|2000
UK|1|1000|2000
awk 'BEGIN { FS=OFS=SUBSEP="|"}{arr[$1,$2]+=$3+$4 }END {for (i in arr) print i,arr[i]}' file
US|A|3000
US|B|3000
US|C|3000
UK|1|9000
The canonical solution is the one mentioned by another respondent:
sort | uniq -c
It is shorter and more concise than what can be written in Perl or awk.
You write that you don't want to use sort, because the data's size is larger than the machine's main memory size. Don't underestimate the implementation quality of the Unix sort command. Sort was used to handle very large volumes of data (think the original AT&T's billing data) on machines with 128k (that's 131,072 bytes) of memory (PDP-11). When sort encounters more data than a preset limit (often tuned close to the size of the machine's main memory) it sorts the data it has read in main memory and writes it into a temporary file. It then repeats the action with the next chunks of data. Finally, it performs a merge sort on those intermediate files. This allows sort to work on data many times larger than the machine's main memory.
cat ip_addresses | sort | uniq -c | sort -nr | awk '{print $2 " " $1}'
this command would give you desired output
Solution ( group by like mysql)
grep -ioh "facebook\|xing\|linkedin\|googleplus" access-log.txt | sort | uniq -c | sort -n
Result
3249 googleplus
4211 linkedin
5212 xing
7928 facebook
It seems that you have to either use a big amount of code to simulate hashes in bash to get linear behavior or stick to the quadratic superlinear versions.
Among those versions, saua's solution is the best (and simplest):
sort -n ip_addresses.txt | uniq -c
I found http://unix.derkeiler.com/Newsgroups/comp.unix.shell/2005-11/0118.html. But it's ugly as hell...
I feel awk associative array is also handy in this case
$ awk '{count[$1]++}END{for(j in count) print j,count[j]}' ips.txt
A group by post here
You probably can use the file system itself as a hash table. Pseudo-code as follows:
for every entry in the ip address file; do
let addr denote the ip address;
if file "addr" does not exist; then
create file "addr";
write a number "0" in the file;
else
read the number from "addr";
increase the number by 1 and write it back;
fi
done
In the end, all you need to do is to traverse all the files and print the file names and numbers in them. Alternatively, instead of keeping a count, you could append a space or a newline each time to the file, and in the end just look at the file size in bytes.
Most of the other solutions count duplicates. If you really need to group key value pairs, try this:
Here is my example data:
find . | xargs md5sum
fe4ab8e15432161f452e345ff30c68b0 a.txt
30c68b02161e15435ff52e34f4fe4ab8 b.txt
30c68b02161e15435ff52e34f4fe4ab8 c.txt
fe4ab8e15432161f452e345ff30c68b0 d.txt
fe4ab8e15432161f452e345ff30c68b0 e.txt
This will print the key value pairs grouped by the md5 checksum.
cat table.txt | awk '{print $1}' | sort | uniq | xargs -i grep {} table.txt
30c68b02161e15435ff52e34f4fe4ab8 b.txt
30c68b02161e15435ff52e34f4fe4ab8 c.txt
fe4ab8e15432161f452e345ff30c68b0 a.txt
fe4ab8e15432161f452e345ff30c68b0 d.txt
fe4ab8e15432161f452e345ff30c68b0 e.txt
GROUP BY under bash
Regarding this SO thread, there are some different answer regarding different needs.
1. Counting IP as SO request (GROUP BY IP address).
As IP are easy to convert to single integer, for small bunch of address, if you need to repeat this kind of operation many time, using a pure bash function could be a lot more efficient!
Pure bash (no fork!)
There is a way, using a bash function. This way is very quick as there is no fork!...
countIp () {
local -a _ips=(); local _a
while IFS=. read -a _a ;do
((_ips[_a<<24|${_a[1]}<<16|${_a[2]}<<8|${_a[3]}]++))
done
for _a in ${!_ips[#]} ;do
printf "%.16s %4d\n" \
$(($_a>>24)).$(($_a>>16&255)).$(($_a>>8&255)).$(($_a&255)) ${_ips[_a]}
done
}
Note: IP addresses are converted to 32bits unsigned integer value, used as index for array. This use simple bash arrays!
time countIp < ip_addresses
10.0.10.1 3
10.0.10.2 1
10.0.10.3 1
real 0m0.001s
user 0m0.004s
sys 0m0.000s
time sort ip_addresses | uniq -c
3 10.0.10.1
1 10.0.10.2
1 10.0.10.3
real 0m0.010s
user 0m0.000s
sys 0m0.000s
On my host, doing so is a lot quicker than using forks, upto approx 1'000 addresses, but take approx 1 entire second when I'll try to sort'n count 10'000 addresses.
2. GROUP BY duplicates (files content)
By using checksum you could indentfy duplicate files somewhere:
find . -type f -exec sha1sum {} + |
sort |
sed '
:a;
$s/^[^ ]\+ \+//;
N;
s/^\([^ ]\+\) \+\([^ ].*\)\n\1 \+\([^ ].*\)$/\1 \2\o11\3/;
ta;
s/^[^ ]\+ \+//;
P;
D;
ba
'
This will print all duplicates, by line, separated by Tabulation ($'\t' or octal 011 ou could change /\1 \2\o11\3/; by /\1 \2|\3/; for using | as separator).
./b.txt ./e.txt
./a.txt ./c.txt ./d.txt
Could be written as (with | as separator):
find . -type f -exec sha1sum {} + | sort | sed ':a;$s/^[^ ]\+ \+//;N;
s/^\([^ ]\+\) \+\([^ ].*\)\n\1 \+\([^ ].*\)$/\1 \2|\3/;ta;s/^[^ ]\+ \+//;P;D;ba'
Pure bash way
By using nameref, you could build bash arrays holding all duplicates:
declare -iA sums='()'
while IFS=' ' read -r sum file ;do
declare -n list=_LST_$sum
list+=("$file")
sums[$sum]+=1
done < <(
find . -type f -exec sha1sum {} +
)
From there, you have a bunch of arrays holding all duplicates file name as separated element:
for i in ${!sums[#]};do
declare -n list=_LST_$i
printf "%d %d %s\n" ${sums[$i]} ${#list[#]} "${list[*]}"
done
This may output something like:
2 2 ./e.txt ./b.txt
3 3 ./c.txt ./a.txt ./d.txt
Where count of files by md5sum (${sums[$shasum]}) match count of element in arrays ${_LST_ShAsUm[#]}.
for i in ${!sums[#]};do
declare -n list=_LST_$i
echo ${list[#]#A}
done
declare -a _LST_22596363b3de40b06f981fb85d82312e8c0ed511=([0]="./e.txt" [1]="./b.txt")
declare -a _LST_f572d396fae9206628714fb2ce00f72e94f2258f=([0]="./c.txt" [1]="./a.txt" [2]="./d.txt")
Note that this method could handle spaces and special characters in filenames!
3. GROUP BY columns in a table
As efficient sample using awk was provided by Anonymous, here is a pure bash solution.
So you want to sumarize columns 3 to last column and group by columns 1 and 2, having table.txt looking like
US|A|1000|2000
US|B|1000|2000
US|C|1000|2000
UK|1|1000|2000
UK|1|1000|2000|3000
UK|1|1000|2000|3000|4000
For not too big tables, you could:
myfunc() {
local -iA restabl='()';
local IFS=+
while IFS=\| read -ra ar; do
restabl["${ar[0]}|${ar[1]}"]+="${ar[*]:2}"
done
for i in ${!restabl[#]} ;do
printf '%s|%s\n' "$i" "${restabl[$i]}"
done
}
Could ouput something like:
myfunc <table.txt
UK|1|19000
US|A|3000
US|C|3000
US|B|3000
And to have table sorted:
myfunc() {
local -iA restabl='()';
local IFS=+ sorted=()
while IFS=\| read -ra ar; do
sorted[64#${ar[0]}${ar[1]}]="${ar[0]}|${ar[1]}"
restabl["${ar[0]}|${ar[1]}"]+="${ar[*]:2}"
done
for i in ${sorted[#]} ;do
printf '%s|%s\n' "$i" "${restabl[$i]}"
done
}
Must return:
myfunc <table
UK|1|19000
US|A|3000
US|B|3000
US|C|3000
I'd have done it like this:
perl -e 'while (<>) {chop; $h{$_}++;} for $k (keys %h) {print "$k $h{$k}\n";}' ip_addresses
but uniq might work for you.
Importing data to sqlite db and using sql syntax (just an other idea).
I know it's too much for this example but would be useful for complex queries with multiple files (tables)
#!/bin/bash
trap clear_db EXIT
clear_db(){ rm -f "mydb$$"; }
# add header to input_file (IP)
INPUT_FILE=ips.txt
# import file into db
sqlite3 -csv mydb$$ ".import ${INPUT_FILE} mytable"
# using sql statements on table 'mytable'
sqlite3 mydb$$ -separator " " "SELECT IP, COUNT(*) FROM mytable GROUP BY IP;"
10.0.10.1 3
10.0.10.2 1
10.0.10.3 1
I understand you are looking for something in Bash, but in case someone else might be looking for something in Python, you might want to consider this:
mySet = set()
for line in open("ip_address_file.txt"):
line = line.rstrip()
mySet.add(line)
As values in the set are unique by default and Python is pretty good at this stuff, you might win something here. I haven't tested the code, so it might be bugged, but this might get you there. And if you want to count occurrences, using a dict instead of a set is easy to implement.
Edit:
I'm a lousy reader, so I answered wrong. Here's a snippet with a dict that would count occurences.
mydict = {}
for line in open("ip_address_file.txt"):
line = line.rstrip()
if line in mydict:
mydict[line] += 1
else:
mydict[line] = 1
The dictionary mydict now holds a list of unique IP's as keys and the amount of times they occurred as their values.
This does not answer the count element of the original question, but this question is the first search engine result when searching for what I wanted to achieve, so I thought this may help someone as it relates to 'group by' functionality.
I wanted to order files based on groupings of them, where the presence of some string in the filename determined the group.
It uses a temporary grouping/ordering prefix which is removed after ordering; sed substitute expressions (s#pattern#replacement#g) match the target string and prepend an integer to the line corresponding to the desired sort order of that target string. Then, grouping prefix is removed with cut.
Note that the sed expressions could be joined (e.g. sed -e '<expr>; <expr>; <expr>;') but here they're split for readability.
It's not pretty and probably not fast (I'm dealing with <50 items) but it at-least conceptually simple and doesn't require learning awk.
#!/usr/bin/env bash
for line in $(find /etc \
| sed -E -e "s#^(.*${target_string_A}.*)#${target_string_A_sort_index}:\1#;" \
| sed -E -e "s#^(.*${target_string_B}.*)#${target_string_B_sort_index}:\1#;" \
| sed -E -e "s#^/(.*)#00:/\1#;" \
| sort \
| cut -c4-
)
do
echo "${line}"
done
e.g. Input
/this/is/a/test/a
/this/is/a/test/b
/this/is/a/test/c
/this/is/a/special/test/d
/this/is/a/another/test/e
#!/usr/bin/env bash
for line in $(find /etc \
| sed -E -e "s#^(.*special.*)#10:\1#;" \
| sed -E -e "s#^(.*another.*)#05:\1#;" \
| sed -E -e "s#^/(.*)#00:/\1#;" \
| sort \
| cut -c4-
)
do
echo "${line}"
done
/this/is/a/test/a
/this/is/a/test/b
/this/is/a/test/c
/this/is/a/another/test/e
/this/is/a/special/test/d
A combination of awk + sort (with version sort flag) is probably fastest (if ur environment has awk at all):
echo "${input...}" |
{m,g}awk '{ __[$+_]++ } END { for(_ in __) { print "",+__[_],_ } }' FS='^$' OFS='\t' |
gsort -t$'\t' -k 3,3 -V
Only the post GROUP-BY summary rows are being sent to the sorting utility - which is far less system intensive sort compared to pre-sorting the input rows for no reason.
For small inputs, e.g. fewer than 1000 rows or so, just directly sort|uniq -c it.
3 10.0.10.1
1 10.0.10.2
1 10.0.10.3
Sort may be omitted if order is not significant
uniq -c <source_file>
or
echo "$list" | uniq -c
if the source list is a variable

Resources