Best way to simulate "group by" from bash? - bash

Suppose you have a file that contains IP addresses, one address in each line:
10.0.10.1
10.0.10.1
10.0.10.3
10.0.10.2
10.0.10.1
You need a shell script that counts for each IP address how many times it appears in the file. For the previous input you need the following output:
10.0.10.1 3
10.0.10.2 1
10.0.10.3 1
One way to do this is:
cat ip_addresses |uniq |while read ip
do
echo -n $ip" "
grep -c $ip ip_addresses
done
However it is really far from being efficient.
How would you solve this problem more efficiently using bash?
(One thing to add: I know it can be solved from perl or awk, I'm interested in a better solution in bash, not in those languages.)
ADDITIONAL INFO:
Suppose that the source file is 5GB and the machine running the algorithm has 4GB. So sort is not an efficient solution, neither is reading the file more than once.
I liked the hashtable-like solution - anybody can provide improvements to that solution?
ADDITIONAL INFO #2:
Some people asked why would I bother doing it in bash when it is way easier in e.g. perl. The reason is that on the machine I had to do this perl wasn't available for me. It was a custom built linux machine without most of the tools I'm used to. And I think it was an interesting problem.
So please, don't blame the question, just ignore it if you don't like it. :-)

sort ip_addresses | uniq -c
This will print the count first, but other than that it should be exactly what you want.

The quick and dirty method is as follows:
cat ip_addresses | sort -n | uniq -c
If you need to use the values in bash you can assign the whole command to a bash variable and then loop through the results.
PS
If the sort command is omitted, you will not get the correct results as uniq only looks at successive identical lines.

for summing up multiple fields, based on a group of existing fields, use the example below : ( replace the $1, $2, $3, $4 according to your requirements )
cat file
US|A|1000|2000
US|B|1000|2000
US|C|1000|2000
UK|1|1000|2000
UK|1|1000|2000
UK|1|1000|2000
awk 'BEGIN { FS=OFS=SUBSEP="|"}{arr[$1,$2]+=$3+$4 }END {for (i in arr) print i,arr[i]}' file
US|A|3000
US|B|3000
US|C|3000
UK|1|9000

The canonical solution is the one mentioned by another respondent:
sort | uniq -c
It is shorter and more concise than what can be written in Perl or awk.
You write that you don't want to use sort, because the data's size is larger than the machine's main memory size. Don't underestimate the implementation quality of the Unix sort command. Sort was used to handle very large volumes of data (think the original AT&T's billing data) on machines with 128k (that's 131,072 bytes) of memory (PDP-11). When sort encounters more data than a preset limit (often tuned close to the size of the machine's main memory) it sorts the data it has read in main memory and writes it into a temporary file. It then repeats the action with the next chunks of data. Finally, it performs a merge sort on those intermediate files. This allows sort to work on data many times larger than the machine's main memory.

cat ip_addresses | sort | uniq -c | sort -nr | awk '{print $2 " " $1}'
this command would give you desired output

Solution ( group by like mysql)
grep -ioh "facebook\|xing\|linkedin\|googleplus" access-log.txt | sort | uniq -c | sort -n
Result
3249 googleplus
4211 linkedin
5212 xing
7928 facebook

It seems that you have to either use a big amount of code to simulate hashes in bash to get linear behavior or stick to the quadratic superlinear versions.
Among those versions, saua's solution is the best (and simplest):
sort -n ip_addresses.txt | uniq -c
I found http://unix.derkeiler.com/Newsgroups/comp.unix.shell/2005-11/0118.html. But it's ugly as hell...

I feel awk associative array is also handy in this case
$ awk '{count[$1]++}END{for(j in count) print j,count[j]}' ips.txt
A group by post here

You probably can use the file system itself as a hash table. Pseudo-code as follows:
for every entry in the ip address file; do
let addr denote the ip address;
if file "addr" does not exist; then
create file "addr";
write a number "0" in the file;
else
read the number from "addr";
increase the number by 1 and write it back;
fi
done
In the end, all you need to do is to traverse all the files and print the file names and numbers in them. Alternatively, instead of keeping a count, you could append a space or a newline each time to the file, and in the end just look at the file size in bytes.

Most of the other solutions count duplicates. If you really need to group key value pairs, try this:
Here is my example data:
find . | xargs md5sum
fe4ab8e15432161f452e345ff30c68b0 a.txt
30c68b02161e15435ff52e34f4fe4ab8 b.txt
30c68b02161e15435ff52e34f4fe4ab8 c.txt
fe4ab8e15432161f452e345ff30c68b0 d.txt
fe4ab8e15432161f452e345ff30c68b0 e.txt
This will print the key value pairs grouped by the md5 checksum.
cat table.txt | awk '{print $1}' | sort | uniq | xargs -i grep {} table.txt
30c68b02161e15435ff52e34f4fe4ab8 b.txt
30c68b02161e15435ff52e34f4fe4ab8 c.txt
fe4ab8e15432161f452e345ff30c68b0 a.txt
fe4ab8e15432161f452e345ff30c68b0 d.txt
fe4ab8e15432161f452e345ff30c68b0 e.txt

GROUP BY under bash
Regarding this SO thread, there are some different answer regarding different needs.
1. Counting IP as SO request (GROUP BY IP address).
As IP are easy to convert to single integer, for small bunch of address, if you need to repeat this kind of operation many time, using a pure bash function could be a lot more efficient!
Pure bash (no fork!)
There is a way, using a bash function. This way is very quick as there is no fork!...
countIp () {
local -a _ips=(); local _a
while IFS=. read -a _a ;do
((_ips[_a<<24|${_a[1]}<<16|${_a[2]}<<8|${_a[3]}]++))
done
for _a in ${!_ips[#]} ;do
printf "%.16s %4d\n" \
$(($_a>>24)).$(($_a>>16&255)).$(($_a>>8&255)).$(($_a&255)) ${_ips[_a]}
done
}
Note: IP addresses are converted to 32bits unsigned integer value, used as index for array. This use simple bash arrays!
time countIp < ip_addresses
10.0.10.1 3
10.0.10.2 1
10.0.10.3 1
real 0m0.001s
user 0m0.004s
sys 0m0.000s
time sort ip_addresses | uniq -c
3 10.0.10.1
1 10.0.10.2
1 10.0.10.3
real 0m0.010s
user 0m0.000s
sys 0m0.000s
On my host, doing so is a lot quicker than using forks, upto approx 1'000 addresses, but take approx 1 entire second when I'll try to sort'n count 10'000 addresses.
2. GROUP BY duplicates (files content)
By using checksum you could indentfy duplicate files somewhere:
find . -type f -exec sha1sum {} + |
sort |
sed '
:a;
$s/^[^ ]\+ \+//;
N;
s/^\([^ ]\+\) \+\([^ ].*\)\n\1 \+\([^ ].*\)$/\1 \2\o11\3/;
ta;
s/^[^ ]\+ \+//;
P;
D;
ba
'
This will print all duplicates, by line, separated by Tabulation ($'\t' or octal 011 ou could change /\1 \2\o11\3/; by /\1 \2|\3/; for using | as separator).
./b.txt ./e.txt
./a.txt ./c.txt ./d.txt
Could be written as (with | as separator):
find . -type f -exec sha1sum {} + | sort | sed ':a;$s/^[^ ]\+ \+//;N;
s/^\([^ ]\+\) \+\([^ ].*\)\n\1 \+\([^ ].*\)$/\1 \2|\3/;ta;s/^[^ ]\+ \+//;P;D;ba'
Pure bash way
By using nameref, you could build bash arrays holding all duplicates:
declare -iA sums='()'
while IFS=' ' read -r sum file ;do
declare -n list=_LST_$sum
list+=("$file")
sums[$sum]+=1
done < <(
find . -type f -exec sha1sum {} +
)
From there, you have a bunch of arrays holding all duplicates file name as separated element:
for i in ${!sums[#]};do
declare -n list=_LST_$i
printf "%d %d %s\n" ${sums[$i]} ${#list[#]} "${list[*]}"
done
This may output something like:
2 2 ./e.txt ./b.txt
3 3 ./c.txt ./a.txt ./d.txt
Where count of files by md5sum (${sums[$shasum]}) match count of element in arrays ${_LST_ShAsUm[#]}.
for i in ${!sums[#]};do
declare -n list=_LST_$i
echo ${list[#]#A}
done
declare -a _LST_22596363b3de40b06f981fb85d82312e8c0ed511=([0]="./e.txt" [1]="./b.txt")
declare -a _LST_f572d396fae9206628714fb2ce00f72e94f2258f=([0]="./c.txt" [1]="./a.txt" [2]="./d.txt")
Note that this method could handle spaces and special characters in filenames!
3. GROUP BY columns in a table
As efficient sample using awk was provided by Anonymous, here is a pure bash solution.
So you want to sumarize columns 3 to last column and group by columns 1 and 2, having table.txt looking like
US|A|1000|2000
US|B|1000|2000
US|C|1000|2000
UK|1|1000|2000
UK|1|1000|2000|3000
UK|1|1000|2000|3000|4000
For not too big tables, you could:
myfunc() {
local -iA restabl='()';
local IFS=+
while IFS=\| read -ra ar; do
restabl["${ar[0]}|${ar[1]}"]+="${ar[*]:2}"
done
for i in ${!restabl[#]} ;do
printf '%s|%s\n' "$i" "${restabl[$i]}"
done
}
Could ouput something like:
myfunc <table.txt
UK|1|19000
US|A|3000
US|C|3000
US|B|3000
And to have table sorted:
myfunc() {
local -iA restabl='()';
local IFS=+ sorted=()
while IFS=\| read -ra ar; do
sorted[64#${ar[0]}${ar[1]}]="${ar[0]}|${ar[1]}"
restabl["${ar[0]}|${ar[1]}"]+="${ar[*]:2}"
done
for i in ${sorted[#]} ;do
printf '%s|%s\n' "$i" "${restabl[$i]}"
done
}
Must return:
myfunc <table
UK|1|19000
US|A|3000
US|B|3000
US|C|3000

I'd have done it like this:
perl -e 'while (<>) {chop; $h{$_}++;} for $k (keys %h) {print "$k $h{$k}\n";}' ip_addresses
but uniq might work for you.

Importing data to sqlite db and using sql syntax (just an other idea).
I know it's too much for this example but would be useful for complex queries with multiple files (tables)
#!/bin/bash
trap clear_db EXIT
clear_db(){ rm -f "mydb$$"; }
# add header to input_file (IP)
INPUT_FILE=ips.txt
# import file into db
sqlite3 -csv mydb$$ ".import ${INPUT_FILE} mytable"
# using sql statements on table 'mytable'
sqlite3 mydb$$ -separator " " "SELECT IP, COUNT(*) FROM mytable GROUP BY IP;"
10.0.10.1 3
10.0.10.2 1
10.0.10.3 1

I understand you are looking for something in Bash, but in case someone else might be looking for something in Python, you might want to consider this:
mySet = set()
for line in open("ip_address_file.txt"):
line = line.rstrip()
mySet.add(line)
As values in the set are unique by default and Python is pretty good at this stuff, you might win something here. I haven't tested the code, so it might be bugged, but this might get you there. And if you want to count occurrences, using a dict instead of a set is easy to implement.
Edit:
I'm a lousy reader, so I answered wrong. Here's a snippet with a dict that would count occurences.
mydict = {}
for line in open("ip_address_file.txt"):
line = line.rstrip()
if line in mydict:
mydict[line] += 1
else:
mydict[line] = 1
The dictionary mydict now holds a list of unique IP's as keys and the amount of times they occurred as their values.

This does not answer the count element of the original question, but this question is the first search engine result when searching for what I wanted to achieve, so I thought this may help someone as it relates to 'group by' functionality.
I wanted to order files based on groupings of them, where the presence of some string in the filename determined the group.
It uses a temporary grouping/ordering prefix which is removed after ordering; sed substitute expressions (s#pattern#replacement#g) match the target string and prepend an integer to the line corresponding to the desired sort order of that target string. Then, grouping prefix is removed with cut.
Note that the sed expressions could be joined (e.g. sed -e '<expr>; <expr>; <expr>;') but here they're split for readability.
It's not pretty and probably not fast (I'm dealing with <50 items) but it at-least conceptually simple and doesn't require learning awk.
#!/usr/bin/env bash
for line in $(find /etc \
| sed -E -e "s#^(.*${target_string_A}.*)#${target_string_A_sort_index}:\1#;" \
| sed -E -e "s#^(.*${target_string_B}.*)#${target_string_B_sort_index}:\1#;" \
| sed -E -e "s#^/(.*)#00:/\1#;" \
| sort \
| cut -c4-
)
do
echo "${line}"
done
e.g. Input
/this/is/a/test/a
/this/is/a/test/b
/this/is/a/test/c
/this/is/a/special/test/d
/this/is/a/another/test/e
#!/usr/bin/env bash
for line in $(find /etc \
| sed -E -e "s#^(.*special.*)#10:\1#;" \
| sed -E -e "s#^(.*another.*)#05:\1#;" \
| sed -E -e "s#^/(.*)#00:/\1#;" \
| sort \
| cut -c4-
)
do
echo "${line}"
done
/this/is/a/test/a
/this/is/a/test/b
/this/is/a/test/c
/this/is/a/another/test/e
/this/is/a/special/test/d

A combination of awk + sort (with version sort flag) is probably fastest (if ur environment has awk at all):
echo "${input...}" |
{m,g}awk '{ __[$+_]++ } END { for(_ in __) { print "",+__[_],_ } }' FS='^$' OFS='\t' |
gsort -t$'\t' -k 3,3 -V
Only the post GROUP-BY summary rows are being sent to the sorting utility - which is far less system intensive sort compared to pre-sorting the input rows for no reason.
For small inputs, e.g. fewer than 1000 rows or so, just directly sort|uniq -c it.
3 10.0.10.1
1 10.0.10.2
1 10.0.10.3

Sort may be omitted if order is not significant
uniq -c <source_file>
or
echo "$list" | uniq -c
if the source list is a variable

Related

How to get from a file only the character with reputed value

I need to extract from the file the words that contain certain letters in a certain amount.
I apologize if this question has been resolved in the past, I just did not find anything that fits what I am looking for.
File:
wab 12aaabbb abababx ab ttttt baaabb zabcabc
baab baaabb cbaab ab ccabab zzz
For example
1. If I chose the letters a and the number is 1 the output should be:
wab
ab
ab
//only the words that contains a and the char appear in the word 1 time
2. If I chose the letters a,b and the number is 3, the output should be:
12aaabbb
abababx
baaabb
//only the word contains a,b, and both chars appear in the word 3 times
3. If I chose the letters a,b,c and the number 2, the output should be:
ccabab
zabcabc
//only the words that contains a,b,c and the chars appear in the word 3 times
Is it possible to find 2 letters in the same script?
I was able to find in a single letter but I get only the words where the letters appear in sequence and I do not want to find only these words, that's what I did:
egrep '([a])\1{N-1}' file
And another problem I can not get only the specific words, I get all file and the letter I am looking for "a" in red.
I tried using -w but it does not display anything.
::: EDIT :::
try to edit what you did to a for
i=$1
fileName=$2
letters=${#: 3}
tr -s '[:space:]' '\n' < $fileName* |
for letter in $letters; do
grep -E "^[^$letter]*($letter[^$letter]*){$i}$"
done | uniq
There are various ways to split input so that grep sees a single word per line. tr is most common. For example:
tr -s '[:space:]' '\n' file | ...
We can build a function to find a specific number of a particular letter:
NofL(){
num=$1
letter=$2
regex="^[^$letter]*($letter[^$letter]*){$num}$"
grep -E "$regex"
}
Then:
# letter=a number=1
tr -s '[:space:]' '\n' file | NofL 1 a
# letters=a,b number=3
tr -s '[:space:]' '\n' file | NofL 3 a | NofL 3 b
# letters=a,b,c number=2
tr -s '[:space:]' '\n' file | NofL 2 a | NofL 2 b | NofL 2 c
Regexes are not really suited for that job as there are more efficient ways, but it is possible using repeated matching. We first select all words, from those we select words with n as, and from those we select words with n bs and so on.
Example for n=3 and a, b:
grep -Eo '[[:alnum:]]+' |
grep -Ex '[^a]*a[^a]*a[^a]*a[^a]*' |
grep -Ex '[^b]*b[^b]*b[^b]*b[^b]*'
To auto-generate such a command from an input like 3 a b, you need to dynamically create a pipeline, which is possible, but also a hassle:
exactly_n_times_char() {
(( $# >= 2 )) || { cat; return; }
local n="$1" char="$2" regex
regex="[^$char]*($char[^$char]*){$n}"
shift 2
grep -Ex "$regex" | exactly_n_times_char "$n" "$#"
}
grep -Eo '[[:alnum:]]+' file.txt | exactly_n_times_char 3 a b
With PCREs (requires GNU grep or pcregrep) the check can be done in a single regex:
exactly_n_times_char() {
local n="$1" regex=""
shift
for char; do # could be done without a loop using sed on $*
regex+="(?=[^$char\\W]*($char[^$char\\W]*){$n})"
done
regex+='\w+'
grep -Pow "$regex"
}
exactly_n_times_char 3 a b < file.txt
If a matching word appears multiple times (like baaabb in your example) it is printed multiple times too. You can filter out duplicates by piping through sort -u but that will change the order.
A method using sed and bash would be:
#!/bin/bash
file=$1
n=$2
chars=$3
for ((i = 0; i < ${#chars}; ++i)); do
c=${chars:i:1}
args+=(-e)
args+=("/^\([^$c]*[$c]\)\{$n\}[^$c]*\$/!d")
done
sed "${args[#]}" <(tr -s '[:blank:]' '\n' < "$file")
Notice that filename, count, and characters are parameterized. Use it as
./script filename 2 abc
which should print out
zabcabc
ccabab
given the file content in the question.
An implementation in pure bash, without calling an external program, could be:
#!/bin/bash
readonly file=$1
readonly n=$2
readonly chars=$3
while read -ra words; do
for word in "${words[#]}"; do
for ((i = 0; i < ${#chars}; ++i)); do
c=${word//[^${chars:i:1}]}
(( ${#c} == n )) || continue 2
done
printf '%s\n' "$word"
done
done < "$file"
You can match a string containing exactly N occurrences of character X with the (POSIX-extended) regexp [^X]*(X[^X]*){N}. To do this for multiple characters you could chain them, and the traditional way to process one 'word' at a time, simplistically defined as a sequence of non-whitespace chars, is like this
<infile tr -s ' \t\n' ' ' | grep -Ex '[^a]*(a[^a]*){3}' | \grep -Ex '[^b]*(b[^b]*){3}'
# may need to add \r on Windows-ish systems or for Windows-derived data
If you get colorized output from egrep and grep and maybe some other utilities it's usually because in a GNU-ish environment you -- often via a profile that was automatically provided and you didn't look at or modify -- set aliases to turn them into e.g. egrep --color=auto or possibly/rarely =always; using \grep or command grep or the pathname such as /usr/bin/grep disables the alias, or you could just un-set it/them. Another possibility is you may have envvar(s) set in which case you need to remove or suppress it/them, or explicitly say --color=never, or (somewhat hackily) pipe the output through ... | cat which has the effect of making [e]grep's stdout a pipe not a tty and thus turning off =auto.
However, GNU awk (not necessarily others) can also do this more directly:
<infile awk -vRS='[ \t\n]+' -F '' '{delete f;for(i=1;i<=NF;i++)f[$i]++}
f["a"]==3&&f["b"]==3'
or to parameterize the criteria:
<infile awk -vRS='[ \t\n]+' -F '' 'BEGIN{split("ab",w,//);n=3}
{delete f;for(i=1;i<=NF;i++)f[$i]++;s=1;for(t in w)if(f[w[t]]!=occur)s=0} s'
perl can do pretty much everything awk can do, and so can some other general-purpose tools, but I leave those as exercises.

Reading recent entry from a file based on a key

Input file, fruits.txt:
JAN,APPLE
FEB,MANGO
JAN,ORANGE
MAR,APPLE
FEB,APPLE
Expected output file:
MAR,APPLE
FEB,APPLE
JAN,ORANGE
For getting the above output, below code is used:
#!/bin/sh
declare -A m_arr
cat fruits.txt > /tmp/ID.part
while read line
do
Month=$(echo $line | cut -d, -f1)
Fruits=$(echo $line | cut -d, -f2)
m_arr[${Month}]=${Fruits}
done < /tmp/ID.part
for i in ${!m_arr[#]}
do
echo "$i,${m_arr[$i]}"
done
This works fine for small number of data in input file. I have 200 000 entries and observed that cut command is very slow. Tried with awk as well, did not get a better result. My requirement is to read the file from row1, with the key as column1. I need to updated entry for each key.
I think this can be done pretty easily with Awk, you just need to hash the values of $1 in $2 once you delimit the file with a , separator
awk -v FS=, -v OFS=, '{key[$1]=$2; next}END{for (i in key) print i,key[i]}' file
Also if you want to speed up things while processing a million line file, you can change the localization settings to speed up the execution while parsing, you can pass LC_ALL=C locally to the command. See Stéphane Chazelas's answer on what "LC_ALL=C" does?
In bash version 4, you can declare an associative array and populate it with the result of read, splitting your lines with a custom IFS:
$ declare -A a
$ while IFS=, read key value; do a["$key"]="$value"; done < fruits.txt
$ declare -p a
declare -A a=([MAR]="APPLE" [FEB]="APPLE" [JAN]="ORANGE" )
If you want to generate that specific output from the array, you'll also require a loop:
$ for key in "${!a[#]}"; do printf '%s,%s\n' "$key" "${a[$key]}"; done
MAR,APPLE
FEB,APPLE
JAN,ORANGE
The shortest one using GNU datamash:
datamash -st, -g1 last 2 <file
g1 - group by the 1st column
last 2 - keep the last value of the group
The output:
FEB,APPLE
JAN,ORANGE
MAR,APPLE

UNIX - Replacing variables in sql with matching values from .profile file

I am trying to write a shell which will take an SQL file as input. Example SQL file:
SELECT *
FROM %%DB.TBL_%%TBLEXT
WHERE CITY = '%%CITY'
Now the script should extract all variables, which in this case everything starting with %%. So the output file will be something as below:
%%DB
%%TBLEXT
%%CITY
Now I should be able to extract the matching values from the user's .profile file for these variables and create the SQL file with the proper values.
SELECT *
FROM tempdb.TBL_abc
WHERE CITY = 'Chicago'
As of now I am trying to generate the file1 which will contain all the variables. Below code sample -
sed "s/[(),']//g" "T:/work/shell/sqlfile1.sql" | awk '/%%/{print $NF}' | awk '/%%/{print $NF}' > sqltemp2.sql
takes me till
%%DB.TBL_%%TBLEXT
%%CITY
Can someone help me in getting to file1 listing the variables?
You can use grep and sort to get a list of unique variables, as per the following transcript:
$ echo "SELECT *
FROM %%DB.TBL_%%TBLEXT
WHERE CITY = '%%CITY'" | grep -o '%%[A-Za-z0-9_]*' | sort -u
%%CITY
%%DB
%%TBLEXT
The -o flag to grep instructs it to only print the matching parts of lines rather than the entire line, and also outputs each matching part on a distinct line. Then sort -u just makes sure there are no duplicates.
In terms of the full process, here's a slight modification to a bash script I've used for similar purposes:
# Define all translations.
declare -A xlat
xlat['%%DB']='tempdb'
xlat['%%TBLEXT']='abc'
xlat['%%CITY']='Chicago'
# Check all variables in input file.
okay=1
for key in $(grep -o '%%[A-Za-z0-9_]*' input.sql | sort -u) ; do
if [[ "${xlat[$key]}" == "" ]] ; then
echo "Bad key ($key) in file:"
grep -n "${key}" input.sql | sed 's/^/ /'
okay=0
fi
done
if [[ ${okay} -eq 0 ]] ; then
exit 1
fi
# Process input file doing substitutions. Fairly
# primitive use of sed, must change to use sed -i
# at some point.
# Note we sort keys based on descending length so we
# correctly handle extensions like "NAME" and "NAMESPACE",
# doing the longer ones first makes it work properly.
cp input.sql output.sql
for key in $( (
for key in ${!xlat[#]} ; do
echo ${key}
done
) | awk '{print length($0)":"$0}' | sort -rnu | cut -d':' -f2) ; do
sed "s/${key}/${xlat[$key]}/g" output.sql >output2.sql
mv output2.sql output.sql
done
cat output.sql
It first checks that the input file doesn't contain any keys not found in the translation array. Then it applies sed substitutions to the input file, one per translation, to ensure all keys are substituted with their respective values.
This should be a good start, though there may be some edge cases such as if your keys or values contain characters sed would consider important (like / for example). If that is the case, you'll probably need to escape them such as changing:
xlat['%%UNDEFINED']='0/0'
into:
xlat['%%UNDEFINED']='0\/0'

If xargs is map, what is filter?

I think of xargs as the map function of the UNIX shell. What is the filter function?
EDIT: it looks like I'll have to be a bit more explicit.
Let's say I have to hand a program which accepts a single string as a parameter and returns with an exit code of 0 or 1. This program will act as a predicate over the strings that it accepts.
For example, I might decide to interpret the string parameter as a filepath, and define the predicate to be "does this file exist". In this case, the program could be test -f, which, given a string, exits with 0 if the file exists, and 1 otherwise.
I also have to hand a stream of strings. For example, I might have a file ~/paths containing
/etc/apache2/apache2.conf
/foo/bar/baz
/etc/hosts
Now, I want to create a new file, ~/existing_paths, containing only those paths that exist on my filesystem. In my case, that would be
/etc/apache2/apache2.conf
/etc/hosts
I want to do this by reading in the ~/paths file, filtering those lines by the predicate test -f, and writing the output to ~/existing_paths. By analogy with xargs, this would look like:
cat ~/paths | xfilter test -f > ~/existing_paths
It is the hypothesized program xfilter that I am looking for:
xfilter COMMAND [ARG]...
Which, for each line L of its standard input, will call COMMAND [ARG]... L, and if the exit code is 0, it prints L, else it prints nothing.
To be clear, I am not looking for:
a way to filter a list of filepaths by existence. That was a specific example.
how to write such a program. I can do that.
I am looking for either:
a pre-existing implementation, like xargs, or
a clear explanation of why this doesn't exist
If map is xargs, filter is... still xargs.
Example: list files in the current directory and filter out non-executable files:
ls | xargs -I{} sh -c "test -x '{}' && echo '{}'"
This could be made handy trough a (non production-ready) function:
xfilter() {
xargs -I{} sh -c "$* '{}' && echo '{}'"
}
ls | xfilter test -x
Alternatively, you could use a parallel filter implementation via GNU Parallel:
ls | parallel "test -x '{}' && echo '{}'"
So, youre looking for the:
reduce( compare( filter( map(.. list()) ) ) )
what can be rewiritten as
list | map | filter | compare | reduce
The main power of bash is a pipelining, therefore isn't need to have one special filter and/or reduce command. In fact nearly all unix commands could act in one (or more) functions as:
list
map
filter
reduce
Imagine:
find mydir -type f -print | xargs grep -H '^[0-9]*$' | cut -d: -f 2 | sort -nr | head -1
^------list+filter------^ ^--------map-----------^ ^--filter--^ ^compare^ ^reduce^
Creating a test case:
mkdir ./testcase
cd ./testcase || exit 1
for i in {1..10}
do
strings -1 < /dev/random | head -1000 > file.$i.txt
done
mkdir emptydir
You will get a directory named testcase and in this directory 10 files and one directory
emptydir file.1.txt file.10.txt file.2.txt file.3.txt file.4.txt file.5.txt file.6.txt file.7.txt file.8.txt file.9.txt
each file contains 1000 lines of random strings some lines are contains only numbers
now run the command
find testcase -type f -print | xargs grep -H '^[0-9]*$' | cut -d: -f 2 | sort -nr | head -1
and you will get the largest number-only line from each files like: 42. (of course, this can be done more effectively, this is only for demo)
decomposed:
The find testcase -type f -print will print every plain files so, LIST (and reduced only to files). ouput:
testcase/file.1.txt
testcase/file.10.txt
testcase/file.2.txt
testcase/file.3.txt
testcase/file.4.txt
testcase/file.5.txt
testcase/file.6.txt
testcase/file.7.txt
testcase/file.8.txt
testcase/file.9.txt
the xargs grep -H '^[0-9]*$' as MAP will run a grep command for each file from a list. The grep is usually using as filter, e.g: command | grep, but now (with xargs) changes the input (filenames) to (lines containing only digits). Output, many lines like:
testcase/file.1.txt:1
testcase/file.1.txt:8
....
testcase/file.9.txt:4
testcase/file.9.txt:5
structure of lines: filename colon number, want only numbers so calling a pure filter, what strips out the filenames from each line cut -d: -f2. It outputs many lines like:
1
8
...
4
5
Now the reduce (getting the largest number), the sort -nr sorts all number numerically and reverse order (desc), so its output is like:
42
18
9
9
...
0
0
and the head -1 print the first line (the largest number).
Of course, you can write your own list/filter/map/reduce functions directly with bash programming constructions (loops, conditions and such), or you can employ any fullblown scripting language like perl, special languages like awk, sed "language", or dc (rpn) and such.
Having an special filter command such:
list | filter_command cut -d: -f 2
is simple doesn't needed, because you can use directly the
list | cut
You can have awk do the filter and reduce function.
Filter:
awk 'NR % 2 { $0 = $0 " [EVEN]" } 1'
Reduce:
awk '{ p = p + $0 } END { print p }'
I totally understand your question here as a long time functional programmer and here is the answer: Bash/unix command pipelining isn't as clean as you'd hoped.
In the example above:
find mydir -type f -print | xargs grep -H '^[0-9]*$' | cut -d: -f 2 | sort -nr | head -1
^------list+filter------^ ^--------map-----------^ ^--filter--^ ^compare^ ^reduce^
a more pure form would look like:
find mydir | xargs -L 1 bash -c 'test -f $1 && echo $1' _ | grep -H '^[0-9]*$' | cut -d: -f 2 | sort -nr | head -1
^---list--^^-------filter---------------------------------^^------map----------^^--map-------^ ^reduce^
But, for example, grep also has a filtering capability: grep -q mypattern which simply return 0 if it matches the pattern.
To get a something more like what you want, you simply would have to define a filter bash function and make sure to export it so it was compatible with xargs
But then you get into some problems. Like, test has binary and unary operators. How will your filter function handle this? Hand, what would you decide to output on true for these cases? Not insurmountable, but weird. Assuming only unary operations:
filter(){
while read -r LINE || [[ -n "${LINE}" ]]; do
eval "[[ ${LINE} $1 ]]" 2> /dev/null && echo "$LINE"
done
}
so you could do something like
seq 1 10 | filter "> 4"
5
6
7
8
9
As I wrote this I kinda liked it

combining multiple grep searches and making my script more efficient

I have a file called Type1.txt, that looks like this:
$ cat Type1.txt
ID.580.G3C0
TTTTTTTTTTT
ID.580.G3C8
ATTATATC-AAA
ID.580.GXC16
ATTATTTC-ACG-TTTTTCCTA
ID.694.G9C3
ATTATATC-ACG-AAATCCTA
ID.694.G9C3
etc...
I want to write a bash script to count the instances of each ID and export it into another file that provides a summary, something like this:
ID.580 = 3
ID.694 = 1
etc...
So far the script is messy and unusable.
For the above I have the following:
#!/bin/bash
for Count in `grep -c "ID.580" Type1.txt; do
echo $Count=ID.580
done > Result.txt #Allows to count only for that single ID.
I have over a thousand ID.XXX, making this code unusable since it's not plausible to add individual ID.XXX for each search. Thank you for the help!
Shell
The code below uses the standard UNIX utilities, and does not assume that the second part of the ID is exactly 3 characters, but will find ID.1.123123123 and ID.1234.123123 and properly only take the first dot-delimited part. As it is
grep '^ID\.[0-9]' Type1.txt | cut -d . -f 1-2 | sort \
| uniq -c | awk '{ print $2" = "$1 }'
grep filters only lines beginning with ID. followed by 1 digit (at least)
cut uses . as the field delimiter, and only outputting fields 1 and 2, thus removing
everything after and including the second . on the line.
sort sorts the lines for uniq to work
uniq prints each line from its input prefixed with a count
awk part reverses these fields and prints them separated with =.
If the first part of the ID can contain letters too, change the end of regular expression to [0-9] to [0-9A-Z]. for example
The pipeline outputs
ID.580 = 3
ID.694 = 2
Python
As the Python is popular among biologists, you might want to hone your python skills instead:
from collections import Counter
counter = Counter()
with open('Type1.txt') as f:
for line in f:
if line.startswith('ID.'):
top_id = '.'.join(line.split('.', 2)[:2])
counter[top_id] += 1
for top_id, count in sorted(counter.items()):
print("%s = %d" % (top_id, count))
The results are exactly identical.
grep '^ID.[0-9][0-9][0-9]' input_file | cut -c1-6 | sort | uniq -c
works?
TL;DR
Given your particular corpus and grouping strategy, there's more than one way to get the results you need. Here are two alternative solutions, one in awk, and one in Ruby.
GNU awk
One way is to use GNU awk to perform the following steps:
match just the ID lines
split matching input lines into fields
select and print the fields you need
sort the lines in the filtered result
count the adjacent duplicates
perform any specialized formatting on the result
For example:
$ awk '/^ID/ {split($0, a, "."); print a[1] "." a[2]}' /tmp/foo |
sort | uniq --count | awk '{print $2 " = " $1}'
ID.580 = 3
ID.694 = 2
With the corpus you provided in your question, this takes an average of 8 ms on my system. A larger corpus will take longer, of course, but unless you have a really huge data set this should be fast enough for most purposes.
Ruby
Ruby offers what I consider a more elegant solution, but is in fact slower. The idea here is to store the relevant portion of your IDs as hash keys, and increment a counter each time you encounter a given ID. For example, consider this Ruby one-liner:
$ ruby -ne 'BEGIN { id = Hash.new(0) }
id[$&] += 1 if /\AID\.\d+/
END { id.each_pair do |k,v| puts "#{k} = #{v}" end }' /tmp/foo
ID.580 = 3
ID.694 = 2
This solution takes around 45 ms to process the same corpus, so I wouldn't recommend it over the awk pipeline just for transforming output. The main advantage to doing it this way is that you have an actual data structure (e.g. a Hash object) that you could manipulate in a more full-featured program.
Here is awk one liner:
$ awk -F. '$1=="ID"{a[$2,$3]++}END{for (i in a) {split(i,ind,SUBSEP); r[ind[1]]++}for (i in r) print "ID."i" = "r[i]}' file
ID.694 = 1
ID.580 = 3
And here is a pure bash solution:
#!/bin/bash
while IFS=. read -r pre id code rest
do
[[ $pre == ID ]] || continue
[[ ${a[$id]} =~ \."$code"\. ]] || {
a[$id]="${a[$id]}.$code."
((count[$id]++));
}
done < file
for i in "${!count[#]}"
do
echo "ID.$i = ${count[$i]}"
done
$ ./script.sh
ID.580 = 3
ID.694 = 1
awk might work too...
awk '/ID.580/{x++}END{print x}' test.txt
You can put this in a for loop
for i in ID.580 ID.694
do
awk '/'$i'/{x++}END{print x}' test.txt
done

Resources