If xargs is map, what is filter? - shell

I think of xargs as the map function of the UNIX shell. What is the filter function?
EDIT: it looks like I'll have to be a bit more explicit.
Let's say I have to hand a program which accepts a single string as a parameter and returns with an exit code of 0 or 1. This program will act as a predicate over the strings that it accepts.
For example, I might decide to interpret the string parameter as a filepath, and define the predicate to be "does this file exist". In this case, the program could be test -f, which, given a string, exits with 0 if the file exists, and 1 otherwise.
I also have to hand a stream of strings. For example, I might have a file ~/paths containing
/etc/apache2/apache2.conf
/foo/bar/baz
/etc/hosts
Now, I want to create a new file, ~/existing_paths, containing only those paths that exist on my filesystem. In my case, that would be
/etc/apache2/apache2.conf
/etc/hosts
I want to do this by reading in the ~/paths file, filtering those lines by the predicate test -f, and writing the output to ~/existing_paths. By analogy with xargs, this would look like:
cat ~/paths | xfilter test -f > ~/existing_paths
It is the hypothesized program xfilter that I am looking for:
xfilter COMMAND [ARG]...
Which, for each line L of its standard input, will call COMMAND [ARG]... L, and if the exit code is 0, it prints L, else it prints nothing.
To be clear, I am not looking for:
a way to filter a list of filepaths by existence. That was a specific example.
how to write such a program. I can do that.
I am looking for either:
a pre-existing implementation, like xargs, or
a clear explanation of why this doesn't exist

If map is xargs, filter is... still xargs.
Example: list files in the current directory and filter out non-executable files:
ls | xargs -I{} sh -c "test -x '{}' && echo '{}'"
This could be made handy trough a (non production-ready) function:
xfilter() {
xargs -I{} sh -c "$* '{}' && echo '{}'"
}
ls | xfilter test -x
Alternatively, you could use a parallel filter implementation via GNU Parallel:
ls | parallel "test -x '{}' && echo '{}'"

So, youre looking for the:
reduce( compare( filter( map(.. list()) ) ) )
what can be rewiritten as
list | map | filter | compare | reduce
The main power of bash is a pipelining, therefore isn't need to have one special filter and/or reduce command. In fact nearly all unix commands could act in one (or more) functions as:
list
map
filter
reduce
Imagine:
find mydir -type f -print | xargs grep -H '^[0-9]*$' | cut -d: -f 2 | sort -nr | head -1
^------list+filter------^ ^--------map-----------^ ^--filter--^ ^compare^ ^reduce^
Creating a test case:
mkdir ./testcase
cd ./testcase || exit 1
for i in {1..10}
do
strings -1 < /dev/random | head -1000 > file.$i.txt
done
mkdir emptydir
You will get a directory named testcase and in this directory 10 files and one directory
emptydir file.1.txt file.10.txt file.2.txt file.3.txt file.4.txt file.5.txt file.6.txt file.7.txt file.8.txt file.9.txt
each file contains 1000 lines of random strings some lines are contains only numbers
now run the command
find testcase -type f -print | xargs grep -H '^[0-9]*$' | cut -d: -f 2 | sort -nr | head -1
and you will get the largest number-only line from each files like: 42. (of course, this can be done more effectively, this is only for demo)
decomposed:
The find testcase -type f -print will print every plain files so, LIST (and reduced only to files). ouput:
testcase/file.1.txt
testcase/file.10.txt
testcase/file.2.txt
testcase/file.3.txt
testcase/file.4.txt
testcase/file.5.txt
testcase/file.6.txt
testcase/file.7.txt
testcase/file.8.txt
testcase/file.9.txt
the xargs grep -H '^[0-9]*$' as MAP will run a grep command for each file from a list. The grep is usually using as filter, e.g: command | grep, but now (with xargs) changes the input (filenames) to (lines containing only digits). Output, many lines like:
testcase/file.1.txt:1
testcase/file.1.txt:8
....
testcase/file.9.txt:4
testcase/file.9.txt:5
structure of lines: filename colon number, want only numbers so calling a pure filter, what strips out the filenames from each line cut -d: -f2. It outputs many lines like:
1
8
...
4
5
Now the reduce (getting the largest number), the sort -nr sorts all number numerically and reverse order (desc), so its output is like:
42
18
9
9
...
0
0
and the head -1 print the first line (the largest number).
Of course, you can write your own list/filter/map/reduce functions directly with bash programming constructions (loops, conditions and such), or you can employ any fullblown scripting language like perl, special languages like awk, sed "language", or dc (rpn) and such.
Having an special filter command such:
list | filter_command cut -d: -f 2
is simple doesn't needed, because you can use directly the
list | cut

You can have awk do the filter and reduce function.
Filter:
awk 'NR % 2 { $0 = $0 " [EVEN]" } 1'
Reduce:
awk '{ p = p + $0 } END { print p }'

I totally understand your question here as a long time functional programmer and here is the answer: Bash/unix command pipelining isn't as clean as you'd hoped.
In the example above:
find mydir -type f -print | xargs grep -H '^[0-9]*$' | cut -d: -f 2 | sort -nr | head -1
^------list+filter------^ ^--------map-----------^ ^--filter--^ ^compare^ ^reduce^
a more pure form would look like:
find mydir | xargs -L 1 bash -c 'test -f $1 && echo $1' _ | grep -H '^[0-9]*$' | cut -d: -f 2 | sort -nr | head -1
^---list--^^-------filter---------------------------------^^------map----------^^--map-------^ ^reduce^
But, for example, grep also has a filtering capability: grep -q mypattern which simply return 0 if it matches the pattern.
To get a something more like what you want, you simply would have to define a filter bash function and make sure to export it so it was compatible with xargs
But then you get into some problems. Like, test has binary and unary operators. How will your filter function handle this? Hand, what would you decide to output on true for these cases? Not insurmountable, but weird. Assuming only unary operations:
filter(){
while read -r LINE || [[ -n "${LINE}" ]]; do
eval "[[ ${LINE} $1 ]]" 2> /dev/null && echo "$LINE"
done
}
so you could do something like
seq 1 10 | filter "> 4"
5
6
7
8
9
As I wrote this I kinda liked it

Related

How to get from a file only the character with reputed value

I need to extract from the file the words that contain certain letters in a certain amount.
I apologize if this question has been resolved in the past, I just did not find anything that fits what I am looking for.
File:
wab 12aaabbb abababx ab ttttt baaabb zabcabc
baab baaabb cbaab ab ccabab zzz
For example
1. If I chose the letters a and the number is 1 the output should be:
wab
ab
ab
//only the words that contains a and the char appear in the word 1 time
2. If I chose the letters a,b and the number is 3, the output should be:
12aaabbb
abababx
baaabb
//only the word contains a,b, and both chars appear in the word 3 times
3. If I chose the letters a,b,c and the number 2, the output should be:
ccabab
zabcabc
//only the words that contains a,b,c and the chars appear in the word 3 times
Is it possible to find 2 letters in the same script?
I was able to find in a single letter but I get only the words where the letters appear in sequence and I do not want to find only these words, that's what I did:
egrep '([a])\1{N-1}' file
And another problem I can not get only the specific words, I get all file and the letter I am looking for "a" in red.
I tried using -w but it does not display anything.
::: EDIT :::
try to edit what you did to a for
i=$1
fileName=$2
letters=${#: 3}
tr -s '[:space:]' '\n' < $fileName* |
for letter in $letters; do
grep -E "^[^$letter]*($letter[^$letter]*){$i}$"
done | uniq
There are various ways to split input so that grep sees a single word per line. tr is most common. For example:
tr -s '[:space:]' '\n' file | ...
We can build a function to find a specific number of a particular letter:
NofL(){
num=$1
letter=$2
regex="^[^$letter]*($letter[^$letter]*){$num}$"
grep -E "$regex"
}
Then:
# letter=a number=1
tr -s '[:space:]' '\n' file | NofL 1 a
# letters=a,b number=3
tr -s '[:space:]' '\n' file | NofL 3 a | NofL 3 b
# letters=a,b,c number=2
tr -s '[:space:]' '\n' file | NofL 2 a | NofL 2 b | NofL 2 c
Regexes are not really suited for that job as there are more efficient ways, but it is possible using repeated matching. We first select all words, from those we select words with n as, and from those we select words with n bs and so on.
Example for n=3 and a, b:
grep -Eo '[[:alnum:]]+' |
grep -Ex '[^a]*a[^a]*a[^a]*a[^a]*' |
grep -Ex '[^b]*b[^b]*b[^b]*b[^b]*'
To auto-generate such a command from an input like 3 a b, you need to dynamically create a pipeline, which is possible, but also a hassle:
exactly_n_times_char() {
(( $# >= 2 )) || { cat; return; }
local n="$1" char="$2" regex
regex="[^$char]*($char[^$char]*){$n}"
shift 2
grep -Ex "$regex" | exactly_n_times_char "$n" "$#"
}
grep -Eo '[[:alnum:]]+' file.txt | exactly_n_times_char 3 a b
With PCREs (requires GNU grep or pcregrep) the check can be done in a single regex:
exactly_n_times_char() {
local n="$1" regex=""
shift
for char; do # could be done without a loop using sed on $*
regex+="(?=[^$char\\W]*($char[^$char\\W]*){$n})"
done
regex+='\w+'
grep -Pow "$regex"
}
exactly_n_times_char 3 a b < file.txt
If a matching word appears multiple times (like baaabb in your example) it is printed multiple times too. You can filter out duplicates by piping through sort -u but that will change the order.
A method using sed and bash would be:
#!/bin/bash
file=$1
n=$2
chars=$3
for ((i = 0; i < ${#chars}; ++i)); do
c=${chars:i:1}
args+=(-e)
args+=("/^\([^$c]*[$c]\)\{$n\}[^$c]*\$/!d")
done
sed "${args[#]}" <(tr -s '[:blank:]' '\n' < "$file")
Notice that filename, count, and characters are parameterized. Use it as
./script filename 2 abc
which should print out
zabcabc
ccabab
given the file content in the question.
An implementation in pure bash, without calling an external program, could be:
#!/bin/bash
readonly file=$1
readonly n=$2
readonly chars=$3
while read -ra words; do
for word in "${words[#]}"; do
for ((i = 0; i < ${#chars}; ++i)); do
c=${word//[^${chars:i:1}]}
(( ${#c} == n )) || continue 2
done
printf '%s\n' "$word"
done
done < "$file"
You can match a string containing exactly N occurrences of character X with the (POSIX-extended) regexp [^X]*(X[^X]*){N}. To do this for multiple characters you could chain them, and the traditional way to process one 'word' at a time, simplistically defined as a sequence of non-whitespace chars, is like this
<infile tr -s ' \t\n' ' ' | grep -Ex '[^a]*(a[^a]*){3}' | \grep -Ex '[^b]*(b[^b]*){3}'
# may need to add \r on Windows-ish systems or for Windows-derived data
If you get colorized output from egrep and grep and maybe some other utilities it's usually because in a GNU-ish environment you -- often via a profile that was automatically provided and you didn't look at or modify -- set aliases to turn them into e.g. egrep --color=auto or possibly/rarely =always; using \grep or command grep or the pathname such as /usr/bin/grep disables the alias, or you could just un-set it/them. Another possibility is you may have envvar(s) set in which case you need to remove or suppress it/them, or explicitly say --color=never, or (somewhat hackily) pipe the output through ... | cat which has the effect of making [e]grep's stdout a pipe not a tty and thus turning off =auto.
However, GNU awk (not necessarily others) can also do this more directly:
<infile awk -vRS='[ \t\n]+' -F '' '{delete f;for(i=1;i<=NF;i++)f[$i]++}
f["a"]==3&&f["b"]==3'
or to parameterize the criteria:
<infile awk -vRS='[ \t\n]+' -F '' 'BEGIN{split("ab",w,//);n=3}
{delete f;for(i=1;i<=NF;i++)f[$i]++;s=1;for(t in w)if(f[w[t]]!=occur)s=0} s'
perl can do pretty much everything awk can do, and so can some other general-purpose tools, but I leave those as exercises.

How to append lots of variables to one variable with a simple command

I want to stick all the variables into one variable
A=('blah')
AA=('blah2')
AAA=('blah3')
AAB=('blah4')
AAC=('blah5')
#^^lets pretend theres 100 more of these ^^
#Variable composition
#after AAA, is AAB then AAC then AAD etc etc, does that 100 times
I want them all placed into this MASTER variable
#MASTER=${A}${AA}${AAA} (<-- insert AAB, AAC and 100 more variables here)
I obviously don't want to type 100 variables in this expression because there's probably an easier way to do this. Plus I'm gonna be doing more of these therefore I need it automated.
I'm relatively new to sed, awk, is there a way to append those 100 variables into the master variable?
For this specific purpose I DO NOT want an array.
You can use a simple one-liner, quite straightforward, though more expensive:
master=$(set | grep -E '^(A|AA|A[A-D][A-D])=' | sort | cut -f2- -d= | tr -d '\n')
set lists all the variables in var=name format
grep filters out the variables we need
sort puts them in the right order (probably optional since set gives a sorted output)
cut extracts the values, removing the variable names
tr removes the newlines
Let's test it.
A=1
AA=2
AAA=3
AAB=4
AAC=5
AAD=6
AAAA=99 # just to make sure we don't pick this one up
master=$(set | grep -E '^(A|AA|A[A-D][A-D])=' | sort | cut -f2- -d= | tr -d '\n')
echo "$master"
Output:
123456
With my best guess, how about:
#!/bin/bash
A=('blah')
AA=('blah2')
AAA=('blah3')
AAB=('blah4')
AAC=('blah5')
# to be continued ..
for varname in A AA A{A..D}{A..Z}; do
value=${!varname}
if [ -n "$value" ]; then
MASTER+=$value
fi
done
echo $MASTER
which yields:
blahblah2blah3blah4blah5...
Although I'm not sure whether this is what the OP wants.
echo {a..z}{a..z}{a..z} | tr ' ' '\n' | head -n 100 | tail -n 3
adt
adu
adv
tells us, that it would go from AAA to ADV to reach 100, or for ADY for 103.
echo A{A..D}{A..Z} | sed 's/ /}${/g'
AAA}${AAB}${AAC}${AAD}${AAE}${AAF}${AAG}${AAH}${AAI}${AAJ}${AAK}${AAL}${AAM}${AAN}${AAO}${AAP}${AAQ}${AAR}${AAS}${AAT}${AAU}${AAV}${AAW}${AAX}${AAY}${AAZ}${ABA}${ABB}${ABC}${ABD}${ABE}${ABF}${ABG}${ABH}${ABI}${ABJ}${ABK}${ABL}${ABM}${ABN}${ABO}${ABP}${ABQ}${ABR}${ABS}${ABT}${ABU}${ABV}${ABW}${ABX}${ABY}${ABZ}${ACA}${ACB}${ACC}${ACD}${ACE}${ACF}${ACG}${ACH}${ACI}${ACJ}${ACK}${ACL}${ACM}${ACN}${ACO}${ACP}${ACQ}${ACR}${ACS}${ACT}${ACU}${ACV}${ACW}${ACX}${ACY}${ACZ}${ADA}${ADB}${ADC}${ADD}${ADE}${ADF}${ADG}${ADH}${ADI}${ADJ}${ADK}${ADL}${ADM}${ADN}${ADO}${ADP}${ADQ}${ADR}${ADS}${ADT}${ADU}${ADV}${ADW}${ADX}${ADY}${ADZ
The final cosmetics is easily made by hand.
One-liner using a for loop:
for n in A AA A{A..D}{A..Z}; do str+="${!n}"; done; echo ${str}
Output:
blahblah2blah3blah4blah5
Say you have the input file inputfile.txt with arbitrary variable names and values:
name="Joe"
last="Doe"
A="blah"
AA="blah2
then do:
master=$(eval echo $(grep -o "^[^=]\+" inputfile.txt | sed 's/^/\$/;:a;N;$!ba;s/\n/$/g'))
This will concatenate the values of all variables in inputfile.txt into master variable. So you will have:
>echo $master
JoeDoeblahblah2

How do I concatenate and rename split file pairs with a loop?

I have a directory that looks like this:
S-100-1-54359386-left.fastq.gz
S-100-1-54469454-left.fastq.gz
S-20-1-54356384-left.fastq.gz
S-20-1-54468477-left.fastq.gz
S-40-1-54343370-left.fastq.gz
S-40-1-54465479-left.fastq.gz
S-100-2-54359386-left.fastq.gz
S-100-2-54469454-left.fastq.gz
S-20-2-54356384-left.fastq.gz
S-20-2-54468477-left.fastq.gz
S-40-2-54343370-left.fastq.gz
S-40-2-54465479-left.fastq.gz
Each pair of consecutive files need to be concatenated and given a unique name. I can use the following for each pair:
zcat S-40-2-54343370-left.fastq.gz S-40-2-54465479-left.fastq.gz | \
gzip -c > S-40-2.left.fq.gz
...but I would like something more elegant. Notice that the number after the second dash also changes. The number after the third dash (54359386, etc.) makes each file unique, but doesn't need to be preserved after they are concatenated. Any advice? I'm not sure how to make a loop structured to identify the pairs.
Assuming that each group can contain only 2 files.
complex pipeline:
echo S-*.fastq.gz | tr ' ' '\n' | sort -t'-' -k1,3n | xargs -n2 bash \
-c 'fn=$(cut -d"-" -f1-3 <<<$0)".left.fq.gz"; zcat "$0" "$1" | gzip -c > "$fn"'
sort -t'-' -k1,3n - sort file list by the first 3 fields treating - as field separator
-n2 - xargs option, tells to use at most 2 arguments per command line

find only the first file from many directories

I have a lot of directories:
13R
613
AB1
ACT
AMB
ANI
Each directories contains a lots of file:
20140828.13R.file.csv.gz
20140829.13R.file.csv.gz
20140830.13R.file.csv.gz
20140831.13R.file.csv.gz
20140901.13R.file.csv.gz
20131114.613.file.csv.gz
20131115.613.file.csv.gz
20131116.613.file.csv.gz
20131117.613.file.csv.gz
20141114.ab1.file.csv.gz
20141115.ab1.file.csv.gz
20141116.ab1.file.csv.gz
20141117.ab1.file.csv.gz
etc..
The purpose if to have the first file from each directories
The result what I expect is:
13R|20140828
613|20131114
AB1|20141114
Which is the name of the directories pipe the date from the filename.
I guess I need a find and head command + awk but I can't make it, I need your help.
Here what I have test it
for f in $(ls -1);do ls -1 $f/ | head -1;done
But the folder name is missing.
When I mean the first file, is the first file returned in an alphabetical order within the folder.
Thanks.
You can do this with a Bash loop.
Given:
/tmp/test
/tmp/test/dir_1
/tmp/test/dir_1/file_1
/tmp/test/dir_1/file_2
/tmp/test/dir_1/file_3
/tmp/test/dir_2
/tmp/test/dir_2/file_1
/tmp/test/dir_2/file_2
/tmp/test/dir_2/file_3
/tmp/test/dir_3
/tmp/test/dir_3/file_1
/tmp/test/dir_3/file_2
/tmp/test/dir_3/file_3
/tmp/test/file_1
/tmp/test/file_2
/tmp/test/file_3
Just loop through the directories and form an array from a glob and grab the first one:
prefix="/tmp/test"
cd "$prefix"
for fn in dir_*; do
cd "$prefix"/"$fn"
arr=(*)
echo "$fn|${arr[0]}"
done
Prints:
dir_1|file_1
dir_2|file_1
dir_3|file_1
If your definition of 'first' is different that Bash's, just sort the array arr according to your definition before taking the first element.
You can also do this with find and awk:
$ find /tmp/test -mindepth 2 -print0 | awk -v RS="\0" '{s=$0; sub(/[^/]+$/,"",s); if (s in paths) next; paths[s]; print $0}'
/tmp/test/dir_1/file_1
/tmp/test/dir_2/file_1
/tmp/test/dir_3/file_1
And insert a sort (or use gawk) to sort as desired
sort has an unique option. Only the directory should be unique, so use the first field in sorting -k1,1. The solution works when the list of files is sorted already.
printf "%s\n" */* | sort -k1,1 -t/ -u | sed 's#\(.*\)/\([0-9]*\).*#\1|\2#'
You will need to change the sed command when the date field may be followed by another number.
This works for me:
for dir in $(find "$FOLDER" -type d); do
FILE=$(ls -1 -p $dir | grep -v / | head -n1)
if [ ! -z "$FILE" ]; then
echo "$dir/$FILE"
fi
done

Best way to simulate "group by" from bash?

Suppose you have a file that contains IP addresses, one address in each line:
10.0.10.1
10.0.10.1
10.0.10.3
10.0.10.2
10.0.10.1
You need a shell script that counts for each IP address how many times it appears in the file. For the previous input you need the following output:
10.0.10.1 3
10.0.10.2 1
10.0.10.3 1
One way to do this is:
cat ip_addresses |uniq |while read ip
do
echo -n $ip" "
grep -c $ip ip_addresses
done
However it is really far from being efficient.
How would you solve this problem more efficiently using bash?
(One thing to add: I know it can be solved from perl or awk, I'm interested in a better solution in bash, not in those languages.)
ADDITIONAL INFO:
Suppose that the source file is 5GB and the machine running the algorithm has 4GB. So sort is not an efficient solution, neither is reading the file more than once.
I liked the hashtable-like solution - anybody can provide improvements to that solution?
ADDITIONAL INFO #2:
Some people asked why would I bother doing it in bash when it is way easier in e.g. perl. The reason is that on the machine I had to do this perl wasn't available for me. It was a custom built linux machine without most of the tools I'm used to. And I think it was an interesting problem.
So please, don't blame the question, just ignore it if you don't like it. :-)
sort ip_addresses | uniq -c
This will print the count first, but other than that it should be exactly what you want.
The quick and dirty method is as follows:
cat ip_addresses | sort -n | uniq -c
If you need to use the values in bash you can assign the whole command to a bash variable and then loop through the results.
PS
If the sort command is omitted, you will not get the correct results as uniq only looks at successive identical lines.
for summing up multiple fields, based on a group of existing fields, use the example below : ( replace the $1, $2, $3, $4 according to your requirements )
cat file
US|A|1000|2000
US|B|1000|2000
US|C|1000|2000
UK|1|1000|2000
UK|1|1000|2000
UK|1|1000|2000
awk 'BEGIN { FS=OFS=SUBSEP="|"}{arr[$1,$2]+=$3+$4 }END {for (i in arr) print i,arr[i]}' file
US|A|3000
US|B|3000
US|C|3000
UK|1|9000
The canonical solution is the one mentioned by another respondent:
sort | uniq -c
It is shorter and more concise than what can be written in Perl or awk.
You write that you don't want to use sort, because the data's size is larger than the machine's main memory size. Don't underestimate the implementation quality of the Unix sort command. Sort was used to handle very large volumes of data (think the original AT&T's billing data) on machines with 128k (that's 131,072 bytes) of memory (PDP-11). When sort encounters more data than a preset limit (often tuned close to the size of the machine's main memory) it sorts the data it has read in main memory and writes it into a temporary file. It then repeats the action with the next chunks of data. Finally, it performs a merge sort on those intermediate files. This allows sort to work on data many times larger than the machine's main memory.
cat ip_addresses | sort | uniq -c | sort -nr | awk '{print $2 " " $1}'
this command would give you desired output
Solution ( group by like mysql)
grep -ioh "facebook\|xing\|linkedin\|googleplus" access-log.txt | sort | uniq -c | sort -n
Result
3249 googleplus
4211 linkedin
5212 xing
7928 facebook
It seems that you have to either use a big amount of code to simulate hashes in bash to get linear behavior or stick to the quadratic superlinear versions.
Among those versions, saua's solution is the best (and simplest):
sort -n ip_addresses.txt | uniq -c
I found http://unix.derkeiler.com/Newsgroups/comp.unix.shell/2005-11/0118.html. But it's ugly as hell...
I feel awk associative array is also handy in this case
$ awk '{count[$1]++}END{for(j in count) print j,count[j]}' ips.txt
A group by post here
You probably can use the file system itself as a hash table. Pseudo-code as follows:
for every entry in the ip address file; do
let addr denote the ip address;
if file "addr" does not exist; then
create file "addr";
write a number "0" in the file;
else
read the number from "addr";
increase the number by 1 and write it back;
fi
done
In the end, all you need to do is to traverse all the files and print the file names and numbers in them. Alternatively, instead of keeping a count, you could append a space or a newline each time to the file, and in the end just look at the file size in bytes.
Most of the other solutions count duplicates. If you really need to group key value pairs, try this:
Here is my example data:
find . | xargs md5sum
fe4ab8e15432161f452e345ff30c68b0 a.txt
30c68b02161e15435ff52e34f4fe4ab8 b.txt
30c68b02161e15435ff52e34f4fe4ab8 c.txt
fe4ab8e15432161f452e345ff30c68b0 d.txt
fe4ab8e15432161f452e345ff30c68b0 e.txt
This will print the key value pairs grouped by the md5 checksum.
cat table.txt | awk '{print $1}' | sort | uniq | xargs -i grep {} table.txt
30c68b02161e15435ff52e34f4fe4ab8 b.txt
30c68b02161e15435ff52e34f4fe4ab8 c.txt
fe4ab8e15432161f452e345ff30c68b0 a.txt
fe4ab8e15432161f452e345ff30c68b0 d.txt
fe4ab8e15432161f452e345ff30c68b0 e.txt
GROUP BY under bash
Regarding this SO thread, there are some different answer regarding different needs.
1. Counting IP as SO request (GROUP BY IP address).
As IP are easy to convert to single integer, for small bunch of address, if you need to repeat this kind of operation many time, using a pure bash function could be a lot more efficient!
Pure bash (no fork!)
There is a way, using a bash function. This way is very quick as there is no fork!...
countIp () {
local -a _ips=(); local _a
while IFS=. read -a _a ;do
((_ips[_a<<24|${_a[1]}<<16|${_a[2]}<<8|${_a[3]}]++))
done
for _a in ${!_ips[#]} ;do
printf "%.16s %4d\n" \
$(($_a>>24)).$(($_a>>16&255)).$(($_a>>8&255)).$(($_a&255)) ${_ips[_a]}
done
}
Note: IP addresses are converted to 32bits unsigned integer value, used as index for array. This use simple bash arrays!
time countIp < ip_addresses
10.0.10.1 3
10.0.10.2 1
10.0.10.3 1
real 0m0.001s
user 0m0.004s
sys 0m0.000s
time sort ip_addresses | uniq -c
3 10.0.10.1
1 10.0.10.2
1 10.0.10.3
real 0m0.010s
user 0m0.000s
sys 0m0.000s
On my host, doing so is a lot quicker than using forks, upto approx 1'000 addresses, but take approx 1 entire second when I'll try to sort'n count 10'000 addresses.
2. GROUP BY duplicates (files content)
By using checksum you could indentfy duplicate files somewhere:
find . -type f -exec sha1sum {} + |
sort |
sed '
:a;
$s/^[^ ]\+ \+//;
N;
s/^\([^ ]\+\) \+\([^ ].*\)\n\1 \+\([^ ].*\)$/\1 \2\o11\3/;
ta;
s/^[^ ]\+ \+//;
P;
D;
ba
'
This will print all duplicates, by line, separated by Tabulation ($'\t' or octal 011 ou could change /\1 \2\o11\3/; by /\1 \2|\3/; for using | as separator).
./b.txt ./e.txt
./a.txt ./c.txt ./d.txt
Could be written as (with | as separator):
find . -type f -exec sha1sum {} + | sort | sed ':a;$s/^[^ ]\+ \+//;N;
s/^\([^ ]\+\) \+\([^ ].*\)\n\1 \+\([^ ].*\)$/\1 \2|\3/;ta;s/^[^ ]\+ \+//;P;D;ba'
Pure bash way
By using nameref, you could build bash arrays holding all duplicates:
declare -iA sums='()'
while IFS=' ' read -r sum file ;do
declare -n list=_LST_$sum
list+=("$file")
sums[$sum]+=1
done < <(
find . -type f -exec sha1sum {} +
)
From there, you have a bunch of arrays holding all duplicates file name as separated element:
for i in ${!sums[#]};do
declare -n list=_LST_$i
printf "%d %d %s\n" ${sums[$i]} ${#list[#]} "${list[*]}"
done
This may output something like:
2 2 ./e.txt ./b.txt
3 3 ./c.txt ./a.txt ./d.txt
Where count of files by md5sum (${sums[$shasum]}) match count of element in arrays ${_LST_ShAsUm[#]}.
for i in ${!sums[#]};do
declare -n list=_LST_$i
echo ${list[#]#A}
done
declare -a _LST_22596363b3de40b06f981fb85d82312e8c0ed511=([0]="./e.txt" [1]="./b.txt")
declare -a _LST_f572d396fae9206628714fb2ce00f72e94f2258f=([0]="./c.txt" [1]="./a.txt" [2]="./d.txt")
Note that this method could handle spaces and special characters in filenames!
3. GROUP BY columns in a table
As efficient sample using awk was provided by Anonymous, here is a pure bash solution.
So you want to sumarize columns 3 to last column and group by columns 1 and 2, having table.txt looking like
US|A|1000|2000
US|B|1000|2000
US|C|1000|2000
UK|1|1000|2000
UK|1|1000|2000|3000
UK|1|1000|2000|3000|4000
For not too big tables, you could:
myfunc() {
local -iA restabl='()';
local IFS=+
while IFS=\| read -ra ar; do
restabl["${ar[0]}|${ar[1]}"]+="${ar[*]:2}"
done
for i in ${!restabl[#]} ;do
printf '%s|%s\n' "$i" "${restabl[$i]}"
done
}
Could ouput something like:
myfunc <table.txt
UK|1|19000
US|A|3000
US|C|3000
US|B|3000
And to have table sorted:
myfunc() {
local -iA restabl='()';
local IFS=+ sorted=()
while IFS=\| read -ra ar; do
sorted[64#${ar[0]}${ar[1]}]="${ar[0]}|${ar[1]}"
restabl["${ar[0]}|${ar[1]}"]+="${ar[*]:2}"
done
for i in ${sorted[#]} ;do
printf '%s|%s\n' "$i" "${restabl[$i]}"
done
}
Must return:
myfunc <table
UK|1|19000
US|A|3000
US|B|3000
US|C|3000
I'd have done it like this:
perl -e 'while (<>) {chop; $h{$_}++;} for $k (keys %h) {print "$k $h{$k}\n";}' ip_addresses
but uniq might work for you.
Importing data to sqlite db and using sql syntax (just an other idea).
I know it's too much for this example but would be useful for complex queries with multiple files (tables)
#!/bin/bash
trap clear_db EXIT
clear_db(){ rm -f "mydb$$"; }
# add header to input_file (IP)
INPUT_FILE=ips.txt
# import file into db
sqlite3 -csv mydb$$ ".import ${INPUT_FILE} mytable"
# using sql statements on table 'mytable'
sqlite3 mydb$$ -separator " " "SELECT IP, COUNT(*) FROM mytable GROUP BY IP;"
10.0.10.1 3
10.0.10.2 1
10.0.10.3 1
I understand you are looking for something in Bash, but in case someone else might be looking for something in Python, you might want to consider this:
mySet = set()
for line in open("ip_address_file.txt"):
line = line.rstrip()
mySet.add(line)
As values in the set are unique by default and Python is pretty good at this stuff, you might win something here. I haven't tested the code, so it might be bugged, but this might get you there. And if you want to count occurrences, using a dict instead of a set is easy to implement.
Edit:
I'm a lousy reader, so I answered wrong. Here's a snippet with a dict that would count occurences.
mydict = {}
for line in open("ip_address_file.txt"):
line = line.rstrip()
if line in mydict:
mydict[line] += 1
else:
mydict[line] = 1
The dictionary mydict now holds a list of unique IP's as keys and the amount of times they occurred as their values.
This does not answer the count element of the original question, but this question is the first search engine result when searching for what I wanted to achieve, so I thought this may help someone as it relates to 'group by' functionality.
I wanted to order files based on groupings of them, where the presence of some string in the filename determined the group.
It uses a temporary grouping/ordering prefix which is removed after ordering; sed substitute expressions (s#pattern#replacement#g) match the target string and prepend an integer to the line corresponding to the desired sort order of that target string. Then, grouping prefix is removed with cut.
Note that the sed expressions could be joined (e.g. sed -e '<expr>; <expr>; <expr>;') but here they're split for readability.
It's not pretty and probably not fast (I'm dealing with <50 items) but it at-least conceptually simple and doesn't require learning awk.
#!/usr/bin/env bash
for line in $(find /etc \
| sed -E -e "s#^(.*${target_string_A}.*)#${target_string_A_sort_index}:\1#;" \
| sed -E -e "s#^(.*${target_string_B}.*)#${target_string_B_sort_index}:\1#;" \
| sed -E -e "s#^/(.*)#00:/\1#;" \
| sort \
| cut -c4-
)
do
echo "${line}"
done
e.g. Input
/this/is/a/test/a
/this/is/a/test/b
/this/is/a/test/c
/this/is/a/special/test/d
/this/is/a/another/test/e
#!/usr/bin/env bash
for line in $(find /etc \
| sed -E -e "s#^(.*special.*)#10:\1#;" \
| sed -E -e "s#^(.*another.*)#05:\1#;" \
| sed -E -e "s#^/(.*)#00:/\1#;" \
| sort \
| cut -c4-
)
do
echo "${line}"
done
/this/is/a/test/a
/this/is/a/test/b
/this/is/a/test/c
/this/is/a/another/test/e
/this/is/a/special/test/d
A combination of awk + sort (with version sort flag) is probably fastest (if ur environment has awk at all):
echo "${input...}" |
{m,g}awk '{ __[$+_]++ } END { for(_ in __) { print "",+__[_],_ } }' FS='^$' OFS='\t' |
gsort -t$'\t' -k 3,3 -V
Only the post GROUP-BY summary rows are being sent to the sorting utility - which is far less system intensive sort compared to pre-sorting the input rows for no reason.
For small inputs, e.g. fewer than 1000 rows or so, just directly sort|uniq -c it.
3 10.0.10.1
1 10.0.10.2
1 10.0.10.3
Sort may be omitted if order is not significant
uniq -c <source_file>
or
echo "$list" | uniq -c
if the source list is a variable

Resources