Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I have the file "test.txt" (arbitrary number of lines):
$ cat test.txt
A
B
C
I would like to find a bash code to generate all possible combinations with n elements, where n >= 2, starting with all elements (i.e. number of lines, X), so that n = X, n = X-1, n = X-2, n = X-3, ..., n = 2, which in the case above would be:
A,B,C
A,B
A,C
B,C
Any suggestions?
Many thanks!
Reusing the get_combs() function from https://stackoverflow.com/a/56916316/1745001:
$ cat tst.awk
###################
# Calculate all combinations of a set of strings, see
# https://rosettacode.org/wiki/Combinations#AWK
###################
function get_combs(A,B, i,n,comb) {
## Default value for r is to choose 2 from pool of all elements in A.
## Can alternatively be set on the command line:-
## awk -v r=<number of items being chosen> -f <scriptname>
n = length(A)
if (r=="") r = 2
comb = ""
for (i=1; i <= r; i++) { ## First combination of items:
indices[i] = i
comb = (i>1 ? comb OFS : "") A[indices[i]]
}
B[comb]
## While 1st item is less than its maximum permitted value...
while (indices[1] < n - r + 1) {
## loop backwards through all items in the previous
## combination of items until an item is found that is
## less than its maximum permitted value:
for (i = r; i >= 1; i--) {
## If the equivalently positioned item in the
## previous combination of items is less than its
## maximum permitted value...
if (indices[i] < n - r + i) {
## increment the current item by 1:
indices[i]++
## Save the current position-index for use
## outside this "for" loop:
p = i
break}}
## Put consecutive numbers in the remainder of the array,
## counting up from position-index p.
for (i = p + 1; i <= r; i++) indices[i] = indices[i - 1] + 1
## Print the current combination of items:
comb = ""
for (i=1; i <= r; i++) {
comb = (i>1 ? comb OFS : "") A[indices[i]]
}
B[comb]
}
}
# Input should be a list of strings
{ A[NR] = $0 }
END {
OFS = ","
for (r=NR; r>=2; r--) {
delete B
get_combs(A,B)
PROCINFO["sorted_in"] = "#ind_str_asc"
for (comb in B) {
print comb
}
}
}
$ awk -f tst.awk test.txt
A,B,C
A,B
A,C
B,C
with the binary counter trick to iterate all subsets...
$ awk '{a[NR]=$1}
END {for(i=0;i<2^NR;i++)
{printf "{";
for(j=0;j<NR;j++) printf "%s", and(i,2^j)?FS a[j+1]:"";
print " }"}}' file
{ }
{ A }
{ B }
{ A B }
{ C }
{ A C }
{ B C }
{ A B C }
{ D }
{ A D }
{ B D }
{ A B D }
{ C D }
{ A C D }
{ B C D }
{ A B C D }
In the desired format, pipe to another awk to filter n<2 elements
awk -v FS=, '{a[NR]=$1}
END {for(i=0;i<2^NR;i++)
{s="";
for(j=0;j<NR;j++)
{e=and(i,2^j);
printf "%s", e?s a[j+1]:""; if(e)s=FS}
print comb}}' file |
awk -F, 'NF>1'
A,B
A,C
B,C
A,B,C
A,D
B,D
A,B,D
C,D
A,C,D
B,C,D
A,B,C,D
How does it work?
All combinations are equivalent to all subsets of the given elements. This in turn can be enumerated (tagged) with 0..2^n-1.
If we represent the enumeration counter in binary, each position bit can be mapped to an element of the full set. So when running the enumeration on all subsets we can create a particular subset with the elements where the corresponding bit is set for a given tag.
For example for a 3 element initial set {A,B,C}. We have the enumeration
0 0 0 -> no elements, empty subset -> { }
0 0 1 -> A bit is set -> { A }
0 1 0 -> B bit is set -> { B }
0 1 1 -> Both A and B bits are set -> { A B }
... etc
the rest is just formatting.
This method is good for generating all combinations, for various constraints which will reduce the choices (e.g. exactly 3 elements) this is not very efficient. Also, there is an upper bound for N, due to 2^N.
We can use the following awk to get each possible combination of all the lines:
awk 'NR==FNR { a[$0]; next } { for (i in a) print i, $0 }' test.txt test.txt
Then, we can use split to write each line into a seperate file:
awk 'NR==FNR { a[$0]; next } { for (i in a) print i, $0 }' test.txt test.txt > tmp.txt
split -l 1 -d tmp.txt "test-"
rm tmp.txt
Example on my local machine:
$
$ cat test.txt
A
B
C
$
$ awk 'NR==FNR { a[$0]; next } { for (i in a) print i, $0 }' test.txt test.txt > tmp.txt
$ split -l 1 -d tmp.txt "test-"
$ rm tmp.txt
$
$ tail -n +1 *
==> test-00 <==
A A
==> test-01 <==
B A
==> test-02 <==
C A
==> test-03 <==
A B
==> test-04 <==
B B
==> test-05 <==
C B
==> test-06 <==
A C
==> test-07 <==
B C
==> test-08 <==
C C
==> test.txt <==
A
B
C
$
A bash recursive function: this won't be the fastest solution
all_combinations() {
(($# == 0)) && return
(IFS=,; echo "$*")
local i x
for ((i=0; i<$#; i++)); do
x=("$#")
unset 'x[i]'
"${FUNCNAME[0]}" "${x[#]}"
done
}
combinations() {
all_combinations "$#" |
grep , | # at least 2 element
sort -u | # remove duplicates
while IFS= read -r line; do # print number of elements
printf '%d\t%s\n' \
$(commas=${line//[^,]/}; echo ${#commas}) \
"$line"
done |
sort -k1,1nr -k2 | # sort by num + line
cut -f2- # remove num
}
mapfile -t lines < test.txt
combinations "${lines[#]}"
if test.txt contains 4 lines, this produces
A,B,C,D
A,B,C
A,B,D
A,C,D
B,C,D
A,B
A,C
A,D
B,C
B,D
C,D
Assumptions:
the string A,B,C is considered to be equivalent to A,C,B, B,A,C, B,C,A, C,A,B and C,B,A (ie, we only need to generate one of these 6 combinations)
One idea for generating a list of combinations ...
we'll load lines into an array
for each item in the array we'll start a new set of output strings
make recursive calls to append the next array item to our output string
as we're appending array items to the output we'll go ahead and print each string that consists of 2 or more array items
this is more of a tail recursion method which should eliminate the generation of duplicates (A,B,C vs the other 5 equivalent patterns) and/or or the need to rollback "already seen' combos
One awk idea for implementing this logic:
awk '
# input params are current output string, current arr[] index, and current output length (ie, number of fields)
# parameter "i" will be treated as a local variable
function combo(output, j, combo_length, i) {
if ( combo_length >= 2) # print any combination with a length >= 2
print output
for (i=j+1; i<=n; i++) # loop through "rest" of array entries for next field in output
combo(output "," arr[i], i, combo_length+1 )
}
{ arr[NR]=$1 } # load fields into array "arr[]"
END { n=length(arr)
for (i=1; i<=n; i++) # for each arr[i] start a new set of combos starting with arr[i]
combo(arr[i],i,1)
}
' test.txt
This generates:
A,B
A,B,C
A,B,C,D
A,B,D
A,C
A,C,D
A,D
B,C
B,C,D
B,D
C,D
If we want to sort based on number of fields and then the output string we can make the following change:
change print output to print combo_length, output and then ...
pipe the awk output through sort | cut (we'll borrow glenn's code here)
This generates:
$ awk ' ... print combo_length,output ...' test.txt | sort -k1,1nr -k2 | cut -d" " -f2-
A,B,C,D
A,B,C
A,B,D
A,C,D
B,C,D
A,B
A,C
A,D
B,C
B,D
C,D
For a 20-line test.txt ( letters A to T) with the output dumped to file test.out:
$ time awk '...' test.txt > test.out
real 0m1.420s
user 0m1.279s
sys 0m0.139s
$ wc -l test.out
1048555 2097110 23685256 test.out
$ time awk '...' test.txt | sort ... | cut ... > test.out
real 0m3.456s
user 0m3.493s
sys 0m0.185s
$ wc test.out
1048555 1048555 20971480 test.out
I'm trying to write an AWK command that allows me to perform matrix multiplication between two tab separated files.
example:
cat m1
1 2 3 4
5 6 7 8
cat m2
1 2
3 4
5 6
7 8
desired output:
50 60
114 140
without any validation of the input files for the sizes.
it will be easier to break into two scripts, one for transposing the second matrix and one to create a dot product of vectors. Also to simply awk code, you can resort to join.
$ awk '{m=NF/2; for(i=1;i<=m;i++) sum[NR] += $i*$(i+m)}
END {for(i=1;i<=NR;i++)
printf "%s", sum[i] (i==sqrt(NR)?ORS:OFS);
print ""}' <(join -j99 m1 <(transpose m2))
where transpose function is defined as
$ function transpose() { awk '{for(j=1;j<=NF;j++) a[NR,j]=$j}
END {for(i=1;i<=NF;i++)
for(j=1;j<=NR;j++)
printf "%s",a[j,i] (j==NR?ORS:OFS)}' "$1"; }
I would suggest going with GNU Octave:
octave --eval 'load("m1"); load("m2"); m1*m2'
Output:
ans =
50 60
114 140
However, assuming well-formatted files you can do it like this with GNU awk:
matrix-mult.awk
ARGIND == 1 {
for(i=1; i<=NF; i++)
m1[FNR][i] = $i
m1_width = NF
m1_height = FNR
}
ARGIND == 2 {
for(i=1; i<=NF; i++)
m2[FNR][i] = $i
m2_width = NF
m2_height = FNR
}
END {
if(m1_width != m2_height) {
print "Matrices are incompatible, unable to multiply!"
exit 1
}
for(i=1; i<=m1_height; i++) {
for(j=1; j<=m2_width; j++) {
for(k=1; k<=m1_width; k++)
sum += m1[i][k] * m2[k][j]
printf sum OFS; sum=0
}
printf ORS
}
}
Run it like this:
awk -f matrix-mult.awk m1 m2
Output:
50 60
114 140
If you process the second matrix before the first matrix, then you don't have to transpose the second matrix or to store both matrices in an array:
awk 'NR==FNR{for(i=1;i<=NF;i++)a[NR,i]=$i;w=NF;next}{for(i=1;i<=w;i++){s=0;for(j=1;j<=NF;j++)s+=$j*a[j,i];printf"%s"(i==w?RS:FS),s}}' m2 m1
When I replaced multidimensional arrays with arrays of arrays by replacing a[NR,i] with a[NR][i] and a[j,i] with a[j][i], it made the code about twice as fast in gawk. But arrays of arrays are not supported by nawk, which is /usr/bin/awk on macOS.
Or another option is to use R:
Rscript -e 'as.matrix(read.table("m1"))%*%as.matrix(read.table("m2"))'
Or this gets the names of the input files as command line arguments and prints the result without column names or row names:
Rscript -e 'write.table(Reduce(`%*%`,lapply(commandArgs(T),function(x)as.matrix(read.table(x)))),col.names=F,row.names=F)' m1 m2
I made a Bash script that extracts words from a text file with grep and sed and then sorts them with sort and counts the repetitions with wc, then sort again by frequency. The example output looks like this:
12 the
7 code
7 with
7 add
5 quite
3 do
3 well
1 quick
1 can
1 pick
1 easy
Now I'd like to merge all words with the same frequency into one line, like this:
12 the
7 code with add
5 quite
3 do well
1 quick can pick easy
Is there any way to do that with Bash and standard Unix toolset? Or I would have to write a script / program in some more sophisticated scripting language?
With awk:
$ echo "12 the
7 code
7 with
7 add
5 quite
3 do
3 well
1 quick
1 can
1 pick
1 easy" | awk '{cnt[$1]=cnt[$1] ? cnt[$1] OFS $2 : $2} END {for (e in cnt) print e, cnt[e]} ' | sort -nr
12 the
7 code with add
5 quite
3 do well
1 quick can pick easy
You can do something similar with Bash 4 associative arrays. awk is easier and POSIX though. Use that.
Explanation:
awk splits the line apart by the separator in FS, in this case the default of horizontal whitespace;
$1 is the first field of the count - use that to collect items with the same count in an associative array keyed by the count with cnt[$1];
cnt[$1]=cnt[$1] ? cnt[$1] OFS $2 : $2 is a ternary assignment - if cnt[$1] has no value, just assign the second field $2 to it (The RH of :). If it does have a previous value, concatenate $2 separated by the value of OFS (the LH of :);
At the end, print out the value of the associative array.
Since awk associative arrays are unordered, you need to sort again by the numeric value of the first column. gawk can sort internally, but it is just as easy to call sort. The input to awk does not need to be sorted, so you can eliminate that part of the pipeline.
If you want the digits to be right justified (as your have in your example):
$ awk '{cnt[$1]=cnt[$1] ? cnt[$1] OFS $2 : $2}
END {for (e in cnt) printf "%3s %s\n", e, cnt[e]} '
If you want gawk to sort numerically by descending values, you can add PROCINFO["sorted_in"]="#ind_num_desc" prior to traversing the array:
$ gawk '{cnt[$1]=cnt[$1] ? cnt[$1] OFS $2 : $2}
END {PROCINFO["sorted_in"]="#ind_num_desc"
for (e in cnt) printf "%3s %s\n", e, cnt[e]} '
With single GNU awk expression (without sort pipeline):
awk 'BEGIN{ PROCINFO["sorted_in"]="#ind_num_desc" }
{ a[$1]=(a[$1])? a[$1]" "$2:$2 }END{ for(i in a) print i,a[i]}' file
The output:
12 the
7 code with add
5 quite
3 do well
1 quick can pick easy
Bonus alternative solution using GNU datamash tool:
datamash -W -g1 collapse 2 <file
The output (comma-separated collapsed fields):
12 the
7 code,with,add
5 quite
3 do,well
1 quick,can,pick,easy
awk:
awk '{a[$1]=a[$1] FS $2}!b[$1]++{d[++c]=$1}END{while(i++<c)print d[i],a[d[i]]}' file
sed:
sed -r ':a;N;s/(\b([0-9]+).*)\n\s*\2/\1/;ta;P;D'
You start with sorted data, so you only need a new line when the first field changes.
echo "12 the
7 code
7 with
7 add
5 quite
3 do
3 well
1 quick
1 can
1 pick
1 easy" |
awk '
{
if ($1==last) {
printf(" %s",$2)
} else {
last=$1;
printf("%s%s",(NR>1?"\n":""),$0)
}
}; END {print}'
next time you find yourself trying to manipulate text with a combination of grep and sed and shell and..., stop and just use awk instead - the end result will be clearer, simpler, more efficient, more portable, etc...
$ cat file
It was the best of times, it was the worst of times,
it was the age of wisdom, it was the age of foolishness.
.
$ cat tst.awk
BEGIN { FS="[^[:alpha:]]+" }
{
for (i=1; i<NF; i++) {
word2cnt[tolower($i)]++
}
}
END {
for (word in word2cnt) {
cnt = word2cnt[word]
cnt2words[cnt] = (cnt in cnt2words ? cnt2words[cnt] " " : "") word
printf "%3d %s\n", cnt, word
}
for (cnt in cnt2words) {
words = cnt2words[cnt]
# printf "%3d %s\n", cnt, words
}
}
$
$ awk -f tst.awk file | sort -rn
4 was
4 the
4 of
4 it
2 times
2 age
1 worst
1 wisdom
1 foolishness
1 best
.
$ cat tst.awk
BEGIN { FS="[^[:alpha:]]+" }
{
for (i=1; i<NF; i++) {
word2cnt[tolower($i)]++
}
}
END {
for (word in word2cnt) {
cnt = word2cnt[word]
cnt2words[cnt] = (cnt in cnt2words ? cnt2words[cnt] " " : "") word
# printf "%3d %s\n", cnt, word
}
for (cnt in cnt2words) {
words = cnt2words[cnt]
printf "%3d %s\n", cnt, words
}
}
$
$ awk -f tst.awk file | sort -rn
4 it was of the
2 age times
1 best worst wisdom foolishness
Just uncomment whichever printf line you like in the above script to get whichever type of output you want. The above will work in any awk on any UNIX system.
Using miller's nest verb:
mlr -p nest --implode --values --across-records -f 2 --nested-fs ' ' file
Output:
12 the
7 code with add
5 quite
3 do well
1 quick can pick easy
This is related to my previous question that I [asked] (bash command for group by count)
What if I want to generalize this? For instance
The input file is
ABC|1|2
ABC|3|4
BCD|7|2
ABC|5|6
BCD|3|5
The output should be
ABC|9|12
BCD|10|7
The result is calculated by group first column and adding the values of 2nd column, and 3rd column, just like similar to group by command in SQL.
I tried modifying the command provided in the link but failed. I don't know whether I'm making a conceptual error or a silly mistake but all I know is none of the mentioned commands aren't working.
Command used
awk -F "|" '{arr[$1]+=$2} END arr2[$1]+=$5 END {for (i in arr) {print i"|"arr[i]"|"arr2[i]}}' sample
awk -F "|" '{arr[$1]+=$2} END {arr2[$1]+=$5} END {for (i in arr) {print i"|"arr[i]"|"arr2[i]}}' sample
awk -F "|" '{arr[$1]+=$2 arr2[$1]+=$5} END {for (i in arr2) {print i"|"arr[i]"|"arr2[i]}}' sample
Additionally, what if I'm trying here is to limit the use to summing the columns upto 2 only. What if there are n columns and we want to perform operations such as addition in one column and subtraction in other? How can that further be modified?
Example
ABC|1|2|4|......... upto n columns
ABC|4|5|6|......... upto n columns
DEF|1|4|6|......... upto n columns
lets say if sum is needed with first column, average may be for second column, some other operation for third column, etc. How this can be tackled?
For 3 fields (key and 2 data fields):
$ awk '
BEGIN { FS=OFS="|" } # set separators
{
a[$1]+=$2 # sum second field to a hash
b[$1]+=$3 # ... b hash
}
END { # in the end
for(i in a) # loop all
print i,a[i],b[i] # and output
}' file
BCD|10|7
ABC|9|12
More generic solution for n columns using GNU awk:
$ awk '
BEGIN { FS=OFS="|" }
{
for(i=2;i<=NF;i++) # loop all data fields
a[$1][i]+=$i # sum them up to related cells
a[$1][1]=i # set field count to first cell
}
END {
for(i in a) {
for((j=2)&&b="";j<a[i][1];j++) # buffer output
b=b (b==""?"":OFS)a[i][j]
print i,b # output
}
}' file
BCD|10|7
ABC|9|12
Latter only tested for 2 fields (busy at a meeting :).
gawk approach using multidimensional array:
awk 'BEGIN{ FS=OFS="|" }{ a[$1]["f2"]+=$2; a[$1]["f3"]+=$3 }
END{ for(i in a) print i,a[i]["f2"],a[i]["f3"] }' file
a[$1]["f2"]+=$2 - summing up values of the 2nd field (f2 - field 2)
a[$1]["f3"]+=$3 - summing up values of the 3rd field (f3 - field 3)
The output:
ABC|9|12
BCD|10|7
Additional short datamash solution (will give the same output):
datamash -st\| -g1 sum 2 sum 3 <file
-s - sort the input lines
-t\| - field separator
sum 2 sum 3 - sums up values of the 2nd and 3rd fields respectively
awk -F\| '{ array[$1]="";for (i=1;i<=NF;i++) { arr[$1,i]+=$i } } END { for (i in array) { printf "%s",i;for (p=2;p<=NF;p++) { printf "|%s",arr[i,p] } print "\n" } }' filename
We use two arrays, (array and arr) array is a single dimensional array tracking all the first pieces and arr is a multidimensional array keyed on the first piece and then the piece index and so for example arr["ABC",1]=1 and arr["ABC",2]=2. At the end we loop through array and then each field in the data set, we pull out the data from the multidimensional array arr.
This will work in any awk and will retain the input keys order in the output:
$ cat tst.awk
BEGIN { FS=OFS="|" }
!seen[$1]++ { keys[++numKeys] = $1 }
{
for (i=2;i<=NF;i++) {
sum[$1,i] += $i
}
}
END {
for (keyNr=1; keyNr<=numKeys; keyNr++) {
key = keys[keyNr]
printf "%s%s", key, OFS
for (i=2;i<=NF;i++) {
printf "%s%s", sum[key,i], (i<NF?OFS:ORS)
}
}
}
$ awk -f tst.awk file
ABC|9|12
BCD|10|7
My input file is
a|b|c|d
w|r|g|h
i want to store the value in array like
a[1,1] = a
a[1,2] = b
a[2,1] = w
Kindly suggest in any way to achieve this in awk bash.
I have two i/p files and need to do field level validation.
Like this
awk -F'|' '{for(i=1;i<=NF;i++)a[NR,i]=$i}
END {print a[1,1],a[2,2]}' file
Output
a r
This parses the file into an awk array:
awk -F \| '{ for(i = 1; i <= NF; ++i) a[NR,i] = $i }' filename
You'll have to add code that uses the array for this to be of any use, of course. Since you didn't say what you wanted to do with the array once it is complete (after the pass over the file), this is all the answer i can give you.
You're REALLY going to want to get/use gawk 4.* if you're using multi-dimensional arrays as that's the only awk that supports them. When you write:
a[1,2]
in any awk you are actually creating a psedudo-multi-dimensional array which is a 1-dimensional array indexed by the string formed by the concatenation of
1 SUBSEP 2
where SUBSEP is a control char that's unlikely to appear in your input.
In GNU awk 4.* you can do:
a[1][2]
(note the different syntax) and that populates an actual multi-dimentional array.
Try this to see the difference:
$ cat tst.awk
BEGIN {
SUBSEP=":" # just to make it visible when printing
oneD[1,2] = "a"
oneD[1,3] = "b"
twoD[1][2] = "c"
twoD[1][3] = "d"
for (idx in oneD) {
print "oneD", idx, oneD[idx]
}
print ""
for (idx1 in twoD) {
print "twoD", idx1
for (idx2 in twoD[idx1]) { # you CANNOT do this with oneD
print "twoD", idx1, idx2, twoD[idx1][idx2]
}
}
}
$ awk -f tst.awk
oneD 1:2 a
oneD 1:3 b
twoD 1
twoD 1 2 c
twoD 1 3 d