How to produce cartesian product in bash? - bash

I want to produce such file (cartesian product of [1-3]X[1-5]):
1 1
1 2
1 3
1 4
1 5
2 1
2 2
2 3
2 4
2 5
3 1
3 2
3 3
3 4
3 5
I can do this using nested loop like:
for i in $(seq 3)
do
for j in $(seq 5)
do
echo $i $j
done
done
is there any solution without loops?

Combine two brace expansions!
$ printf "%s\n" {1..3}" "{1..5}
1 1
1 2
1 3
1 4
1 5
2 1
2 2
2 3
2 4
2 5
3 1
3 2
3 3
3 4
3 5
This works by using a single brace expansion:
$ echo {1..5}
1 2 3 4 5
and then combining with another one:
$ echo {1..5}+{a,b,c}
1+a 1+b 1+c 2+a 2+b 2+c 3+a 3+b 3+c 4+a 4+b 4+c 5+a 5+b 5+c

A shorter (but hacky) version of Rubens's answer:
join -j 999999 -o 1.1,2.1 file1 file2
Since the field 999999 most likely does not exist it is considered equal for both sets and therefore join have to do the Cartesian product. It uses O(N+M) memory and produces output at 100..200 Mb/sec on my machine.
I don't like the "shell brace expansion" method like echo {1..100}x{1..100} for large datasets because it uses O(N*M) memory and can when used careless bring your machine to knees. It is hard to stop because ctrl+c does not interrupts brace expansion which is done by the shell itself.

The best alternative for cartesian product in bash is surely -- as pointed by #fedorqui -- to use parameter expansion. However, in case your input that is not easily producible (i.e., if {1..3} and {1..5} does not suffice), you could simply use join.
For example, if you want to peform the cartesian product of two regular files, say "a.txt" and "b.txt", you could do the following. First, the two files:
$ echo -en {a..c}"\tx\n" | sed 's/^/1\t/' > a.txt
$ cat a.txt
1 a x
1 b x
1 c x
$ echo -en "foo\nbar\n" | sed 's/^/1\t/' > b.txt
$ cat b.txt
1 foo
1 bar
Notice the sed command is used to prepend each line with an identifier. The identifier must be the same for all lines, and for all files, so the join will give you the cartesian product -- instead of putting aside some of the resultant lines. So, the join goes as follows:
$ join -j 1 -t $'\t' a.txt b.txt | cut -d $'\t' -f 2-
a x foo
a x bar
b x foo
b x bar
c x foo
c x bar
After both files are joined, cut is used as an alternative to remove the column of "1"s formerly prepended.

Related

unix sort groups by their associated maximum value?

Let's say I have this input file 49142202.txt:
A 5
B 6
C 3
A 4
B 2
C 1
Is it possible to sort the groups in column 1 by the value in column 2? The desired output is as follows:
B 6 <-- B group at the top, because 6 is larger than 5 and 3
B 2 <-- 2 less than 6
A 5 <-- A group in the middle, because 5 is smaller than 6 and larger than 3
A 4 <-- 4 less than 5
C 3 <-- C group at the bottom, because 3 is smaller than 6 and 5
C 1 <-- 1 less than 3
Here is my solution:
join -t$'\t' -1 2 -2 1 \
<(cat 49142202.txt | sort -k2nr,2 | sort --stable -k1,1 -u | sort -k2nr,2 \
| cut -f1 | nl | tr -d " " | sort -k2,2) \
<(cat 49142202.txt | sort -k1,1 -k2nr,2) \
| sort --stable -k2n,2 | cut -f1,3
The first input to join sorted by column 2 is this:
2 A
1 B
3 C
The second input to join sorted by column 1 is this:
A 5
A 4
B 6
B 2
C 3
C 1
The output of join is:
A 2 5
A 2 4
B 1 6
B 1 2
C 3 3
C 3 1
Which is then sorted by the nl line number in column 2 and then the original input columns 1 and 3 are kept with cut.
I know it can be done a lot easier with for example groupby of pandas of Python, but is there a more elegant way of doing it, while sticking to the use of GNU Coreutils such as sort, join, cut, tr and nl? Preferably I want to avoid a memory inefficient awk solution, but please share those as well. Thanks!
As explained in the comment my solution tries to reduce the number of pipes, unnecessary cat commands and more especially the number of pipeline sort operations since sorting is a complex/time consuming operation:
I reached the following solution where f_grp_sort is the input file:
for elem in $(sort -k2nr f_grp_sort | awk '!seen[$1]++{print $1}')
do
grep $elem <(sort -k2nr f_grp_sort)
done
OUTPUT:
B 6
B 2
A 5
A 4
C 3
C 1
Explanations:
sort -k2nr f_grp_sort will generate the following output:
B 6
A 5
A 4
C 3
B 2
C 1
and sort -k2nr f_grp_sort | awk '!seen[$1]++{print $1}' will generate the output:
B
A
C
the awk will just generate in the same order 1 unique element of the first column of the temporary output.
Then the for elem in $(...)do grep $elem <(sort -k2nr f_grp_sort); done
will grep for lines containing B then A, then C what will provide the required output.
Now as enhancement, you can use a temporary file to avoid doing sort -k2nr f_grp_sort operation twice:
$ sort -k2nr f_grp_sort > tmp_sorted_file && for elem in $(awk '!seen[$1]++{print $1}' tmp_sorted_file); do grep $elem tmp_sorted_file; done && rm tmp_sorted_file
So, this won't work for all cases, but if the values in your first column can be turned into bash variables, we can use dynamically named arrays to do this instead of a bunch of joins. It should be pretty fast.
The first while block reads in the contents of the file, getting the first two space separated strings and putting them into col1 and col2. We then create a series of arrays named like ARR_A and ARR_B where A and B are the values from column 1 (but only if $col1 only contains characters that can be used in bash variable names). The array contains the column 2 values associated with these column 1 values.
I use your fancy sort chain to get the order we want column 1 values to print out in, we just loop through them, then for each column 1 array we sort the values and echo out column 1 and column 2.
The dynamc variable bits can be hard to follow, but for the right values in column 1 it will work. Again, if there's any characters that can't be part of a bash variable name in column 1, this solution will not work.
file=./49142202.txt
while read col1 col2 extra
do
if [[ "$col1" =~ ^[a-zA-Z0-9_]+$ ]]
then
eval 'ARR_'${col1}'+=("'${col2}'")'
else
echo "Bad character detected in Column 1: '$col1'"
exit 1
fi
done < "$file"
sort -k2nr,2 "$file" | sort --stable -k1,1 -u | sort -k2nr,2 | while read col1 extra
do
for col2 in $(eval 'printf "%s\n" "${ARR_'${col1}'[#]}"' | sort -r)
do
echo $col1 $col2
done
done
This was my test, a little more complex than your provided example:
$ cat 49142202.txt
A 4
B 6
C 3
A 5
B 2
C 1
C 0
$ ./run
B 6
B 2
A 5
A 4
C 3
C 1
C 0
Thanks a lot #JeffBreadner and #Allan! I came up with yet another solution, which is very similar to my first one, but gives a bit more control, because it allows for easier nesting with for loops:
for x in $(sort -k2nr,2 $file | sort --stable -k1,1 -u | sort -k2nr,2 | cut -f1); do
awk -v x=$x '$1==x' $file | sort -k2nr,2
done
Do you mind, if I don't accept either of your answers, until I have time to evaluate the time and memory performance of your solutions? Otherwise I would probably just go for the awk solution by #Allan.

Bash_shell Use shell to convert three format in one script to another script at one time

cat file1.txt
set A B 1
set C D E 2
set E F 3 3 3 3 3 3
cat file2.txt
A;B;1;
C;D.E;2;
E;F;3 3 3 3 3 3;
please help convert the format in file1.txt to file2.txt, the file2.txt is the output. I just input 3 lines in file1.txt for taking example, but in fact ,there are many command lines same with these 3 format.So the shell command should be adapt to any situation where the content contains these 3 format in file1.txt.
echo "set A B 1
set C D E 2
set E F 3 3 3 3 3 3 " | sed -r 's/set (.) /\1;/;s/([A-Z])*( ([A-Z]))/\1.\3/g;s/([A-Z]) ([0-9])/\1;\2/;s/ ?$/;/'
A;B;1;
C;D.E;2;
E;F;3 3 3 3 3 3;

Loop through a file and paste columns next to one another

Given I have a python script as follows:
#!/usr/bin/python
for i in range(1,4):
print i
I want to run it in a bash loop for 3 times but I want to add the output as columns rather than concatenating. Is there a way to achieve this?
Output:
1 1 1
2 2 2
3 3 3
Like this?:
$ for i in {1..3} ; do echo $i $i $i ; done
1 1 1
2 2 2
3 3 3
You are looking for the pr command:
for i in 1 2 3 ; do
python a.py
done | pr -t -3
Output:
1 1 1
2 2 2
3 3 3
Btw, to get the numbers from 1 to 3 you need to use:
range(1,4) # <-- 4, not 3!
in Python

Repeat an element n number of times in an array

Basically, I am trying to repeat each element in the following array [1 2 3] 4 times such that I will get something like this:
[1 1 1 1 2 2 2 2 3 3 3 3]
I tried a very stupid line of code i.e. abc=('1%.0s' {1..4}). But it failed miserably.
I am looking for an efficient one line solution to this problem and preferably, without using loops. If it is not possible to achieve this with just one line, then use loops.
Unless you're trying to avoid loops you can do:
arr=(1 2 3)
for i in ${arr[#]}; do for ((n=1; n<=4; n++)) do echo -n "$i ";done; done; echo
1 1 1 1 2 2 2 2 3 3 3 3
To store the results in an array:
aarr=($(for i in ${arr[#]}; do for ((n=1; n<=4; n++)) do echo -n "$i ";done; done;))
declare -p aarr
declare -a aarr='([0]="1" [1]="1" [2]="1" [3]="1" [4]="2" [5]="2" [6]="2" [7]="2" [8]="3" [9]="3" [10]="3" [11]="3")'
This does what you need and stores it in an array:
declare -a res=($(for v in 1 2 3; do for i in {1..4}; do echo $v; done; done))
Taking your idea to the next step:
$ a=(1 2 3)
$ b=($(for x in "${a[#]}"; do printf "$x%.0s " {1..4}; done))
$ echo ${b[#]}
1 1 1 1 2 2 2 2 3 3 3 3
Alternatively, using sed:
$ echo ${a[*]} | sed -r 's/[[:alnum:]]+/& & & &/g'
1 1 1 1 2 2 2 2 3 3 3 3
Or, using awk:
$ echo ${a[*]} | awk -v RS='[ \n]' '{for (i=1;i<=4;i++)printf "%s ", $0;} END{print""}'
1 1 1 1 2 2 2 2 3 3 3 3
Simple one liner:
for x in 1 2 3 ; do array+="$(printf "%1.0s$x" {1..4})" ;done
Similar to what you wanted.

Paste side by side multiple files by numerical order

I have many files in a directory with similar file names like file1, file2, file3, file4, file5, ..... , file1000. They are of the same dimension, and each one of them has 5 columns and 2000 lines. I want to paste them all together side by side in a numerical order into one large file, so the final large file should have 5000 columns and 2000 lines.
I tried
for x in $(seq 1 1000); do
paste `echo -n "file$x "` > largefile
done
Instead of writing all file names in the command line, is there a way I can paste those files in a numerical order (file1, file2, file3, file4, file5, ..., file10, file11, ..., file1000)?
for example:
file1
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
...
file2
2 2 2 2 2
2 2 2 2 2
2 2 2 2 2
....
file 3
3 3 3 3 3
3 3 3 3 3
3 3 3 3 3
....
paste file1 file2 file3 .... file 1000 > largefile
largefile
1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
....
Thanks.
If your current shell is bash: paste -d " " file{1..1000}
you need rename the files with leading zeroes, like
paste <(ls -1 file* | sort -te -k2.1n) <(seq -f "file%04g" 1000) | xargs -n2 echo mv
The above is for "dry run" - Remove the echo if you satisfied...
or you can use e.g. perl
ls file* | perl -nlE 'm/file(\d+)/; rename $_, sprintf("file%04d", $1);'
and after you can
paste file*
With zsh:
setopt extendedglob
paste -d ' ' file<->(n)
<x-y> is to match positive decimal integer numbers from x to y. x and/or y can be omitted so <-> is any positive decimal integer number. It could also be written [0-9]## (## being the zsh equivalent of regex +).
The (n) is the globbing qualifiers. The n globbing qualifier turns on numeric sorting which sorts on all sequences of decimal digits appearing in the file names.

Resources