Get non-monotonically increasing fields in Bash - bash

Let's say I have a file with multiple columns and I want to get several fields but they may be not in increasing order. Field indexes are in an array, indexes can be in any order or not order at all and the number of indexes is unknown, for example:
arr=(1 3 2) #indexes, unknown length
echo 'c1 c2 c3' | cut -d " " -f "${arr[*]}"
The output of that is
c1 c2 c3
but I want
c1 c3 c2
So it seems cut is sorting the fields before reading them, I don't want that. I am not restricted to cut, any other command can be used.
However, I am restricted to this, rather old, version of bash:
GNU bash, version 2.05b.0(1)-release (i586-suse-linux)
Copyright (C) 2002 Free Software Foundation, Inc.
EDIT Solved thanks to Benjamin W and Glenn Jackman
echo "1 2 3" | awk -v fields="${arr[*]}" 'BEGIN{ n = split(fields,f) } { for (i=1; i<=n; ++i) printf "%s%s", $f[i], (i<n?OFS:ORS) }'
It is important to reference the array with '*' instead of '#'.

This may or may not work with bash 2.05:
arr=(1 3 2)
set -f # disable filename generation
while read line; do
set -- $line # unquoted: taking advantage of word splitting,
# store the words as positional parameters
for i in "${arr[#]}"; do
printf "%s " "${!i}" # indirect variable expansion
done
echo
done < file
Or, perl
$ cat file
c1 c2 c3
$ perl -slane '
BEGIN {#a = map {$_ - 1} split " ", $arr}
print join " ", #F[#a]
' -- -arr="${arr[*]}" file
c1 c3 c2

Using awk
$ arr=(1 3 2)
$ echo 'c1 c2 c3' | awk -v arr="${arr[*]}" '
BEGIN {
split(arr, idx," ");
}
{
for(i=1; i<=length(idx); ++i)
printf("%s ",$idx[i])} ;
END {
printf("\n")
}
'
First, split arr by ' ' and assign to idx
Then, print based on each index i

Using awk:
$ cat file
a b c d
a b c d
a b c d
a b c d
$ awk -v ord="1 4 3 2" 'BEGIN { split(ord, order, " ") }
{
split($0, line, FS)
for (i = 1; i <= length(order); ++i)
$i = line[order[i]]
print
}' file
a d c b
a d c b
a d c b
a d c b
The order is given by the ord variable passed on the command line. This variable is assumed to hold as many space-separated values as are available in the input file.
In the BEGIN block, an array, order, is created from ord by splitting it on spaces.
In the default block, the current input line is split into the array line on FS (whitespace by default). The fields are then rearranged according to the order array and then the re-constructed line is printed out.
No test is made that the passed-in value in ord is sane. If the input has N columns, it must contain all the integers from 1 to N is some order.

Abstract your read from your print so you can name the parts and order them accordingly.
$: cat x
c1 c2 c3
c1 c2 c3
c1 c2 c3
c1 c2 c3
c1 c2 c3
$: cut -f 1,3,2 x |
> while read a b c
> do printf "$a $c $b\n"
> done
c1 c3 c2
c1 c3 c2
c1 c3 c2
c1 c3 c2
c1 c3 c2
This puts the read loop into the bash interpreter, which isn't as fast as a binary, but doesn't require another tool that you were already using.
I don't see much point in using awk if you have perl, so if the file is big enough you need a faster solution, try this:
perl -a -n -e 'print join " ", #F[0,2,1],"\n"' x
Assumes a lot, and adds a space before the newline, but should give you a working place to start.

Related

Filter and sort columm [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed last month.
Improve this question
I learnt awk and sed, but I'm stuck at this problem, can anyone help me?
I have a table like this:
a1 a2 a3 a4
b1 b2 b3 b4
c1 c2 c3 c4
So I want to filter value at odd and even columns like this:
table 1:
a1
a3
b1
b3
c1
c3
and table 2:
a2
a4
b2
b4
c2
c4
How can I do this?
It's easy to work in awk using:
awk '{ for (i = 1; i <= NF; i += 2) print $i > "table.1"
for (i = 2; i <= NF; i += 2) print $i > "table.2" }' data
For each line, the first loop writes the odd fields to table.1 and the second loop writes the even fields to table.2. It will even work with different numbers of columns in each line if the input data is not wholly consistent. A single pass through the input data generates both output files.
If you have the maximum number of fields (say, 100+) just use cut:
$ echo 'a1 a2 a3 a4
b1 b2 b3 b4
c1 c2 c3 c4' | cut -d' ' -f $(seq -s, 2 2 100) | tr ' ' '\n'
a2
a4
b2
b4
c2
c4
and for the odd ones seq would just start at 1.
Here's the same thing in awk (i=1 for the odd ones):
echo ... | awk '{for(i=2; i<=NF;i+=2){ print $i}}'
This might work for you (GNU sed):
sed 's/ \+/\n/g;s/^\n\|\n$//' file | sed -ne '1~2w table1' -e '2~2w table2'
Replace space(s) by newlines and remove leading or trailing newlines.
Pipe output into a second invocation of sed which directs odd lines to table1 and even lines to table2.
Or you may prefer to use:
paste -sd' ' file | tr -s ' ' '\n' | sed -ne '1~2w table1' -e '2~2w table2'
$ awk '{for (i=1; i<=NF; i++) print $i > ("table" (i+1)%2+1)}' file
$ head table*
==> table1 <==
a1
a3
b1
b3
c1
c3
==> table2 <==
a2
a4
b2
b4
c2
c4

Extract rows from table where values are less than and greater than in columns in shell

I have a very large tab separated table (24 gb in size) with C1 and C2 , C3 and C4 columns as shown below. I would like to extract rows that have C1 < 0.6 and C2 < 0.4. How do I do in unix/ shell using logical operators?
C1 C2 C3 C4
0.8 0.1 A1 C.a
0.2 0.3 A2 C.b
0.5 0.8 A3 C.c
0.1 0.1 A4 C.c
Result I expect:
C1 C2 C3 C4
0.2 0.3 A2 C.b
0.1 0.1 A4 C.c
1st solution: This simple awk should do the job for you.
awk 'FNR==1 || ($1<.6 && $2<.4)' Input_file
OR for tab separated Input_file try following:
awk 'BEGIN{FS=OFS="\t"}FNR==1 || ($1<.6 && $2<.4)' Input_file
2nd solution(Generic one): In case you don't want to hard code field number of field c1 and c2 and want to get it programmatically then try following. Add BEGIN{FS=OFS="\t"} in following in case your Input_file is TAB delimited.
awk -v c1Thre="0.6" -v c2Thre="0.4" '
FNR==1{
for(i=1;i<=NF;i++){
if($i=="C1"){ C1Field=i }
if($i=="C2"){ C2Field=i }
}
print
next
}
$C1Field<c1Thre && $C2Field<c2Thre
' Input_file
try this :
I have removed spaces ( there are 3/4 spaces ) and changed them to "," for processing :
cat mydata.txt | tr -s " " "," | awk -F"," 'BEGIN { X = NF } { for (i = 0; i <= X; i = i + 1) if($1 < 0.6 && $2<0.4) print $0}'

Compare two delimited files field by field and find the missing and non matching records

Two input files each having 3 fields. The first two fields in both the files have to matched and the third field has to be compared.
File1
A ; 1 ; a1
B ; 2 ; b2
C ; 3 ; c3
A ; 4 ; a4
File 2
B ; 2 ; b2
C ; 3 ; c5
E ; 5 ; e5
I want output like below.
Mismatching:
C ; 3 ; c3
Lines missing in file1:
E ; 5 ; e5
Lines missing in file2:
A ; 1 ; a1
A ; 4 ; a4
I also want the records missing in file1 and file2.
I tried
awk 'BEGIN {FS = ";"} NR==FNR{a[$1,$2] = $3; next} (a[$1,$2] != $3)' file1 file2
but this is giving me only the rows in file2 which are not present in file1..
$ awk -F';' '
NR==FNR{a[$1","$2]=$0; next}
$1","$2 in a{if(a[$1","$2] != $0)mm=mm $0 RS; delete a[$1","$2]; next}
{nf=nf $0 RS}
END{print "Mismatching:\n" mm;
print "Lines missing in file1:"; for(i in a)print a[i];
print "\nLines missing in file2:\n" nf}
' file2 file1
Mismatching:
C ; 3 ; c3
Lines missing in file1:
E ; 5 ; e5
Lines missing in file2:
A ; 1 ; a1
A ; 4 ; a4
$1","$2 in a if first two fields are found in a
if value in a doesn't match current line, append the line to variable mm (mismatch lines)
delete the key from a so that at the end whichever keys were not called upon will give missing lines
nf=nf $0 RS if the key wasn't found in a then we get lines not found in first file argument passed to awk
END{...} print as required
Better to save code in a file and call it using -f
$ cat cmp.awk
NR==FNR{a[$1","$2]=$0; next}
$1","$2 in a{if(a[$1","$2] != $0)mm=mm $0 RS; delete a[$1","$2]; next}
{nf=nf $0 RS}
END{print "Mismatching:\n" mm;
print "Lines missing in file1:"; for(i in a)print a[i];
print "\nLines missing in file2:\n" nf}
$ awk -F';' -f cmp.awk file2 file1

How to repeat lines in bash and paste with different columns?

is there a short way in bash to repeat the first line of a file as often as needed to paste it with another file in a kronecker product type (for the mathematicians of you)?
What I mean is, I have a file A:
a
b
c
and a file B:
x
y
z
and I want to merge them as follows:
a x
a y
a z
b x
b y
b z
c x
c y
c z
I could probably write a script, read the files line by line and loop over them, but I am wondering if there a short one-line command that could do the same job. I can't think of one and as you can see, I am also lacking some keywords to search for. :-D
Thanks in advance.
You can use this one-liner awk command:
awk 'FNR==NR{a[++n]=$0; next} {for(i=1; i<=n; i++) print $0, a[i]}' file2 file1
a x
a y
a z
b x
b y
b z
c x
c y
c z
Breakup:
NR == FNR { # While processing the first file in the list
a[++n]=$0 # store the row in array 'a' by the an incrementing index
next # move to next record
}
{ # while processing the second file
for(i=1; i<=n; i++) # iterate over the array a
print $0, a[i] # print current row and array element
}
alternative to awk
join <(sed 's/^/_\t/' file1) <(sed 's/^/_\t/' file2) | cut -d' ' -f2-
add a fake key for join to have all records of file1 to match all records of file2, trim afterwards

Cartesian product of two files (as sets of lines) in GNU/Linux

How can I use shell one-liners and common GNU tools to concatenate lines in two files as in Cartesian product? What is the most succinct, beautiful and "linuxy" way?
For example, if I have two files:
$ cat file1
a
b
$ cat file2
c
d
e
The result should be
a, c
a, d
a, e
b, c
b, d
b, e
Here's shell script to do it
while read a; do while read b; do echo "$a, $b"; done < file2; done < file1
Though that will be quite slow.
I can't think of any precompiled logic to accomplish this.
The next step for speed would be to do the above in awk/perl.
awk 'NR==FNR { a[$0]; next } { for (i in a) print i",", $0 }' file1 file2
Hmm, how about this hacky solution to use precompiled logic?
paste -d, <(sed -n "$(yes 'p;' | head -n $(wc -l < file2))" file1) \
<(cat $(yes 'file2' | head -n $(wc -l < file1)))
There won't be a comma to separate but using only join:
$ join -j 2 file1 file2
a c
a d
a e
b c
b d
b e
The mechanical way to do it in shell, not using Perl or Python, is:
while read line1
do
while read line2
do echo "$line1, $line2"
done < file2
done < file1
The join command can sometimes be used for these operations - however, I'm not clear that it can do cartesian product as a degenerate case.
One step up from the double loop would be:
while read line1
do
sed "s/^/$line1, /" file2
done < file1
I'm not going to pretend this is pretty, but...
join -t, -j 9999 -o 2.1,1.1 /tmp/file1 /tmp/file2
(updated thanks to Iwan Aucamp below)
-- join (GNU coreutils) 8.4
Edit:
DVK's attempt inspired me to do this with eval:
script='1{x;d};${H;x;s/\n/\,/g;p;q};H'
eval "echo {$(sed -n $script file1)}\,\ {$(sed -n $script file2)}$'\n'"|sed 's/^ //'
Or a simpler sed script:
script=':a;N;${s/\n/,/g;b};ba'
which you would use without the -n switch.
which gives:
a, c
a, d
a, e
b, c
b, d
b, e
Original answer:
In Bash, you can do this. It doesn't read from files, but it's a neat trick:
$ echo {a,b}\,\ {c,d,e}$'\n'
a, c
a, d
a, e
b, c
b, d
b, e
More simply:
$ echo {a,b}{c,d,e}
ac ad ae bc bd be
a generic recursive BASH function could be something like this:
foreachline() {
_foreachline() {
if [ $# -lt 2 ]; then
printf "$1\n"
return
fi
local prefix=$1
local file=$2
shift 2
while read line; do
_foreachline "$prefix$line, " $*
done <$file
}
_foreachline "" $*
}
foreachline file1 file2 file3
Regards.
Solution 1:
perl -e '{use File::Slurp; #f1 = read_file("file1"); #f2 = read_file("file2"); map { chomp; $v1 = $_; map { print "$v1,$_"; } #f2 } #f1;}'
Edit: Oops... Sorry, I thought this was tagged python...
If you have python 2.6:
from itertools import product
print('\n'.join((', '.join(elt) for elt in (product(*((line.strip() for line in fh) for fh in (open('file1','r'), open('file2','r'))))))))
a, c
a, d
a, e
b, c
b, d
b, e
If you have python pre-2.6:
def product(*args, **kwds):
'''
Source: http://docs.python.org/library/itertools.html#itertools.product
'''
# product('ABCD', 'xy') --> Ax Ay Bx By Cx Cy Dx Dy
# product(range(2), repeat=3) --> 000 001 010 011 100 101 110 111
pools = map(tuple, args) * kwds.get('repeat', 1)
result = [[]]
for pool in pools:
result = [x+[y] for x in result for y in pool]
for prod in result:
yield tuple(prod)
print('\n'.join((', '.join(elt) for elt in (product(*((line.strip() for line in fh) for fh in (open('file1','r'), open('file2','r'))))))))
A solution using join, awk and process substitution:
join <(xargs -I_ echo 1 _ < setA) <(xargs -I_ echo 1 _ < setB)
| awk '{ printf("%s, %s\n", $2, $3) }'
awk 'FNR==NR{ a[++d]=$1; next}
{
for ( i=1;i<=d;i++){
print $1","a[i]
}
}' file2 file1
# ./shell.sh
a,c
a,d
a,e
b,c
b,d
b,e
OK, this is derivation of Dennis Williamson's solution above since he noted that his does not read from file:
$ echo {`cat a | tr "\012" ","`}\,\ {`cat b | tr "\012" ","`}$'\n'
a, c
a, d
a, e
b, c
b, d
b, e
GNU Parallel:
parallel echo "{1}, {2}" :::: file1 :::: file2
Output:
a, c
a, d
a, e
b, c
b, d
b, e
Of course perl has a module for that:
#!/usr/bin/perl
use File::Slurp;
use Math::Cartesian::Product;
use v5.10;
$, = ", ";
#file1 = read_file("file1", chomp => 1);
#file2 = read_file("file2", chomp => 1);
cartesian { say #_ } \#file1, \#file2;
Output:
a, c
a, d
a, e
b, c
b, d
b, e
In fish it's a one-liner
printf '%s\n' (cat file1)", "(cat file2)

Resources