File Manipulation Loop - bash

Ok, this question is two-fold: 1) The actual file manupulation bit 2) Looping this manipulation in unix
Part 1)
I have two files:
File_1
a b
c d
e f
g h
and File_2
A B
C D
E F
G H
I J
i would like to get (in the first instance) the following result:
a b
A B
>
c d
A B
>
e f
A B
>
g h
A B
...and save this output to outfile1.
I gather I would have to use things like awk, cut and/or paste but I can't manage to put it all together.
Part 2)
I then want to loop this manipulation for all rows in File_2 (note that the number of rows in File_1 is not the same as in File_2), such that I end up with 5 output files, where outfile2 would be:
a b
C D
>
c d
C D
>
e f
C D
>
g h
C D
and outfile3 would be:
a b
E F
>
c d
E F
>
e f
E F
>
g h
E F
etc.
At the moment I'm working in bash. Thank you in advance for any help!

This can be done with bash redirection:
i=1
while read f2; do
while read f1; do
echo "$f1"
echo "$f2"
echo ">"
done < File_1 | head -n -1 > output$i
(( i++ ))
done < File_2
head -n -1 avoids having a lone delimiter at the end of each output$i file.

to make outfile_3(with E F) for example:
x=$(sed -n '3p' File_2)
awk "{ printf \"%s\\n%s\\n>\\n\", \$0, \"$x\" }" File_1 > outfile_3
in first line '3p' cuts 3-rd line
Now let's do it in a loop:
(( i = 1 ))
while read line
do
awk "{ printf \"%s\\n%s\\n>\\n\", \$0, \"$line\" }" File_1 > "outfile_$i"
(( i++ ))
done < File_2

sort -m -f file1 file2 | uniq -i --all-repeated=separate
Looked rather close. However, on second reading I think you want something more like this perl script:
use strict;
use warnings;
open(my $FILE1, '<file1') or die;
my $output = 0;
while (my $a = <$FILE1>)
{
$output++;
open(my $OUT, ">output$output");
open(my $FILE2, '<file2') or die;
print $OUT "$a$_---\n" foreach (<$FILE2>);
close $FILE2;
close $OUT;
}
close $FILE1;
This creates output files output1, output12, output3, output..., as many as there are lines in file1

an awk one liner:
awk 'NR==FNR{a[NR]=$0;l=NR;next;} {b[FNR]=$0;}
END{f=1; for(x=1;x<=FNR;x++){for(i=1;i<=length(a);i++){
printf "%s\n%s\n%s\n", a[i],b[x],">" > "output"f }f++;}}' f1 f2
test:
kent$ head f1 f2
==> f1 <==
a b
c d
e f
g h
==> f2 <==
A B
C D
E F
G H
I J
kent$ awk 'NR==FNR{a[NR]=$0;l=NR;next;} {b[FNR]=$0;}
END{f=1; for(x=1;x<=FNR;x++){for(i=1;i<=length(a);i++){printf "%s\n%s\n%s\n", a[i],b[x],">" > "output"f }f++;}}' f1 f2
kent$ head -30 out*
==> output1 <==
a b
A B
>
c d
A B
>
e f
A B
>
g h
A B
>
==> output2 <==
a b
C D
>
c d
C D
>
e f
C D
>
g h
C D
>
==> output3 <==
a b
E F
>
c d
E F
>
e f
E F
>
g h
E F
>
==> output4 <==
a b
G H
>
c d
G H
>
e f
G H
>
g h
G H
>
==> output5 <==
a b
I J
>
c d
I J
>
e f
I J
>
g h
I J
>

Related

Add a specific string at the end of each line

I have a mainfile with 4 columns, such as:
a b c d
e f g h
i j k l
in another file, i have one line of text corresponding to the respective line in the mainfile, which i want to add as a new column to the mainfile, such as:
a b c d x
e f g h y
i j k l z
Is this possible in bash? I can only add the same string to the end of each line.
Two ways you can do
1) paste file1 file2
2) Iterate over both files and combine line by line and write to new file
You could use GNU parallel for that:
fe-laptop-m:test fe$ cat first
a b c d
e f g h
i j k l
fe-laptop-m:test fe$ cat second
x
y
z
fe-laptop-m:test fe$ parallel echo ::::+ first second
a b c d x
e f g h y
i j k l z
Do I get you right what you try to achieve?
This might work for you (GNU sed):
sed -E 's#(^.*) .*#/^\1/s/$/ &/#' file2 | sed -f - file1
Create a sed script from file2 that uses a regexp to match a line in file1 and if it does appends the contents of that line in file2 to the matched line.
N.B.This is independent of the order and length of file1.
You can try using pr
pr -mts' ' file1 file2

converting four columns to two using linux commands

I am wondering how one could merge four columns into two in the following manner (using the awk command, or other possible commands).
For example,
Old:
A B C D
E F G H
I J K L
M N O P
.
.
.
New:
A B
C D
E F
G H
I J
K L
M N
O P
.
.
Thanks so much!
That's actually quite easy with awk, as per the following transcript:
pax> cat inputFile
A B C D
E F G H
pax> awk '{printf "%s %s\n%s %s\n", $1, $2, $3, $4}' <inputFile
A B
C D
E F
G H
Hww about using xargs here? Could you please try following once.
xargs -n 2 < Input_file
Output will be as follows.
A B
C D
E F
G H
I J
K L
M N
O P
with GNU sed
$ sed 's/ /\n/2' file
replace 2nd space with new line.

Linux Bash count and summarize by unique columns

I have a text file with lines like this (in Linux Bash):
A B C D
A B C J
E B C P
E F G N
E F G P
A B C Q
H F S L
G Y F Q
H F S L
I need to find the lines with unique values for the first 3 columns, print their count and then print summarized last column for each unique line, so the result is like this:
3 A B C D,J,Q
1 E B C P
2 E F G N,P
1 G Y F Q
2 H F S L
What I have tried:
cat FILE | sort -k1,3 | uniq -f3 -c | sort -k3,5nr
Is there maybe any advice?
Thanks in advance!
The easiest is to do the following:
awk '{key=$1 OFS $2 OFS $3; a[key]=a[key]","$4; c[key]++}
END{for(key in a) { print c[key],key,substr(a[key],2) }}' <file>
If you do not want any duplication, you can do
awk '{ key=$1 OFS $2 OFS $3; c[key]++ }
!gsub(","$4,","$4,a[key]) {a[key]=a[key]","$4; }
END{for(key in a) { print c[key],key,substr(a[key],2) }} <file>
Could you please try following and let me know if this helps you.
This will give you output in same sequence as per Input_file's $1, $2, and $3 occurrence only.
awk '
!a[$1,$2,$3]++{
b[++count]=$1 FS $2 FS $3
}
{
c[$1,$2,$3]=c[$1,$2,$3]?c[$1,$2,$3] "," $4:$0
d[$1 FS $2 FS $3]++
}
END{
for(i=1;i<=count;i++){
print d[b[i]],c[b[i]]
}
}
' SUBSEP=" " Input_file
Another using GNU awk and 2d arrays for removing duplicates in $4:
$ awk '{
i=$1 OFS $2 OFS $3 # key to hash
a[i][$4] # store each $4 to separate element
c[i]++ # count key references
}
END {
for(i in a) {
k=1 # comma counter for output
printf "%s %s ",c[i],i # output count and key
for(j in a[i]) # each a[]i[j] element
printf "%s%s",((k++)==1?"":","),j # output commas and elements
print "" # line-ending
}
}' file
Output in default random order:
2 E F G N,P
3 A B C Q,D,J
1 G Y F Q
1 E B C P
2 H F S L
Since we are using GNU awk the order of output could be affected easily by setting PROCINFO["sorted_in"]="#ind_str_asc":
3 A B C D,J,Q
1 E B C P
2 E F G N,P
1 G Y F Q
2 H F S L
You could utilize GNU datamash:
$ cat input
A B C D
A B C J
E B C P
E F G N
E F G P
A B C Q
H F S L
G Y F Q
H F S L
$ datamash -t' ' --sort groupby 1,2,3 unique 4 count 4 < input
A B C D,J,Q 3
E B C P 1
E F G N,P 2
G Y F Q 1
H F S L 2
This unfortunately outputs the count as the last column. If it is absolutely necessary for it to be the first column, you will have to reformat it:
$ datamash -t' ' --sort groupby 1,2,3 unique 4 count 4 < input | awk '{$0=$NF FS $0; NF--}1'
3 A B C D,J,Q
1 E B C P
2 E F G N,P
1 G Y F Q
2 H F S L

How to merge rows in a file based on common fields using awk?

I have a large tab delimited two column file that has the coordinates of many biochemical pathways like this:
A B
B D
D F
F G
G I
A C
C P
P R
A M
M L
L X
I want to combine the lines if column 1 in one line is equal to column 2 in another line resulting in the following output:
A B D F G I
B D F G I
D F G I
F G I
G I
A C P R
C P R
P R
A M L X
M L X
L X
I would like to use something simple such as an awk 1 liner, does anyone have any idea how I would approach this without writing a shell script? Any help is appreciated. I am trying to get each step and each subsequent step in each pathway. As these pathways often intersect some steps are shared by other pathways but I want to analyse each separately.
I have tried a shell script where I try to grep out any column where $2 = $1 later in the file:
while [ -s test ]; do
grep -m1 "^" test > i
cut -f2 i | sed 's/^/"/' | sed 's/$/"/' | sed "s/^/awk \'\$1 == /" | sed "s/$/' test >> i/" > i.sh
sh i.sh
perl -p -e 's/\n/\t/g' i >> OUT
sed '1d' test > i ; mv i test
done
I know that my problem comes from (a) deleting the line and (b) the fact that there are duplicates. I am just not sure how to tackle this.
Input
$ cat f
A B
B D
D F
F G
G I
A C
C P
P R
A M
M L
L X
Output
$ awk '{
for(j=1; j<=NF; j+=2)
{
for(i=j;i<=NF;i+=2)
{
printf("%s%s", i==j ? $i OFS : OFS,$(i+1));
if($(i+1)!=$(i+2)){ print ""; break }
}
}
}' RS= OFS="\t" f
A B D F G I
B D F G I
D F G I
F G I
G I
A C P R
C P R
P R
A M L X
M L X
L X
One liner
awk '{ for(j=1; j<=NF; j+=2)for(i=j;i<=NF;i+=2){printf("%s%s", i==j ? $i OFS : OFS,$(i+1)); if($(i+1)!=$(i+2)){ print ""; break }}}' RS= OFS="\t" f
Well, you could put this on one line, but I wouldn't recommend it :)
#!/usr/bin/awk -f
{
a[NR] = $0
for(i = 1; i < NR; i++){
if(a[i] ~ $1"$")
a[i] = a[i] FS $2
if(a[i] ~ "^"$1){
for(j = i; j < NR; j++){
print a[j]
delete a[j]
}
}
}
}
END{
for(i = 1; i <= NR; i++)
if(a[i] != "")
print a[i]
}
$ <f.txt tac | awk 'BEGIN{OFS="\t"}{if($2==c1){$2=$2"\t"c2};print $1,$2;c1=$1;c2=$2}' | tac
A B D F G I
B D F G I
D F G I
F G I
G I
A C P R
C P R
P R
A M L X
M L X
L X

Cartesian product of two files (as sets of lines) in GNU/Linux

How can I use shell one-liners and common GNU tools to concatenate lines in two files as in Cartesian product? What is the most succinct, beautiful and "linuxy" way?
For example, if I have two files:
$ cat file1
a
b
$ cat file2
c
d
e
The result should be
a, c
a, d
a, e
b, c
b, d
b, e
Here's shell script to do it
while read a; do while read b; do echo "$a, $b"; done < file2; done < file1
Though that will be quite slow.
I can't think of any precompiled logic to accomplish this.
The next step for speed would be to do the above in awk/perl.
awk 'NR==FNR { a[$0]; next } { for (i in a) print i",", $0 }' file1 file2
Hmm, how about this hacky solution to use precompiled logic?
paste -d, <(sed -n "$(yes 'p;' | head -n $(wc -l < file2))" file1) \
<(cat $(yes 'file2' | head -n $(wc -l < file1)))
There won't be a comma to separate but using only join:
$ join -j 2 file1 file2
a c
a d
a e
b c
b d
b e
The mechanical way to do it in shell, not using Perl or Python, is:
while read line1
do
while read line2
do echo "$line1, $line2"
done < file2
done < file1
The join command can sometimes be used for these operations - however, I'm not clear that it can do cartesian product as a degenerate case.
One step up from the double loop would be:
while read line1
do
sed "s/^/$line1, /" file2
done < file1
I'm not going to pretend this is pretty, but...
join -t, -j 9999 -o 2.1,1.1 /tmp/file1 /tmp/file2
(updated thanks to Iwan Aucamp below)
-- join (GNU coreutils) 8.4
Edit:
DVK's attempt inspired me to do this with eval:
script='1{x;d};${H;x;s/\n/\,/g;p;q};H'
eval "echo {$(sed -n $script file1)}\,\ {$(sed -n $script file2)}$'\n'"|sed 's/^ //'
Or a simpler sed script:
script=':a;N;${s/\n/,/g;b};ba'
which you would use without the -n switch.
which gives:
a, c
a, d
a, e
b, c
b, d
b, e
Original answer:
In Bash, you can do this. It doesn't read from files, but it's a neat trick:
$ echo {a,b}\,\ {c,d,e}$'\n'
a, c
a, d
a, e
b, c
b, d
b, e
More simply:
$ echo {a,b}{c,d,e}
ac ad ae bc bd be
a generic recursive BASH function could be something like this:
foreachline() {
_foreachline() {
if [ $# -lt 2 ]; then
printf "$1\n"
return
fi
local prefix=$1
local file=$2
shift 2
while read line; do
_foreachline "$prefix$line, " $*
done <$file
}
_foreachline "" $*
}
foreachline file1 file2 file3
Regards.
Solution 1:
perl -e '{use File::Slurp; #f1 = read_file("file1"); #f2 = read_file("file2"); map { chomp; $v1 = $_; map { print "$v1,$_"; } #f2 } #f1;}'
Edit: Oops... Sorry, I thought this was tagged python...
If you have python 2.6:
from itertools import product
print('\n'.join((', '.join(elt) for elt in (product(*((line.strip() for line in fh) for fh in (open('file1','r'), open('file2','r'))))))))
a, c
a, d
a, e
b, c
b, d
b, e
If you have python pre-2.6:
def product(*args, **kwds):
'''
Source: http://docs.python.org/library/itertools.html#itertools.product
'''
# product('ABCD', 'xy') --> Ax Ay Bx By Cx Cy Dx Dy
# product(range(2), repeat=3) --> 000 001 010 011 100 101 110 111
pools = map(tuple, args) * kwds.get('repeat', 1)
result = [[]]
for pool in pools:
result = [x+[y] for x in result for y in pool]
for prod in result:
yield tuple(prod)
print('\n'.join((', '.join(elt) for elt in (product(*((line.strip() for line in fh) for fh in (open('file1','r'), open('file2','r'))))))))
A solution using join, awk and process substitution:
join <(xargs -I_ echo 1 _ < setA) <(xargs -I_ echo 1 _ < setB)
| awk '{ printf("%s, %s\n", $2, $3) }'
awk 'FNR==NR{ a[++d]=$1; next}
{
for ( i=1;i<=d;i++){
print $1","a[i]
}
}' file2 file1
# ./shell.sh
a,c
a,d
a,e
b,c
b,d
b,e
OK, this is derivation of Dennis Williamson's solution above since he noted that his does not read from file:
$ echo {`cat a | tr "\012" ","`}\,\ {`cat b | tr "\012" ","`}$'\n'
a, c
a, d
a, e
b, c
b, d
b, e
GNU Parallel:
parallel echo "{1}, {2}" :::: file1 :::: file2
Output:
a, c
a, d
a, e
b, c
b, d
b, e
Of course perl has a module for that:
#!/usr/bin/perl
use File::Slurp;
use Math::Cartesian::Product;
use v5.10;
$, = ", ";
#file1 = read_file("file1", chomp => 1);
#file2 = read_file("file2", chomp => 1);
cartesian { say #_ } \#file1, \#file2;
Output:
a, c
a, d
a, e
b, c
b, d
b, e
In fish it's a one-liner
printf '%s\n' (cat file1)", "(cat file2)

Resources