converting four columns to two using linux commands - bash

I am wondering how one could merge four columns into two in the following manner (using the awk command, or other possible commands).
For example,
Old:
A B C D
E F G H
I J K L
M N O P
.
.
.
New:
A B
C D
E F
G H
I J
K L
M N
O P
.
.
Thanks so much!

That's actually quite easy with awk, as per the following transcript:
pax> cat inputFile
A B C D
E F G H
pax> awk '{printf "%s %s\n%s %s\n", $1, $2, $3, $4}' <inputFile
A B
C D
E F
G H

Hww about using xargs here? Could you please try following once.
xargs -n 2 < Input_file
Output will be as follows.
A B
C D
E F
G H
I J
K L
M N
O P

with GNU sed
$ sed 's/ /\n/2' file
replace 2nd space with new line.

Related

Add a specific string at the end of each line

I have a mainfile with 4 columns, such as:
a b c d
e f g h
i j k l
in another file, i have one line of text corresponding to the respective line in the mainfile, which i want to add as a new column to the mainfile, such as:
a b c d x
e f g h y
i j k l z
Is this possible in bash? I can only add the same string to the end of each line.
Two ways you can do
1) paste file1 file2
2) Iterate over both files and combine line by line and write to new file
You could use GNU parallel for that:
fe-laptop-m:test fe$ cat first
a b c d
e f g h
i j k l
fe-laptop-m:test fe$ cat second
x
y
z
fe-laptop-m:test fe$ parallel echo ::::+ first second
a b c d x
e f g h y
i j k l z
Do I get you right what you try to achieve?
This might work for you (GNU sed):
sed -E 's#(^.*) .*#/^\1/s/$/ &/#' file2 | sed -f - file1
Create a sed script from file2 that uses a regexp to match a line in file1 and if it does appends the contents of that line in file2 to the matched line.
N.B.This is independent of the order and length of file1.
You can try using pr
pr -mts' ' file1 file2

Linux Bash count and summarize by unique columns

I have a text file with lines like this (in Linux Bash):
A B C D
A B C J
E B C P
E F G N
E F G P
A B C Q
H F S L
G Y F Q
H F S L
I need to find the lines with unique values for the first 3 columns, print their count and then print summarized last column for each unique line, so the result is like this:
3 A B C D,J,Q
1 E B C P
2 E F G N,P
1 G Y F Q
2 H F S L
What I have tried:
cat FILE | sort -k1,3 | uniq -f3 -c | sort -k3,5nr
Is there maybe any advice?
Thanks in advance!
The easiest is to do the following:
awk '{key=$1 OFS $2 OFS $3; a[key]=a[key]","$4; c[key]++}
END{for(key in a) { print c[key],key,substr(a[key],2) }}' <file>
If you do not want any duplication, you can do
awk '{ key=$1 OFS $2 OFS $3; c[key]++ }
!gsub(","$4,","$4,a[key]) {a[key]=a[key]","$4; }
END{for(key in a) { print c[key],key,substr(a[key],2) }} <file>
Could you please try following and let me know if this helps you.
This will give you output in same sequence as per Input_file's $1, $2, and $3 occurrence only.
awk '
!a[$1,$2,$3]++{
b[++count]=$1 FS $2 FS $3
}
{
c[$1,$2,$3]=c[$1,$2,$3]?c[$1,$2,$3] "," $4:$0
d[$1 FS $2 FS $3]++
}
END{
for(i=1;i<=count;i++){
print d[b[i]],c[b[i]]
}
}
' SUBSEP=" " Input_file
Another using GNU awk and 2d arrays for removing duplicates in $4:
$ awk '{
i=$1 OFS $2 OFS $3 # key to hash
a[i][$4] # store each $4 to separate element
c[i]++ # count key references
}
END {
for(i in a) {
k=1 # comma counter for output
printf "%s %s ",c[i],i # output count and key
for(j in a[i]) # each a[]i[j] element
printf "%s%s",((k++)==1?"":","),j # output commas and elements
print "" # line-ending
}
}' file
Output in default random order:
2 E F G N,P
3 A B C Q,D,J
1 G Y F Q
1 E B C P
2 H F S L
Since we are using GNU awk the order of output could be affected easily by setting PROCINFO["sorted_in"]="#ind_str_asc":
3 A B C D,J,Q
1 E B C P
2 E F G N,P
1 G Y F Q
2 H F S L
You could utilize GNU datamash:
$ cat input
A B C D
A B C J
E B C P
E F G N
E F G P
A B C Q
H F S L
G Y F Q
H F S L
$ datamash -t' ' --sort groupby 1,2,3 unique 4 count 4 < input
A B C D,J,Q 3
E B C P 1
E F G N,P 2
G Y F Q 1
H F S L 2
This unfortunately outputs the count as the last column. If it is absolutely necessary for it to be the first column, you will have to reformat it:
$ datamash -t' ' --sort groupby 1,2,3 unique 4 count 4 < input | awk '{$0=$NF FS $0; NF--}1'
3 A B C D,J,Q
1 E B C P
2 E F G N,P
1 G Y F Q
2 H F S L

How to merge rows in a file based on common fields using awk?

I have a large tab delimited two column file that has the coordinates of many biochemical pathways like this:
A B
B D
D F
F G
G I
A C
C P
P R
A M
M L
L X
I want to combine the lines if column 1 in one line is equal to column 2 in another line resulting in the following output:
A B D F G I
B D F G I
D F G I
F G I
G I
A C P R
C P R
P R
A M L X
M L X
L X
I would like to use something simple such as an awk 1 liner, does anyone have any idea how I would approach this without writing a shell script? Any help is appreciated. I am trying to get each step and each subsequent step in each pathway. As these pathways often intersect some steps are shared by other pathways but I want to analyse each separately.
I have tried a shell script where I try to grep out any column where $2 = $1 later in the file:
while [ -s test ]; do
grep -m1 "^" test > i
cut -f2 i | sed 's/^/"/' | sed 's/$/"/' | sed "s/^/awk \'\$1 == /" | sed "s/$/' test >> i/" > i.sh
sh i.sh
perl -p -e 's/\n/\t/g' i >> OUT
sed '1d' test > i ; mv i test
done
I know that my problem comes from (a) deleting the line and (b) the fact that there are duplicates. I am just not sure how to tackle this.
Input
$ cat f
A B
B D
D F
F G
G I
A C
C P
P R
A M
M L
L X
Output
$ awk '{
for(j=1; j<=NF; j+=2)
{
for(i=j;i<=NF;i+=2)
{
printf("%s%s", i==j ? $i OFS : OFS,$(i+1));
if($(i+1)!=$(i+2)){ print ""; break }
}
}
}' RS= OFS="\t" f
A B D F G I
B D F G I
D F G I
F G I
G I
A C P R
C P R
P R
A M L X
M L X
L X
One liner
awk '{ for(j=1; j<=NF; j+=2)for(i=j;i<=NF;i+=2){printf("%s%s", i==j ? $i OFS : OFS,$(i+1)); if($(i+1)!=$(i+2)){ print ""; break }}}' RS= OFS="\t" f
Well, you could put this on one line, but I wouldn't recommend it :)
#!/usr/bin/awk -f
{
a[NR] = $0
for(i = 1; i < NR; i++){
if(a[i] ~ $1"$")
a[i] = a[i] FS $2
if(a[i] ~ "^"$1){
for(j = i; j < NR; j++){
print a[j]
delete a[j]
}
}
}
}
END{
for(i = 1; i <= NR; i++)
if(a[i] != "")
print a[i]
}
$ <f.txt tac | awk 'BEGIN{OFS="\t"}{if($2==c1){$2=$2"\t"c2};print $1,$2;c1=$1;c2=$2}' | tac
A B D F G I
B D F G I
D F G I
F G I
G I
A C P R
C P R
P R
A M L X
M L X
L X

Merging two outputs in shell script

I have output of 2 commands like:
op of first cmd:
A B
C D
E F
G H
op of second cmd:
I J
K L
M B
i want to merge both the outputs , and if a value in second column is same for both outputs, I'll take entry set from 1st output..
So , my output should be
A B
C D
E F
G H
I J
K L
//not taking (M B) sice B is already there in first entry(A B) , so giving preference to first output
can i do this using shell script , is there any command?
You can use awk:
awk 'FNR==NR{a[$2];print;next} !($2 in a)' file1 file2
A B
C D
E F
G H
I J
K L
If the order of entries is not important, you can sort on the 2nd column and uniquefy:
sort -u -k2 file1 file2
Both -u and -k are specified in the POSIX standard
This wouldn't work if there are repeated entries in the 2nd column of file1.

File Manipulation Loop

Ok, this question is two-fold: 1) The actual file manupulation bit 2) Looping this manipulation in unix
Part 1)
I have two files:
File_1
a b
c d
e f
g h
and File_2
A B
C D
E F
G H
I J
i would like to get (in the first instance) the following result:
a b
A B
>
c d
A B
>
e f
A B
>
g h
A B
...and save this output to outfile1.
I gather I would have to use things like awk, cut and/or paste but I can't manage to put it all together.
Part 2)
I then want to loop this manipulation for all rows in File_2 (note that the number of rows in File_1 is not the same as in File_2), such that I end up with 5 output files, where outfile2 would be:
a b
C D
>
c d
C D
>
e f
C D
>
g h
C D
and outfile3 would be:
a b
E F
>
c d
E F
>
e f
E F
>
g h
E F
etc.
At the moment I'm working in bash. Thank you in advance for any help!
This can be done with bash redirection:
i=1
while read f2; do
while read f1; do
echo "$f1"
echo "$f2"
echo ">"
done < File_1 | head -n -1 > output$i
(( i++ ))
done < File_2
head -n -1 avoids having a lone delimiter at the end of each output$i file.
to make outfile_3(with E F) for example:
x=$(sed -n '3p' File_2)
awk "{ printf \"%s\\n%s\\n>\\n\", \$0, \"$x\" }" File_1 > outfile_3
in first line '3p' cuts 3-rd line
Now let's do it in a loop:
(( i = 1 ))
while read line
do
awk "{ printf \"%s\\n%s\\n>\\n\", \$0, \"$line\" }" File_1 > "outfile_$i"
(( i++ ))
done < File_2
sort -m -f file1 file2 | uniq -i --all-repeated=separate
Looked rather close. However, on second reading I think you want something more like this perl script:
use strict;
use warnings;
open(my $FILE1, '<file1') or die;
my $output = 0;
while (my $a = <$FILE1>)
{
$output++;
open(my $OUT, ">output$output");
open(my $FILE2, '<file2') or die;
print $OUT "$a$_---\n" foreach (<$FILE2>);
close $FILE2;
close $OUT;
}
close $FILE1;
This creates output files output1, output12, output3, output..., as many as there are lines in file1
an awk one liner:
awk 'NR==FNR{a[NR]=$0;l=NR;next;} {b[FNR]=$0;}
END{f=1; for(x=1;x<=FNR;x++){for(i=1;i<=length(a);i++){
printf "%s\n%s\n%s\n", a[i],b[x],">" > "output"f }f++;}}' f1 f2
test:
kent$ head f1 f2
==> f1 <==
a b
c d
e f
g h
==> f2 <==
A B
C D
E F
G H
I J
kent$ awk 'NR==FNR{a[NR]=$0;l=NR;next;} {b[FNR]=$0;}
END{f=1; for(x=1;x<=FNR;x++){for(i=1;i<=length(a);i++){printf "%s\n%s\n%s\n", a[i],b[x],">" > "output"f }f++;}}' f1 f2
kent$ head -30 out*
==> output1 <==
a b
A B
>
c d
A B
>
e f
A B
>
g h
A B
>
==> output2 <==
a b
C D
>
c d
C D
>
e f
C D
>
g h
C D
>
==> output3 <==
a b
E F
>
c d
E F
>
e f
E F
>
g h
E F
>
==> output4 <==
a b
G H
>
c d
G H
>
e f
G H
>
g h
G H
>
==> output5 <==
a b
I J
>
c d
I J
>
e f
I J
>
g h
I J
>

Resources