bash: all combinations of lines

bash: all combinations of lines - bash

I have the following file (this is semicolon delimited; the real file is tab-delimited)
abc;173959;172730
def;4186657;4187943
ghi;4703911;4702577
jkl;2243551;2242259
and I want to combine each line with each, so that my output would be:
abc;173959;172730;def;4186657;4187943
abc;173959;172730;ghi;4703911;4702577
abc;173959;172730;jkl;2243551;2242259
def;4186657;4187943;ghi;4703911;4702577
def;4186657;4187943;jkl;2243551;2242259
ghi;4703911;4702577;jkl;2243551;2242259
The order is not important.
I came up with the following awk-solution:
awk '{ a[$0] } END { for (i in a){ for (j in a){if (i != j) print (i "\t" j) } } }' file
But this prints me the combinations in both directions, so for example
abc;173959;172730;def;4186657;4187943
def;4186657;4187943;abc;173959;172730
Because I am pretty unfamiliar with python or perl, I kindly ask for a solution using awk/bash etc.

In awk:
$ awk '{ a[$0] }
END {
for(i in a) {
delete a[i] # new place for delete
for(j in a)
if(i!=j)
print i ";" j
# delete a[i] # previous and maybe wrong place
}
}' file
def;4186657;4187943;ghi;4703911;4702577
def;4186657;4187943;abc;173959;172730
def;4186657;4187943;jkl;2243551;2242259
ghi;4703911;4702577;abc;173959;172730
ghi;4703911;4702577;jkl;2243551;2242259
abc;173959;172730;jkl;2243551;2242259
Unfortunately the order is random.
Another way that restores the order and doesn't modify the a while processing (see comments) is:
$ awk '{ a[NR]=$0 } # index on NR
END {
for(i=1;i<=NR;i++)
for(j=i+1;j<=NR;j++) # j=i+1 is the magic
print a[i] ";" a[j]
}' file
abc;173959;172730;def;4186657;4187943
abc;173959;172730;ghi;4703911;4702577
abc;173959;172730;jkl;2243551;2242259
def;4186657;4187943;ghi;4703911;4702577
def;4186657;4187943;jkl;2243551;2242259
ghi;4703911;4702577;jkl;2243551;2242259

This awk should work as well:
awk -F ';' 'NR==FNR{a[++k]=$0; next} {for (i=FNR+1; i<=k; i++) print $0 FS a[i]}' file{,}
abc;173959;172730;def;4186657;4187943
abc;173959;172730;ghi;4703911;4702577
abc;173959;172730;jkl;2243551;2242259
def;4186657;4187943;ghi;4703911;4702577
def;4186657;4187943;jkl;2243551;2242259
ghi;4703911;4702577;jkl;2243551;2242259

Could you please try following one, it will give you same order as Input_file's field values only, by reading the Input_file once only.
awk '{a[FNR]=$0} END{j=1;while(length(a)>=++k){for(q=j+1;q<=FNR;q++){print a[j]";"a[q]}j++};}' Input_file
OR
awk '
{
a[FNR]=$0
}
END{
j=1;
while(length(a)>=++k){
for(q=j+1;q<=FNR;q++){
print a[j]";"a[q]
}
j++
}
}
' Input_file
Output will be as follows.
abc;173959;172730;def;4186657;4187943
abc;173959;172730;ghi;4703911;4702577
abc;173959;172730;jkl;2243551;2242259
def;4186657;4187943;ghi;4703911;4702577
def;4186657;4187943;jkl;2243551;2242259
ghi;4703911;4702577;jkl;2243551;2242259

Related

awk to get value for a column of next line and add it to the current line in shellscript

I have a csv file lets say lines
cat lines
1:abc
6:def
17:ghi
21:tyu
I wanted to achieve something like this
1:6:abc
6:17:def
17:21:ghi
21::tyu
Tried the below code by didn't work
awk 'BEGIN{FS=OFS=":"}NR>1{nln=$1;cl=$2}NR>0{print $1,nln,$2}' lines
1::abc
6:6:def
17:17:ghi
21:21:tyu
Can you please help ?

Here is a potential AWK solution:
cat lines
1:abc
6:def
17:ghi
21:tyu
awk -F":" '{num[NR]=$1; letters[NR]=$2}; END{for(i=1;i<=NR;i++) print num[i] ":" num[i + 1] ":" letters[i]}' lines
1:6:abc
6:17:def
17:21:ghi
21::tyu
Formatted:
awk '
BEGIN {FS=OFS=":"}
{
num[NR] = $1;
letters[NR] = $2
}
END {for (i = 1; i <= NR; i++)
print num[i], num[i + 1], letters[i]
}
' lines
1:6:abc
6:17:def
17:21:ghi
21::tyu

Basically this is your solution but I switched the order of the code blocks and added the END block to output the last record, you were close:
awk 'BEGIN{FS=OFS=":"}FNR>1{print p,$1,q}{p=$1;q=$2}END{print p,"",q}' file
Explained:
$ awk 'BEGIN {
FS=OFS=":" # delims
}
FNR>1 { # all but the first record
print p,$1,q # output $1 and $2 from the previous round
}
{
p=$1 # store for the next round
q=$2
}
END { # gotta output the last record in the END
print p,"",q # "" feels like cheating
}' file
Output:
1:6:abc
6:17:def
17:21:ghi
21::tyu

1st solution: Here is a tac + awk + tac solution. Written and tested with shown samples only.
tac Input_file |
awk '
BEGIN{
FS=OFS=":"
}
{
prev=(prev?$2=prev OFS $2:$2=OFS $2)
}
{
prev=$1
}
1
' | tac
Explanation: Adding detailed explanation for above code.
tac Input_file | ##Printing lines from bottom to top of Input_file.
awk ' ##Getting input from previous command as input to awk.
BEGIN{ ##Starting BEGIN section from here.
FS=OFS=":" ##Setting FS and OFS as colon here.
}
{
prev=(prev?$2=prev OFS $2:$2=OFS $2) ##Creating prev if previous NOT NULL then add its value prior to $2 with prev OFS else add OFS $2 in it.
}
{
prev=$1 ##Setting prev to $1 value here.
}
1 ##printing current line here.
' | tac ##Sending awk output to tac to make it in actual sequence.
2nd solution: Adding Only awk solution with 2 times passing Input_file to it.
awk '
BEGIN{
FS=OFS=":"
}
FNR==NR{
if(FNR>1){
arr[FNR-1]=$1
}
next
}
{
$2=(FNR in arr)?(arr[FNR] OFS $2):OFS $2
}
1
' Input_file Input_file

print columns based on column name

Let's say I have a file test.txt that contains
a,b,c,d,e
1,2,3,4,5
6,7,8,9,10
I want to print out columns based on matching column names, either from another text file or from an array. So for example if I was given
arr=(a b c)
I want my output to then be
a,b,c
1,2,3
6,7,8
How can I do this with bash utilities/awk/sed? My actual text file is 3GB (and the line I want to match column values on is actually line 3), so efficient solutions are appreciated. This is what I have so far:
for j in "${arr[#]}"; do awk -F ',' -v a=$j '{ for(i=1;i<=NF;i++) {if($i==a) {print $i}}}' test.txt; done
but the output I get is
a
b
c
which not only is missing the other rows, but each column name is printed on one line each.

With your shown samples, please try following. Code is reading 2 files another_file.txt(which has a b c as per samples) and actual Input_file named test.txt(which has all the values in it).
awk '
FNR==NR{
for(i=1;i<=NF;i++){
arr[$i]
}
next
}
FNR==1{
for(i=1;i<=NF;i++){
if($i in arr){
valArr[i]
header=(header?header OFS:"")$i
}
}
print header
next
}
{
val=""
for(i=1;i<=NF;i++){
if(i in valArr){
val=(val?val OFS:"")$i
}
}
print val
}
' another_file.txt FS="," OFS="," test.txt
Output will be as follows:
a,b,c
1,2,3
6,7,8
Explanation: Adding detailed explanation for above solution.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition which will be TRUE while reading another_text file here.
for(i=1;i<=NF;i++){ ##Traversing through all fields of current line.
arr[$i] ##Creating arr with index of current field value.
}
next ##next will skip all statements from here.
}
FNR==1{ ##Checking if this is 1st line for test.txt file.
for(i=1;i<=NF;i++){ ##Traversing through all fields of current line.
if($i in arr){ ##If current field values comes in arr then do following.
valArr[i] ##Creating valArr which has index of current field number.
header=(header?header OFS:"")$i ##Creating header which has each field value in it.
}
}
print header ##Printing header here.
next ##next will skip all statements from here.
}
{
val="" ##Nullifying val here.
for(i=1;i<=NF;i++){ ##Traversing through all fields of current line.
if(i in valArr){ ##Checking if i is present in valArr then do following.
val=(val?val OFS:"")$i ##Creating val which has current field value.
}
}
print val ##printing val here.
}
' another_file.txt FS="," OFS="," test.txt ##Mentioning Input_file names here.

Here is how you can do this in a single pass awk command:
arr=(a c e)
awk -v cols="${arr[*]}" 'BEGIN {FS=OFS=","; n=split(cols, tmp, / /); for (i=1; i<=n; ++i) hdr[tmp[i]]} NR==1 {for (i=1; i<=NF; ++i) if ($i in hdr) hnum[i]} {for (i=1; i<=NF; ++i) if (i in hnum) {printf "%s%s", (f ? OFS : ""), $i; f=1} f=0; print ""}' file
a,c,e
1,3,5
6,8,10
A more readable form:
awk -v cols="${arr[*]}" '
BEGIN {
FS = OFS = ","
n = split(cols, tmp, / /)
for (i=1; i<=n; ++i)
hdr[tmp[i]]
}
NR == 1 {
for (i=1; i<=NF; ++i)
if ($i in hdr)
hnum[i]
}
{
for (i=1; i<=NF; ++i)
if (i in hnum) {
printf "%s%s", (f ? OFS : ""), $i
f = 1
}
f = 0
print ""
}' file

Define field with FPAT

I am trying to split data into field in awk, but I cant come up with the right regex using FPAT.
I have tried:
echo 'C002 2019-06-28;16:03;approved;content=L1-34,EE;not taken;;1024 ' | awk 'BEGIN {FPAT = "([^ ]+) +[^ ]+|;"} {print "f1:"$1;print "f2:"$2;print "f3:"$3;print "f6:"$6;print "f7:"$7}'
Expected result:
f1:C002
f2:2019-06-28
f3:16:03
f6:not taken
f7:

There are no simple way to separate random space from random space.
You need to do as David writes, separate using ; and then split first field by space.
awk -F";" '{split($1,a,"[ \t]+");print "a[1]---"a[1]"\na[2]---"a[2];for (i=1;i<=NF;i++) print i"---"$i}'
a[1]---C002
a[2]---2019-06-28
1---C002 2019-06-28
2---16:03
3---approved
4---content=L1-34,EE
5---not taken
6---
7---1024

A bit similar to the answer of Jotne, but you could write a function to split the record according to your wishes:
awk 'function split_record(string,f, t,n,m) {
n=split(string,t,";"); m=split(t[1],f,"[ \t]+")
for(i=2;i<=n;++i) f[m+i-1]=t[i]
return m+n-1
}
{ split_record($0,f) }
{print "f1:"f[1];print "f2:"f[2];print "f3:"f[3];print "f6:"f[6];print "f7:"f[7]}'
This returns:
f1:C002
f2:2019-06-28
f3:16:03
f6:not taken
f7:
You can update the split record in any way you like.

awk '
BEGIN { FS=OFS=";" }
{
split($1,a,/[[:space:]]+/)
$1 = ""
$0 = a[1] FS a[2] $0
for (i=1; i<=NF; i++) {
print "f" i ":" $i
}
}
' file
f1:C002
f2:2019-06-28
f3:16:03
f4:approved
f5:content=L1-34,EE
f6:not taken
f7:
f8:1024

Spreading cell values into columns using UNIX

Suppose we have this file:
head file
id,name,value
1,Je,1
2,Je,1
3,Ko,1
4,Ne,1
5,Ne,1
6,Je,1
7,Ko,1
8,Ne,1
9,Ne,1
And I'd like to get this out:
id,Je,Ko,Ne
1,1,0,0
2,1,0,0
3,0,1,0
4,0,0,1
5,0,0,1
6,1,0,0
7,0,1,0
8,0,0,1
9,0,0,1
Does someone know how to get this output, using awk or sed?

Assuming that the possible values of name are only Je or Ko or Ne, you can do:
awk -F, 'BEGIN{print "id,Je,Ko,Ne"}
NR==1{ next }
{je=$2=="Je"?"1":"0";
ko=$2=="Ko"?"1":"0";
ne=$2=="Ne"?"1":"0";
print $1","je","ko","ne}' file
If you want something that will print the values in the same order they are read and not limited to your example fields, you could do:
awk -F, 'BEGIN{OFS=FS; x=1;y=1}
NR==1 { next }
!($2 in oa){ oa[$2]=1; ar[x++]=$2}
{lines[y++]=$0;}
END{
s="";
for (i=1; i<x; i++)
s=s==""?ar[i]:s OFS ar[i];
print "id" OFS s;
for (j=1; j<y; j++){
split(lines[j], a)
s=""
for (i=1; i<x; i++) {
tt=ar[i]==a[2]?"1":"0"
s=s==""?tt:s OFS tt;
}
print a[1] OFS s;
}
}
' file

Here's a "two-pass solution" (along the lines suggested by #Drakosha) implemented using a single invocation of awk. The implementation would be a little simpler if there was no requirement regarding the ordering of names.
awk -F, '
# global: n, array a
function println(ix,name,value, i,line) {
line=ix;
for (i=0;i<n;i++) {
if (a[i]==name) {line=line OFS value} else {line=line OFS 0}
}
print line;
}
BEGIN {OFS=FS; n=0}
FNR==1 {next} # skip the header each time
NR==FNR {if (!mem[$2]) {mem[$2] = a[n++] = $2}; next}
!s { s="id"; for (i=0;i<n;i++) {s=s OFS a[i]}; print s}
{println($1, $2, $3)}
' file file

I suggest 2 passes.
1st will generate all the possible values of column 2 (Je, Ko, Ne,
...).
2nd will be able to trivially generate the output you are looking for.

awk -F, 'BEGIN{s="Je,Ko,Ne";print "id,"s}
NR>1 {m=s; sub($2,1,m); gsub("[^0-9,]+","0",m); print $1","m}' file

Unix/Bash: Uniq on a cell

I have a tab-separated fileA where the 12th column (starting from 1) contain several comma separated identifiers. Some of them in the same row, however, can occur more than once:
GO:0042302, GO:0042302, GO:0042302
GO:0004386,GO:0005524,GO:0006281, GO:0004386,GO:0005524,GO:0006281
....
....
(some with a white-space after the comma, some where it is not).
I would like to only get the unique identifiers and remove the multiples for each row in the 12th column:
GO:0042302
GO:0004386,GO:0005524,GO:0006281
....
....
Here is what I have so far:
for row in `fileA`
do
cut -f12 $row | sed "s/,/\n/" | sort | uniq | paste fileA - | \
awk 'BEGIN {OFS=FS="\t"}{print $1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $13}'
done > out
The idea was to go over each row at a time, cut out the 12th column, replace all commas with newlines and then sort and take uniq to get rid of duplicates, paste it back and print the columns in the right order, skipping the original identifier column.
However, this does not seem to work. Any ideas?

Just for completeness, and because I personally prefer Perl over Awk for this sort of thing, here's a Perl one-liner solution:
perl -F'\t' -le '%u=();#k=split/,/,$F[11];#u{#k}=#k;$F[11]=join",",sort
keys%u;print join"\t",#F'
Explanation:
-F'\t' Loop over input lines, splitting each one into fields at tabs
-l automatically remove newlines from input and append on output
-e get code to execute from the next argument instead of standard input
%u = (); # clear out the hash variable %u
#k = split /,/, $F[11]; # Split 12th field (1st is 0) on comma into array #k
#u{#k} = #k; # Copy the contents of #k into #u as key/value pairs
Because hash keys are unique, that last step means that the keys of %u are now a deduplicated copy of #k.
$F[11] = join ",", sort keys %u; # replace the 12th field with the sorted unique list
print join "\t", #F; # and print out the modified line

If I understand you correctly, then with awk:
awk -F '\t' 'BEGIN { OFS = FS } { delete b; n = split($12, a, /, */); $12 = ""; for(i = 1; i <= n; ++i) { if(!(a[i] in b)) { b[a[i]]; $12 = $12 a[i] "," } } sub(/,$/, "", $12); print }' filename
This works as follows:
BEGIN { OFS = FS } # output FS same as input FS
{
delete b # clear dirty table from last pass
n = split($12, a, /, */) # split 12th field into tokens,
$12 = "" # then clear it out for reassembly
for(i = 1; i <= n; ++i) { # wade through those tokens
if(!(a[i] in b)) { # those that haven't been seen yet:
b[a[i]] # remember that they were seen
$12 = $12 a[i] "," # append to result
}
}
sub(/,$/, "", $12) # remove trailing comma from resulting field
print # print the transformed line
}
The delete b; has been POSIX-conforming for only a short while, so if you're working with an old, old awk and it fails for you, see #MarkReed's comment for another way that ancient awks should accept.

Using field 2 instead of field 12:
$ cat tst.awk
BEGIN{ FS=OFS="\t" }
{
split($2,f,/ *, */)
$2 = ""
delete seen
for (i=1;i in f;i++) {
if ( !seen[f[i]]++ ) {
$2 = $2 (i>1?",":"") f[i]
}
}
print
}
.
$ cat file
a,a,a GO:0042302, GO:0042302, GO:0042302 b,b,b
c,c,c GO:0004386,GO:0005524,GO:0006281, GO:0004386,GO:0005524,GO:0006281 d,d,d
$ awk -f tst.awk file
a,a,a GO:0042302 b,b,b
c,c,c GO:0004386,GO:0005524,GO:0006281 d,d,d
If your awk doesn't support delete seen you can use split("",seen).

Using this awk:
awk -F '\t' -v OFS='\t' '{
delete seen;
split($12, a, /[,; ]+/);
for (i=1; i<=length(a); i++) {
if (!(a[i] in seen)) {
seen[a[i]];
s=sprintf("%s%s,", s, a[i])
}
}
$12=s} 1' file
GO:0042302,
GO:0042302,GO:0004386,GO:0005524,GO:0006281,

In your example data, the comma followed by a space is the delimiter of the 12th field. Every subfield after that is merely a repeat of the first field. The subfields appear to already be in sorted order.
GO:0042302, GO:0042302, GO:0042302
^^^dup1^^^ ^^^dup2^^^
GO:0004386,GO:0005524,GO:0006281, GO:0004386,GO:0005524,GO:0006281
^^^^^^^^^^^^^^^dup1^^^^^^^^^^^^^
Based on that, you could simply keep the first of the subfields and toss the rest:
awk -F"\t" '{sub(/, .*/, "", $12)} 1' fileA
If instead, you can have different sets of repeated subfields, where keys are not sorted like this:
GO:0042302, GO:0042302, GO:0042302, GO:0062122,GO:0055000, GO:0055001, GO:0062122,GO:0055000
GO:0004386,GO:0005524,GO:0006281, GO:0005525, GO:0004386,GO:0005524,GO:0006281
If you were stuck with a default MacOS awk you could introduce a sort/uniq functions in an awk executable script:
#!/usr/bin/awk -f
BEGIN {FS="\t"}
{
c = uniq(a, split($12, a, /, |,/))
sort(a, c)
s = a[1]
for(i=2; i<=c; i++) { s = s "," a[i] }
$2 = s
}
47 # print out the modified line
# take an indexed arr as from split and de-dup it
function uniq(arr, len, i, uarr) {
for(i=len; i>=1; i--) { uarr[arr[i]] }
delete arr
for(k in uarr) { arr[++i] = k }
return( i )
}
# slightly modified from
# http://rosettacode.org/wiki/Sorting_algorithms/Bubble_sort#AWK
function sort(arr, len, haschanged, tmp, i)
{
haschanged = 1
while( haschanged==1 ) {
haschanged = 0
for(i=1; i<=(len-1); i++) {
if( arr[i] > arr[i+1] ) {
tmp = arr[i]
arr[i] = arr[i + 1]
arr[i + 1] = tmp
haschanged = 1
}
}
}
}
If you had GNU-awk, I think you could swap out the sort(a, c) call with asort(a), and drop the bubble-sort local function completely.
I get the following for the 12th field:
GO:0042302,GO:0055000,GO:0055001,GO:0062122
GO:0004386,GO:0005524,GO:0005525,GO:0006281

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

bash: all combinations of lines - bash

Related

awk to get value for a column of next line and add it to the current line in shellscript

print columns based on column name

Define field with FPAT

Spreading cell values into columns using UNIX

Unix/Bash: Uniq on a cell

Categories

Resources