I have a small bioinformatics problem that I think should be easy to solve. Related to "genotype phasing". But I'm not sure how to tackle it.
In the extract below, the first column are identifiers, the subsequent columns are binary genotypes labelled with "a" or "b". "-" means missing value.
Si_gnF.scaffold10533.53688bp_tag414456 b a a b b a b a a a b a b b a b a a b b a a b b
Si_gnF.scaffold10533.76297bp_tag414484 a b b a a b a b b b a b a a b a b b a - b b a a
Si_gnF.scaffold10533.98416bp_tag414526 a b b a a b a b b b a b a a b a b b a a b b a a
Si_gnF.scaffold10534.48805bp_tag414546 b a a b a b a b b b b b b a a a a b a b b b b a
Si_gnF.scaffold10535.1091787bp_tag414684 a a a b b a a a b a b a a a a b b b a a b b a a
Si_gnF.scaffold10535.1151107bp_tag414765 b b b a a b b b a b a - b b b a a a b b a a b b
Si_gnF.scaffold10535.1220879bp_tag414877 a a a b b a a a b a b a a a a b b b a a b b a a
Si_gnF.scaffold10535.1304464bp_tag414988 b b b a a b b b a b a b b b b a a a b b a a b b
Si_gnF.scaffold10535.1347462bp_tag415047 b b b a a b b b a b a b b b b a a a b b a a b b
Si_gnF.scaffold10535.1379804bp_tag415090 b b b a a b b b a b a b b b b a a a b b a a b b
Si_gnF.scaffold10535.1540335bp_tag415345 a a a b b a a a b a b a a a a b b b a a b b a a
Si_gnF.scaffold10535.1585442bp_tag415410 a a a b b a a a b a b a a a a b b b a a b b a a
Si_gnF.scaffold10535.1609908bp_tag415431 b b b a a b a b a b a b b b b a a a b b a a b b
Si_gnF.scaffold10535.1711158bp_tag415567 b b b a a b b b a b a b b b b a a a b b a a b b
Si_gnF.scaffold10535.1744394bp_tag415609 b b b a a b b b a b a b b b b a a a b b a a b b
Si_gnF.scaffold10535.1751886bp_tag415620 a a a b b a a a b a b a a a a b b b a a b b a a
Si_gnF.scaffold10535.1752774bp_tag415622 a a a b b a a a b a b a a a a b b b a a b b a a
Si_gnF.scaffold10535.1789478bp_tag415675 b b - a a b b b a b a b b b b a a a b b a a b b
Si_gnF.scaffold10535.1800135bp_tag415687 b b b a a b b b a b a b b b b a a a b b a a b b
Si_gnF.scaffold10535.1885424bp_tag415814 a a a b b a a a b a b a a a a b b b a a b b a a
Basically, I want to minimize the number of differences between lines. (I cannot edit individual columns, but can flip the labels on whole lines). The result for the first four lines would be this:
Si_gnF.scaffold10533.53688bp_tag414456 b a a b b a b a a a b a b b a b a a b b a a b b
Si_gnF.scaffold10533.76297bp_tag414484 b a a b b a b a a a b a b b a b a a b - a a b b <-- this one flipped
Si_gnF.scaffold10533.98416bp_tag414526 b a a b b a b a a a b a b b a b a a b b a a b b <-- this one flipped
Si_gnF.scaffold10533.53688bp_tag414456 b a a b b a b a a a b a b b a b a a b b a a b b
As a first step I'll need to make pairwise comparisons. But what is a good way of quantifying the differences, so that I know for which lines labels must be flipped? (2 consecutive lines rarely match 100%; there can be multiple (even many) mismatches as well as missing values).
(ideally in ruby or R)
You can use the Levenshtein algorithm to quantify the difference between two strings. One way to do it:
require 'text' # See http://rubygems.org/gems/text
lines # => a array with each line
def compare(line1, line2)
Text::Levenshtein.distance(line1.sub(/.*\s/, '').sort,
line2.sub(/.*\s/, '').sort)
end
compare(lines[0], lines[1]) # => 1 (one value different)
(If "a b a a" is not equal to "a a a b", remove the sort from the method.)
Related
I'm trying to output something that resembles as ls output. The ls command outputs like this:
file1.txt file3.txt file5.txt
file2.txt file4.txt
But I this sample list:
a b c d e f g h i j k l m n o p q r s t u v w x y z
to appear as:
a e i m q u y
b f j n r v z
c g k o s w
d h l p t x
In that case, it gave 7 columns which is fine, but I wanted up to 8 columns max. Next the following list:
a b c d e f g h i j k l m n o p q r s t u v w
will have to show as:
a d g j m p s v
b e h k n q t w
c f i l o r u
And "a b c d e f g h" will have to show as is because it is already 8 columns in 1 line, but:
a b c d e f g h i
will show as:
a c e g i
b d f h
And:
a b c d e f g h i j
a c e g i
b d f h j
One way:
#!/usr/bin/env tclsh
proc columnize {lst {columns 8}} {
set len [llength $lst]
set nrows [expr {int(ceil($len / (0.0 + $columns)))}]
set cols [list]
for {set n 0} {$n < $len} {incr n $nrows} {
lappend cols [lrange $lst $n [expr {$n + $nrows - 1}]]
}
for {set n 0} {$n < $nrows} {incr n} {
set row [list]
foreach col $cols {
lappend row [lindex $col $n]
}
puts [join $row " "]
}
}
columnize {a b c d e f g h i j k l m n o p q r s t u v w x y z}
puts ----
columnize {a b c d e f g h i j k l m n o p q r s t u v w}
puts ----
columnize {a b c d e f g h}
puts ----
columnize {a b c d e f g h i}
puts ----
columnize {a b c d e f g h i j}
The columnize function first figures out how many rows are needed with a simple division of the length of the list by the number of columns requested, then splits the list up into chunks of that length, one per column, and finally iterates through those sublists extracting the current row's element for each column, and prints the row out as a space-separated list.
How to operation two set that contain structured data.
e.g.
set(set(<a b c>), set(<d e f>)) ⊆ set(set(<a b c>), set(<d e f>), set(<g h i>))#True
set(set(<a b c>), set(<d e f>)) eq set(set(<a b c>), set(<d e f>), set(<g h i>))#false
set(set(<a b c>), set(<d e f>)) ∩ set(set(<a b c>), set(<d e f>), set(<g h i>))#set(<a b c>), set(<d e f>))
Regardless of values in a Set, you can use the eqv operator to find out if they are the same:
$ raku -e 'say <a b c>.Set eqv <c b a>.Set'
True
$ raku -e 'say <a b c>.Set eqv <d b a>.Set'
False
$ raku -e 'say set(<a b c>.Set,<a b d>.Set) eqv set(<d b a>.Set,<c b a>.Set)'
True
Algorithm for Finding first set:
Given a grammar with the rules A1 → w1, ..., An → wn, we can compute the Fi(wi) and Fi(Ai) for every rule as follows:
initialize every Fi(Ai) with the empty set
set Fi(wi) to Fi(wi) for every rule Ai → wi, where Fi is defined as follows:
Fi(a w' ) = { a } for every terminal a
Fi(A w' ) = Fi(A) for every nonterminal A with ε not in Fi(A)
Fi(A w' ) = Fi(A) \ { ε } ∪ Fi(w' ) for every nonterminal A with ε in Fi(A)
Fi(ε) = { ε }
add Fi(wi) to Fi(Ai) for every rule Ai → wi
do steps 2 and 3 until all Fi sets stay the same.
Grammar:
A -> B C c
A -> g D B
B -> EPSILON | b C D E
C -> D a B | c a
D -> EPSILON | d D
E -> g A f | c
This website generates the first set as follows:
Non-Terminal Symbol First Set
A g, ε, b, a, c, d
B ε, b
C a, c, ε, d
D ε, d
E g, c
But the algorithm says Fi(A w' ) = Fi(A) for every nonterminal A with ε not in Fi(A) so the First(A) according to this algorithm should not contain ε. First(A) = {g, b, a, c, d}.
Q: First(A) for the above grammar is = First(B) - ε U First(C) U {g} ?
This video also follows the above algorithm and do not choose ε.
First(B) = {ε, b}
First(D) = {ε, d}
First(E) = {g, c}
First(C) = {c, d, a}
First(A) = {b, g, c, d, a}
Example:
X -> Y a | b
Y -> c | ε
First(X) = {c, a, b}
First(Y) = {c, ε}
First(X) doesn't have ε because if you replace Y by ε, then First(Y a) is equal to First(ε a) = {a}
First set implementation on my github.
Edit: Updated link
https://github.com/amirbawab/EasyCC-CPP/blob/master/src/syntax/grammar/Grammar.cpp#L229
Computing the first and follow sets are both available on the new link above.
i'm trying to understand how Ruby's stdout actually works, since i'm struggling with the output of some code.
Actually, within my script i'm using a unix sort, which works fine from termina, but this is what i get from ruby, suppose you have this in your file (tsv)
a b c d e f g h i l m
a b c d e f g h i l m
a b c d e f g h i l m
a b c d e f g h i l m
a b c d e f g h i l m
a b c d e f g h i l m
a b c d e f g h i l m
a b c d e f g h i l m
My ruby code is this:
#raw_file=File.open(ARGV[0],"r") unless File.open(ARGV[0],"r").nil?
tmp_raw=File.new("#{#pwd}/tmp_raw","w")
`cut -f1,6,3,4,2,5,9,12 #{#raw_file.path} | sort -k1,1 -k8,8 > #{tmp_raw.path}`
This is what i get (misplaced separators):
a b c d e f i
1a b c d e f g h i l m
1
Whats happening here?
When running from terminal i get no separators misplacement
enter code here
Instead of writing to a temporary file, passing the file via argument etc, you can use Ruby's open3 module to create the pipeline in a more Ruby-friendly manner (instead of relying on the underlying shell):
require 'open3'
raw_file = File.open(ARGV[0], "r")
commands = [
["cut", "-f1,6,3,4,2,5,9,12"],
["sort", "-k1,1", "-k8,8"],
]
result = Open3.pipeline_r(*commands, in: raw_file) do |out|
break out.read
end
puts result
Shell escaping problems, for example, become a thing from the past, and no temporary files are necessary, since pipes are used.
I would, however, advise doing this kind of processing in Ruby itself, instead of calling external utilities; you're getting no benefit from using Ruby here, you're just doing shell stuff.
As Linuxios says, your code never uses STDOUT, so your question doesn't make a lot of sense.
Here's a simple example showing how to do this all in Ruby.
Starting with an input file called "test.txt":
a b c d e f g h i l m
a b c d e f g h i l m
a b c d e f g h i l m
a b c d e f g h i l m
a b c d e f g h i l m
a b c d e f g h i l m
a b c d e f g h i l m
a b c d e f g h i l m
This code:
File.open('test_out.txt', 'w') do |test_out|
File.foreach('test.txt') do |line_in|
chars = line_in.split
test_out.puts chars.values_at(0, 5, 2, 3, 1, 4, 8, 10).sort_by{ |*c| [c[0], c[7]] }.join("\t")
end
end
Creates this output in 'test_out.txt':
a b c d e f i m
a b c d e f i m
a b c d e f i m
a b c d e f i m
a b c d e f i m
a b c d e f i m
a b c d e f i m
a b c d e f i m
Read about values_at and sort_by.
Suppose I have N elements, I want to create a list of all possible groupings of the elements, where there can be multiple groups of the N elements at once.
For example, suppose we have 4 elements: a, b, c, d. Let [ ] denote that the elements within the brackets are in a grouping. I'm looking for an algorithm (in Matlab if possible) that can create a list of all the ways they can group together like so:
a b c d
[a b] c d
a [b c] d
a b [c d]
[a d] b c
[a c] b d
a c [b d]
[a b] [c d]
[b c] [a d]
[a c] [b d]
[a b c] d
a [b c d]
b [c d a]
c [d a b]
[a b c d]
This solution generates all possibilities:
%Generate potential solutions
p=dec2base(0:base2dec('123',4), 4);
%convert to numeric
p=[zeros(size(p,1),1) p-'0'];
for col=2:4
%sort out. in column col a new group index is used, but not all previous indices are used.
valid=max(p(:,1:col-1),[],2)+1>=p(:,col);
p=p(valid,:);
end
The output is a matrix which labels the characters. For example 1233 is ab[cd] in your notation.