Related
Suppose there are N sets of words and I would like to create a map from those sets so that it maps the words to the number of the words occurrences in all these sets.
For example:
N = 3
S1 = {"a", "b", "c"}, S2 = {"a", "b", "d"}, S3 = {"a", "c", "e"}
M = { "a" -> 3, "b" -> 2, "c" -> 2, "d" -> 1, "e" -> 1}
Now I have M computers to use. Thus, I can make each computer create a map from N/M sets. In the second (final) phase I can create a map from the M maps. Looks like a map/reduce. Does it make sense ? How would you improve this approach ?
This is the standard map reduce example.
For example here is Python code based on the mincemeat map/reduce library:
#!/usr/bin/env python
import mincemeat
S1 = {"a", "b", "c"}
S2 = {"a", "b", "d"}
S3 = {"a", "c", "e"}
datasource = dict(enumerate([S1,S2,S3]))
def mapfn(k, v):
for w in v:
yield w, 1
def reducefn(k, vs):
result = sum(vs)
return result
s = mincemeat.Server()
s.datasource = datasource
s.mapfn = mapfn
s.reducefn = reducefn
results = s.run_server(password="changeme")
print results
Prints
{'a': 3, 'c': 2, 'b': 2, 'e': 1, 'd': 1}
Note that the way that map/reduce is structured means that the server gives new tasks to clients as they complete their tasks.
This means that there is not necessarily a fixed partitioning of N/M tasks to each client.
If one client is faster than the others then it will end up being given more tasks in order to make best use of the available resources.
I have a small piece of code to generate sequences, which is ok.
List = Reap[
For[i = 1, i <= 10000, i++,
Sow[RandomSample[Join[Table["a", {2}], Table["b", {2}]], 2]]];][[2, 1]];
Tally[List]
Giving the following output,
{{{"b", "b"}, 166302}, {{"b", "a"}, 333668}, {{"a", "b"}, 332964}, {{"a", "a"}, 167066}}
My problem is I have yet to find a way to extract the frequencies from the output ....?
Thanks in advance for any help
Note: Generally do not start user-created Symbol names with a capital letter as these may conflict with internal functions.
It is not clear to me how you wish to transform the output. One interpretation is that you just want:
{166302, 333668, 332964, 167066}
In your code you use [[2, 1]] so I presume you know how to use Part, of which this is a short form. The documentation for Part includes:
If any of the listi are All or ;;, all parts at that level are kept.
You could therefore use:
Tally[list][[All, 2]]
You could also use:
Last /# Tally[list]
As george comments you can use Sort, which due to the structure of the Tally data will sort first by the item because it appears first in each list, and each list has the same length.
tally =
{{{"b","b"},166302},{{"b","a"},333668},{{"a","b"},332964},{{"a","a"},167066}};
Sort[tally][[All, 2]]
{167066, 332964, 333668, 166302}
You could also convert your data into a list of Rule objects and then pull values from a predetermined list:
rules = Rule ### tally
{{"b", "b"} -> 166302, {"b", "a"} -> 333668, {"a", "b"} -> 332964, {"a", "a"} -> 167066}
These could be in any order you choose:
{{"a", "a"}, {"a", "b"}, {"b", "a"}, {"b", "b"}} /. rules
{167066, 332964, 333668, 166302}
Merely to illustrate another technique if you have a specific list of items you wish to count you may find value in this Sow and Reap construct. For example, with a random list of "a", "b", "c", "d":
SeedRandom[1];
dat = RandomChoice[{"a", "b", "c", "d"}, 50];
Counting the "a" and "c" elements:
Reap[Sow[1, dat], {"a", "c"}, Tr[#2] &][[2, All, 1]]
{19, 5}
This is not as fast as Tally but it is faster than doing a Count for each element, and sometimes the syntax is useful.
Related to this question, I am wondering the algorithms (and actual code in java/c/c++/python/etc., if you have!) to generate all combinations of r elements for a list with m elements in total. Some of these m elements may be repeated.
Thanks!
recurse for each element type
int recurseMe(list<list<item>> items, int r, list<item> container)
{
if (r == container.length)
{
//print out your collection;
return 1;
}
else if (container.length > score)
{
return 0;
}
list<item> firstType = items[0];
int score = 0;
for(int i = 0; i < firstType.length; i++)
{
score += recurseMe(items without items[0], r, container + i items from firstType);
}
return score;
}
This takes as input a list containing lists of items, assuming each inner list represents a unique type of item. You may have to build a sorting function to feed as input to this.
//start with a list<item> original;
list<list<item>> grouped = new list<list<item>>();
list<item> sorted = original.sort();//use whichever method for this
list<item> temp = null;
item current = null;
for(int x = 0; x < original.length; x++)
if (sorted[x] == current)
{
temp.add(current);
}
else
{
if (temp != null && temp.isNotEmpty)
grouped.add(temp);
temp = new list<item>();
temp.add(sorted[x]);
}
}
if (temp != null && temp.isNotEmpty)
grouped.add(temp);
//grouped is the result
This sorts the list, then creates sublists containing elements that are the same, inserting them into the list of lists grouped
Here is a recursion that I believe is closely related to Jean-Bernard Pellerin's algorithm, in Mathematica.
This takes input as the number of each type of element. The output is in similar form. For example:
{a,a,b,b,c,d,d,d,d} -> {2,2,1,4}
Function:
f[k_, {}, c__] := If[+c == k, {{c}}, {}]
f[k_, {x_, r___}, c___] := Join ## (f[k, {r}, c, #] & /# 0~Range~Min[x, k - +c])
Use:
f[4, {2, 2, 1, 4}]
{{0, 0, 0, 4}, {0, 0, 1, 3}, {0, 1, 0, 3}, {0, 1, 1, 2}, {0, 2, 0, 2},
{0, 2, 1, 1}, {1, 0, 0, 3}, {1, 0, 1, 2}, {1, 1, 0, 2}, {1, 1, 1, 1},
{1, 2, 0, 1}, {1, 2, 1, 0}, {2, 0, 0, 2}, {2, 0, 1, 1}, {2, 1, 0, 1},
{2, 1, 1, 0}, {2, 2, 0, 0}}
An explanation of this code was requested. It is a recursive function that takes a variable number of arguments. The first argument is k, length of subset. The second is a list of counts of each type to select from. The third argument and beyond is used internally by the function to hold the subset (combination) as it is constructed.
This definition therefore is used when there are no more items in the selection set:
f[k_, {}, c__] := If[+c == k, {{c}}, {}]
If the total of the values of the combination (its length) is equal to k, then return that combination, otherwise return an empty set. (+c is shorthand for Plus[c])
Otherwise:
f[k_, {x_, r___}, c___] := Join ## (f[k, {r}, c, #] & /# 0~Range~Min[x, k - +c])
Reading left to right:
Join is used to flatten out a level of nested lists, so that the result is not an increasingly deep tensor.
f[k, {r}, c, #] & calls the function, dropping the first position of the selection set (x), and adding a new element to the combination (#).
/# 0 ~Range~ Min[x, k - +c]) for each integer between zero and the lesser of the first element of the selection set, and k less total of combination, which is the maximum that can be selected without exceeding combination size k.
I'm going to make this an answer rather than a bunch of comments.
My original comment was:
The CombinationGenerator Java class systematically generates all
combinations of n elements, taken r at a time. The algorithm is
described by Kenneth H. Rosen, Discrete Mathematics and Its
Applications, 2nd edition (NY: McGraw-Hill, 1991), pp. 284-286." See
merriampark.com/comb.htm. It has a link to source code.
As you pointed out in your comment, you want unique combinations. So, given the array ["a", "a", "b", "b"], you want it to generate aab, abb. The code I linked generates aab, aab, baa, baa.
With that array, removing duplicates is very easy. Depending on how you implement it, you either let it generate the duplicates and then filter them after the fact (i.e. selecting unique elements from an array), or you modify the code to include a hash table so that when it generates a combination, it checks the hash table before putting the item into the output array.
Looking something up in a hash table is an O(1) operation, so either of those is going to be efficient. Doing it after the fact will be a little bit more expensive, because you'll have to copy items. Still, you're talking O(n), where n is the number of combinations generated.
There is one complication: order is irrelevant. That is, given the array ["a", "b", "a", "b"], the code will generate aba, abb, aab, bab. In this case, aba and aab are duplicate combinations, as are abb and bab, and using a hash table isn't going to remove those duplicates for you. You could, though, create a bit mask for each combination, and use the hash table idea with the bit masks. This would be slightly more complicated, but not terribly so.
If you sort the initial array first, so that duplicate items are adjacent, then the problem goes away and you can use the hash table idea.
There's undoubtedly a way to modify the code to prevent it from generating duplicates. I can see a possible approach, but it would be messy and expensive. It would probably make the algorithm slower than if you just used the hash table idea. The approach I would take:
Sort the input array
Use the linked code to generate the combinations
Use a hash table or some other code to select unique items.
Although ... a thought that occurred to me.
Is it true that if you sort the input array, then any generated duplicates will be adjacent? That is, given the input array ["a", "a", "b", "b"], then the generated output will be aab, aab, abb, abb, in that order. This will be implementation dependent, of course. But if it's true in your implementation, then modifying the algorithm to remove duplicates is a simple matter of checking to see if the current combination is equal to the previous one.
Given two strings of equal length such that
s1 = "ACCT"
s2 = "ATCT"
I would like to find out the positions where there strings differ. So i have done this. (please suggest a better way of doing it. I bet there should be)
z= seq1.chars.zip(seq2.chars).each_with_index.map{|(s1,s2),index| index+1 if s1!=s2}.compact
z is an array of positions where the two strings are different. In this case z returns 2
Imagine that I add a new string
s3 = "AGCT"
and I wish to compare it with the the others and see where the 3 strings differ. We could do the same approach as above but this time
s1.chars.zip(s2.chars,s3.chars)
returns an array of arrays. Given two strings I was relaying on just comparing two chars for equality, but as I add more strings it starts to become overwhelming and as the strings become longer.
#=> [["A", "A", "A"], ["C", "T", "G"], ["C", "C", "C"], ["T", "T", "T"]]
Running
s1.chars.zip(s2.chars,s3.chars).each_with_index.map{|item| item.uniq}
#=> [["A"], ["C", "T", "G"], ["C"], ["T"]]
can help reduce redundancy and return positions that are exactly the same(non empty subarray of size 1). I could then print out the indices and contents of the subarrays that are of size > 1.
s1.chars.zip(s2.chars,s3.chars,s4.chars).each_with_index.map{|item| item.uniq}.each_with_index.map{|a,index| [index+1,a] unless a.size== 1}.compact.map{|h| Hash[*h]}
#=> [{2=>["C", "T", "G"]}]
I feel that this will glide to a halt or get slow as I increase the number of strings and as the string lengths get longer. What are some alternative ways of optimally doing this?
Thank you.
Here's where I'd start. I'm purposely using different strings to make it easier to see the differences:
str1 = 'jackdaws love my giant sphinx of quartz'
str2 = 'jackdaws l0ve my gi4nt sphinx 0f qu4rtz'
To get the first string's characters:
str1.chars.with_index.to_a - str2.chars.with_index.to_a
=> [["o", 10], ["a", 19], ["o", 30], ["a", 35]]
To get the second string's characters:
str2.chars.with_index.to_a - str1.chars.with_index.to_a
=> [["0", 10], ["4", 19], ["0", 30], ["4", 35]]
There will be a little slow down as the strings get bigger, but it won't be bad.
EDIT: Added more info.
If you have an arbitrary number of strings, and need to compare them all, use Array#combination:
str1 = 'ACCT'
str2 = 'ATCT'
str3 = 'AGCT'
require 'pp'
pp [str1, str2, str3].combination(2).to_a
>> [["ACCT", "ATCT"], ["ACCT", "AGCT"], ["ATCT", "AGCT"]]
In the above output you can see that combination cycles through the array, returning the various n sized combinations of the array elements.
pp [str1, str2, str3].combination(2).map{ |a,b| a.chars.with_index.to_a - b.chars.with_index.to_a }
>> [[["C", 1]], [["C", 1]], [["T", 1]]]
Using combination's output you could cycle through the array, comparing all the elements against each other. So, in the above returned array, in the "ACCT" and "ATCT" pair, 'C' was the difference between the two, located at position 1 in the string. Similarly, in "ACCT" and "AGCT" the difference is "C" again, in position 1. Finally for 'ATCT' and 'AGCT' it's 'T' at position 1.
Because we already saw in the longer string samples that the code will return multiple changed characters, this should get you pretty close.
Solution 1
strings = %w[ACCT ATCT AGCT]
First, join the strings, and make a hash of all the positions for each character.
joined = strings.join
positions = (0...joined.length).group_by{|i| joined[i]}
# => {"A"=>[0, 4, 8], "C"=>[1, 2, 6, 10], "T"=>[3, 5, 7, 11], "G"=>[9]}
Then, group the indices by their corresponding position within each string, remove those that are repeated as many times as the number of strings. This part is a variant of an algorithm that Jorg suggests.
length = strings.first.length
n = strings.length
diff = Hash[*positions.map{|k, v|
[k, v.group_by{|i| i % length}.reject{|i, is| is.length == n}.keys]
}]
This will give something like:
diff
# => {"A"=>[], "C"=>[1], "T"=>[1], "G"=>[1]}
which means that, "A" appears in the same positions in all strings, and "C", "T", and "G" differ at position 1 (count starts from 0) of the strings.
If you simply want to know the positions where the strings differ, do
diff["G"] + diff["A"] + diff["C"] + diff["T"]
# or diff["G"] + diff["A"] + diff["C"]
# => [1]
Solution 2
Note that, by maintaining an array of indices where a pairwise comparison fails, and keep adding to indices to it, comparison of s1 against the rest (s2, s3, ...) will suffice.
length = s1.length
diff = []
[s2, s3, ...].each{|s| diff += (0...length).reject{|i| s1[i] == s[i]}}
Explanation in a bit more detail
Suppose
s1 = 'GGGGGGGGG'
s2 = 'GGGCGGCGG'
s3 = 'GGGAGGCGG'
Afters1 and s2 are compared, we have the set of indices [3, 6] that represents where they differ. Now, when we add s3 into consideration, it does not matter whether we compare it with s1 or with s2 because, if s1[i] and s2[i] are different, then i is already included in the set [3, 6], so it does not make difference whether or not either of them are different from s3[i] and i is to be added to the set. On the other hand, if s1[i] and s2[i] are the same, it also does not make difference which one of them we compare with s3[i]. Therefore, pairwise comparison of s1 with s2, s3, ... is enough.
You almost certainly don't want to be doing this analysis with your own code. Rather, you want to be handing it off to an existing multiple sequence alignment tool, like Clustal.
I realise this is not an answer to your question, but i hope it's a solution to your problem!
How do I loop this?
p = Table[RandomChoice[{Heads, Tails}, 2 i + 1], {i, 10}];
v = Count[#, Heads] & /# p;
c = Count[#, Tails] & /# p;
f = Abs[v - c];
g = Take[f, LengthWhile[f, # != 3 &] + 1]
Thanks!
EDIT
In this coin flipping game the rules are as follows :
A single play consists of repeatedly
flipping a fair coin until the
difference between the number of
heads tossed and the number of tails
is three.
You must pay $1 each time the coin is
flipped, and you may not quit during
the play of the game.
You receive $8 at the end of each
play of the game.
Should you play this game?
How much might you expect to win or
lose after 500 plays?
You may use a spreadsheet simulation and/or reasoning about probabilities to answer these questions.
The class is using Excel, I'm trying to learn Mathematica.
A little bit more on the theoretical side
Your game is a random walk on R1.
As such, the expectancy value for the number of flips to get a distance of 3 is 32=9, and that is also the expectancy value for your cost.
As your earning per game is $8, you'll lose at a mean rate of $1 per game.
Note that these figures are consistent with #Mr. Wizard's result of 135108 - 120000 = 15108 for 15000 games.
If I understand the rules of the coin flipping game, and if you must use a Monte Carlo method, consider this:
count =
Table[
i = x = 0;
While[Abs[x] < 3, x += RandomChoice[{1, -1}]; i++];
i,
{15000}
];
The idea is to flip a coin until one person is winning by three, and then output the number of turns it took to get there. Do this 15,000 times, and create a list of the results (count).
The money you spent to play 15,000 games is simply the number of turns that were played, or:
Total # count
(* Out= 135108 *)
While your winnings are $8 * 15,000 = $120,000, so this is not a good game to play.
If you need to count the number of times each number of turns comes up, then:
Sort # Tally # count
Not sure if this is the best way to accomplish what you want, but this should get you started. First, note that I changed the names Heads and Tails to lowercase (Heads is a built-in symbol...)---lowercase variable names are the best way to avoid this type of problem.
Remove[p, v, c, fun, f, g, head, tail];
fun[n_] :=
Do[
Block[
{p, v, c, f, g},
p = Table[RandomChoice[{head, tail}, 2 i + 1], {i, 10}];
v = Count[#, head] & /# p;
c = Count[#, tail] & /# p;
f = Abs[v - c];
g = Print[Take[f, LengthWhile[f, # != 3 &] + 1]]
],
{n}]
Simply enter the number of times you want to run the loop... fun[5] gives:
{1,1,1,1,5,3}
{3}
{1,1,5,1,5,1,3}
{3}
{1,5,3}
Note: because you'll probably want to do something with the output, using Table[] is probably better than Do[]. This will return a list of lists.
Remove[p, v, c, fun, f, g, head, tail];
fun[n_] :=
Table[
Block[
{p, v, c, f, g},
p = Table[RandomChoice[{head, tail}, 2 i + 1], {i, 10}];
v = Count[#, head] & /# p;
c = Count[#, tail] & /# p;
f = Abs[v - c];
g = Take[f, LengthWhile[f, # != 3 &] + 1]
],
{n}]
Nothing fancy!
A little more Mathematica-ish. No vars defined.
g[n_] := Table[(Abs /# Total /#
Array[RandomChoice[{-1, 1}, (2 # + 1)] &, 10]) /.
{x___, 3, ___} :> {x, 3},
{n}]
Credit to #Mr.Wizard for this answer.
g[2]
->{{1, 1, 1, 5, 5, 1, 5, 7, 3}, {1, 3}}
I don't like bitching about RTFM etc. but looping is pretty basic. If I type "loop" in the search box in the documentation center one of the first few hits contains a link to the page "guide/LoopingConstructs" and this contains a link to the tutorial "tutorial/LoopsAndControlStructures". Have you read these?