Efficient And Conditional Tuples or Subsets - wolfram-mathematica

Stepping back from the following question :
Selecting with Cases
I need to generate a random Set (1 000 000 items would be enough)
Subsets[Flatten[ParallelTable[{i, j}, {i, 1, 96}, {j, 1, 4}], 1], {4}]
Further, I need to reject any quadruples with non-unique first elements, such as {{1,1},{1,2},{2,3},{6,1}}.
But the above is impossible on a laptop. How could I just draw uniformly one millions sets avoiding killing my machine ?

Provided you have a base set you need to generate 4-element subsets of,
baseSet = Flatten[Table[{i, j}, {i, 1, 96}, {j, 1, 4}], 1];
you can use RandomSample as follows:
RandomSample[baseSet, 4]
This gives you a length-4 random subset of baseSet. Generating a million of them takes 2.5 seconds on my very old machine:
Timing[subsets = Table[RandomSample[baseSet, 4], {1000000}];]
Not all of what we get are going to be different subsets, so we need to remove duplicates using Union:
subsets = Union[subsets];
After this I'm still left with 999 971 items in a sample run, thanks to the much larger number of possible subsets (Binomial[Length[baseSet], 4] == 891 881 376)

This should also do the trick, and it runs faster than Szabolcs' proposal.
(t=Table[{RandomInteger[{1, 96}], RandomInteger[{1, 4}]}, {10^6}, {4}]); //Timing
I saw no need to remove duplicate subsets since we're sampling, not trying to produce the entire population. (But you can easily remove duplicates if you so wish.)
BTW, for this case, Table runs faster than ParallelTable.

I believe a slight variation of David's method will produce the duplicate-free form requested in the original post.
set =
With[{r = Range#96},
{RandomSample[r, 4], RandomInteger[{1, 4}, 4]}\[Transpose] ~Table~ {1*^6}
];
This of course does not produce 10^6 unique samples, but Szabolcs showed how that may be done, and the cost is not great.

Related

Data structure for quick match in a card game

When playing trading card games, I frequently wonder what would be the most efficient data structure to deal with the following problem.
In such games, I face an opponent with a deck that contains N cards (N ~ 30..60..100), each of them is chosen out of possible M card types (M ~ typically 1000..10000s). Cards are generally not required to be unique, i.e. there can be repeated card types. The contents of opponent's deck are unknown before the game.
As the game starts and progresses, I slowly learn card-by-card, which cards an opponent uses. There is a dataset that includes full contents of K (K ~ typically 100000..100000s) of the decks seen previously. I want to query this dataset using progressively increasing sample I've obtained in a certain game to make a ranked list of possible decks an opponent uses.
What would be the most efficient data structure to do such querying, given mentioned limits on reasonably modern hardware (i.e. several gigabytes of RAM available)?
A very small example
possible card types = [1..10]
known K decks:
d1 = [1, 4, 6, 3, 4]
d2 = [5, 3, 3, 9, 5]
d3 = [5, 10, 4, 10, 1]
d4 = [3, 7, 1, 8, 5]
on turn 1, I reveal that an opponent uses card #5; thus, my list of candidates is reduced to:
d2 = [5, 3, 3, 9, 5] - score 2
d3 = [5, 10, 4, 10, 1] - score 1
d4 = [3, 7, 1, 8, 5] - score 1
d2 is ranked higher than the rest in the results, because there are double 5s in that deck, so it's probably more likely that it is
on turn 2, I reveal that an opponent uses card #1; list of candidates is reduced to:
d3 = [5, 10, 4, 10, 1]
d4 = [3, 7, 1, 8, 5]
My ideas on solution
The trivial solution is, of course, to store K decks as an arrays of N integers. Getting match score for a given query of p cards revealed for one deck thus takes O(N*p) checks. Each time we see a match, we just increase the score by 1. Thereby, checking all K known decks against a query of p cards would take O(KNp), that is roughly 100000 * 100 * 100 operations in worst case => 1e9, that's lots of work.
We can set up an index that will hold a list of pointers to decks that card is encountered in for every known card type — however, it doesn't solve the problem of intersecting all these lists (and they are going to be huge, there might be cards that are found in 90..95% of known decks). For a given p card lookup, it boils down to intersecting p lists of K decks pointers, calculating intersection scores in process. Roughly, that is O(Kp), but with a fairly large constant. It's still 1e7 operations in late stages.
However, if we'll use the fact that every next turn in fact restricts our dataset further, we can reapply filtering to whatever came up on previous query. This way, it would be O(K) every turn => 1e5 operations.
Is there a way to perform better, ideally, not depending on value of K?
There are two things you can do to speed this up. First, create an inverted index that tells you which decks contain each card. So in your example decks above:
d1 = [1, 4, 6, 3, 4]
d2 = [5, 3, 3, 9, 5]
d3 = [5, 10, 4, 10, 1]
d4 = [3, 7, 1, 8, 5]
Your index is:
1: d1, d3, d4
3: d1, d2, d4
4: d1(2), d3
5: d2(2), d3, d4
6: d1
7: d4
8: d4
9: d2
10: d3(2)
It should be clear that this takes the about the same amount of memory as the decks themselves. That is, rather than having N decks of K cards, you have up to M cards, each of which has up to N deck references.
When the user turns over the first card, 5, you quickly look up 5 in your index and you get the candidate lists [d2,d3,d4].
Here's the second optimization: you keep that list of candidates around. You're no longer interested in the rest of the decks; they have been eliminated from the list of candidates. When the next card, 1, is revealed, you look up 1 in your index and you get [d1,d3,d4]. You intersect that with the first list of candidates to produce [d3,d4].
In the worst possible case, you'd end up doing N intersections (one per card) of K items each (if the decks are all very similar). But in most cases the number of decks that a card is in will be much smaller than K, so your candidate list length will likely shrink very quickly.
Finally, if you store the deck references as hash maps then the intersection goes very quickly because you only have to look for items from the (usually small) existing candidate list in the large list of items for the next card turned over. Those lookups are O(1).
This is the basic idea of how a search engine works. You have a list of words, each of which contains references to the documents the word appears in. You can very quickly narrow a list of documents from hundreds of millions to just a handful in short order.
Your idea with intersecting p lists of deck pointers is good, but you're missing some optimizations.
Sort the decks by some criteria (i.e. deck index) and use binary search to advance through the lists (take the smallest deck id using a heap and advance it to match or exceed to current largest deck id). This way you get through them faster, especially if you don't have a lot of decks in the intersection.
Also store the previous intersection so that for the next move you only need to intersect 2 lists (previous result and the new card).
Finally you can simply ignore cards that are too popular and just check for them in the final result.
I would suggest you implement a solution like this and run some benchmarks. It will be faster than O(K).

Sum of cells with several possibilities

I'm programming a Killer Sudoku Solver in Ruby and I try to take human strategies and put them into code. I have implemented about 10 strategies but I have a problem on this one.
In killer sudoku, we have "zones" of cells and we know the sum of these cells and we know possibilities for each cell.
Example :
Cell 1 can be 1, 3, 4 or 9
Cell 2 can be 2, 4 or 5
Cell 3 can be 3, 4 or 9
The sum of all cells must be 12
I want my program to try all possibilities to eliminate possibilities. For instance, here, cell 1 can't be 9 because you can't make 3 by adding two numbers possible in cells 2 and 3.
So I want that for any number of cells, it removes the ones that are impossible by trying them and seeing it doesn't work.
How can I get this working ?
There's multiple ways to approach the general problem of game solving, and emulating human strategies is not always the best way. That said, here's how you can solve your question:
1st way, brute-forcy
Basically, we want to try all possibilities of the combinations of the cells, and pick the ones that have the correct sum.
cell_1 = [1,3,4,9]
cell_2 = [2,4,5]
cell_3 = [3,4,9]
all_valid_combinations = cell_1.product(cell_2,cell_3).select {|combo| combo.sum == 12}
# => [[1, 2, 9], [3, 5, 4], [4, 4, 4], [4, 5, 3]]
#.sum isn't a built-in function, it's just used here for convenience
to pare this down to individual cells, you could do:
cell_1 = all_valid_combinations.map {|combo| combo[0]}.uniq
# => [1, 3, 4]
cell_2 = all_valid_combinations.map {|combo| combo[1]}.uniq
# => [2, 5, 4]
. . .
if you don't have a huge large set of cells, this way is easier to code. it can get a bit inefficienct though. For small problems, this is the way I'd use.
2nd way, backtracking search
Another well known technique takes the problem from the other approach. Basically, for each cell, ask "Can this cell be this number, given the other cells?"
so, starting with cell 1, can the number be 1? to check, we see if cells 2 and 3 can sum to 11. (12-1)
* can cell 2 have the value 2? to check, can cell 3 sum to 9 (11-1)
and so on. In very large cases, where you could have many many valid combinations, this will be slightly faster, as you can return 'true' on the first time you find a valid number for a cell. Some people find recursive algorithms a bit harder to grok, though, so your mileage may vary.

Create expression trees from given sets of numbers and operations and find those that evaluate to a target number in Mathematica 8 or above

Given a set of numbers and a set of binary operations,
what is the fastest way to create random expression trees or exhaustively check every possible combination in Mathematica?
What I am trying to solve is given:
numbers={25,50,75,100,3,6} (* each can ONLY be used ONCE *)
operators={Plus,Subtract,Times,Divide} (* each can be used repeatedly *)
target=99
find expression trees that would evaluate to target.
I have two solutions whose performances I give for the case where expression trees contain exactly 4 of the numbers and 3 operators:
random sample & choice: ~25K trees / second
exhaustive scan: 806400 trees in ~2.15 seconds
(timed on a laptop with: Intel(R) Core(TM)2 Duo CPU T9300 # 2.50GHz, 3GB ram, no parallelization used yet but would be most welcome in answers)
My notebooks are a bit messy at the moment. So I would first love to pose the question and hope for original ideas and answers while I clean up my code for sharing.
Largest possible case is where every expression tree uses up all the (6) numbers and 'Length[numbers]-1' (5) operators.
Performance of methods in the largest case is:
random sample & choice: ~21K trees / second
exhaustive scan: 23224320 trees in ~100 seconds
Also I am using Mathematica 8.0.1 so I am more than all ears if there are any ways to do it in OpenCL or using compiled functions wiht CompilationTarget->"C", etc.
OK, this is not elegant or fast, and it's buggy, but it works (sometimes). It uses a monte carlo method, implementing the metropolis algorithm for a weight function that I (arbitrarily) selected just to see if this would work. This was some time ago for a similar problem; I suppose my mathematica skills have improved as it looks ugly now, but I have no time to fix it at the moment.
Execute this (it looks more reasonable when you paste it into a notebook):
ClearAll[swap];
swap[lst_, {p1_, p2_}] :=
ReplacePart[
lst, {p1 \[Rule] lst\[LeftDoubleBracket]p2\[RightDoubleBracket],
p2 \[Rule] lst\[LeftDoubleBracket]p1\[RightDoubleBracket]}]
ClearAll[evalops];
(*first element of opslst is Identity*)
evalops[opslst_, ord_, nums_] :=
Module[{curval}, curval = First#nums;
Do[curval =
opslst\[LeftDoubleBracket]p\[RightDoubleBracket][curval,
nums\[LeftDoubleBracket]ord\[LeftDoubleBracket]p\
\[RightDoubleBracket]\[RightDoubleBracket]], {p, 2, Length#nums}];
curval]
ClearAll[randomizeOrder];
randomizeOrder[ordlst_] :=
swap[ordlst, RandomInteger[{1, Length#ordlst}, 2]]
ClearAll[randomizeOps];
(*never touch the first element*)
randomizeOps[oplst_, allowedOps_] :=
ReplacePart[
oplst, {RandomInteger[{2, Length#oplst}] \[Rule] RandomChoice[ops]}]
ClearAll[takeMCstep];
takeMCstep[goal_, opslst_, ord_, nums_, allowedops_] :=
Module[{curres, newres, newops, neword, p},
curres = evalops[opslst, ord, nums];
newops = randomizeOps[opslst, allowedops];
neword = randomizeOrder[ord];
newres = evalops[newops, neword, nums];
Switch[Abs[newres - goal],
0, {newops,
neword}, _, (p = Abs[curres - goal]/Abs[newres - goal];
If[RandomReal[] < p, {newops, neword}, {opslst, ord}])]]
then to solve your actual problem, do
ops = {Times, Plus, Subtract, Divide}
nums = {25, 50, 75, 100, 3, 6}
ord = Range[Length#nums]
(*the first element is identity to simplify the logic later*)
oplist = {Identity}~Join~RandomChoice[ops, Length#nums - 1]
out = NestList[
takeMCstep[
99, #\[LeftDoubleBracket]1\[RightDoubleBracket], #\
\[LeftDoubleBracket]2\[RightDoubleBracket], nums, ops] &, {oplist,
ord}, 10000]
and then to see that it worked,
ev = Map[evalops[#\[LeftDoubleBracket]1\[RightDoubleBracket], #\
\[LeftDoubleBracket]2\[RightDoubleBracket], nums] &, out];
ev // Last // N
ev // ListPlot[#, PlotMarkers \[Rule] None] &
giving
thus, it obtained the correct order of operators and numbers after around 2000 tries.
As I said, it's ugly, inefficient, and badly programmed as it was a quick-and-dirty adaptation of a quick-and-dirty hack. If you're interested I can clean up and explain the code.
This was a fun question. Here's my full solution:
ExprEval[nums_, ops_] := Fold[
#2[[1]][#1, #2[[2]]] &,
First#nums,
Transpose[{ops, Rest#nums}]]
SymbolicEval[nums_, ops_] := ExprEval[nums, ToString /# ops]
GetExpression[nums_, ops_, target_] := Select[
Tuples[ops, Length#nums - 1],
(target == ExprEval[nums, #]) &]
Usage example:
nums = {-1, 1, 2, 3};
ops = {Plus, Subtract, Times, Divide};
solutions = GetExpression[nums, ops, 3]
ExprEval[nums, #] & /# solutions
SymbolicEval[nums, #] & /# solutions
Outputs:
{{Plus, Times, Plus}, {Plus, Divide, Plus}, {Subtract, Plus,
Plus}, {Times, Plus, Times}, {Divide, Plus, Times}}
{3, 3, 3, 3, 3}
{"Plus"["Times"["Plus"[-1, 1], 2], 3],
"Plus"["Divide"["Plus"[-1, 1], 2], 3],
"Plus"["Plus"["Subtract"[-1, 1], 2], 3],
"Times"["Plus"["Times"[-1, 1], 2], 3],
"Times"["Plus"["Divide"[-1, 1], 2], 3]}
How it works
The ExprEval function takes in the numbers and operations, and applies them using (I think) RPN:
ExprEval[{1, 2, 3}, {Plus, Times}] == (1 + 2) * 3
It does this by continually folding pairs of numbers using the appropriate operation.
Now that I have a way to evaluate an expression tree, I just needed to generate them. Using Tuples, I'm able to generate all the different operators that I would intersperse between the numbers.
Once you get all possible operations, I used Select to pick out the the ones that evaluate to the target number.
Drawbacks
The solution above is really slow. Generating all the possible tuples is exponential in time. If there are k operations and n numbers, it's on the order of O(k^n).
For n = 10, it took 6 seconds to complete on Win 7 x64, Core i7 860, 12 GB RAM. The timings of the runs match the theoretical time complexity almost exactly:
Red line is the theoretical, blue is experimental. The x-axis is size of the nums input and the y-axis is the time in seconds to enumerate all solutions.
The above solution also solves the problem using a functional programming style. It looks pretty, but the thing also sucks up a butt ton of memory since it's storing the full results at nearly every step.
It doesn't even make use of parallelization, and I'm not entirely certain how you would even parallelize the solution I produced.
Some limitations
Mr. Wizard brought to my attention that this code only solves for only particular set of solutions. Given some input such as {a, b, c, d, e, ... } it only permutes the operators in between the numbers. It doesn't permute the ordering of the numbers. If it were to permute the numbers as well, the time complexity would rise up to O(k^n * n!) where k is the number of operators and n is the length of the input number array.
The following will produce the set of solutions for any permutation of the input numbers and operators:
(* generates a lists of the form
{
{number permutation, {{op order 1}, {op order 2}, ... }
}, ...
}*)
GetAllExpressions[nums_, ops_, target_] :=
ParallelMap[{#, GetExpression[#, ops, target]} &,
Tuples[nums, Length#nums]]

Algorithm for picking pattern free downvalues from a sparse definition list

I have the following problem.
I am developing a stochastic simulator which samples configurations of the system at random and stores the statistics of how many times each configuration has been visited at certain time instances. Roughly the code works like this
f[_Integer][{_Integer..}] :=0
...
someplace later in the code, e.g.,
index = get index;
c = get random configuration (i.e. a tuple of integers, say a pair {n1, n2});
f[index][c] = f[index][c] + 1;
which tags that configuration c has occurred once more in the simulation at time instance index.
Once the code has finished there is a list of definitions for f that looks something like this (I typed it by hand just to emphasize the most important parts)
?f
f[1][{1, 2}] = 112
f[1][{3, 4}] = 114
f[2][{1, 6}] = 216
f[2][{2, 7}] = 227
...
f[index][someconfiguration] = some value
...
f[_Integer][{_Integer..}] :=0
Please note that pattern free definitions that come first can be rather sparse. Also one cannot know which values and configurations will be picked.
The problem is to efficiently extract down values for a desired index, for example issue something like
result = ExtractConfigurationsAndOccurences[f, 2]
which should give a list with the structure
result = {list1, list2}
where
list1 = {{1, 6}, {2, 7}} (* the list of configurations that occurred during the simulation*)
list2 = {216, 227} (* how many times each of them occurred *)
The problem is that ExtractConfigurationsAndOccurences should be very fast. The only solution I could come up with was to use SubValues[f] (which gives the full list) and filter it with Cases statement. I realize that this procedure should be avoided at any cost since there will be exponentially many configurations (definitions) to test, which slows down the code considerably.
Is there a natural way in Mathematica to do this in a fast way?
I was hoping that Mathematica would see f[2] as a single head with many down values but using DownValues[f[2]] gives nothing. Also using SubValues[f[2]] results in an error.
This is a complete rewrite of my previous answer. It turns out that in my previous attempts, I overlooked a much simpler method based on a combination of packed arrays and sparse arrays, that is much faster and more memory - efficient than all previous methods (at least in the range of sample sizes where I tested it), while only minimally changing the original SubValues - based approach. Since the question was asked about the most efficient method, I will remove the other ones from the answer (given that they are quite a bit more complex and take a lot of space. Those who would like to see them can inspect past revisions of this answer).
The original SubValues - based approach
We start by introducing a function to generate the test samples of configurations for us. Here it is:
Clear[generateConfigurations];
generateConfigurations[maxIndex_Integer, maxConfX_Integer, maxConfY_Integer,
nconfs_Integer] :=
Transpose[{
RandomInteger[{1, maxIndex}, nconfs],
Transpose[{
RandomInteger[{1, maxConfX}, nconfs],
RandomInteger[{1, maxConfY}, nconfs]
}]}];
We can generate a small sample to illustrate:
In[3]:= sample = generateConfigurations[2,2,2,10]
Out[3]= {{2,{2,1}},{2,{1,1}},{1,{2,1}},{1,{1,2}},{1,{1,2}},
{1,{2,1}},{2,{1,2}},{2,{2,2}},{1,{2,2}},{1,{2,1}}}
We have here only 2 indices, and configurations where both "x" and "y" numbers vary from 1 to 2 only - 10 such configurations.
The following function will help us imitate the accumulation of frequencies for configurations, as we increment SubValues-based counters for repeatedly occurring ones:
Clear[testAccumulate];
testAccumulate[ff_Symbol, data_] :=
Module[{},
ClearAll[ff];
ff[_][_] = 0;
Do[
doSomeStuff;
ff[#1][#2]++ & ## elem;
doSomeMoreStaff;
, {elem, data}]];
The doSomeStuff and doSomeMoreStaff symbols are here to represent some code that might preclude or follow the counting code. The data parameter is supposed to be a list of the form produced by generateConfigurations. For example:
In[6]:=
testAccumulate[ff,sample];
SubValues[ff]
Out[7]= {HoldPattern[ff[1][{1,2}]]:>2,HoldPattern[ff[1][{2,1}]]:>3,
HoldPattern[ff[1][{2,2}]]:>1,HoldPattern[ff[2][{1,1}]]:>1,
HoldPattern[ff[2][{1,2}]]:>1,HoldPattern[ff[2][{2,1}]]:>1,
HoldPattern[ff[2][{2,2}]]:>1,HoldPattern[ff[_][_]]:>0}
The following function will extract the resulting data (indices, configurations and their frequencies) from the list of SubValues:
Clear[getResultingData];
getResultingData[f_Symbol] :=
Transpose[{#[[All, 1, 1, 0, 1]], #[[All, 1, 1, 1]], #[[All, 2]]}] &#
Most#SubValues[f, Sort -> False];
For example:
In[10]:= result = getResultingData[ff]
Out[10]= {{2,{2,1},1},{2,{1,1},1},{1,{2,1},3},{1,{1,2},2},{2,{1,2},1},
{2,{2,2},1},{1,{2,2},1}}
To finish with the data-processing cycle, here is a straightforward function to extract data for a fixed index, based on Select:
Clear[getResultsForFixedIndex];
getResultsForFixedIndex[data_, index_] :=
If[# === {}, {}, Transpose[#]] &[
Select[data, First## == index &][[All, {2, 3}]]];
For our test example,
In[13]:= getResultsForFixedIndex[result,1]
Out[13]= {{{2,1},{1,2},{2,2}},{3,2,1}}
This is presumably close to what #zorank tried, in code.
A faster solution based on packed arrays and sparse arrays
As #zorank noted, this becomes slow for larger sample with more indices and configurations. We will now generate a large sample to illustrate that (note! This requires about 4-5 Gb of RAM, so you may want to reduce the number of configurations if this exceeds the available RAM):
In[14]:=
largeSample = generateConfigurations[20,500,500,5000000];
testAccumulate[ff,largeSample];//Timing
Out[15]= {31.89,Null}
We will now extract the full data from the SubValues of ff:
In[16]:= (largeres = getResultingData[ff]); // Timing
Out[16]= {10.844, Null}
This takes some time, but one has to do this only once. But when we start extracting data for a fixed index, we see that it is quite slow:
In[24]:= getResultsForFixedIndex[largeres,10]//Short//Timing
Out[24]= {2.687,{{{196,26},{53,36},{360,43},{104,144},<<157674>>,{31,305},{240,291},
{256,38},{352,469}},{<<1>>}}}
The main idea we will use here to speed it up is to pack individual lists inside the largeres, those for indices, combinations and frequencies. While the full list can not be packed, those parts individually can:
In[18]:= Timing[
subIndicesPacked = Developer`ToPackedArray[largeres[[All,1]]];
subCombsPacked = Developer`ToPackedArray[largeres[[All,2]]];
subFreqsPacked = Developer`ToPackedArray[largeres[[All,3]]];
]
Out[18]= {1.672,Null}
This also takes some time, but it is a one-time operation again.
The following functions will then be used to extract the results for a fixed index much more efficiently:
Clear[extractPositionFromSparseArray];
extractPositionFromSparseArray[HoldPattern[SparseArray[u___]]] := {u}[[4, 2, 2]]
Clear[getCombinationsAndFrequenciesForIndex];
getCombinationsAndFrequenciesForIndex[packedIndices_, packedCombs_,
packedFreqs_, index_Integer] :=
With[{positions =
extractPositionFromSparseArray[
SparseArray[1 - Unitize[packedIndices - index]]]},
{Extract[packedCombs, positions],Extract[packedFreqs, positions]}];
Now, we have:
In[25]:=
getCombinationsAndFrequenciesForIndex[subIndicesPacked,subCombsPacked,subFreqsPacked,10]
//Short//Timing
Out[25]= {0.094,{{{196,26},{53,36},{360,43},{104,144},<<157674>>,{31,305},{240,291},
{256,38},{352,469}},{<<1>>}}}
We get a 30 times speed-up w.r.t. the naive Select approach.
Some notes on complexity
Note that the second solution is faster because it uses optimized data structures, but its complexity is the same as that of Select- based one, which is, linear in the length of total list of unique combinations for all indices. Therefore, in theory, the previously - discussed solutions based on nested hash-table etc may be asymptotically better. The problem is, that in practice we will probably hit the memory limitations long before that. For the 10 million configurations sample, the above code was still 2-3 times faster than the fastest solution I posted before.
EDIT
The following modification:
Clear[getCombinationsAndFrequenciesForIndex];
getCombinationsAndFrequenciesForIndex[packedIndices_, packedCombs_,
packedFreqs_, index_Integer] :=
With[{positions =
extractPositionFromSparseArray[
SparseArray[Unitize[packedIndices - index], Automatic, 1]]},
{Extract[packedCombs, positions], Extract[packedFreqs, positions]}];
makes the code twice faster still. Moreover, for more sparse indices (say, calling the sample-generation function with parameters like generateConfigurations[2000, 500, 500, 5000000] ), the speed-up with respect to the Select- based function is about 100 times.
I'd probably use SparseArrays here (see update below), but if you insist on using functions and *Values to store and retrieve values an approach would be to have the first part (f[2] etc.) replaced by a symbol you create on the fly like:
Table[Symbol["f" <> IntegerString[i, 10, 3]], {i, 11}]
(* ==> {f001, f002, f003, f004, f005, f006, f007, f008, f009, f010, f011} *)
Symbol["f" <> IntegerString[56, 10, 3]]
(* ==> f056 *)
Symbol["f" <> IntegerString[56, 10, 3]][{3, 4}] = 12;
Symbol["f" <> IntegerString[56, 10, 3]][{23, 18}] = 12;
Symbol["f" <> IntegerString[56, 10, 3]] // Evaluate // DownValues
(* ==> {HoldPattern[f056[{3, 4}]] :> 12, HoldPattern[f056[{23, 18}]] :> 12} *)
f056 // DownValues
(* ==> {HoldPattern[f056[{3, 4}]] :> 12, HoldPattern[f056[{23, 18}]] :> 12} *)
Personally I prefer Leonid's solution, as it's much more elegant but YMMV.
Update
On OP's request, about using SparseArrays:
Large SparseArrays take up a fraction of the size of standard nested lists. We can make f to be a large (100,000 entires) sparse array of sparse arrays:
f = SparseArray[{_} -> 0, 100000];
f // ByteCount
(* ==> 672 *)
(* initialize f with sparse arrays, takes a few seconds with f this large *)
Do[ f[[i]] = SparseArray[{_} -> 0, {100, 110}], {i,100000}] // Timing//First
(* ==> 18.923 *)
(* this takes about 2.5% of the memory that a normal array would take: *)
f // ByteCount
(* ==> 108000040 *)
ConstantArray[0, {100000, 100, 100}] // ByteCount
(* ==> 4000000176 *)
(* counting phase *)
f[[1]][[1, 2]]++;
f[[1]][[1, 2]]++;
f[[1]][[42, 64]]++;
f[[2]][[100, 11]]++;
(* reporting phase *)
f[[1]] // ArrayRules
f[[2]] // ArrayRules
f // ArrayRules
(*
==>{{1, 2} -> 2, {42, 64} -> 1, {_, _} -> 0}
==>{{100, 11} -> 1, {_, _} -> 0}
==>{{1, 1, 2} -> 2, {1, 42, 64} -> 1, {2, 100, 11} -> 1, {_, _, _} -> 0}
*)
As you can see, ArrayRules makes a nice list with contributions and counts. This can be done for each f[i] separately or the whole bunch together (last line).
In some scenarios (depending upon the performance needed to generate the values), the following easy solution using an auxiliary list (f[i,0]) may be useful:
f[_Integer][{_Integer ..}] := 0;
f[_Integer, 0] := Sequence ## {};
Table[
r = RandomInteger[1000, 2];
f[h = RandomInteger[100000]][r] = RandomInteger[10];
f[h, 0] = Union[f[h, 0], {r}];
, {i, 10^6}];
ExtractConfigurationsAndOccurences[f_, i_] := {f[i, 0], f[i][#] & /# f[i, 0]};
Timing#ExtractConfigurationsAndOccurences[f, 10]
Out[252]= {4.05231*10^-15, {{{172, 244}, {206, 115}, {277, 861}, {299,
862}, {316, 194}, {361, 164}, {362, 830}, {451, 306}, {614,
769}, {882, 159}}, {5, 2, 1, 5, 4, 10, 4, 4, 1, 8}}}
Many thanks for everyone on the help provided. I've been thinking a lot about everybody's input and I believe that in the simulation setup the following is the optimal solution:
SetAttributes[linkedList, HoldAllComplete];
temporarySymbols = linkedList[];
SetAttributes[bookmarkSymbol, Listable];
bookmarkSymbol[symbol_]:=
With[{old = temporarySymbols}, temporarySymbols= linkedList[old,symbol]];
registerConfiguration[index_]:=registerConfiguration[index]=
Module[
{
cs = linkedList[],
bookmarkConfiguration,
accumulator
},
(* remember the symbols we generate so we can remove them later *)
bookmarkSymbol[{cs,bookmarkConfiguration,accumulator}];
getCs[index] := List ## Flatten[cs, Infinity, linkedList];
getCsAndFreqs[index] := {getCs[index],accumulator /# getCs[index]};
accumulator[_]=0;
bookmarkConfiguration[c_]:=bookmarkConfiguration[c]=
With[{oldCs=cs}, cs = linkedList[oldCs, c]];
Function[c,
bookmarkConfiguration[c];
accumulator[c]++;
]
]
pattern = Verbatim[RuleDelayed][Verbatim[HoldPattern][HoldPattern[registerConfiguration [_Integer]]],_];
clearSimulationData :=
Block[{symbols},
DownValues[registerConfiguration]=DeleteCases[DownValues[registerConfiguration],pattern];
symbols = List ## Flatten[temporarySymbols, Infinity, linkedList];
(*Print["symbols to purge: ", symbols];*)
ClearAll /# symbols;
temporarySymbols = linkedList[];
]
It is based on Leonid's solution from one of previous posts, appended with belsairus' suggestion to include extra indexing for configurations that have been processed. Previous approaches are adapted so that configurations can be naturally registered and extracted using the same code more or less. This is hitting two flies at once since bookkeeping and retrieval and strongly interrelated.
This approach will work better in the situation when one wants to add simulation data incrementally (all curves are normally noisy so one has to add runs incrementally to obtain good plots). The sparse array approach will work better when data are generated in one go and then analyzed, but I do not remember being personally in such a situation where I had to do that.
Also, I was rather naive thinking that the data extraction and generation could be treated separately. In this particular case it seems one should have both perspectives in mind. I profoundly apologise for bluntly dismissing any previous suggestions in this direction (there were few implicit ones).
There are some open/minor problems that I do not know how to handle, e.g. when clearing the symbols I cannot clear headers like accumulator$164, I can only clean subvalues associated with it. Have not clue why. Also, if With[{oldCs=cs}, cs = linkedList[oldCs, c]]; is changed into something like cs = linkedList[cs, c]]; configurations are not stored. Have no clue either why the second option does not work. But these minor problems are well defined satellite issues that one can address in the future. By and large the problem seems solved by the generous help from all involved.
Many thanks again for all the help.
Regards
Zoran
p.s. There are some timings, but to understand what is going on I will append the code that is used for benchmarking. In brief, idea is to generate lists of configurations and just Map through them by invoking registerConfiguration. This essentially simulates data generation process. Here is the code used for testing:
fillSimulationData[sampleArg_] :=MapIndexed[registerConfiguration[#2[[1]]][#1]&, sampleArg,{2}];
sampleForIndex[index_]:=
Block[{nsamples,min,max},
min = Max[1,Floor[(9/10)maxSamplesPerIndex]];
max = maxSamplesPerIndex;
nsamples = RandomInteger[{min, max}];
RandomInteger[{1,10},{nsamples,ntypes}]
];
generateSample :=
Table[sampleForIndex[index],{index, 1, nindexes}];
measureGetCsTime :=((First # Timing[getCs[#]])& /# Range[1, nindexes]) // Max
measureGetCsAndFreqsTime:=((First # Timing[getCsAndFreqs[#]])& /# Range[1, nindexes]) // Max
reportSampleLength[sampleArg_] := StringForm["Total number of confs = ``, smallest accumulator length ``, largest accumulator length = ``", Sequence## {Total[#],Min[#],Max[#]}& [Length /# sampleArg]]
The first example is relatively modest:
clearSimulationData;
nindexes=100;maxSamplesPerIndex = 1000; ntypes = 2;
largeSample1 = generateSample;
reportSampleLength[largeSample1];
Total number of confs = 94891, smallest accumulator length 900, largest accumulator length = 1000;
First # Timing # fillSimulationData[largeSample1]
gives 1.375 secs which is fast I think.
With[{times = Table[measureGetCsTime, {50}]},
ListPlot[times, Joined -> True, PlotRange -> {0, Max[times]}]]
gives times around 0.016 secs, and
With[{times = Table[measureGetCsAndFreqsTime, {50}]},
ListPlot[times, Joined -> True, PlotRange -> {0, Max[times]}]]
gives same times. Now the real killer
nindexes = 10; maxSamplesPerIndex = 100000; ntypes = 10;
largeSample3 = generateSample;
largeSample3 // Short
{{{2,2,1,5,1,3,7,9,8,2},92061,{3,8,6,4,9,9,7,8,7,2}},8,{{4,10,1,5,9,8,8,10,8,6},95498,{3,8,8}}}
reported as
Total number of confs = 933590, smallest accumulator length 90760, largest accumulator length = 96876
gives generation times of ca 1.969 - 2.016 secs which is unbeliavably fast. I mean this is like going through the gigantic list of ca one million elements and applying a function to each element.
The extraction times for configs and {configs, freqs} are roughly 0.015 and 0.03 secs respectivelly.
To me this is a mind blowing speed I would never expect from Mathematica!

Mathematica "AppendTo" function problem

I'm a newbie in Mathematica and I'm having a major malfunction with adding columns to a data table. I'm running Mathematica 7 in Vista. I have spent a lot of time RFD before asking here.
I have a data table (mydata) with three columns and five rows. I'm trying to add two lists of five elements to the table (effectively adding two columns to the data table).
This works perfectly:
Table[AppendTo[mydata[[i]],myfirstlist[[i]]],{i,4}]
Printing out the table with: mydata // TableForm shows the added column.
However, when I try to add my second list
Table[AppendTo[mydata[[i]],mysecondlist[[i]]],{i,5}]
either Mathematica crashes(!) or I get a slew of Part::partw and Part::spec errors saying Part 5 does not exist.
However, after all the error messages (if Mathematica does not crash), again printing out the data table with: mydata // TableForm shows the data table with five columns just like I intended. All TableForm formatting options on the revised data table work fine.
Could anyone tell me what I'm doing wrong? Thanks in advance!
Let's try to clarify what the double transpose method consists of. I make no claims about the originality of the approach. My focus is on the clarity of exposition.
Let's begin with 5 lists. First we'll place three in a table. Then we'll add the final two.
food = {"bagels", "lox", "cream cheese", "coffee", "blueberries"};
mammals = {"fisher cat", "weasel", "skunk", "raccon", "squirrel"};
painters = {"Picasso", "Rembrandt", "Klee", "Rousseau", "Warhol"};
countries = {"Brazil", "Portugal", "Azores", "Guinea Bissau",
"Cape Verde"};
sports = {"golf", "badminton", "football", "tennis", "rugby"};
The first three lists--food, mammals, painters--become the elements of lists3. They are just lists, but TableForm displays them in a table as rows.
(lists3 = {food, mammals, painters}) // TableForm
mydata will be the name for lists3 transposed. Now the three lists appear as columns. That's what transposition does: columns and rows are switched.
(mydata = Transpose#lists3) // TableForm
This is where the problem actually begins. How can we add two additional columns (that is, the lists for countries and sports)? So let's work with the remaining two lists.
(lists2 = {countries, sports}) // TableForm
So we can Join Transpose[mydata] and lists2....
(lists5 = Join[Transpose[mydata], lists2]) // TableForm
[Alternatively, we might have Joined lists3 and lists2 because the second transposition, the transposition of mydata undoes the earlier transposition.
lists3 is just the transposition of mydata. (and vice-versa).]
In[]:= lists3 === Transpose[mydata]
Out[]:= True
Now we only need to Transpose the result to obtain the desired final table of five lists, each occupying its own column:
Transpose#lists5 // TableForm
I hope that helps shed some light on how to add two columns to a table. I find this way reasonably clear. You may find some other way clearer.
There are several things to cover here. First, the following code does not give me any errors, so there may be something else going on here. Perhaps you should post a full code block that produces the error.
mydata = Array[Subscript[{##}] &, {5, 3}];
myfirstlist = Range[1, 5];
mysecondlist = Range[6, 10];
Table[AppendTo[mydata[[i]], myfirstlist[[i]]], {i, 4}];
mydata // TableForm
Table[AppendTo[mydata[[i]], mysecondlist[[i]]], {i, 5}];
mydata // TableForm
Second, there is no purpose in using Table here, as you are modifying mydata directly. Table will use up memory pointlessly.
Third, there are better ways to accomplish this task.
See How to prepend a column and Inserting into a 2d list
I must retract my definitive statement that there are better ways. After changing Table to Do and running a few quick tests, this appears to be a competitive method for some data.
I am using Mathematica 7, so that does not appear to be the problem.
As mentioned before, there are better alternatives to adding columns to a list, and like Gareth and Mr.Wizard, I do not seem to be able to reproduce the problem on v. 7. But, I want to focus on the error itself, and see if we can correct it that way. When Mathematica produces the message Part::partw it spits out part of the offending list like
Range[1000][[1001]]
Part::partw: Part 1001 of
{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,
31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,<<950>>}
does not exist.
So, the question I ask is which list is giving me the problems? My best guess is it is mysecondlist, and I'd check Length # mysecondlist to see if it is actually 5 elements long.
Well, here's my two cents with what I believe is a very fast and IMHO most easily understandable construction.
First, some test arrays:
m = RandomInteger[100, {2000, 10000}];
l1 = RandomInteger[100, 2000];
l2 = RandomInteger[100, 2000];
{r, c} = Dimensions[m];
I increased the test array sizes somewhat to improve accuracy of the following timing measurements.
The method involves the invoking of the powers of Part ([[...]]), All and Span (;;).
Basically, I set up a new working matrix with the future dimensions of the data array after addition of the two columns, then add the original matrix using All and Span and add the additional columns with All only. I then copy back the scrap matrix to our original matrix, as the other methods also return the modified data matrix.
n = ConstantArray[0, {r, c} + {0, 2}];
n[[All, 1 ;; c]] = m;
n[[All, c + 1]] = l1;
n[[All, c + 2]] = l2;
m = n;
As for timing:
Mean[
Table[
First[
AbsoluteTiming[
n2 = ConstantArray[0, {r, c} + {0, 2}];
n2[[All, 1 ;; c]] = m;
n2[[All, c + 1]] = l1;
n2[[All, c + 2]] = l2;
m2 = n2;
]
],
{10}
]
]
0.1056061
(an average of 10 runs)
The other proposed method with Do (Mr.Wizard and the OP):
Mean[
Table[
n1 = m;
First[
AbsoluteTiming[
Do[AppendTo[n1[[i]], l1[[i]]], {i, 2000}];
Do[AppendTo[n1[[i]], l2[[i]]], {i, 2000}];
]
],
{10}
]
]
0.4898280
The result is the same:
In[9]:= n2 == n1
Out[9]= True
So, a conceptually easy and quick (5 times faster!) method.
I tried to reproduce this but failed. I'm running Mma 8 on Windows XP; it doesn't seem like the difference should matter, but who knows? I said, successively,
myData = {{1, 2, 3}, {2, 3, 4}, {8, 9, 10}, {1, 1, 1}, {2, 2, 2}}
myFirstList = {9, 9, 9, 9, 9}
mySecondList = {6, 6, 6, 6, 6}
Table[AppendTo[myData[[i]], myFirstList[[i]]], {i, 4}]
Table[AppendTo[myData[[i]], mySecondList[[i]]], {i, 5}]
myData // TableForm
and got (0) no crash, (1) no errors or warnings, and (2) the output I expected. (Note: I used 4 rather than 5 in the limit of the first set of appends, just like in your question, in case that was somehow provoking trouble.)
The Mma documentation claims that AppendTo[a,b] is always equivalent to a=Append[a,b], which suggests that it isn't modifying the list in-place. But I wonder whether maybe AppendTo sometimes does modify the list when it thinks it's safe to do so; then if it thinks it's safe and it isn't, there could be nasty consequences. Do the weird error messages and crashes still happen if you replace AppendTo with Append + ordinary assignment?

Resources