Conditional Data Manipulation in Mathematica - data-structures

I am trying to prepare the best tools for efficient Data Analysis in Mathematica.
I have a approximately 300 Columns & 100 000 Rows.
What would be the best tricks to :
"Remove", "Extract" or simply "Consider" parts of the data structure, for plotting for e.g.
One of the trickiest examples I could think of is :
Given a data structure,
Extract Column 1 to 3, 6 to 9 as well as the last One for every lines where the value in Column 2 is equal to x and the value in column 8 is different than y
I also welcome any general advice on data manipulation.

For a generic manipulation of data in a table with named columns, I refer you to this solution of mine, for a similar question. For any particular case, it might be easier to write a function for Select manually. However, for many columns, and many different queries, chances to mess up indexes are high. Here is the modified solution from the mentioned post, which provides a more friendly syntax:
Clear[getIds];
getIds[table : {colNames_List, rows__List}] := {rows}[[All, 1]];
ClearAll[select, where];
SetAttributes[where, HoldAll];
select[cnames_List, from[table : {colNames_List, rows__List}], where[condition_]] :=
With[{colRules = Dispatch[ Thread[colNames -> Thread[Slot[Range[Length[colNames]]]]]],
indexRules = Dispatch[Thread[colNames -> Range[Length[colNames]]]]},
With[{selF = Apply[Function, Hold[condition] /. colRules]},
Select[{rows}, selF ## # &][[All, cnames /. indexRules]]]];
What happens here is that the function used in Select gets generated automatically from your specifications. For example (using #Yoda's example):
rows = Array[#1 #2 &, {5, 15}];
We need to define the column names (must be strings or symbols without values):
In[425]:=
colnames = "c" <> ToString[#] & /# Range[15]
Out[425]= {"c1", "c2", "c3", "c4", "c5", "c6", "c7", "c8", "c9", "c10", "c11", "c12",
"c13", "c14", "c15"}
(in practice, usually names are more descriptive, of course). Here is the table then:
table = Prepend[rows, colnames];
Here is the select statement you need (I picked x = 4 and y=2):
select[{"c1", "c2", "c3", "c6", "c7", "c8", "c9", "c15"}, from[table],
where["c2" == 4 && "c8" != 2]]
{{2, 4, 6, 12, 14, 16, 18, 30}}
Now, for a single query, this may look like a complicated way to do this. But you can do many different queries, such as
In[468]:= select[{"c1", "c2", "c3"}, from[table], where[EvenQ["c2"] && "c10" > 10]]
Out[468]= {{2, 4, 6}, {3, 6, 9}, {4, 8, 12}, {5, 10, 15}}
and similar.
Of course, if there are specific correlations in your data, you might find a particular special-purpose algorithm which will be faster. The function above can be extended in many ways, to simplify common queries (include "all", etc), or to auto-compile the generated pure function (if possible).
EDIT
On a philosophical note, I am sure that many Mathematica users (myself included) found themselves from time to time writing similar code again and again. The fact that Mathematica has a concise syntax makes it often very easy to write for any particular case. However, as long as one works in some specific domain (like, for example, data manipulations in a table), the cost of repeating yourself will be high for many operations. What my example illustrates in a very simple setting is a one possible way out - create a Domain-Specific Language (DSL). For that, one generally needs to define a syntax/grammar for it, and write a compiler from it to Mathematica (to generate Mathematica code automatically). Now, the example above is a very primitive realization of this idea, but my point is that Mathematica is generally very well suited for DSL creation, which I think is a very powerful technique.

data = RandomInteger[{1, 20}, {40, 20}]
x = 5;
y = 8;
Select[data, (#[[2]] == x && #[[8]] != y &)][[All, {1, 2, 3, 6, 7, 8, 9, -1}]]
==> {{5, 5, 1, 4, 18, 6, 3, 5}, {10, 5, 15, 3, 15, 14, 2, 5}, {18, 5, 6, 7, 7, 19, 14, 6}}
Some useful commands to get pieces of matrices and list are Span (;;), Drop, Take, Select, Cases and more. See tutorial/GettingAndSettingPiecesOfMatrices and guide/PartsOfMatrices,
Part ([[...]]) in combination with ;; can be quite powerful. a[[All, 1;;-1;;2]], for instance, means take all rows and all odd columns (-1 having the usual meaning of counting from the end).
Select can be used to pick elements from a list (and remember a matrix is a list of lists), based on a logical function. It's twin brother is Cases which does selection based on a pattern. The function I used here is a 'pure' function, where # refers to the argument on which this function is applied (the elements of the list in this case). Since the elements are lists themselves (the rows of the matrix) I can refer to the columns by using the Part ([[..]]) function.

To pull out columns (or rows) you can do it by part indexing
data = Array[#1 #2 &, {5, 15}];
data[[All, Flatten#{Range#3, Range ## {6, 9}, -1}]]
MatrixForm#%
The last line is just to view it pretty.
As Sjoerd mentioned in his comment (and in the explanation in his answer), indexing a single range can be easily done with the Span (;;) command. If you are joining multiple disjoint ranges, using Flatten to combine the separate ranges created with Range is easier than entering them by hand.

I read:
Extract Column 1 to 3, 6 to 9 as well as the last One for every lines where the value in Column 2 is equal to x and the value in column 8 is different than y
as meaning that we want:
elements 1-3 and 6-9 from each row
AND
the last element from rows wherein [[2]] == x && [[8]] != y.
This is what I hacked together:
a = RandomInteger[5, {20, 10}]; (*define the array*)
x = 4; y = 0; (*define the test values*)
Join ## Range ### {1 ;; 3, 6 ;; 9}; (*define the column ranges*)
#2 == x && #8 != y & ### a; (*test the rows*)
Append[%%, #] & /# % /. {True -> -1, False :> Sequence[]}; (*complete the ranges according to the test*)
MapThread[Part, {a, %}] // TableForm (*extract and display*)

Related

Wolfram Language, RandomChoice recursion

a = RandomChoice[{a,2}]&
a[]
There are other ways to achieve this example, but I wish to do more complicated things similar to this using this method.
Can I get this to continue until there are no as left to resolve, without producing a stack overflow by trying to resolve {a,2} before making the choice? Instead making the choice and resolving only the symbol chosen.
Here is a way to have RandomChoice evaluate a function only when selected:
g := (Print["evaluate g"]; 42);
f = ( If[TrueQ[#], g, #] &#RandomChoice[{True, 1, 2, 3, 4}]) &
Table[f[], {10}]
this prints "evaluate g" just when randomly selected and outputs eg.
(* {2, 42, 3, 1, 3, 2, 4, 42, 2, 4} *)
This is another way, maybe a bit cleaner:
f = Unevaluated[{g, 1, 2, 3, 4}][[RandomInteger[{1, 5}]]] &
this works fine recursively:
a = Unevaluated[{a[], 2}][[RandomInteger[{1, 2}]]] &
Though as i said in comment it simply returns 2 every time since it recurses until 2 is chosen.
a[] (* 2 *)
I do not understand the entirety of the question and my guess is there is a better way to accomplish what you want.
Your request seems slightly contradictory: when the random choice selects a you want it to recurse, but if it chooses a number you still want to continue until there are no a's left.
The following code does this, but the recursion is technically unnecessary. Perhaps in your application it will be applicable.
The first part shows how you can recurse selections of a by using Hold :-
Clear[a]
list = {1, 2, a, 1, a, 3, a, 1, 4, 5};
heldlist = Hold /# list;
a := Module[{z}, z = RandomChoice[heldlist];
Print["for information, choice was ", z];
If[MatchQ[z, Hold[_Symbol]],
heldlist = DeleteCases[heldlist, z, 1, 1];
If[MemberQ[heldlist, Hold[_Symbol]], ReleaseHold[z]]]]
a
On this occasion, calling a recurses once, then picks a 4 and stops, as expected.
for information, choice was Hold[a]
for information, choice was Hold[4]
To make the process continue until there are no a's While can be used. There is recursion going on here too but While keeps the process going when a number is picked.
While[MemberQ[heldlist, Hold[_Symbol]], a]
for information, choice was Hold[a]
for information, choice was Hold[1]
for information, choice was Hold[a]
These are the remaining items in the list :-
ReleaseHold /# heldlist
{1, 2, 1, 3, 1, 4, 5}

What does the <> symbol mean in InterpolatingFunction?

What does <> (less than followed by greater than) mean in Mathematica? For example:
InterpolatingFunction[{-6,6},{0,6}],<>[x,y]
I am very much confused in such kind of expressions. As I have received such kind of output in NDSolve.
Mathematica expressions come with a head and then several arguments. For example, the output of some operation might give you the output List[1,2,3,4,5]. However, Mathematica knows this is a list and the output is formatted as {1,2,3,4,5} instead.
A function like Interpolation will give you a special type of object (an interpolating function) that has many components. Unlike list, most of its components are irrelevant, so you can ignore them. Mathematica hides them using <> so that you don't have to look at them.
f = Interpolation[RandomInteger[10, 10]]
output: InterpolatingFunction[{{1, 10}}, "<>"]
All it shows you is the Head, which is InterpolatingFunction, and then the first argument which is the domain(s) of the function. There is only one variable, so there is only one domain {1,10} so the list of domains is {{1,10}}.
All the other arguments are there, so you can find them. You can evaluate f by:
f[2.3]
output: 0.7385
(Your output will vary!) But you can also look at the pieces of f:
f[[2]]
output: {4, 3, 0, {10}, {4}, 0, 0, 0, 0, Automatic}
The second piece, normally hidden, is a list of different properties of the interpolating function that we normally don't care about.
You can change the head on many things using ## which changes the header of one thing into another. For example:
mylist = {2,3,4,5};
Plus##mylist
output: 14
You can do this with our function:
List##f
output: {{{1, 10}}, {4, 3, 0, {10}, {4}, 0, 0, 0, 0, Automatic},
{{1, 2, 3, 4, 5, 6, 7, 8, 9, 10}}, {{9}, {2}, {0}, {6},
{10}, {6}, {7}, {5}, {0}, {6}}, {Automatic}}
All of that is the "guts" of the interpolating function. That's what's missing in the <>, because this might describe the interpolating function but we don't really need to see it.
If you are looking for an explicit polynomial interpolation, you should be doing:
InterpolatingPolynomial[RandomInteger[10, 10], x]
which gives you a function of x (in a very non-simplified form) that is what you want.

Efficient data structure for a list of index sets

I am trying to explain by example:
Imagine a list of numbered elements E = [elem0, elem1, elem2, ...].
One index set could now be {42, 66, 128} refering to elements in E. The ordering in this set is not important, so {42, 66, 128} == {66, 128, 42}, but each element is at most once in any given index set (so it is an actual set).
What I want now is a space efficient data structure that gives me another ordered list M that contains index sets that refer to elements in E. Each index set in M will only occur once (so M is a set in this regard) but M must be indexable itself (so M is a List in this sense, whereby the precise index is not important). If necessary, index sets can be forced to all contain the same number of elements.
For example, M could look like:
0: {42, 66, 128}
1: {42, 66, 9999}
2: {1, 66, 9999}
I could now do the following:
for(i in M[2]) { element = E[i]; /* do something with E[1],E[66],and E[9999] */ }
You probably see where this is going: You may now have another map M2 that is an ordered list of sets pointing into M which ultimately point to elements in E.
As you can see in this example, index sets can be relatively similar (M[0] and M[1] share the first two entries, M[1] and M[2] share the last two) which makes me think that there must be something more efficient than the naive way of using an array-of-sets. However, I may not be able to come up with a good global ordering of index entries that guarantee good "sharing".
I could think of anything ranging from representing M as a tree (where M's index comes from the depth-first search ordering or something) to hash maps of union-find structures (no idea how that would work though:)
Pointers to any textbook datastructure for something like this are highly welcome (is there anything in the world of databases?) but I also appreciate if you propose a "self-made" solution or only random ideas.
Space efficiency is important for me because E may contain thousands or even few million elements, (some) index sets are potentially large, similarities between at least some index sets should be substantial, and there may be multiple layers of mappings.
Thanks a ton!
You may combine all numbers from M and remove duplicates and name it as UniqueM.
All M[X] collections convert to bit masks. For example int value may store 32 numbers (To support of unlimited count you should store array of ints, if array size is 10 totally we can store 320 different elements). long type may store 64 bits.
E: {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}
M[0]: {6, 8, 1}
M[1]: {2, 8, 1}
M[2]: {6, 8, 5}
Will be converted to:
UniqueM: {6, 8, 1, 2, 5}
M[0]: 11100 {this is 7}
M[1]: 01110 {this is 14}
M[2]: 11001 {this is 19}
Note:
Also you may combine my and ring0 approaches, instead of rearrange E make new UniqueM and use intervals inside it.
It will be pretty hard to beat an index. You could save some space by using the right data type (eg in gnu C, short if less than 64k elements in E, int if < 4G...).
Besides,
Since you say the order in E is not important, you could sort E a way it maximizes the consecutive elements to match as much as possible the Ms.
For instance,
E: { 1,2,3,4,5,6,7,8 }
0: {1,3,5,7}
1: {1,3,5,8}
2: {3,5,7,8}
By re-arranging E
E: { 1,3,5,7,8,2,4,6 }
and using E indexes, not values, you could define the Ms based on subsets of E, giving indexes
0: {0-3} // E[0]: 1, E[1]: 3, E[2]: 5, E[3]: 7 etc...
1: {0-2,4}
2: {1-3,4}
this way
you use indexes instead of the raw numbers (indexes are usually smaller, no negative..)
the Ms are made of sub-sets, 0-3 meaning 0,1,2,3,
The difficult part is to make the algorithm to re-arrange E so that you maximize the subsets sizes - minimize the Ms sizes.
E rearrangement algo suggestion
sort all Ms
process all Ms:
algo to build a map, which gives for an element 'x' its list of neighbors 'y', along with points, number of times 'y' is just after 'x'
Map map (x,y) -> z
for m in Ms
for e,f in m // e and f are consecutive elements
if ( ! map(e,f)) map(e,f) = 1
else map(e,f)++
rof
rof
Get E rearranged
ER = {} // E rearranged
Map mas = sort_map(map) // mas(x) -> list(y) where 'y' are sorted desc based on 'z'
e = get_min_elem(mas) // init with lowest element (regardless its 'z' scores)
while (mas has elements)
ER += e // add element e to ER
f = mas(e)[0] // get most likely neighbor of e (in f), ie first in the list
if (empty(mas(e))
e = get_min_elem(mas) // Get next lowest remaining value
else
delete mas(e)[0] // set next e neighbour in line
e = f
fi
elihw
The algo (map) should be O(n*m) space, with n elements in E, m elements in all Ms.
Bit arrays may be used. They're arrays of elements a[i] which are 1 if i is in set and 0 if i is not in set. So every set would occupy exactly size(E) bits even if it contain a few or no members. Not so space efficient, but if you compress this array with some compression algorithm it will be much less in size (possibly reaching ultimate entropy limit). So you can try dynamic Markov coder or RLE or group Huffman and choose one most efficient for you. Then, iteration process could include on-the-fly decompression followed by linear scanning for 1 bits. For looong 0 runs you could modify decompression algorithm to detect such cases (RLE is simplest case for it).
If you found sets having small defference, you may store sets A and A xor B anstead of A and B saving space for common parts. In this case to iterate over B you'll have to unpack both A and A xor B then xor them.
Another useful solution:
E: {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}
M[0]: {1, 2, 3, 4, 5, 10, 14, 15}
M[1]: {1, 2, 3, 4, 5, 11, 14, 15}
M[2]: {1, 2, 3, 4, 5, 12, 13}
Cache frequently used items:
Cache[1] = {1, 2, 3, 4, 5}
Cache[2] = {14, 15}
Cache[3] = {-2, 7, 8, 9} //Not used just example.
M[0]: {-1, 10, -2}
M[1]: {-1, 11, -2}
M[2]: {-1, 12, 13}
Mark links to cached list as negative numbers.

Selecting Non-Sequential Elements From a List in Mathematica

I frequently import spreadsheets into Mathematica for analysis and am having trouble coding a simple way to select non-sequential elements for the analysis. For example, If I import a spreadsheet with 20 columns and 100 rows, I commonly will need to drop selected rows/columns.
In this example I need all rows and columns 2,4,7-17. It seems logical the following code should work but results in error:
v[[ All, {2,4,7;;17} ]]
Instead I have been using:
v[[ All, {2,4,7,8,9,10,11,12,13,14,15,16,17} ]]
Is it possible to use Span (;;) to select a block of columns (7-17) while also selecting rows 2 and 4 in my example?
the x ;; y syntax is a special argument to Part, not a general syntax that can be used to build lists. So you could say v[[ All, 7;;17 ]], but not v[[All, {7;;17}]] -- the latter is neither a list of integers nor a special syntax recognizable by Part.
But it is pretty easy to solve your problem. You can try:
v[[All, {2,4}~Join~Table[x,{x,7,17}] ]]
for example, or
Join[v[[All, {2, 4}]], v[[All, 7 ;; 17]], 2]
Good luck!
This is a known limitation of Part and Span. See my own closely related question:
Part and Span: is there a reason this *should* not work?
Your solution is the most common work-around. If you find it too inconvenient to build lists of indices you can try to make it easier with a custom Part function. For example:
SetAttributes[part, HoldFirst]
part[x_, parts__] := x[[##]] & ## Flatten /# ({parts} /. Span -> Range)
Use:
a = Range#24 ~Partition~ 4;
part[a, {1 ;; 3, 6}, {1, 3 ;; 4}]
{{1, 3, 4}, {5, 7, 8}, {9, 11, 12}, {21, 23, 24}}
This makes no attempt to handle negative index Spans which would be considerably more complicated, but perhaps it is useful at least to give you some ideas.
another approach..
pys[{all___}] :=
Flatten[(Switch[Head[#], Span,
Range ## (# /. Span -> List), __, #]) & /# {all}]
list = Range[100];
list[[pys[{1, 3 ;; 12 ;; 2, 19, -3 ;; -1}] ]]
{1, 3, 5, 7, 9, 11, 19, 98, 99, 100}
This notably does not handle open ends {1,3;;} or mixed +/- spans { 5;;-5 }

Store unevaluted function in list mathematica

Example:
list:={ Plus[1,1], Times[2,3] }
When looking at list, I get
{2,6}
I want to keep them unevaluated (as above) so that list returns
{ Plus[1,1], Times[2,3] }
Later I want to evaluate the functions in list sequence to get
{2,6}
The number of unevaluated functions in list is not known beforehand. Besides Plus, user defined functions like f[x_] may be stored in list
I hope the example is clear.
What is the best way to do this?
The best way is to store them in Hold, not List, like so:
In[255]:= f[x_] := x^2;
lh = Hold[Plus[1, 1], Times[2, 3], f[2]]
Out[256]= Hold[1 + 1, 2 3, f[2]]
In this way, you have full control over them. At some point, you may call ReleaseHold to evaluate them:
In[258]:= ReleaseHold#lh
Out[258]= Sequence[2, 6, 4]
If you want the results in a list rather than Sequence, you may use just List##lh instead. If you need to evaluate a specific one, simply use Part to extract it:
In[261]:= lh[[2]]
Out[261]= 6
If you insist on your construction, here is a way:
In[263]:= l:={Plus[1,1],Times[2,3],f[2]};
Hold[l]/.OwnValues[l]
Out[264]= Hold[{1+1,2 3,f[2]}]
EDIT
In case you have some functions/symbols with UpValues which can evaluate even inside Hold, you may want to use HoldComplete in place of Hold.
EDIT2
As pointed by #Mr.Wizard in another answer, sometimes you may find it more convenient to have Hold wrapped around individual items in your sequence. My comment here is that the usefulness of both forms is amplified once we realize that it is very easy to transform one into another and back. The following function will split the sequence inside Hold into a list of held items:
splitHeldSequence[Hold[seq___], f_: Hold] := List ## Map[f, Hold[seq]]
for example,
In[274]:= splitHeldSequence[Hold[1 + 1, 2 + 2]]
Out[274]= {Hold[1 + 1], Hold[2 + 2]}
grouping them back into a single Hold is even easier - just Apply Join:
In[275]:= Join ## {Hold[1 + 1], Hold[2 + 2]}
Out[275]= Hold[1 + 1, 2 + 2]
The two different forms are useful in diferrent circumstances. You can easily use things such as Union, Select, Cases on a list of held items without thinking much about evaluation. Once finished, you can combine them back into a single Hold, for example, to feed as unevaluated sequence of arguments to some function.
EDIT 3
Per request of #ndroock1, here is a specific example. The setup:
l = {1, 1, 1, 2, 4, 8, 3, 9, 27}
S[n_] := Module[{}, l[[n]] = l[[n]] + 1; l]
Z[n_] := Module[{}, l[[n]] = 0; l]
placing functions in Hold:
In[43]:= held = Hold[Z[1], S[1]]
Out[43]= Hold[Z[1], S[1]]
Here is how the exec function may look:
exec[n_] := MapAt[Evaluate, held, n]
Now,
In[46]:= {exec[1], exec[2]}
Out[46]= {Hold[{0, 1, 1, 2, 4, 8, 3, 9, 27}, S[1]], Hold[Z[1], {1, 1, 1, 2, 4, 8, 3, 9, 27}]}
Note that the original variable held remains unchanged, since we operate on the copy. Note also that the original setup contains mutable state (l), which is not very idiomatic in Mathematica. In particular, the order of evaluations matter:
In[61]:= Reverse[{exec[2], exec[1]}]
Out[61]= {Hold[{0, 1, 1, 2, 4, 8, 3, 9, 27}, S[1]], Hold[Z[1], {2, 1, 1, 2, 4, 8, 3, 9, 27}]}
Whether or not this is desired depends on the specific needs, I just wanted to point this out. Also, while the exec above is implemented according to the requested spec, it implicitly depends on a global variable l, which I consider a bad practice.
An alternative way to store functions suggested by #Mr.Wizard can be achieved e.g. like
In[63]:= listOfHeld = splitHeldSequence[held]
Out[63]= {Hold[Z1], Hold[S1]}
and here
In[64]:= execAlt[n_] := MapAt[ReleaseHold, listOfHeld, n]
In[70]:= l = {1, 1, 1, 2, 4, 8, 3, 9, 27} ;
{execAlt[1], execAlt[2]}
Out[71]= {{{0, 1, 1, 2, 4, 8, 3, 9, 27}, Hold[S[1]]}, {Hold[Z[1]], {1, 1, 1, 2, 4, 8, 3, 9, 27}}}
The same comments about mutability and dependence on a global variable go here as well. This last form is also more suited to query the function type:
getType[n_, lh_] := lh[[n]] /. {Hold[_Z] :> zType, Hold[_S] :> sType, _ :> unknownType}
for example:
In[172]:= getType[#, listOfHeld] & /# {1, 2}
Out[172]= {zType, sType}
The first thing that spings to mind is to not use List but rather use something like this:
SetAttributes[lst, HoldAll];
heldL=lst[Plus[1, 1], Times[2, 3]]
There will surely be lots of more erudite suggestions though!
You can also use Hold on every element that you want held:
a = {Hold[2 + 2], Hold[2*3]}
You can use HoldForm on either the elements or the list, if you want the appearance of the list without Hold visible:
b = {HoldForm[2 + 2], HoldForm[2*3]}
c = HoldForm#{2 + 2, 2*3}
{2 + 2, 2 * 3}
And you can recover the evaluated form with ReleaseHold:
a // ReleaseHold
b // ReleaseHold
c // ReleaseHold
Out[8]= {4, 6}
Out[9]= {4, 6}
Out[10]= {4, 6}
The form Hold[2+2, 2*3] or that of a and b above are good because you can easily add terms with e.g. Append. For b type is it logically:
Append[b, HoldForm[8/4]]
For Hold[2+2, 2*3]:
Hold[2+2, 2*3] ~Join~ Hold[8/4]
Another way:
lh = Function[u, Hold#u, {HoldAll, Listable}];
k = lh#{2 + 2, Sin[Pi]}
(*
->{Hold[2 + 2], Hold[Sin[\[Pi]]]}
*)
ReleaseHold#First#k
(*
-> 4
*)

Resources