In Mathematica as in other systems of computer math the numbers are internally stored in binary form. However when exporting them with such functions as Put and PutAppend they are converted into approximate decimals. When you import them back with such functions as Get they are restored from this approximate decimal representation to binary form.
The question is whether the recovered number is always identical to the original binary number and, if not always, in which cases it is not and how large can be the difference? I am particularly interested in the Put - Get cycle (on the same computer system).
The following two simple experiments show that probably the Put - Get cycle in Mathematica always restores original numbers exactly even for arbitrary precision numbers:
In[1]:= list=RandomReal[{-10^6,10^6},10000];
Put[list,"test.txt"];
list2=Get["test.txt"];
Order[list,list2]===0
Order[Total#Abs[list-list2],0.]===0
Out[4]= True
Out[5]= True
In[6]:= list=SetPrecision[RandomReal[{-10^6,10^6},10000],50];
Put[list,"test.txt"];
list2=Get["test.txt"];
Order[list,list2]===0
Total#Abs[list-list2]//InputForm
Out[9]= True
Out[10]//InputForm=
0``39.999515496936205
But maybe I am missing something?
UPDATE
With more correct test code I have found that in reality these tests show only that restored numbers have identical binary RealDigits but their Precisions may differ even in Equal sense. Here are more correct tests:
test := (Put[list, "test.txt"];
list2 = Get["test.txt"];
{Order[list, list2] === 0,
Order[Total#Abs[list - list2], 0.] === 0,
Total[Order ### RealDigits[Transpose[{list, list2}], 2]],
Total[Order ### Map[Precision, Transpose[{list, list2}], {-1}]],
Total[1 - Boole[Equal ### Map[Precision, Transpose[{list, list2}], {-1}]]]})
In[8]:= list=RandomReal[NormalDistribution[],10000]^1001;
test
Out[9]= {False,True,0,1,3}
In[6]:= list=RandomReal[NormalDistribution[],10000,WorkingPrecision->50]^1001;
test
Out[7]= {False,False,0,-2174,1}
I'm afraid I can't give a definitive answer. If you look into the text file you see it's stored as something like the InputForm of the values, including the precision indication for non-machine precision numbers.
Assuming that Get uses the same conversion routines as ImportString and ExportString your test can be sped up a tiny bit.
Monitor[
Do[
i = RandomReal[{$MinMachineNumber, 10 $MinMachineNumber}, 100000];
If[i =!=
ToExpression[ImportString[ExportString[i, "Text"], "List"]],
Print[i]], {n, 100}
],
n]
I have tested this for several hundreds of millions of numbers in various ranges between $MinMachineNumber and $MaxMachineNumber and I always get back the original numbers. It's no proof, of course, but it seems unlikely that you're going to see numbers for which this is not true if there are any (and in that case the difference would be so tiny as to be negligible).
One important thing to know is that Put[] / Get[] doesn't keep packed arrays packed. You should check out DumpSave[]. It's much faster as it's a binary format and keeps arrays packed.
Related
I want mathematica to display the result in decimal form.Say 3 decimal places together with 10 to some power. What function shall I use?
r = 0;
Do[r += (i/100), {i, 1, 100}];
Print[r];
I tried ScientificForm[r,3]andNumberForm[r,3] and both do not work.
Thanks in advance!
The problem you have, though you don't quite state this, is that Mathematica can compute r accurately. Your code sets the value of 2 to the rational number 101/2 and Mathematica's default behaviour is to display accurate numbers accurately. That's what you (or whoever bought your licence) pay for.
The expression
N[r]
will produce a decimal representation of r, ie 50.5 and
ScientificForm[N[r]]
gives the result
5.05*10^(1)
(though formatted rather more nicely in the Mathematica front end).
I am using Mathematica 7 to process a large data set. The data set is a three-dimensional array of signed integers. The three levels may be thought of as corresponding to X points per shot, Y shots per scan, and Z scans per set.
I also have a "zeroing" shot (containing X points, which are signed fractions of integers), which I would like to subtract from every shot in the data set. Afterwards, I will never again need the original data set.
How can I perform this transformation without creating new copies of the data set, or parts of it, in the process? Conceptually, the data set is located in memory, and I would like to scan through each element, and change it at that location in memory, without permanently copying it to some other memory location.
The following self-contained code captures all the aspects of what I am trying to do:
(* Create some offsetted data, and a zero data set. *)
myData = Table[Table[Table[RandomInteger[{1, 100}], {k, 500}], {j, 400}], {i, 200}];
myZero = Table[RandomInteger[{1, 9}]/RandomInteger[{1, 9}] + 50, {i, 500}];
(* Method 1 *)
myData = Table[
f1 = myData[[i]];
Table[
f2 = f1[[j]];
f2 - myZero, {j, 400}], {i, 200}];
(* Method 2 *)
Do[
Do[
myData[[i]][[j]] = myData[[i]][[j]] - myZero, {j, 400}], {i, 200}]
(* Method 3 *)
Attributes[Zeroing] = {HoldFirst};
Zeroing[x_] := Module[{},
Do[
Do[
x[[i]][[j]] = x[[i]][[j]] - myZero, {j, Length[x[[1]]]}
], {i, Length[x]}
]
];
(Note: Hat tip to Aaron Honecker for Method #3.)
On my machine (Intel Core2 Duo CPU 3.17 GHz, 4 GB RAM, 32-bit Windows 7), all three methods use roughly 1.25 GB of memory, with #2 and #3 fairing slightly better.
If I don't mind losing precision, wrapping N[ ] around the innards of myData and myZero when they're being created increases their size in memory by 150 MB initially but reduces the amount of memory required for zeroing (by methods #1-#3 above) from 1.25 GB down to just 300 MB! That's my working solution, but it would be great to know the best way of handling this problem.
Unfortunately I have little time now, so I must be concise ...
When working with large data, you need to be aware that Mathematica has a different storage format called packed arrays which is much more compact and much faster than the regular one, but only works for machine reals or integers.
Please evaluate ?Developer`*Packed* to see what functions are available for directly converting to/from them, if this doesn't happen automatically.
So the brief explanation behind why my solution is fast and memory efficient is that it uses packed arrays. I tested using Developer`PackedArrayQ that my arrays never get unpacked, and I used machine reals (I applied N[] to everything)
In[1]:= myData = N#RandomInteger[{1, 100}, {200, 400, 500}];
In[2]:= myZero =
Developer`ToPackedArray#
N#Table[RandomInteger[{1, 9}]/RandomInteger[{1, 9}] + 50, {i, 500}];
In[3]:= myData = Map[# - myZero &, myData, {2}]; // Timing
Out[3]= {1.516, Null}
Also, the operation you were asking for ("I would like to scan through each element, and change it at that location in memory") is called mapping (see Map[] or /#).
Let me start by noting that this answer must be viewed as complementary to the one by #Szabolcs, with the latter being, in my conclusion, the better option. While the solution of #Szabolcs is probably the fastest and best overall, it falls short of the original spec in that Map returns a (modified) copy of the original list, rather than "scan each element, and change it at that location in memory". Such behavior, AFAIK, is only provided by Part command. I will use his ideas (converting everything into packed arrays), to show the code that does in-memory changes to the original list:
In[5]:=
Do[myData[[All,All,i]]=myData[[All,All,i]]- myZero[[i]],
{i,Last#Dimensions#myData}];//Timing
Out[5]= {4.734,Null}
This is conceptually equivalent to method 3 listed in the question, but runs much faster because this is a partly vectorized solution and only a single loop is needed. This is however still at least order of magnitude slower than the solution of #Szabolcs.
In theory, this seems to be a classic speed/memory tradeoff: if you need speed and have some spare memory, #Szabolcs's solution is the way to go. If your memory requirements are tough, in theory this slower method would save on intermediate memory consumption (in the method of #Szabolcs, the original list is garbage-collected after the myData is assigned the result of Map, so the final memory usage is the same, but during the computation, one extra array of the size of myData is maintained by Map).
In practice, however, the memory consumption seems to not be smaller, since an extra copy of the list is for some reason maintained in the Out variable in both cases during (or right after) the computation, even when the output is suppressed (it may also be that this effect does not manifest itself in all cases). I don't quite understand this yet, but my current conclusion is that the method of #Szabolcs is just as good in terms of intermediate memory consumption as the present one based on the in-place list modifications. Therefore, his method seems the way to go in all cases, but I still decided to publish this answer as a complementary.
Can anybody advise an alternative to importing a couple of
GByte of numeric data (in .mx form) from a list of 60 .mx files, each about 650 MByte?
The - too large to post here - research-problem involved simple statistical operations
with double as much GB of data (around 34) than RAM available (16).
To handle the data size problem I just split things up and used
a Get / Clear strategy to do the math.
It does work, but calling Get["bigfile.mx"] takes quite some time, so I was wondering if it would be quicker to use BLOBs or whatever with PostgreSQL or MySQL or whatever database people use for GB of numeric data.
So my question really is:
What is the most efficient way to handle truly large data set imports in Mathematica?
I have not tried it yet, but I think that SQLImport from DataBaseLink will be slower than Get["bigfile.mx"].
Anyone has some experience to share?
(Sorry if this is not a very specific programming question, but it would really help me to move on with that time-consuming finding-out-what-is-the-best-of-the-137-possibilities-to-tackle-a-problem-in-Mathematica).
Here's an idea:
You said you have a ragged matrix, i.e. a list of lists of different lengths. I'm assuming floating point numbers.
You could flatten the matrix to get a single long packed 1D array (use Developer`ToPackedArray to pack it if necessary), and store the starting indexes of the sublists separately. Then reconstruct the ragged matrix after the data has been imported.
Here's a demonstration that within Mathematica (i.e. after import), extracting the sublists from a big flattened list is fast.
data = RandomReal[1, 10000000];
indexes = Union#RandomInteger[{1, 10000000}, 10000];
ranges = #1 ;; (#2 - 1) & ### Partition[indexes, 2, 1];
data[[#]] & /# ranges; // Timing
{0.093, Null}
Alternatively store a sequence of sublist lengths and use Mr.Wizard's dynamicPartition function which does exactly this. My point is that storing the data in a flat format and partitioning it in-kernel is going to add negligible overhead.
Importing packed arrays as MX files is very fast. I only have 2 GB of memory, so I cannot test on very large files, but the import times are always a fraction of a second for packed arrays on my machine. This will solve the problem that importing data that is not packed can be slower (although as I said in the comments on the main question, I cannot reproduce the kind of extreme slowness you mention).
If BinaryReadList were fast (it isn't as fast as reading MX files now, but it looks like it will be significantly sped up in Mathematica 9), you could store the whole dataset as one big binary file, without the need of breaking it into separate MX files. Then you could import relevant parts of the file like this:
First make a test file:
In[3]:= f = OpenWrite["test.bin", BinaryFormat -> True]
In[4]:= BinaryWrite[f, RandomReal[1, 80000000], "Real64"]; // Timing
Out[4]= {9.547, Null}
In[5]:= Close[f]
Open it:
In[6]:= f = OpenRead["test.bin", BinaryFormat -> True]
In[7]:= StreamPosition[f]
Out[7]= 0
Skip the first 5 million entries:
In[8]:= SetStreamPosition[f, 5000000*8]
Out[8]= 40000000
Read 5 million entries:
In[9]:= BinaryReadList[f, "Real64", 5000000] // Length // Timing
Out[9]= {0.609, 5000000}
Read all the remaining entries:
In[10]:= BinaryReadList[f, "Real64"] // Length // Timing
Out[10]= {7.782, 70000000}
In[11]:= Close[f]
(For comparison, Get usually reads the same data from an MX file in less than 1.5 seconds here. I am on WinXP btw.)
EDIT If you are willing to spend time on this, and write some C code, another idea is to create a library function (using Library Link) that will memory-map the file (link for Windows), and copy it directly into an MTensor object (an MTensor is just a packed Mathematica array, as seen from the C side of Library Link).
I think the two best approaches are either:
1) use Get on the *.mx file,
2) or read in that data and save it in some binary format for which you write a LibraryLink code and then read the stuff via that. That, of course, has the disadvantage that you'd need to convert your MX stuff. But perhaps this is an option.
Generally speaking Get with MX files is pretty fast.
Are sure this is not a swapping problem?
Edit 1:
You could then use also write in an import converter: tutorial/DevelopingAnImportConverter
I've been looking at the ways to check arguments of functions. I noticed that
MatrixQ takes 2 arguments, the second is a test to apply to each element.
But ListQ only takes one argument. (also for some reason, ?ListQ does not have a help page, like ?MatrixQ does).
So, for example, to check that an argument to a function is a matrix of numbers, I write
ClearAll[foo]
foo[a_?(MatrixQ[#, NumberQ] &)] := Module[{}, a + 1]
What would be a good way to do the same for a List? This below only checks that the input is a List
ClearAll[foo]
foo[a_?(ListQ[#] &)] := Module[{}, a + 1]
I could do something like this:
ClearAll[foo]
foo[a_?(ListQ[#] && (And ## Map[NumberQ[#] &, # ]) &)] := Module[{}, a + 1]
so that foo[{1, 2, 3}] will work, but foo[{1, 2, x}] will not (assuming x is a symbol). But it seems to me to be someone complicated way to do this.
Question: Do you know a better way to check that an argument is a list and also check the list content to be Numbers (or of any other Head known to Mathematica?)
And a related question: Any major run-time performance issues with adding such checks to each argument? If so, do you recommend these checks be removed after testing and development is completed so that final program runs faster? (for example, have a version of the code with all the checks in, for the development/testing, and a version without for production).
You might use VectorQ in a way completely analogous to MatrixQ. For example,
f[vector_ /; VectorQ[vector, NumericQ]] := ...
Also note two differences between VectorQ and ListQ:
A plain VectorQ (with no second argument) only gives true if no elements of the list are lists themselves (i.e. only for 1D structures)
VectorQ will handle SparseArrays while ListQ will not
I am not sure about the performance impact of using these in practice, I am very curious about that myself.
Here's a naive benchmark. I am comparing two functions: one that only checks the arguments, but does nothing, and one that adds two vectors (this is a very fast built-in operation, i.e. anything faster than this could be considered negligible). I am using NumericQ which is a more complex (therefore potentially slower) check than NumberQ.
In[2]:= add[a_ /; VectorQ[a, NumericQ], b_ /; VectorQ[b, NumericQ]] :=
a + b
In[3]:= nothing[a_ /; VectorQ[a, NumericQ],
b_ /; VectorQ[b, NumericQ]] := Null
Packed array. It can be verified that the check is constant time (not shown here).
In[4]:= rr = RandomReal[1, 10000000];
In[5]:= Do[add[rr, rr], {10}]; // Timing
Out[5]= {1.906, Null}
In[6]:= Do[nothing[rr, rr], {10}]; // Timing
Out[6]= {0., Null}
Homogeneous non-packed array. The check is linear time, but very fast.
In[7]:= rr2 = Developer`FromPackedArray#RandomInteger[10000, 1000000];
In[8]:= Do[add[rr2, rr2], {10}]; // Timing
Out[8]= {1.75, Null}
In[9]:= Do[nothing[rr2, rr2], {10}]; // Timing
Out[9]= {0.204, Null}
Non-homogeneous non-packed array. The check takes the same time as in the previous example.
In[10]:= rr3 = Join[rr2, {Pi, 1.0}];
In[11]:= Do[add[rr3, rr3], {10}]; // Timing
Out[11]= {5.625, Null}
In[12]:= Do[nothing[rr3, rr3], {10}]; // Timing
Out[12]= {0.282, Null}
Conclusion based on this very simple example:
VectorQ is highly optimized, at least when using common second arguments. It's much faster than e.g. adding two vectors, which itself is a well optimized operation.
For packed arrays VectorQ is constant time.
#Leonid's answer is very relevant too, please see it.
Regarding the performance hit (since your first question has been answered already) - by all means, do the checks, but in your top-level functions (which receive data directly from the user of your functionality. The user can also be another independent module, written by you or someone else). Don't put these checks in all your intermediate functions, since such checks will be duplicate and indeed unjustified.
EDIT
To address the problem of errors in intermediate functions, raised by #Nasser in the comments: there is a very simple technique which allows one to switch pattern-checks on and off in "one click". You can store your patterns in variables inside your package, defined prior to your function definitions.
Here is an example, where f is a top-level function, while g and h are "inner functions". We define two patterns: for the main function and for the inner ones, like so:
Clear[nlPatt,innerNLPatt ];
nlPatt= _?(!VectorQ[#,NumericQ]&);
innerNLPatt = nlPatt;
Now, we define our functions:
ClearAll[f,g,h];
f[vector:nlPatt]:=g[vector]+h[vector];
g[nv:innerNLPatt ]:=nv^2;
h[nv:innerNLPatt ]:=nv^3;
Note that the patterns are substituted inside definitions at definition time, not run-time, so this is exactly equivalent to coding those patterns by hand. Once you test, you just have to change one line: from
innerNLPatt = nlPatt
to
innerNLPatt = _
and reload your package.
A final question is - how do you quickly find errors? I answered that here, in sections "Instead of returning $Failed, one can throw an exception, using Throw.", and "Meta-programming and automation".
END EDIT
I included a brief discussion of this issue in my book here. In that example, the performance hit was on the level of 10% increase of running time, which IMO is borderline acceptable. In the case at hand, the check is simpler and the performance penalty is much less. Generally, for a function which is any computationally-intensive, correctly-written type checks cost only a small fraction of the total run-time.
A few tricks which are good to know:
Pattern-matcher can be very fast, when used syntactically (no Condition or PatternTest present in the pattern).
For example:
randomString[]:=FromCharacterCode#RandomInteger[{97,122},5];
rstest = Table[randomString[],{1000000}];
In[102]:= MatchQ[rstest,{__String}]//Timing
Out[102]= {0.047,True}
In[103]:= MatchQ[rstest,{__?StringQ}]//Timing
Out[103]= {0.234,True}
Just because in the latter case the PatternTest was used, the check is much slower, because evaluator is invoked by the pattern-matcher for every element, while in the first case, everything is purely syntactic and all is done inside the pattern-matcher.
The same is true for unpacked numerical lists (the timing difference is similar). However, for packed numerical lists, MatchQ and other pattern-testing functions don't unpack for certain special patterns, moreover, for some of them the check is instantaneous.
Here is an example:
In[113]:=
test = RandomInteger[100000,1000000];
In[114]:= MatchQ[test,{__?IntegerQ}]//Timing
Out[114]= {0.203,True}
In[115]:= MatchQ[test,{__Integer}]//Timing
Out[115]= {0.,True}
In[116]:= Do[MatchQ[test,{__Integer}],{1000}]//Timing
Out[116]= {0.,Null}
The same, apparently, seems to be true for functions like VectorQ, MatrixQ and ArrayQ with certain predicates (NumericQ) - these tests are extremely efficient.
A lot depends on how you write your test, i.e. to what degree you reuse the efficient Mathematica structures.
For example, we want to test that we have a real numeric matrix:
In[143]:= rm = RandomInteger[10000,{1500,1500}];
Here is the most straight-forward and slow way:
In[144]:= MatrixQ[rm,NumericQ[#]&&Im[#]==0&]//Timing
Out[144]= {4.125,True}
This is better, since we reuse the pattern-matcher better:
In[145]:= MatrixQ[rm,NumericQ]&&FreeQ[rm,Complex]//Timing
Out[145]= {0.204,True}
We did not utilize the packed nature of the matrix however. This is still better:
In[146]:= MatrixQ[rm,NumericQ]&&Total[Abs[Flatten[Im[rm]]]]==0//Timing
Out[146]= {0.047,True}
However, this is not the end. The following one is near instantaneous:
In[147]:= MatrixQ[rm,NumericQ]&&Re[rm]==rm//Timing
Out[147]= {0.,True}
Since ListQ just checks that the head is List, the following is a simple solution:
foo[a:{___?NumberQ}] := Module[{}, a + 1]
Mathematica 8.0.1
Any one could explain what would be the logic behind this result
In[24]:= Round[10.75, .1]
Out[24]= 10.8
In[29]:= Round[2.75, .1]
Out[29]= 2.8000000000000003
I have expected the second result above to be 2.8?
EDIT 1:
I was trying to do the above for formatting purposes only to make the number fit in the space. I ended up doing the following to get the result I want:
In[41]:= NumberForm[2.75,2]
Out[41] 2.8
I wish Mathematica has printf() like formatting function. I find formatting numbers in Mathematica for exact field width and form a little awkward compared to using printf() formatting rules.
EDIT 2:
I tried $MaxExtraPrecision=1000 on some number I was trying for format/round, but it did not work, that is why I posted this question. Here it is
In[42]:= $MaxExtraPrecision=1000;
Round[2035.7520395261859,.1]
Out[43]= 2035.8000000000002
In[46]:= $MaxExtraPrecision=50;
Round[2.75,.1]
Out[47]= 2.8000000000000003
EDIT 3:
I found this way, to format a number to one decimal point only. Use Numberform, but first need to find what n-digit precision to use by counting the number of digits to the left of the decimal point, then adding 1.
In[56]:= x=2035.7520395261859;
NumberForm[x,IntegerLength[Round#x]+1]
Out[57]//NumberForm= 2035.8
EDIT 4:
The above (Edit 3) did not work for numbers such as
a=2.67301785 10^7
After some trials, I found Accounting Form to do what I want. AccountingForm gets rid of the 10^n form which NumberForm did not:
In[76]:= x=2035.7520395261859;
AccountingForm[x,IntegerLength[Round#x]+1]
Out[77]//AccountingForm= 2035.8
In[78]:= x=2.67301785 10^7;
AccountingForm[x,IntegerLength[Round#x]+1]
Out[79]//AccountingForm= 26730178.5
For formatting numerical values, the best language I found was Fortran, followed COBOL and also by those languages that use or support printf() standard formatting. With Mathematica, one can do such formatting I am sure, but it sure seems too complicated to me. I never understood why Mathematics does not have Printf[].
Not all decimal (base 10) numbers with a finite number of digits are representable in binary (base 2) with a finite number of digits. E.g. 0.1 is not representable in binary, just like 1/3 ~= 0.33333... is not representable in decimal. Mathematica (and other software) will only use a limited number of decimal digits when showing the number to hide this effect. However, occasionally it might happen that enough decimal digits are shown that the mismatch becomes visible.
http://en.wikipedia.org/wiki/Floating_point#Representable_numbers.2C_conversion_and_rounding
EDIT
This command will show you what happens when you find the closes binary representation of 0.1 using 20 binary digits, then convert it back to decimal:
RealDigits[FromDigits[RealDigits[1/10, 2, 20], 2], 10]
The number is stored in base 2, rather than base 10 (decimal). It's impossible to represent 2.8 in base 2, so it uses the closest value: 2.8000000000000003
Number/AccountingForm can take a list in the second argument, the second item of which is how many digits after the decimal place to show:
In[61]:= x=2035.7520395261859;
In[62]:= AccountingForm[x,{Infinity,3}]
Out[62]//AccountingForm= 2035.752
Perhaps this is useful.