Reading matrices into Mata - matrix

I have some large matrices I want to process in Mata, i.e., typical matrix operations such as inverting, multiplying, etc. These are Stata files with variable names in the first row. Some are quite large, >15 GB. So, the first problem is reading the data. I read something about setting up views, but my version of Stata does not show any help for st_view. The help for Mata talks about opening a file with fopen(), but it's pretty cryptic. I also read something about Mata adding changes to the original data. I'd prefer some strategy that doesn't alter my original data as it takes a long time to create the original matrices. Can someone point me in the right direction?

Some misinformation here!
If your matrix is already read in, fopen() sounds irrelevant to you.
If your matrix consists of variables already in Stata, consider using putmata. However, if variable names really are in the first row (i.e. observation) you may need to take them out and destring.
st_view() is documented; presumably you are just looking in the wrong place. Start at help m4_stata.
Mata won't change your Stata data unless you ask it to.

Related

how to plot variables with possibly wild variable values?

I want to build an application that would do something equivalent to running lsof (maybe changing it to output differently, because string processing may mean it is not real time enough) in a loop and then associate each line (entries) with what iteration it was present in, what I will be referring further as frames, as later on it will be better for understanding. My intention with it is that showing the times in which files are open by applications can reveal something about their structure, while not having big impact on their execution, which is often a problem. One problem I have is on processing the output, which would be a table relating "frames X entry", for that I am already anticipating that I will have wildly variable entry lengths. Which can fall in that problem of representing on geometry when you have very different scales, the smaller get infinitely small, while the bigger gets giant and fragmentation makes it even worse; so my question is if plotting libraries deal with this problem and how they do it
The easiest and most well-established technique for showing both small and large values in reasonable detail is a logarithmic scale. Instead of plotting raw values, plot their logarithms. This is notoriously problematic if you can have zero or even negative values, but as I understand your situations all your lengths would be strictly positive so this should work.
Another statistical solution you could apply is to plot ranks instead of raw values. Take all the observed values, and put them in a sorted list. When plotting any single data point, instead of plotting the value itself you look up that value in the list of values (possibly using binary search since it's a sorted list) then plot the index at which you found the value.
This is a monotonous transformation, so small values map to small indices and big values to big indices. On the other hand it completely discards the actual magnitude, only the relative comparisons matter.
If this is too radical, you could consider using it as an ingredient for something more tuneable. You could experiment with a linear combination, i.e. plot
a*x + b*log(x) + c*rank(x)
then tweak a, b and c till the result looks pleasing.

Save/Restore Ruby's Random

I'm trying to create a game, which I want to always run the same given the same seed. That means that random events - be them what they may - will always be the same for two players using the same seed.
However, given the user's ability to save and load the game, Ruby's Random would reset every time the save loaded, making the whole principle void if two players save and load at different points.
The only solution I have imagined for this is, whenever a save file is loaded, to generate the same number of points as before, and thus getting Ruby's Random to the same state as it was before load. However, to do that I'd need to extend it so a counter is updated every time a random number is generated.
Does anyone know how to do that or has a better way to restore the state of Ruby's Random?
PS: I cannot use an instance of Random (Random.new) and Marshall it. I have to use Ruby's default.
Sounds like Marshal.dump/Marshal.load may be exactly what you want. The Random class documentation explicitly states "Random objects can be marshaled, allowing sequences to be saved and resumed."
You may still have problems with synchronization across games, since different user-based decisions can take you through different logic paths and thus use the sequence of random numbers in entirely different ways.
I'd suggest maybe saving the 'current' data to a file when the user decides to save (or when the program closes) depending on what you prefer.
This can be done using the File class in ruby.
This would mean you'd need to keep track of turns and pass that along with the save data. Or you could loop through the data in the file and find out how many turns have occurred that way I suppose.
So you'd have something like:
def loadGame(loadFile)
loadFile.open
data = loadFile.read
# What you do below here depends on how you decide to store the data in saveGame.
end
def saveGame(saveFile)
saveFile.open
saveFile.puts data
end
Havent really tried the above code so it could be bad syntax or such. It's mainly just the concept I'm trying to get across.
Hopefully that helps?
There are many generators that compute each random number in the sequence from the previous value alone, so if you used one of those you need only save the last random number as part of the state of the game. An example is a basic linear congruential generator, which has the form:
z(n+1) = (az(n) + b) mod c
where a, b and c are typically large (known) constants, and z(0) is the seed.
An arguably better one is the so-called "mulitply-with-carry" method.

Matrix transposition on a magnetic tape

Programming pearls Problem 7 is about transposing a 4000 x 4000 matrix stored on a magnetic tape. My solution was to simply use a temporary variable and swap the contents of a[i][j] and a[j][i].
The solution given by the author confused me a little bit. He says we should:
Prepend the row and column indices to each
sort the records in the matrix by row
remove the appended indices.
Why do you have to go through so much trouble to get this done? Does it have something to do with magnetic tapes?
I think the meaning of this exercise is as follows.
For the computer in that old age, there was no sufficient RAM to hold a matrix with that size. So your proposed swapping method could not be feasible. In order to transpose such a large matrix, the external memory storage, i.e. magnetic tape, should be exploited.
However, the reading and writing tapes back and forth is rather slow. But tapes are serial storage device. So reading and writing in serial can save a lot of time.
Merge sort is very suitable for such serialization storage because of the way it accesses elements, as what is said at this wikipedia page. So I believe the "system tape sort" meant merge sort on tape.
After keeping in mind the three points above, I think you can understand this exercise.
I think magnetic tapes means: to find one certain element, you have to travel from the beginning to that element.
But I have difficulty in understanding "what is system tape sort" and "why it works".

Best way to store 1 trillion lines of information

I'm doing calculations and the resultant text file right now has 288012413 lines, with 4 columns. Sample column:
288012413; 4855 18668 5.5677643628300215
the file is nearly 12 GB's.
That's just unreasonable. It's plain text. Is there a more efficient way? I only need about 3 decimal places, but would a limiter save much room?
Go ahead and use MySQL database
MSSQL express has a limit of 4GB
MS Access has a limit of 4 GB
So these options are out. I think by using a simple database like mysql or sSQLLite without indexing will be your best bet. It will probably be faster accessing the data using a database anyway and on top of that the file size may be smaller.
Well,
The first column looks suspiciously like a line number - if this is the case then you can probably just get rid of it saving around 11 characters per line.
If you only need about 3 decimal places then you can round / truncate the last column, potentially saving another 12 characters per line.
I.e. you can get rid of 23 characters per line. That line is 40 characters long, so you can approximatley halve your file size.
If you do round the last column then you should be aware of the effect that rounding errors may have on your calculations - if the end result needs to be accurate to 3 dp then you might want to keep a couple of extra digits of precision depending on the type of calculation.
You might also want to look into compressing the file if it is just used to storing the results.
Reducing the 4th field to 3 decimal places should reduce the file to around 8GB.
If it's just array data, I would look into something like HDF5:
http://www.hdfgroup.org/HDF5/
The format is supported by most languages, has built-in compression and is well supported and widely used.
If you are going to use the result as a lookup table, why use ASCII for numeric data? why not define a struct like so:
struct x {
long lineno;
short thing1;
short thing2;
double value;
}
and write the struct to a binary file? Since all the records are of a known size, advancing through them later is easy.
well, if the files are that big, and you are doing calculations that require any sort of precision with the numbers, you are not going to want a limiter. That might possibly do more harm than good, and with a 12-15 GB file, problems like that will be really hard to debug. I would use some compression utility, such as GZIP, ZIP, BlakHole, 7ZIP or something like that to compress it.
Also, what encoding are you using? If you are just storing numbers, all you need is ASCII. If you are using Unicode encodings, that will double to quadruple the size of the file vs. ASCII.
Like AShelly, but smaller.
Assuming line #'s are continuous...
struct x {
short thing1;
short thing2;
short value; // you said only 3dp. so store as fixed point n*1000. you get 2 digits left of dp
}
save in binary file.
lseek() read() and write() are your friends.
file will be large(ish) at around 1.7Gb.
The most obvious answer is just "split the data". Put them to different files, eg. 1 mln lines per file. NTFS is quite good at handling hundreds of thousands of files per folder.
Then you've got a number of answers regarding reducing data size.
Next, why keep the data as text if you have a fixed-sized structure? Store the numbers as binaries - this will reduce the space even more (text format is very redundant).
Finally, DBMS can be your best friend. NoSQL DBMS should work well, though I am not an expert in this area and I dont know which one will hold a trillion of records.
If I were you, I would go with the fixed-sized binary format, where each record occupies the fixed (16-20?) bytes of space. Then even if I keep the data in one file, I can easily determine at which position I need to start reading the file. If you need to do lookup (say by column 1) and the data is not re-generated all the time, then it could be possible to do one-time sorting by lookup key after generation -- this would be slow, but as a one-time procedure it would be acceptable.

Log combing algorithm

We get these ~50GB data files consisting of 16 byte codes, and I want to find any code that occurs 1/2% of the time or more. Is there any way I can do that in a single pass over the data?
Edit: There are tons of codes - it's possible that every code is different.
EPILOGUE: I've selected Darius Bacon as best answer, because I think the best algorithm is a modification of the majority element he linked to. The majority algorithm should be modifiable to only use a tiny amount of memory - like 201 codes to get 1/2% I think. Basically you just walk the stream counting up to 201 distinct codes. As soon as you find 201 distinct codes, you drop one of each code (deduct 1 from the counters, forgetting anything that becomes 0). At the end, you have dropped at most N/201 times, so any code occurring more times than that must still be around.
But it's a two pass algorithm, not one. You need a second pass to tally the counts of the candidates. It's actually easy to see that any solution to this problem must use at least 2 passes (the first batch of elements you load could all be different and one of those codes could end up being exactly 1/2%)
Thanks for the help!
Metwally et al., Efficient Computation of Frequent and Top-k Elements in Data Streams (2005). There were some other relevant papers I read for my work at Yahoo that I can't find now; but this looks like a good start.
Edit: Ah, see this Brian Hayes article. It sketches an exact algorithm due to Demaine et al., with references. It does it in one pass with very little memory, yielding a set of items including the frequent ones you're looking for, if they exist. Getting the exact counts takes a (now-tractable) second pass.
this will depend on the distribution of the codes. if there are a small enough number of distinct codes you can build a http://en.wikipedia.org/wiki/Frequency_distribution in core with a map. otherwise you probably will have to build a http://en.wikipedia.org/wiki/Histogram and then make multiple passes over the data examining frequencies of codes in each bucket.
Sort chunks of the file in memory, as if you were performing and external sort. Rather than writing out all of the sorted codes in each chunk, however, you can just write each distinct code and the number of occurrences in that chunk. Finally, merge these summary records to find the number of occurrences of each code.
This process scales to any size data, and it only makes one pass over the input data. Multiple merge passes may be required, depending on how many summary files you want to open at once.
Sorting the file allows you to count the number of occurrences of each code using a fixed amount of memory, regardless of the input size.
You also know the total number of codes (either by dividing the input size by a fixed code size, or by counting the number of variable length codes during the sorting pass in a more general problem).
So, you know the proportion of the input associated with each code.
This is basically the pipeline sort * | uniq -c
If every code appears just once, that's no problem; you just need to be able to count them.
That depends on how many different codes exist, and how much memory you have available.
My first idea would be to build a hash table of counters, with the codes as keys. Loop through the entire file, increasing the counter of the respective code, and counting the overall number. Finally, filter all keys with counters that exceed (* overall-counter 1/200).
If the files consist solely of 16-byte codes, and you know how large each file is, you can calculate the number of codes in each file. Then you can find the 0.5% threshold and follow any of the other suggestions to count the occurrences of each code, recording each one whose frequency crosses the threshold.
Do the contents of each file represent a single data set, or is there an arbitrary cutoff between files? In the latter case, and assuming a fairly constant distribution of codes over time, you can make your life simpler by splitting each file into smaller, more manageable chunks. As a bonus, you'll get preliminary results faster and can pipeline then into the next process earlier.

Resources