Reading files into Ruby Numo::NArray - ruby

I have given number of files, which all have the same size. What I'm trying to do is to load those files into Numo::Narray in a way that every file needs to be in a different row of this array. Number of files and their size is known before creating Narray. What I'm using now is 8-bit unsigned int.
Example: For 5 files of size 512 I would need multidimentional array of shape [5, 512]. Data should be stored in Galois fields. It's crucial as this matrix is going to be used in mathematical operations. What I'm using now to store data is binary data converted 8-bit unsigned int array. Sadly, performance of ruby's "read" and "unpack('*C')" methods is not high enough.
I have done this with old version of NArray, but the performance was not good enough, since I had to first create NMatrix of fixed size filled with zeros, load data to normal Ruby array and replace NMatrix's given row. This new library is quite large and I can't find methods that would e.g. insert row or dynamically add data to row. Do I have to declare fixed NArray or maybe there is a way to do it dynamically by loading data directly from file.read method into Narray so I don't have to create helper ruby array?
Would appreciate optimal solution as I'm interested in high performance.

Suppose there are 512 integers stored as 8-bit unsigned integer binary data in the files data0 ~ data4. Then, you can store the 8-bit unsigned integer values from each file in each row of Numo::UInt8 array as follows,
require "numo/narray"
n, m = 5, 512
string = ""
n.times do |k|
string.concat File.binread("data#{k}", m)
end
na = Numo::UInt8.from_binary(string, [n,m])
This code combines the string data that should be converted to NArray into a single string and reads it with Numo::Uint8.from_binary. However, when I tried it in my environment, it is rate-limiting in the file loading part, and I don't think the from_binary method is too slow.
If you are currently using HDD as a strage, you may want to consider using SSDs or other faster storage devices.

Related

why is hash output fixed in length?

Hash functions always produce a fixed length output regardless of the input (i.e. MD5 >> 128 bits, SHA-256 >> 256 bits), but why?
I know that it is how the designer designed them to be, but why they designed the output to have the same length?
So that it can be stored in a consistent fashion? easier to be compared? less complicated?
Because that is what the definition of a hash is. Refer to wikipedia
A hash function is any function that can be used to map digital data
of arbitrary size to digital data of fixed size.
If your question relates to why it is useful for a hash to be a fixed size there are multiple reasons (non-exhaustive list):
Hashes typically encode a larger (often arbitrary size) input into a smaller size, generally in a lossy way, i.e. unlike compression functions, you cannot reconstruct the input from the hash value by "reversing" the process.
Having a fixed size output is convenient, especially for hashes designed to be used as a lookup key.
You can predictably (pre)allocate storage for hash values and index them in a contiguous memory segment such as an array.
For hashes of "native word sizes", e.g. 16, 32 and 64 bit integer values, you can do very fast equality and ordering comparisons.
Any algorithm working with hash values can use a single set of fixed size operations for generating and handling them.
You can predictably combine hashes produced with different hash functions in e.g. a bloom filter.
You don't need to waste any space to encode how big the hash value is.
There do exist special hash functions, that are capable of producing an output hash of a specified fixed length, such as so-called sponge functions.
As you can see it is the standard.
Also what you want is specified in standard :
Some application may require a hash function with a message digest
length different than those provided by the hash functions in this
Standard. In such cases, a truncated message digest may be used,
whereby a hash function with a larger message digest length is applied
to the data to be hashed, and the resulting message digest is
truncated by selecting an appropriate number of the leftmost bits.
Often it's because you want to use the hash value, or some part of it, to quickly store and look up values in a fixed-size array. (This is how a non-resizable hashtable works, for example.)
And why use a fixed-size array instead of some other, growable data structure (like a linked list or binary tree)? Because accessing them tends to be both theoretically and practically fast: provided that the hash function is good and the fraction of occupied table entries isn't too high, you get O(1) lookups (vs. O(log n) lookups for tree-based data structures or O(n) for lists) on average. And these accesses are fast in practice: after calculating the hash, which usually takes linear time in the size of the key with a low hidden constant, there's often just a bit shift, a bit mask and one or two indirect memory accesses into a contiguous block of memory that (a) makes good use of cache and (b) pipelines well on modern CPUs because few pointer indirections are needed.

J Language: Reading large 2D matrix

I'm a J newbie, and am trying to import one of my large datasets for further experimentation. It is a 2D matrix of doubles, approximately 80000x50000. So far, I have found two different methods to load data into J.
The first is to convert the data into J format (replacing negatives with underscores, putting exponential notation numbers into J format, etc) and then load with (adapted from J: Handy method to enter a matrix?):
(".;._2) fread 'path/to/file'
The second method is to use tables/dsv.
I am experiencing the same problem with both methods: namely, that these methods work with small matrices, but fail at approximately 10M values. It seems the input just gets truncated to some arbitrary limit. How can I load matrices of arbitrary size? If I have to convert to some binary format, that's OK, as long as there is a description of the format somewhere.
I should add that this is a 64-bit system and build of J, and I can successfully create a matrix of random numbers of the appropriate size, so it doesn't seem to be a limitation on matrix size per se but only during I/O.
Thanks!
EDIT: I did not find what exactly was causing this, but thanks to Dane, I did find a workaround by using JMF ( 'data/jmf' package). It turns out that JMF is just straight binary data with no header and native (?) or little-endian data can be mapped directly with JFL map_jmf_ 'x';'whatever.bin'
You're running out of memory. A quick test to see how much space integers take up yields the following:
7!:2 'i. 80000 5000'
8589936256
That is, an 80,000 by 5,000 matrix of integers requires 8 GB of memory. Your 80,000 by 50,000 matrix, if it were of integers, would require approximately 80 GB of memory.
Your next question should be about performing array or matrix operations on a matrix too big to load into memory.

How to efficiently store and manipulate sparse binary matrices in Octave?

I'm trying to manipulate sparse binary matrices in GNU Octave, and it's using way more memory than I expect, and relevant sparse-matrix functions don't behave the way I want them to. I see this question about higher-than-expected sparse-matrix storage in MATLAB, which suggests that this matrix should consume even more memory, but helped explain (only) part of this situation.
For a sparse, binary matrix, I can't figure out any way to get Octave to NOT STORE the array of values (they're always implicitly 1, so need not be stored). Can this be done? Octave always seems to consume memory for a values array.
A trimmed-down example demonstrating the situation: create random sparse matrix, turn it into "binary":
mys=spones(sprandn(1024,1024,.03)); nnz(mys), whos mys
Shows the situation. The consumed size is consistent with the storage mechanism outlined in aforementioned SO answer and expanded below, if spones() creates an array of storage-class double and if all indices are 32-bit (i.e., TotalStorageSize - rowIndices - columnIndices == NumNonZero*sizeof(double) -- unnecessarily storing these values (all 1s as doubles) is over half of the total memory consumed by this 3%-sparse object.
After messing with this (for too long) while composing this question, I discovered some partial workarounds, so I'm going to "self-answer" (only) part of the question for continuity (hopefully), but I didn't figure out an adequate answer to main question:
How do I create an efficiently-stored ("no-/implicit-values") binary matrix in Octave?
Additional background on storage format follows...
The Octave docs say the storage format for sparse matrices uses format Compressed Sparse Column (CSC). This seems to imply storing the following arrays (expanding on aforementioned SO answer, with canonical Yale format labels and tweaks for column-major order):
values (A), number-of-nonzeros (NNZ) entries of storage-class size;
row numbers (IA), NNZ entries of index size (hopefully int64 but maybe int32);
start of each column (JA), number-of-columns-plus-1 entries of index size)
In this case, for binary-only storage, I hope there's a way to completely avoid storing array (A), but I can't figure it out.
Full disclosure: As noted above, as I was composing this question, I discovered a workaround to reduce memory usage, so I'm "self-answering" part of this here, but it still isn't fully satisfying, so I'm still listening for a better actual answer to storage of a sparse binary matrix without a trivial, bloated, unnecessary values array...
To get a binary-like value out of a number-like value and reduce the memory usage in this case, use "logical" storage, created by logical(X). For example, building from above,
logicalmys = logical(mys);
creates a sparse bool matrix, that takes up less memory (1-byte logical rather than 8-byte double for the values array).
Adding more information to the whos information using whos_line_format helps illuminate the situation: The default string includes 5 of the 7 properties (see docs for more). I'm using the format string
whos_line_format(" %a:4; %ln:6; %cs:16:6:1; %rb:12; %lc:8; %e:10; %t:20;\n")
to add display of "elements", and "type" (which is distinct from "class").
With that, whos mys logicalmys shows something like
Attr Name Size Bytes Class Elements Type
==== ==== ==== ===== ===== ======== ====
mys 1024x1024 391100 double 32250 sparse matrix
logicalmys 1024x1024 165350 logical 32250 sparse bool matrix
So this shows a distinction between sparse matrix and sparse bool matrix. However, the total memory consumed by logicalmys is consistent with actually storing an array of NNZ booleans (1-byte) -- That is:
totalMemory minus rowIndices minus columnOffsets leaves NNZ bytes left;
in numbers,
165350 - 32250*4 - 1025*4 == 32250.
So we're still storing 32250 elements, all of which are 1. Further, if you set one of the 1-elements to zero, it reduces the reported storage! For a good time, try: pick a nonzero element, e.g., (42,1), then zero it: logicalmys(42,1) = 0; then whos it!
My hope is that this is correct, and that this clarifies some things for those who might be interested. Comments, corrections, or actual answers welcome!

Chicken/Egg problem: Hash of file (including hash) inside file! Possible?

Thing is I have a file that has room for metadata. I want to store a hash for integrity verification in it. Problem is, once I store the hash, the file and the hash along with it changes.
I perfectly understand that this is by definition impossible with one way cryptographic hash methods like md5/sha.
I am also aware of the possibility of containers that store verification data separated from the content as zip & co do.
I am also aware of the possibility to calculate the hash separately and send it along with the file or to append it at the end or somewhere where the client, when calculating the hash, ignores it.
This is not what I want.
I want to know whether there is an algorithm where its possible to get the resulting hash from data where the very result of the hash itself is included.
It doesn't need to be cryptographic or fullfill a lot of criterias. It can also be based on some heuristics that after a realistic amount of time deliver the desired result.
I am really not so into mathematics, but couldn't there be some really advanced exponential modulo polynom cyclic back-reference devision stuff that makes this possible?
And if not, whats (if there is) the proof against it?
The reason why i need tis is because i want (ultimately) to store a hash along with MP4 files. Its complicated, but other solutions are not easy to implement as the file walks through a badly desigend production pipeline...
It's possible to do this with a CRC, in a way. What I've done in the past is to set aside 4 bytes in a file as a placeholder for a CRC32, filling them with zeros. Then I calculate the CRC of the file.
It is then possible to fill the placeholder bytes to make the CRC of the file equal to an arbitrary fixed constant, by computing numbers in the Galois field of the CRC polynomial.
(Further details possible but not right at this moment. You basically need to compute (CRC_desired - CRC_initial) * 2-8*byte_offset in the Galois field, where byte_offset is the number of bytes between the placeholder bytes and the end of the file.)
Note: as per #KeithS's comments this solution is not to prevent against intentional tampering. We used it on one project as a means to tie metadata within an embedded system to the executable used to program it -- the embedded system itself does not have direct knowledge of the file(s) used to program it, and therefore cannot calculate a CRC or hash itself -- to detect inadvertent mismatch between an embedded system and the file used to program it. (In later systems I've just used UUIDs.)
Of course this is possible, in a multitude of ways. However, it cannot prevent intentional tampering.
For example, let
hash(X) = sum of all 32-bit (non-overlapping) blocks of X modulo 65521.
Let
Z = X followed by the 32-bit unsigned integer (hash(X) * 65521)
Then
hash(Z) == hash(X) == last 32-bits of Z
The idea here is just that any 32-bit integer congruent to 0 modulo 65521 will have no effect on the hash of X. Then, since 65521 < 2^16, hash has a range less then 2^16, and there are at least 2^16 values less than 2^32 congruent to 0 modulo 65521. And so we can encode the hash into a 32 bit integer that will not affect the hash. You could actually use any number less than 2^16, 65521 just happens to be the largest such prime number.
I remember an old DOS program that was able to embed in a text file the CRC value of that file. However, this is possible only with simple hash functions.
Altough in theory you could create such file for any kind of hash function (given enough time or the right algorithm), the attacker would be able to use exactly the same approach. Even more, he would have a chose: to use exactly your approach to obtain such file, or just to get rid of the check.
It means that now you have two problems instead of one, and both should be implemented with the same complexity. It's up to you to decide if it worth it.
EDIT: you could consider hashing some intermediary results (like RAW decoded output, or something specific to your codec). In this way the decoder would have it anyway, but for another program it would be more difficult to compute.
No, not possible. You either you a separate file for hashs ala md5sum, or the embedded hash is only for the "data" portion of the file.
the way the nix package manager does this is by when calculating the hash you pretend the contents of the hash in the file are some fixed value like 20 x's and not the hash of the file then you write the hash over those 20 x's and when you check the hash you read that and ignore again it pretending the hash was just the fixed value of 20 x's when hashing
they do this because the paths at which a package is installed depend on the hash of the whole package so as the hash is of fixed length they set it as some fixed value and then replace it with the real hash and when verifying they ignore the value they placed and pretend it's that fixed value
but if you don't use such a method is it impossible
It depends on your definition of "hash". As you state, obviously with any pseudo-random hash this would be impossible (in a reasonable amount of time).
Equally obvious, there are of course trivial "hashes" where you can do this. Data with an odd number of bits set to 1 hash to 00 and an even number of 1s hash to 11, for example. The hash doesn't modify the odd/evenness of the 1 bits, so files hash the same when their hash is included.

How would you sort 1 million 32-bit integers in 2MB of RAM?

Please, provide code examples in a language of your choice.
Update:
No constraints set on external storage.
Example: Integers are received/sent via network. There is a sufficient space on local disk for intermediate results.
Split the problem into pieces small enough to fit into available memory, then use merge sort to combine them.
Sorting a million 32-bit integers in 2MB of RAM using Python by Guido van Rossum
1 million 32-bit integers = 4 MB of memory.
You should sort them using some algorithm that uses external storage. Mergesort, for example.
You need to provide more information. What extra storage is available? Where are you supposed to store the result?
Otherwise, the most general answer:
1. load the fist half of data into memory (2MB), sort it by any method, output it to file.
2. load the second half of data into memory (2MB), sort it by any method, keep it in memory.
3. use merge algorithm to merge the two sorted halves and output the complete sorted data set to a file.
This wikipedia article on External Sorting have some useful information.
Dual tournament sort with polyphased merge
#!/usr/bin/env python
import random
from sort import Pickle, Polyphase
nrecords = 1000000
available_memory = 2000000 # number of bytes
#NOTE: it doesn't count memory required by Python interpreter
record_size = 24 # (20 + 4) number of bytes per element in a Python list
heap_size = available_memory / record_size
p = Polyphase(compare=lambda x,y: cmp(y, x), # descending order
file_maker=Pickle,
verbose=True,
heap_size=heap_size,
max_files=4 * (nrecords / heap_size + 1))
# put records
maxel = 1000000000
for _ in xrange(nrecords):
p.put(random.randrange(maxel))
# get sorted records
last = maxel
for n, el in enumerate(p.get_all()):
if el > last: # elements must be in descending order
print "not sorted %d: %d %d" % (n, el ,last)
break
last = el
assert nrecords == (n + 1) # check all records read
Um, store them all in a file.
Memory map the file (you said there was only 2M of RAM; let's assume the address space is large enough to memory map a file).
Sort them using the file backing store as if it were real memory now!
Here's a valid and fun solution.
Load half the numbers into memory. Heap sort them in place and write the output to a file. Repeat for the other half. Use external sort (basically a merge sort that takes file i/o into account) to merge the two files.
Aside:
Make heap sort faster in the face of slow external storage:
Start constructing the heap before all the integers are in memory.
Start putting the integers back into the output file while heap sort is still extracting elements
As people above mention type int of 32bit 4 MB.
To fit as much "Number" as possible into as little of space as possible using the types int, short and char in C++. You could be slick(but have odd dirty code) by doing several types of casting to stuff things everywhere.
Here it is off the edge of my seat.
anything that is less than 2^8(0 - 255) gets stored as a char (1 byte data type)
anything that is less than 2^16(256 - 65535) and > 2^8 gets stored as a short ( 2 byte data type)
The rest of the values would be put into int. ( 4 byte data type)
You would want to specify where the char section starts and ends, where the short section starts and ends, and where the int section starts and ends.
No example, but Bucket Sort has relatively low complexity and is easy enough to implement

Resources