Byte array (byte string) data structure with persistent, pure interface - data-structures

I'm looking for a data structure that stores bytes, allows constant-time pure indexing, and presents a pure, persistent interface for updating. Basically, I'm looking for Idris's equivalent of Haskell (strict) bytestrings.
Data.ByteArray provides a mutable byte array that is accessible from IO only; and it builds Data.Bytes on top of that which at first glance seems promising; however, it doesn't seem to have an indexing operation, and updating seems to only be possible by splitting and re-concatenating.

Related

Use big.Rat with Go to get Abs() value

I am a beginner with Go and a java developer.
I am currently working with big.Rat.
I need to get the Abs of a Rat n for which I have to write something like
n.Abs(n) or something like big.Rat{}.Abs(n)
Why didn't go provide something like just n.Abs()?
Or am I going wrong somewhere?
Go's big package is concerned with memory allocation when it comes to its function signatures. A big.Rat consists of two big.Ints which each contain an array of uints. Unlike an int (native 32 or 64 bit integer), a big.Int must thus be allocated dynamically, depending on its value. For large values this means more elements in the array.
Your proposed function signature n.Abs() would mean that a new array of the same size as n's would have to be allocated for this operation. In reality we often have the case that the original n is no longer needed, thus we can reuse its existing memory. To allow this, the Abs function takes a pointer to an existing big.Rat which might be n itself. The implementation can now reuse the memory. The caller is now in full control of what memory to use for these operations.
This might not make the nicest API for all use cases, in fact if you just want to do a quick calculation for a few large numbers, on a computer with Gigabytes of RAM, you might have preferred the n.Abs() version, but if you do numerically expensive computations with a lot of large numbers, you must be able to control your memory. Imagine doing some image manipulation on a Raspberry for example, where you are more constraint by the available memory. In this case the existing API allows you to be more efficient.

How does gc handle slice memory reclaim

var a = [...]int{1,2,3,4,5,6}
s1 := a[2:4:5]
Suppose s1 goes out of scope later than a. How does gc know to reclaim the memory of s1's underlying array a?
Consider the runtime representation of s1, spec
type SliceHeader struct {
Data uintptr
Len int
Cap int
}
The GC doesn't even know about the beginning of a.
Go uses mark-and-sweep collector as it's present implementation.
As per the algorithm, there will be one root object, and the rest is tree like structure, in case of multi-core machines gc runs along with the program on one core.
gc will traverse the tree and when something is not reachable it, considers it as free.
Go objects also have metadata for objects as stated in this post.
An excerpt:
We needed to have some information about the objects since we didn't have headers. Mark bits are kept on the side and used for marking as well as allocation. Each word has 2 bits associated with it to tell you if it was a scalar or a pointer inside that word. It also encoded whether there were more pointers in the object so we could stop scanning objects sooner than later.
The reason go's slices (slice header) were structures instead of pointer to structures is documented by russ cox in this page under slice section.
This is an excerpt:
Go originally represented a slice as a pointer to the structure(slice header) , but doing so meant that every slice operation allocated a new memory object. Even with a fast allocator, that creates a lot of unnecessary work for the garbage collector, and we found that, as was the case with strings, programs avoided slicing operations in favor of passing explicit indices. Removing the indirection and the allocation made slices cheap enough to avoid passing explicit indices in most cases.
The size(length) of an array is part of its type. The types [1]int and [2]int are distinct.
One thing to remember is go is value oriented language, instead of storing pointers, they store direct values.
[3]int, arrays are values in go, so if you pass an array, it copies the whole array.
[3]int this is a value (one as a whole).
When one does a[1] you are accessing part of the value.
SliceHeader Data field says consider this as base point of array, instead of a[0]
As far as my knowledge is considered:
When one requests for a[4],
a[0]+(sizeof(type)*4)
is calculated.
Now if you are accessing something in through slice s = a[2:4],
and if one requests for s[1], what one was requesting is,
a[2]+sizeof(type)*1

What abstract data type is this?

Is the following a common data type (i.e. does it have a name)?
Its unique characteristic is, unlike a regular Set, that it contains the "universe" on initialisation with O(C) memory overhead, and a max memory overhead of O(N/2) (which only occurs when you remove every-other element):
> s = new Structure(701)
s = Structure(0-700)
> s.remove(100)
s = Structure(0-99, 101-700)
> s.add(100)
s = Structure(0-700)
> s.remove(200)
s = Structure(0-199, 201-700)
> s.remove(202)
s = Structure(0-199, 201, 203-700)
> s.removeAll()
s = Structure()
Does something like this have a standard name?
I've used this many times in the past and seen it used in things like plane-sweep algorithms for polygon clipping.
Sometimes the abstract data type it represents is just a set, and the data structure is an optimization. I use this for representing the set of matching characters given by a regex expression like [^a-zA-z0-9.-], for example, and to perform intersection, union, and other operations on those sets.
This sort of data structure is implemented on top of some other ordered set or map structure, by simply storing the keys where membership in the set changes instead of the keys in the set itself. In all the other cases where I've seen this sort of thing done, the authors refer to that underlying structure instead of giving a name to the concept itself.
I like the idea of having a name for it, though, since as I said I've used it myself many times. Maybe I would call it an "in & out set" in honor of the hamburger chain I liked the best back when I ate hamburgers.
It's a Compressed Bit Set or Compressed Bitmap.
A Bit Set or Bitmap is a set specifically designed for storing Integers. Most languages offer standard implementations of these. They typically work by assigning a 1 to the Nth bit in an internal array of Integers where N is the number you're adding to the set. 0 indicates the value is not present. The memory usage for these types of Bit Sets is dictated by the largest number you store.
A Compressed Bit Set is one that compacts ranges of 0s and 1s.
In this case, the question demonstrates a type of compression called "run-length-encoding" (thank you #Ralf Kleberhoff), so it is specifically a Run-length Encoded Bitmap.
Common implementations of Compressed Bitmaps (from newest-to-oldest) are:
Roaring Bitmaps (only one to provide "good random access")
EWAH
WAH
Oracle BBC

How does go calculate a hash value for keys in a map?

How does Go calculate a hash for keys in a map? Is it truly unique and is it available for use in other structures?
I imagine it's easy for primitive keys like int or immutable string but it seems nontrivial for composite structures.
The language spec doesn't say, which means that it's free to change at any time, or differ between implementations.
The hash algorithm varies somewhat between types and platforms. As of now: On x86 (32 or 64 bit) if the CPU supports AES instructions, the runtime uses aeshash, a hash built on AES primitives, otherwise it uses a function "inspired by" xxHash and cityhash, but different from either. There are different variants for 32-bit and 64-bit systems. Most types use a simple hash of their memory contents, but floating-point types have code to ensure that 0 and -0 hash equally (since they compare equally) and NaNs hash randomly (since two NaNs are never equal). Since complex types are built from floats, their hashes are composed from the hashes of their two floating-point parts. And an interface's hash is the hash of the value stored in the interface, and not the interface header itself.
All of this stuff is in private functions, so no, you can't access Go's internal hash for a value in your own code.
The Go map implementation uses a hash called aeshash. It's not AES, but it uses the aesenc assembly instruction to compute hashes. This hash isn't exported for use in the standard library.
The hash itself is written in assembly, and can be found in the runtime package source.
Since Go 1.14, the go standard library provides the hash/maphash package. The hash functions in this package aren't guaranteed to be the same ones used by Go maps (but it appears that they are, which makes sense); they are guaranteed to be good functions for implementing hashmaps and the like.
hash/maphash only operates on strings or byte slices, so it's still up to you to figure out how to serialize a composite data structure into bytes for hashing purposes.

Efficiency of appending to vectors

Appending an element onto a million-element ArrayList has the cost of setting one reference now, and copying one reference in the future when the ArrayList must be resized.
As I understand it, appending an element onto a million-element PersistenVector must create a new path, which consists of 4 arrays of size 32. Which means more than 120 references have to be touched.
How does Clojure manage to keep the vector overhead to "2.5 times worse" or "4 times worse" (as opposed to "60 times worse"), which has been claimed in several Clojure videos I have seen recently? Has it something to do with caching or locality of reference or something I am not aware of?
Or is it somehow possible to build a vector internally with mutation and then turn it immutable before revealing it to the outside world?
I have tagged the question scala as well, since scala.collection.immutable.vector is basically the same thing, right?
Clojure's PersistentVector's have special tail buffer to enable efficient operation at the end of the vector. Only after this 32-element array is filled is it added to the rest of the tree. This keeps the amortized cost low. Here is one article on the implementation. The source is also worth a read.
Regarding, "is it somehow possible to build a vector internally with mutation and then turn it immutable before revealing it to the outside world?", yes! These are known as transients in Clojure, and are used for efficient batch changes.
Cannot tell about Clojure, but I can give some comments about Scala Vectors.
Persistent Scala vectors (scala.collection.immutable.Vectors) are much slower than an array buffer when it comes to appending. In fact, they are 10x slower than the List prepend operation. They are 2x slower than appending to Conc-trees, which we use in Parallel Collections.
But, Scala also has mutable vectors -- they're hidden in the class VectorBuilder. Appending to mutable vectors does not preserve the previous version of the vector, but mutates it in place by keeping the pointer to the rightmost leaf in the vector. So, yes -- keeping the vector mutable internally, and than returning an immutable reference is exactly what's done in Scala collections.
The VectorBuilder is slightly faster than the ArrayBuffer, because it needs to allocate its arrays only once, whereas ArrayBuffer needs to do it twice on average (because of growing). Conc.Buffers, which we use as parallel array combiners, are twice as fast compared to VectorBuilders.
Benchmarks are here. None of the benchmarks involve any boxing, they work with reference objects to avoid any bias:
comparison of Scala List, Vector and Conc
comparison of Scala ArrayBuffer, VectorBuilder and Conc.Buffer
More collections benchmarks here.
These tests were executed using ScalaMeter.

Resources