Represent of a collection of binary data - performance

I'm working with signals in which the samples consist of Floats. Some of the algorithms I've written only require to know when the signal crosses the x-axis (i.e. positive value to a negative value and vice versa). When I'm doing these kinds of operations, I realized that I don't need to know the actual Float value of each sample. I just need to know whether the sample's value is positive or not.
I originally represented the signal as a Vector of Floats. After my discovery, I started representing it as a Vector of Boolean values (i.e. False for a negative value and True for a positive value). This turned out to be a lot more efficient and I improved the program's performance both in terms of run-time and memory consumption.
I'm still wondering if there isn't a more efficient way of representing this "collection of binary data". Something like a Bit Vector or Bit Array. I've found a BitArray on Hackage but it doesn't seem to support the same functionality that a Vector does.
Is there a more efficient way of representing the data of my use case or should I stick to a Vector of Boolean values?

Both one-bool-per-byte and one-bool-per-bit options are available from the vector and array packages respectively.
First, a Vector Bool from Data.Vector.Unboxed uses a byte array with one byte per Bool. This can be verified from the source in module Data.Vector.Unboxed.Base where Vector Bool is defined as:
newtype instance Vector Bool = V_Bool (P.Vector Word8)
and getting and setting are mediated through the functions:
fromBool :: Bool -> Word8
toBool :: Word8 -> Bool
Alternatively, it can be verified directly by profiling the program:
import Data.Vector.Unboxed as V
main = let v = V.replicate 1000000000 True
in print (v ! 5)
and observing that it allocates just over 1,000,000,000 bytes.
Second, a UArray Int Bool from Data.Array.Unboxed is implemented as a bit vector, with one Bool per bit. The relevant source is in Data.Array.Base, where you can see the bit manipulation used in the instance:
instance IArray UArray Bool where
...
unsafeAt (UArray _ _ _ arr#) (I# i#) = isTrue#
((indexWordArray# arr# (bOOL_INDEX i#) `and#` bOOL_BIT i#)
`neWord#` int2Word# 0#)
Again, this can be verified directly by profiling:
import Data.Array.Unboxed as A
main = let v = A.listArray (1,1000000000) (repeat True) :: UArray Int Bool
in print (v ! 5)
and verifying that it allocates approximately 125,000,000 bytes.

Related

How go compare overflow int? [duplicate]

I've been reading this post on constants in Go, and I'm trying to understand how they are stored and used in memory. You can perform operations on very large constants in Go, and as long as the result fits in memory, you can coerce that result to a type. For example, this code prints 10, as you would expect:
const Huge = 1e1000
fmt.Println(Huge / 1e999)
How does this work under the hood? At some point, Go has to store 1e1000 and 1e999 in memory, in order to perform operations on them. So how are constants stored, and how does Go perform arithmetic on them?
Short summary (TL;DR) is at the end of the answer.
Untyped arbitrary-precision constants don't live at runtime, constants live only at compile time (during the compilation). That being said, Go does not have to represent constants with arbitrary precision at runtime, only when compiling your application.
Why? Because constants do not get compiled into the executable binaries. They don't have to be. Let's take your example:
const Huge = 1e1000
fmt.Println(Huge / 1e999)
There is a constant Huge in the source code (and will be in the package object), but it won't appear in your executable. Instead a function call to fmt.Println() will be recorded with a value passed to it, whose type will be float64. So in the executable only a float64 value being 10.0 will be recorded. There is no sign of any number being 1e1000 in the executable.
This float64 type is derived from the default type of the untyped constant Huge. 1e1000 is a floating-point literal. To verify it:
const Huge = 1e1000
x := Huge / 1e999
fmt.Printf("%T", x) // Prints float64
Back to the arbitrary precision:
Spec: Constants:
Numeric constants represent exact values of arbitrary precision and do not overflow.
So constants represent exact values of arbitrary precision. As we saw, there is no need to represent constants with arbitrary precision at runtime, but the compiler still has to do something at compile time. And it does!
Obviously "infinite" precision cannot be dealt with. But there is no need, as the source code itself is not "infinite" (size of the source is finite). Still, it's not practical to allow truly arbitrary precision. So the spec gives some freedom to compilers regarding to this:
Implementation restriction: Although numeric constants have arbitrary precision in the language, a compiler may implement them using an internal representation with limited precision. That said, every implementation must:
Represent integer constants with at least 256 bits.
Represent floating-point constants, including the parts of a complex constant, with a mantissa of at least 256 bits and a signed exponent of at least 32 bits.
Give an error if unable to represent an integer constant precisely.
Give an error if unable to represent a floating-point or complex constant due to overflow.
Round to the nearest representable constant if unable to represent a floating-point or complex constant due to limits on precision.
These requirements apply both to literal constants and to the result of evaluating constant expressions.
However, also note that when all the above said, the standard package provides you the means to still represent and work with values (constants) with "arbitrary" precision, see package go/constant. You may look into its source to get an idea how it's implemented.
Implementation is in go/constant/value.go. Types representing such values:
// A Value represents the value of a Go constant.
type Value interface {
// Kind returns the value kind.
Kind() Kind
// String returns a short, human-readable form of the value.
// For numeric values, the result may be an approximation;
// for String values the result may be a shortened string.
// Use ExactString for a string representing a value exactly.
String() string
// ExactString returns an exact, printable form of the value.
ExactString() string
// Prevent external implementations.
implementsValue()
}
type (
unknownVal struct{}
boolVal bool
stringVal string
int64Val int64 // Int values representable as an int64
intVal struct{ val *big.Int } // Int values not representable as an int64
ratVal struct{ val *big.Rat } // Float values representable as a fraction
floatVal struct{ val *big.Float } // Float values not representable as a fraction
complexVal struct{ re, im Value }
)
As you can see, the math/big package is used to represent untyped arbitrary precision values. big.Int is for example (from math/big/int.go):
// An Int represents a signed multi-precision integer.
// The zero value for an Int represents the value 0.
type Int struct {
neg bool // sign
abs nat // absolute value of the integer
}
Where nat is (from math/big/nat.go):
// An unsigned integer x of the form
//
// x = x[n-1]*_B^(n-1) + x[n-2]*_B^(n-2) + ... + x[1]*_B + x[0]
//
// with 0 <= x[i] < _B and 0 <= i < n is stored in a slice of length n,
// with the digits x[i] as the slice elements.
//
// A number is normalized if the slice contains no leading 0 digits.
// During arithmetic operations, denormalized values may occur but are
// always normalized before returning the final result. The normalized
// representation of 0 is the empty or nil slice (length = 0).
//
type nat []Word
And finally Word is (from math/big/arith.go)
// A Word represents a single digit of a multi-precision unsigned integer.
type Word uintptr
Summary
At runtime: predefined types provide limited precision, but you can "mimic" arbitrary precision with certain packages, such as math/big and go/constant. At compile time: constants seemingly provide arbitrary precision, but in reality a compiler may not live up to this (doesn't have to); but still the spec provides minimal precision for constants that all compiler must support, e.g. integer constants must be represented with at least 256 bits which is 32 bytes (compared to int64 which is "only" 8 bytes).
When an executable binary is created, results of constant expressions (with arbitrary precision) have to be converted and represented with values of finite precision types – which may not be possible and thus may result in compile-time errors. Note that only results –not intermediate operands– have to be converted to finite precision, constant operations are carried out with arbitrary precision.
How this arbitrary or enhanced precision is implemented is not defined by the spec, math/big for example stores "digits" of the number in a slice (where digits is not a digit of the base 10 representation, but "digit" is an uintptr which is like base 4294967295 representation on 32-bit architectures, and even bigger on 64-bit architectures).
Go constants are not allocated to memory. They are used in context by the compiler. The blog post you refer to gives the example of Pi:
Pi = 3.14159265358979323846264338327950288419716939937510582097494459
If you assign Pi to a float32 it will lose precision to fit, but if you assign it to a float64, it will lose less precision, but the compiler will determine what type to use.

Max value on 64 bit

Below code compiles:
package main
import "fmt"
const (
// Max integer value on 64 bit architecture.
maxInt = 9223372036854775807
// Much larger value than int64.
bigger = 9223372036854775808543522345
// Will NOT compile
// biggerInt int64 = 9223372036854775808543522345
)
func main() {
fmt.Println("Will Compile")
//fmt.Println(bigger) // error
}
Type is size in memory + representation of bits in that memory
What is the implicit type assigned to bigger at compile time? Because error constant 9223372036854775808543522345 overflows int for line fmt.Println(bigger)
Those are untyped constants. They have larger limits than typed constants:
https://golang.org/ref/spec#Constants
In particular:
Represent integer constants with at least 256 bits.
None, it's an untyped constant. Because you haven't assigned it to any variable or used it in any expression, it doesn't "need" to be given a representation as any concrete type yet. Numeric constants in Go have effectively unlimited precision (required by the language spec to be at least 256 bits for integers, and at least 256 mantissa bits for floating-point numbers, but I believe that the golang/go compiler uses the Go arbitrary-precision types internally which are only limited by memory). See the section about Constants in the language spec.
What is the use of a constant if you can't assign it to a variable of any type? Well, it can be part of a constant expression. Constant expressions are evaluated at arbitrary precision, and their results may be able to be assigned to a variable. In other words, it's allowed to use values that are too big to represent to reach an answer that is representable, as long as all of that computation happens at compile time.
From this comment:
my goal is to convertf bigger = 9223372036854775808543522345 to binary form
we find that your question is an XY problem.
Since we do know that the constant exceeds 64 bits, we'll need to take it apart into multiple 64-bit words, or store it in some sort of bigger-integer storage.
Go provides math/big for general purpose large-number operations, or in this case we can take advantage of the fact that it's easy to store up to 127-bit signed values (or 128-bit unsigned values) in a struct holding two 64-bit integers (at least one of which is unsigned).
This rather trivial program prints the result of converting to binary:
500000000 x 2-sup-64 + 543522345 as binary:
111011100110101100101000000000000000000000000000000000000000000100000011001010111111000101001
package main
import "fmt"
const (
// Much larger value than int64.
bigger = 9223372036854775808543522345
d64 = 1 << 64
)
type i128 struct {
Upper int64
Lower uint64
}
func main() {
x := i128{Upper: bigger / d64, Lower: bigger % d64}
fmt.Printf("%d x 2-sup-64 + %d as binary:\n%b%.64b\n", x.Upper, x.Lower, x.Upper, x.Lower)
}

Why does len() returned a signed value?

Go's builtin len() function returns a signed int. Why wasn't a uint used instead?
Is it ever possible for len() to return something negative?
As far as I can tell, the answer is no:
Arrays: "The number of elements is called the length and is never negative."
Slices: "At any time the following relationship holds: 0 <= len(s) <= cap(s)"
Maps "The number of map elements is called its length". (I couldn't find anything in the spec that explicitly restricts this to a nonnegative value, but it's difficult for me to understand how there could be fewer than 0 elements in a map)
Strings "A string value is a (possibly empty) sequence of bytes.... The length of a string s (its size in bytes) can be discovered using the built-in function len()" (Again, hard to see how a sequence could have a negative number of bytes)
Channels "number of elements queued in channel buffer (ditto)
len() (and cap()) return int because that is what is used to index slices and arrays (not uint). So the question is more "Why does Go use signed integers to index slices/arrays when there are no negative indices?".
The answer is simple: It is common to compute an index and such computations tend to underflow much too easy if done in unsigned integers. Some innocent code like i := a-b+7 might yield i == 4294967291 for innocent values for aand b of 6 and 10. Such an index will probably overflow your slice. Lots of index calculations happen around 0 and are tricky to get right using unsigned integers and these bugs hide behind mathematically totally sensible and sound formulas. This is neither safe nor convenient.
This is a tradeoff based on experience: Underflow tends to happen often for index calculations done with unsigned ints while overflow is much less common if signed integers are used for index calculations.
Additionally: There is basically zero benefit from using unsigned integers in these cases.
There is a proposal in progress "issue 31795 Go 2: change len, cap to
return untyped int if result is constant"
It might be included for Go 1.14 (Q1 2010)
we should be able to do it for len and cap without problems - and indeed
there aren't any in the stdlib as a type-checking it via a modified type
checker shows
See CL 179184 as a PoC: this is still experimental.
As noted below by peterSO, this has been closed.
Robert Griesemer explains:
As you noted, the problem with making len always untyped is the size of the
result. For booleans (and also strings) the size is known, no matter what
kind of boolean (or string).
Russ Cox added:
I am not sure the costs here are worth the benefit. Today there is a simple
rule: len(x) has type int. Changing the type to depend on what x is
will interact in non-orthogonal ways with various code changes. For example,
under the proposed semantics, this code compiles:
const x string = "hello"
func f(uintptr)
...
f(len(x))
but suppose then someone comes along and wants to be able to modify x for
testing or something like that, so they s/const/var/. That's usually fairly
safe, but now the f(len(x)) call fails to type-check, and it will be
mysterious why it ever worked.
This change seems like it might add more rough edges than it removes.
Length and capacity
The built-in functions len and cap take arguments of various types and
return a result of type int. The implementation guarantees that the
result always fits into an int.
Golang is strongly typed language, so if len() was uint then instead of:
i := 0 // int
if len(a) == i {
}
you should write:
if len(a) == uint(i) {
}
or:
if int(len(a)) == i {
}
Also See:
uint either 32 or 64 bits
int same size as uint
uintptr an unsigned integer large enough to store the uninterpreted
bits of a pointer value
Also for compatibility with C: CGo the C.size_t and size of array in C is of type int.
From the spec:
The length is part of the array's type; it must evaluate to a non-negative constant representable by a value of type int. The length of array a can be discovered using the built-in function len. The elements can be addressed by integer indices 0 through len(a)-1. Array types are always one-dimensional but may be composed to form multi-dimensional types.
I realize it's maybe a little circular to say the spec dictates X because the spec dictates Y, but since the length can't exceed the maximum value of an int, it's equally as impossible for len to return a uint-exclusive value as for it to return a negative value.

How does Go perform arithmetic on constants?

I've been reading this post on constants in Go, and I'm trying to understand how they are stored and used in memory. You can perform operations on very large constants in Go, and as long as the result fits in memory, you can coerce that result to a type. For example, this code prints 10, as you would expect:
const Huge = 1e1000
fmt.Println(Huge / 1e999)
How does this work under the hood? At some point, Go has to store 1e1000 and 1e999 in memory, in order to perform operations on them. So how are constants stored, and how does Go perform arithmetic on them?
Short summary (TL;DR) is at the end of the answer.
Untyped arbitrary-precision constants don't live at runtime, constants live only at compile time (during the compilation). That being said, Go does not have to represent constants with arbitrary precision at runtime, only when compiling your application.
Why? Because constants do not get compiled into the executable binaries. They don't have to be. Let's take your example:
const Huge = 1e1000
fmt.Println(Huge / 1e999)
There is a constant Huge in the source code (and will be in the package object), but it won't appear in your executable. Instead a function call to fmt.Println() will be recorded with a value passed to it, whose type will be float64. So in the executable only a float64 value being 10.0 will be recorded. There is no sign of any number being 1e1000 in the executable.
This float64 type is derived from the default type of the untyped constant Huge. 1e1000 is a floating-point literal. To verify it:
const Huge = 1e1000
x := Huge / 1e999
fmt.Printf("%T", x) // Prints float64
Back to the arbitrary precision:
Spec: Constants:
Numeric constants represent exact values of arbitrary precision and do not overflow.
So constants represent exact values of arbitrary precision. As we saw, there is no need to represent constants with arbitrary precision at runtime, but the compiler still has to do something at compile time. And it does!
Obviously "infinite" precision cannot be dealt with. But there is no need, as the source code itself is not "infinite" (size of the source is finite). Still, it's not practical to allow truly arbitrary precision. So the spec gives some freedom to compilers regarding to this:
Implementation restriction: Although numeric constants have arbitrary precision in the language, a compiler may implement them using an internal representation with limited precision. That said, every implementation must:
Represent integer constants with at least 256 bits.
Represent floating-point constants, including the parts of a complex constant, with a mantissa of at least 256 bits and a signed exponent of at least 32 bits.
Give an error if unable to represent an integer constant precisely.
Give an error if unable to represent a floating-point or complex constant due to overflow.
Round to the nearest representable constant if unable to represent a floating-point or complex constant due to limits on precision.
These requirements apply both to literal constants and to the result of evaluating constant expressions.
However, also note that when all the above said, the standard package provides you the means to still represent and work with values (constants) with "arbitrary" precision, see package go/constant. You may look into its source to get an idea how it's implemented.
Implementation is in go/constant/value.go. Types representing such values:
// A Value represents the value of a Go constant.
type Value interface {
// Kind returns the value kind.
Kind() Kind
// String returns a short, human-readable form of the value.
// For numeric values, the result may be an approximation;
// for String values the result may be a shortened string.
// Use ExactString for a string representing a value exactly.
String() string
// ExactString returns an exact, printable form of the value.
ExactString() string
// Prevent external implementations.
implementsValue()
}
type (
unknownVal struct{}
boolVal bool
stringVal string
int64Val int64 // Int values representable as an int64
intVal struct{ val *big.Int } // Int values not representable as an int64
ratVal struct{ val *big.Rat } // Float values representable as a fraction
floatVal struct{ val *big.Float } // Float values not representable as a fraction
complexVal struct{ re, im Value }
)
As you can see, the math/big package is used to represent untyped arbitrary precision values. big.Int is for example (from math/big/int.go):
// An Int represents a signed multi-precision integer.
// The zero value for an Int represents the value 0.
type Int struct {
neg bool // sign
abs nat // absolute value of the integer
}
Where nat is (from math/big/nat.go):
// An unsigned integer x of the form
//
// x = x[n-1]*_B^(n-1) + x[n-2]*_B^(n-2) + ... + x[1]*_B + x[0]
//
// with 0 <= x[i] < _B and 0 <= i < n is stored in a slice of length n,
// with the digits x[i] as the slice elements.
//
// A number is normalized if the slice contains no leading 0 digits.
// During arithmetic operations, denormalized values may occur but are
// always normalized before returning the final result. The normalized
// representation of 0 is the empty or nil slice (length = 0).
//
type nat []Word
And finally Word is (from math/big/arith.go)
// A Word represents a single digit of a multi-precision unsigned integer.
type Word uintptr
Summary
At runtime: predefined types provide limited precision, but you can "mimic" arbitrary precision with certain packages, such as math/big and go/constant. At compile time: constants seemingly provide arbitrary precision, but in reality a compiler may not live up to this (doesn't have to); but still the spec provides minimal precision for constants that all compiler must support, e.g. integer constants must be represented with at least 256 bits which is 32 bytes (compared to int64 which is "only" 8 bytes).
When an executable binary is created, results of constant expressions (with arbitrary precision) have to be converted and represented with values of finite precision types – which may not be possible and thus may result in compile-time errors. Note that only results –not intermediate operands– have to be converted to finite precision, constant operations are carried out with arbitrary precision.
How this arbitrary or enhanced precision is implemented is not defined by the spec, math/big for example stores "digits" of the number in a slice (where digits is not a digit of the base 10 representation, but "digit" is an uintptr which is like base 4294967295 representation on 32-bit architectures, and even bigger on 64-bit architectures).
Go constants are not allocated to memory. They are used in context by the compiler. The blog post you refer to gives the example of Pi:
Pi = 3.14159265358979323846264338327950288419716939937510582097494459
If you assign Pi to a float32 it will lose precision to fit, but if you assign it to a float64, it will lose less precision, but the compiler will determine what type to use.

Type signatures for a mutable Haskell Heap

I want to make in Haskell a mutable array based heap (the basic kind found everywhere else). There are some things I don't like about my initial approach, however:
I use tuples to represent my heaps instead of a proper data type
I don't know how to declare the types I am using (too many type variables around), relying on type inference (and copying examples from the Haskell Wiki)
My heap isn't polymorphic
When using my heap in the f example function, I have to thread the ns around.
What would be the best way to abstract my heap into a nice data type? I feel this would solve most of my problems.
buildHeap max_n =
do
arr <- newArray (1, max_n) 0 :: ST s (STArray s Int Int)
return (0, arr)
-- My heap needs to know the array where the elements are stored
-- and how many elements it has
f n = --example use
do
(n1, arr) <- buildHeap n
(n2, _) <- addHeap (n1, arr) 1
(n3, _) <- addHeap (n2, arr) 2
getElems arr
main = print $ runST (f 3)
You should definitely wrap the underlying representation in an abstract data type, and provide only smart constructors for building and taking apart your heap data structure.
Mutable structures tend to have simpler APIs than immutable ones (as they support fewer nice behaviors). To get a sense for what a reasonable API for a mutable container type looks like, including how to abstract the representation, perhaps look at the Judy package.
In particular,
The abstract data type, which is polymorphic in the elements
And the API:
new :: JE a => IO (JudyL a)
-- Allocate a new empty JudyL array.
null :: JudyL a -> IO Bool
-- O(1), null. Is the map empty?
size :: JudyL a -> IO Int
-- O(1), size. The number of elements in the map.
lookup :: JE a => Key -> JudyL a -> IO (Maybe a)
-- Lookup a value associated with a key in the JudyL array.
insert :: JE a => Key -> a -> JudyL a -> IO ()
-- Insert a key and value pair into the JudyL array. Any existing key will be overwritten.
delete :: Key -> JudyL a -> IO ()
-- Delete the Index/Value pair from the JudyL array.
You'll need to support many of the same operations, with similar type signatures.
The actual, underlying representation of a JudyL is given by:
newtype JudyL a =
JudyL { unJudyL :: MVar (ForeignPtr JudyL_) }
type JudyL_ = Ptr JudyLArray
data JudyLArray
Notice how we provide thread safety by locking the underlying resource (in this case, a C pointer to a data structure). Separately, the representation is abstract, and not visible to the user, keeping the API simple.
A general piece of advice for a beginner -- start with IOArrays rather than STArrays so you don't have to worry about higher-kinded types and the other tricks that come with ST. Then, after you've gotten something you're happy with, you can think about how to move it, if you desire, to ST.
Along with that, you should make your own ADT instead of tuples.
data MutHeap a = MutHeap Int (IOArray Int a)
buildHeap max_n = do
arr <- newArray_ (1, max_n)
return (MutHeap 0 arr)
etc.

Resources