Erlang pattern matching bitstrings - performance

I'm writing code to decode messages from a binary protocol. Each message type is assigned a 1 byte type identifier and each message carries this type id. Messages all start with a common header consisting of 5 fields. My API is simple:
decoder:decode(Bin :: binary()) -> my_message_type() | {error, binary()}`
My first instinct is to lean heavily on pattern matching by writing one decode function for each message type and to decode that message type completely in the fun argument
decode(<<Hdr1:8, ?MESSAGE_TYPE_ID_X:8, Hdr3:8, Hdr4:8, Hdr5:32,
TypeXField1:32, TypeXFld2:32, TypeXFld3:32>>) ->
#message_x{hdr1=Hdr1, hdr3=Hdr3 ... fld4=TypeXFld3};
decode(<<Hdr1:8, ?MESSAGE_TYPE_ID_Y:8, Hdr3:8, Hdr4:8, Hdr5:32,
TypeYField1:32, TypeYFld2:16, TypeYFld3:4, TypeYFld4:32
TypeYFld5:64>>) ->
#message_y{hdr1=Hdr1, hdr3=Hdr3 ... fld5=TypeYFld5}.
Note that while the first 5 fields of the messages are structurally identical, the fields after that vary for each message type.
I have roughly 20 message types and thus 20 functions similar to the above. Am I decoding the full message multiple times with this structure? Is it idiomatic? Would I be better off just decoding the message type field in the function header and then decode the full message in the body of the message?

Just to agree that your style is very idiomatic Erlang. Don't split the decoding into separate parts unless you feel it makes your code clearer. Sometimes it can be more logical to do that type of grouping.
The compiler is smart and compiles pattern matching in such a way that it will not decode the message more than once. It will first decode the first two fields (bytes) and then use the value of the second field, the message type, to determine how it is going to handle the rest of the message. This works irrespective of how long the common part of the binary is.
So their is no need to try and "help" the compiler by splitting the decoding into separate parts, it will not make it more efficient. Again, only do it if it makes your code clearer.

Your current approach is idiomatic Erlang, so keep going this direction. Don't worry about performance, Erlang compiler does good work here. If your messages are really exactly same format you can write macro for it but it should generate same code under hood. Anyway using macro usually leads to worse maintainability. Just for curiosity why you are generating different record types when all have exactly same fields? Alternative approach is just translate message type from constant to Erlang atom and store it in one record type.

Related

Safety of using reflect.StringHeader in Go?

I have a small function which passes the pointer of Go string data to C (Lua library):
func (L *C.lua_State) pushLString(s string) {
gostr := (*reflect.StringHeader)(unsafe.Pointer(&s))
C.lua_pushlstring(L, (*C.char)(unsafe.Pointer(gostr.Data)), C.ulong(gostr.Len))
// lua_pushlstring copies the given string, not keeping the original pointer.
}
It works in simple tests, but from the documentations it's unclear whether this is safe at all.
According to Go document, the memory of reflect.StringHeader should be pinned for gostr, but the Stringheader.Data is already a uintptr, "an integer value with no pointer semantics" - which is itself odd because if it has no pointer semantics, wouldn't the field be completely useless as the memory may be moved right after the value is read? Or is the field treated specially like reflect.Value.Pointer? Or perhaps there is a different way of getting C pointer from string?
it's unclear whether this is safe at all.
Tapir Liui (https://twitter.com/TapirLiu/) dans Go101 (https://github.com/go101/go101) gives a clue as to the "safety" of reflect.StringHeader in this tweet:
Since Go 1.20, the reflect.StringHeader and reflect.SliceHeader types will be depreciated and not recommended to be used.
Accordingly, two functions, unsafe.StringData and unsafe.SliceData, will be introduced in Go 1.20 to take over the use cases of two old reflect types.
That was initially discussed in CL 401434, then in issue 53003.
The reason for deprecation is that reflect.SliceHeader and reflect.StringHeader are commonly misused.
As well, the types have always been documented as unstable and not to be relied upon.
We can see in Github code search that usage of these types is ubiquitous.
The most common use cases I've seen are:
converting []byte to string:
Equivalent to *(*string)(unsafe.Pointer(&mySlice)), which is never actually officially documented anywhere as something that can be relied upon.
Under the hood, the shape of a string is less than a slice, so this seems valid per unsafe rule.
converting string to []byte:
commonly seen as *(*[]byte)(unsafe.Pointer(&string)), which is by-default broken because the Cap field can be past the end of a page boundary (example here, in widely used code) -- this violates unsafe rule.
grabbing the Data pointer field for ffi or some other niche use converting a slice of one type to a slice of another type
Ian Lance Taylor adds:
One of the main use cases of unsafe.Slice is to create a slice whose backing array is a memory buffer returned from C code or from a call such as syscall.MMap.
I agree that it can be used to (unsafely) convert from a slice of one type to a slice of a different type.

Refactoring Business Rule, Function Naming, Width, Height, Position X & Y

I am refactoring some business rule functions to provide a more generic version of the function.
The functions I am refactoring are:
DetermineWindowWidth
DetermineWindowHeight
DetermineWindowPositionX
DetermineWindowPositionY
All of them do string parsing, as it is a string parsing business rules engine.
My question is what would be a good name for the newly refactored function?
Obviously I want to shy away from a function name like:
DetermineWindowWidthHeightPositionXPositionY
I mean that would work, but it seems unnecessarily long when it could be something like:
DetermineWindowMoniker or something to that effect.
Function objective: Parse an input string like 1280x1024 or 200,100 and return either the first or second number. The use case is for data-driving test automation of a web browser window, but this should be irrelevant to the answer.
Question objective: I have the code to do this, so my question is not about code, but just the function name. Any ideas?
There are too little details, you should have specified at least the parameters and returns of the functions.
Have I understood correctly that you use strings of the format NxN for sizes and N,N for positions?
And that this generic function will have to parse both (and nothing else), and will return either the first or second part depending on a parameter of the function?
And that you'll then keep the various DetermineWindow* functions but make them all call this generic function?
If so:
Without knowing what parameters the generic function has it's even harder to help, but it's most likely impossible to give it a simple name.
Not all batches of code can be described by a simple name.
You'll most likely need to use a different construction if you want to have clear names. Here's an idea, in pseudo code:
ParseSize(string, outWidth, outHeight) {
ParsePair(string, "x", outWidht, outHeight)
}
ParsePosition(string, outX, outY) {
ParsePair(string, ",", outX, outY)
}
ParsePair(string, separator, outFirstItem, outSecondItem) {
...
}
And the various DetermineWindow would call ParseSize or ParsePosition.
You could also use just ParsePair, directly, but I thinks it's cleaner to have the two other functions in the middle.
Objects
Note that you'd probably get cleaner code by using objects rather than strings (a Size and a Position one, and probably a Pair one too).
The ParsePair code (adapted appropriately) would be included in a constructor or factory method that gives you a Pair out of a string.
---
Of course you can give other names to the various functions, objects and parameters, here I used the first that came to my mind.
It seems this question-answer provides a good starting point to answer this question:
Appropriate name for container of position, size, angle
A search on www.thesaurus.com for "Property" gives some interesting possible answers that provide enough meaningful context to the usage:
Aspect
Character
Characteristic
Trait
Virtue
Property
Quality
Attribute
Differentia
Frame
Constituent
I think ConstituentProperty is probably the most apt.

Changing behavior based on number of return arguments like type assertions

I've been learning Go and one thing that stood out as particularly interesting to me is the way that the behavior of type assertions changes based on how many return values are being captured:
var i interface{} = "hello"
val, ok := i.(int) // All good
fmt.Println(val, ok)
val = i.(int) // Panics
fmt.Println(val)
This feels like a pattern that can be very useful for user defined functions. The user either has to explicitly get the "ok" second return value or use an underscore to ignore it. In either case, they're making it clear that they're aware that the function can fail. Whereas if they just get one return value, it could silently fail. Hence, it seems reasonable to panic or similar if the user isn't checking for an error (which would be reasonable if the error should "never" happen). I assume that's the logic behind the language developers in making type assertions work this way.
But when I tried to find out how that could be done, I found nothing. I'm aware that type assertions aren't an actual function. And many languages with multiple return values can't check how many return values are actually being used (MATLAB is the only one I'm aware of), but then again, most of those don't use behavior like the type assertions demonstrate.
So, is it possible and if so, how? If not, is there a particular reason that this behavior was excluded despite it being possible with the built in type assertions?
Sadly they cannot be used in normal functions. As far as i know only type assertions, map value access and range allow it.
Usually when you want to have a function with one and optional a second error argument you name them like
func DoSomething() (string, error) {...} // i will return an error
func MustDoSomething() string {...} // i will panic
An example would be https://golang.org/pkg/regexp/#MustCompile
This answer: https://stackoverflow.com/a/41816171/10278 by #christian provides the best practical advice for how to emulate the "overloaded-on-result-count" pattern.
My aim is to address another part of the question—this part: "But when I tried to find out how that could be done, I found nothing".
The following explains how it is done for Go type assertions.
Invocations of type assertions in Go behave as though they are overloaded based on number of results.
Yet, Go does not support overloading of methods and operators.
Looking at Go's implementation, here is the reason type assertions appear to be overloaded based on number of results:
The Go compiler provides special handling that is peculiar to these built-in operations.
This special dispatching occurs for the built-in concept of type assertions because the compiler is carving out special logic that is not available to non-built-in code.
The Go compiler and runtime are written in Go. That made it (somewhat) easy for me to discover that the compiler is the key to explaining this behavior.
Take a look at this part of the compiler:
https://github.com/golang/go/blob/8d86ef2/src/cmd/compile/internal/gc/ssa.go#L4782
The code comment already reveals a lot:
// dottype generates SSA for a type assertion node.
// commaok indicates whether to panic or return a bool.
// If commaok is false, resok will be nil.
We can go further by using a debugger to step through some type assertion code.
Take this playground snippet for example. Specifically, these lines:
object_as_closer_hardstop := thing.(io.Closer) // will panic!!
object_as_closer, ok := thing.(io.Closer)
(If you build Go from source, then) if you use a debugger to step into the first type assertion, you will end up at the following code in the Go runtime:
https://github.com/golang/go/blob/8d86ef2/src/runtime/iface.go#L438
If you step into the second one, you end up at:
https://github.com/golang/go/blob/8d86ef2/src/runtime/iface.go#L454
On line 438, you see func assertI2I (with a single return value). A bit lower, on line 454, you see assertI2I2. Note that these two functions have nearly identical names, but not quite!
The second function has a trailing 2 at the end of its name. That function also has two returned results.
As we expect:
assertI2I can panic, but
assertI2I2 cannot.
(Look at the function bodies in iface.go and note which contains panic.)
assertI2I and assertI2I2 abide by the overloading rules we expect. If they were to differ only by number of results, then those of us who compile Go from source would be unable to compile the Go runtime, due to a compiler error such as "assertI2I redeclared".
Users of the language are generally not aware of these builtin runtime functions, so on the surface, both lines of code seem to call the same function:
object_as_closer_hardstop := thing.(io.Closer) // will panic!!
object_as_closer, ok := thing.(io.Closer)
However, at compile time the compiler branches based on whether it found the case "commaok":
https://github.com/golang/go/blob/8d86ef2/src/cmd/compile/internal/gc/ssa.go#L4871
Our own end-user code does not get to modify Go's lexing/parsing/AST-walking in order to dispatch different flavors of our functions based on "commaok".
For better or for worse, that is why user-written code cannot leverage this pattern.

Unicode - the right thing to do

I'm working on something which processes UTF-8 encoding, and I found myself asking the question:
What should I do when I encounter a byte which never occur inside a
UTF-8 encoded string?
i.e. 0x1111111X
For example, I'm writing a small snippet of code which looks at the current place in the stream of bytes, and tells you how many bytes are used to represent the code point at that place in the stream.
0x0XXXXXXX just 1
0x10XXXXXX oops, we are in a continuation byte,
search back upstream to find the leading byte
0x11XXXXXX count the
number of leading 1s, that's the answer
0x1111111X err, this is not
possible in UTF-8!!! what to do!?!?
I'm thinking of returning an error value, but wondering if I should, as a side effect, replace it with some more predictable error glyph (I mean the code point representing said glyph). And later when I do something more complicated, like jumping through the string and find that the leading byte does not have the correct number of continuation bytes after it... I'm thinking I should "fix" that up too.
Is it standard practice to leave wrongly encoded strings broken, or to change them and make them be wrong but at least play nice?
The most common way is to just throw a meaningful error if the input is not correct and stop.
There are a lot of good reasons to do so:
speed: if you try to fix errors this often cause your
function to be slower even on correct inputs
simplicity: your code can become really complicated if you try to fix any error
maintainability and correctness: it's just easier to ensure the function works correctly
when you stop whenever the input does not match the specification you are working with. Since you have only to check input according to specification.
purpose: any time you get to such a point like here you have to think about:
what is the purpose of my function? Why I came up with the idea to write it?
Also: a function fixcode which fixes the uft8 could be used also at an other place, so it makes total sense to separate fixing (purpose, simplicity, maintainability and correctness argument again).
Even if you expect an error, I would prefer to separate the encode and fixcode since
your can reuse fixcode in outer contexts.
If you are really thinking about fixing the utf8 code while encoding I would use a pattern like this:
try {
q = encode(s);
} catch(encodingerror) {
log(encodingerror);
t = fixcode(s);
q = encode(t);
}

reading a "." in scheme R5RS

I need to be able to read user input in scheme for a project. For example, I need to be able to read the string 4 5 * .. I was implementing it using the (read) function but it gives an error when it reads a .. I would use a different symbol but it is specified by the project description. Is there a way to do this?
You cannot use read to input arbitrary text. The read procedure is only meant for inputting "S-expressions", a data format that can be used to represent a superset of Scheme source code expressions.
The reason you cannot read a . via the read procedure is that a period token has a special role in Scheme source: it is used for dotted pair notation. (C1 . C2) is the way that the pair of C1 and C2 is written as an S-expression. Note there is a crucial difference between the single pair (C1 . C2) and the list (C1 C2) (which is made from two pairs); and yet the only difference between the source text is the presence/absence of a single period.
The dotted pair notation is described in section 6.3.2 of the R5RS.
So, as suggested in the comments on your question by Dan D., you should consider using the read-char procedure to consume user input text. It described in section 6.6.2 of the R5RS. It may seem counter-intuitive, since read-char only consumes a single character while read consumes many characters (and builds a potentially large tree of structured data), but the reality is that you can build your own parser on top of read-char, by invoking it repeatedly in a loop, as suggested by Dan D.
In fact, some scheme systems implement read itself by making it a Scheme procedure that invokes read-char. See for example Larceny's reader source code, where read will call get-datum, which calls get-datum-with-source-locations, which calls read-char in a number of places.
Alternatively, you might have other ways of reading input from the user. The read-line procedure is quite common (and its also easy to write on top of read-char). Or you might look into a Parser-Generator (like the one that generated the source code for Larceny's reader, linked above.

Resources