Golang complex fold grüßen - go

I'm trying to get case folding to be consistent between three languages (C++, Python and Golang) because I need to be able to check if a string matches the one saved no matter the language.
An example problematic word is the German word "grüßen" which in uppercase is "GRÜSSEN" (Note the 'ß' becomes two characters as 'SS').
C++ works well using boost::locale text conversion docs
Python 3 also works through str.casefold() casefold docs
However, Golang doesn't seem to have a way to do proper case folding. golang
playground example
Is there some way to do this that I'm missing, or does this bug at the end of unicode's documentation apply to all usages of text conversion in golang? If so, what are my options for case folding other than writing it in cgo?

Advanced (Unicode-enabled) text processing is not part of the Go stdlib,¹
and exists in the form of a host of ("blessed") third-party packages
under the golang.org/x/text/ umbrella.
As Shawn figured out by himself, one can do
import (
"golang.org/x/text/cases"
)
c := cases.Fold()
c.String("grüßen")
to get "grüssen" back.
¹ That's because whatever is shipped in the stdlib is subject to the
Go 1 compatibility promise,
and at the time Go 1 was shipped certain functionality wasn't available
or was incomplete or its APIs were in flux etc, so such bits were kept out
of the core to let them mature.

Related

Why does the Tool Help Library offer 2 versions of same functions/structures?

I've noticed that the Tool Help Library offers some functions and structures with 2 versions: normal and ending with W. For example: Process32First and Process32FirstW. Since their documentation is identical, I wonder what are the differences between those two?
The W and A versions stand for "wide" and "ANSI". In the past they made different functions, structures and types for both ANSI and unicode strings. For the purpose of this answer, unicode is widechar which is 2 bytes per character and ANSI is 1 byte per character (but it's actually more complicated than that). By supplying both types, the developer can use whichever he wants but the standard today is to use unicode.
If you look at the ToolHelp32 header file it does include both A and W versions of the structures and functions. If you're not finding them, you're not looking hard enough, do an explicit search for the identifiers and you will find them. If you're just doing "view definition" you will find the #ifdef macros. If you still can't find them, change your character set in your Visual Studio project and check again.
Due to wide char arrays being twice the size, structure alignment will be incorrect if you do not use the correct types. Let the macros resolve them for you, by setting the correct character set and using PROCESSENTRY32 instead of indicating A or W, this is the preferred method. Some APIs you are better off using the ANSI version to be honest but that is something you will learn with experience and have to make your own decision.
Here is an excellent article on the topic of character sets / encoding

Using unexported functions/types from stdlib in Go

Disclaimer: yes I know that this is not "supposed to be done" and "use interface composition and delegation" and "the authors of the language know better". However, I am confronted with a choice of either copy-pasting from the standard library and creating my own packages, or doing what I am asking. So please do not reply with "What you want to do is wrong, you are a bad dev and you should feel bad."
So, in Go we have the http stdlib package. This package has a number of functions for dealing with HTTP Range headers and responses (parsers, a struct for "offset+size" and so forth). For various reasons I want to use something that is very similar to ServeContent but works a bit differently (long story short - the amount of plumbing needed to do the ReaderAt gymnastics is suboptimal for what I want to accomplish) so I want to parse the HTTP Range header myself, using the utility functions/structs from the http stdlib package and then deal with them manually. Basically, I want a changed version of ServeContent :-)
Is there a way for me to "reopen" the http stdlib package to use it's unexported identifiers? ABI is not a concern for me as the source is mine, the program gets compiled from scratch every time etc. etc. and it does not need binary compatibility with older/other Go versions. I.e. I am able to ensure that the build is going to be done on a specific Go version and there are tests to check that an unexported identifier disappeared. So...
If there is a package called foo in the Go standard library, but it only exposes a MagicMegamethod that does the thing I do not need, and uses usefulFunc and usefulStruct that I want to get access to, is there a way for me to get access to those identifiers? Either by reopening the package, or using some other way... that does not involve copy-pasting dozens of lines from stdlib without tests etc.
There exist (rather gruesome) ways of accessing unexported symbols, but it requires nontrivial amounts of tricky code, so there's unlikely to be a net win.
Since you've outruled the "don't do this" direction, it seems that the answer is either NO or use the methods described in the post I linked to (and this repo).
FWIW I'd personally just copy the code I need from the standard library and tweak it to my needs. This would likely take less time than the time it took you to write this SO question :-)

Decoding Protobuf encoded data using non-supported platform

I am new to Protobufs; I haven't had much exposure to them. One of the API endpoints we require data from, uses Protobuf encoded data. This generally wouldn't be an issue if I was using a 'supported' language such as JavaScript, Java, Python or even R to decode the data...
Unfortunately, I am trying to automate the process using Alteryx. Rather than this being an Alteryx specific question, I have a few questions about Protobufs themselves so I understand this situation better. I've read through the implementation of Protobufs in Java and Python, and have a basic understanding of how to use them.
To surmise (please correct me if I am wrong), a Protobuf is a method of serializing structured data where a .proto schema is used to encode / decode data into raw binary. My confusion lies with the compiler. Google documentation and examples for Python / Java show how a Protobuf compiler (library) is required in order to run the encoding and decoding process. Reading the Google website, it advises that the Protobufs are 'language neutral and platform neutral', but I can't see how that is possible if you need the compiler (and .proto file!) to do the decoding. For example, how would anyone using a language outside of the languages where Google have a compiler created possibly decode Protobuf encoded data? Am I missing something?
I figure I'm missing something, since it seems weird that a public API would force this constraint.
"language/platform neutral" here simply means that you can reliably get the same data back from any language/framework/platform. The serialization format is defined independently and does not rely on the nuances of any particular framework.
This might seem a low bar, but you'd be surprised how many serialization formats fail to clear it.
Because the format is specified, anyone can create a tool for some other platform. It is a little fiddly if you're not used to dealing in bits, but: totally doable. The protobuf landscape is not dependent on Google - here's a list of some of the known non-Google tools: https://github.com/protocolbuffers/protobuf/blob/master/docs/third_party.md
Also, note that technically you don't even need a .proto; you just need some mechanism for specifying which fields map to which field numbers (since protobuf doesn't include the names). Quite a few in that list can work either from a .proto, or from the field/number map being specified in some other way. The advantage of .proto is simply that it is easy to convey as the schema - and again: isn't tied to any particular language. You can write plugins for "protoc" to add your own tooling, so you don't need to write your own parser from scratch. Or you can write your own parser from scratch if you prefer.
You can't speak of non-supported platform in this case: it is more about languages for which you can't find a protobuf implementation.
My 2 cents is: if you can't find a protobuf implementation for your language, find another language you're familiar with (and popular in protobuf community) and handle the protobuf serialization/deserialization with it. Then call it via a REST API, a executable ... whatever

How to decode a single UTF-8 character and step onto the next using only the Rust standard library?

Does Rust provide a way to decode a single character (unicode-scalar-value to be exact) from a &[u8], which may be multiple bytes, returning a single USV?
Something like GLib's g_utf8_get_char & g_utf8_next_char:
// Example of what glib's functions might look like once ported to Rust.
let i = 0;
while i < slice.len() {
let unicode_char = g_utf8_get_char(&slice[i..]);
// do something with the unicode character
funcion(unicode_char);
// move onto the next.
i += g_utf8_next_char(&slice[i..]);
}
Short of porting parts of the GLib API to Rust, does Rust provide a way to do this, besides some trial & error calls to from_utf8 which stop once the second character is reached?
See GLib's code.
No, there is no such functionality publicly exposed in the Rust standard library as of Rust 1.14.
And neither should there be. Rust doesn't believe in a gigantic standard library. Crates are trivial to use and prevent people from rewriting code. Many people have an incorrect opinion (yeah, that's right: an opinion is incorrect) that using dependencies makes their program weaker.
Anything put in the standard library has to be maintained forever. There are zero plans for a Rust 2.0 that would break backwards compatibility. Python is the normal example here, with a multitude of "get data from a URL" parts of the standard library that are all redundant and deprecated now. The Python maintainers have to waste time keeping those working, instead of advancing the language.
Third-party crates allow things to be created, evolve, and die without burdening the entire language.
You can convert a byte slice (&[u8]) into a string slice (&str) by using str::from_utf8 (note that this validates that the whole byte slice is valid UTF-8). You can then use the chars() iterator on the string slice to iterate on each character (char) in the string.

Naming convention for not-exported type names in Go

I like to name my types using Pascal case - starting with an upper case letter. In Go this implies the name is exported.
To avoid export, I've started to prefix the type name with undercsore instead of lower-casing the first letter.
E.g: Instead of
type Column struct{}, I use type _Column struct{} to avoid export.
I haven't seen this naming scheme used, but neither found any reason not to use it.
Since golint accepts it without complaint, I guess this is OK?
Conclusion: Based on answers and comments I've decided to stay with lower-cased type names.
I'd suggest using column in preference to _Column, on the basis that the style used by the standard libraries follow that naming convention.
This is not explicit in the Names section of the style guide, but based on the fact that underscores are generally discouraged, I'd say that using _Column is, at best, not idiomatic.
"I like to" and go don't super mix.
There are idiomatic bits and tooling enforced bits.
The community sticking to the standards makes for codebases that can be reasonably easy to read and comprehend by others.
I find this to be one of the best attributes of go.
Sure, channels and goroutines are nice.
Easily being able to read a codebase is often much more valuable.

Resources