How to handle NaN values when writing to parquet in GO? - go

I am trying to write to a parquet file in GO. While writing to this file, I can get NaN values. Since NaN is neither defined in the primitive types nor in logical type then how do I handle this value in GO? Does any existing schema work for it?
I am using the parquet GO library from here. You can find an example of the code using JSON schema for writing to parquet here using this library.

The isse was discussed at lenght in xitongsys/parquet-go issue 281, with the recommandation being to
use OPTIONAL type.
Even you don't assign a value (like you code), the non-point value will be assigned a default value.
So parquet-go don't know it's null or default value.
However:
What is comes down to is that I cannot use the OPTIONAL type, in other words I cannot convert my structure to use pointers.
I have tried to use repetitiontype=OPTIONAL as a tag, but this leads to some weird behavior.
I would expect that tag to behave the same way that the omitempty tag in the Golang standard library, i.e. if the value is not present then it is not put into the JSON.
The reason this is important is that if the field is missing or not set, when it is encoded to parquet then there is no way of telling if the value was 0 or just not set in the case of int64.
This illustrates the issue:
package main
import (
"encoding/json"
"io/ioutil"
)
type Salary struct {
Basic, HRA, TA float64 `json:",omitempty"`
}
type Employee struct {
FirstName, LastName, Email string `json:",omitempty"`
Age int
MonthlySalary []Salary `json:",omitempty"`
}
func main() {
data := Employee{
Email: "mark#gmail.com",
MonthlySalary: []Salary{
{
Basic: 15000.00,
},
},
}
file, _ := json.MarshalIndent(data, "", " ")
_ = ioutil.WriteFile("test.json", file, 0o644)
}
with a JSON produced as:
{
"Email": "mark#gmail.com",
"Age": 0,
"MonthlySalary": [
{
"Basic": 15000
}
]
}
As you can see, the item in the struct that have the omit empty tag and that are not assigned do no appear in the JSON, i.e. HRA TA.
But on the other hand Age does not have this tag and hence it is still included in the JSON.
This is problematic as all fields in the struct are assigned memory when this golang library writes to parquet- so if you have a big struct that is only sparsely populated it will still take the full amount of memory.
It is a bigger problem when the file is read again as there is no way of know if the value that was put in the parquet file was the empty value or it is was just not assigned.
I am happy to help implement an omitempty tag for this library if I can convince you of the value of having it.
That echoes issue 403 "No option to omitempty when not using pointers".

Related

How to represent golang "any" type in protobuf language

am using a protobuf definition file to create golang classes
file.proto
message my_message {
<what type to use?> data = 1 [json_name = "data"]
}
generated.go
type MyMessage struct {
data any
}
I have checked out Struct , Value and Any, though none of them actually mark the type as “any” in generated classes.
The closes I get is Value type which can accomodate primitives as well as structs and lists.
As far as I'm aware the Go code generator doesn't have any circumstances where any/interface{} will be used for a generated field.
google.protobuf.Any is likely what you want here, but the field in the generated code will of type anypb.Any. If you absolutely require the any type be used, you'll unfortunately have to manually map to another struct.

Deriving a hashmap/Btree from a strcut of values in rust

I am trying to do the following:
Deserialise a struct of fields into another struct of different format.
I am trying to do the transformation via an Hashmap as this seems to be the best suited way.
I am required to do this as a part of transformation of a legacy system, and this being one of the intermediary phases.
I have raised another question which caters to a subset of the same use-case, but do not think it gives enough overview, hence raising this question with more details and code.
(Rust converting a struct o key value pair where the field is not none)
Will be merging both questions once I figure out how.
Scenario:
I am getting input via an IPC through a huge proto.
I am using prost to deserialise this data into a struct.
My deserialised struct has n no of fields and can increase as well.
I need to transform this deserialised struct into a key,value struct. (shown ahead).
The incoming data, will mostly have a majority of null keys .i.e out of n fields, most likely only 1,2 have values at a given time.
Input struct after prost deserialisation:
Using proto3 so unable to define optional fields.
Ideally I would prefer a prost struct of options on every field. .i.e Option instead of string.
struct Data{
int32 field1,
int64 field2,
string field3,
...
}
This needs to be transformed to a genric struct as below for further use:
struct TransformedData
{
string Key
string Value
}
So,
Data{
field1: 20
field2: null
field3: null
field4: null
....
}
becomes
TransformedData{
key:"field1"
Value: "20"
}
Methods I have tried:
Method1
Add serde to the prost struct definiton and deserialise it into a map.
Loop over each item in a map to get values which are non-null.
Use these to create a struct
https://play.rust-lang.org/?version=stable&mode=debug&edition=2015&gist=9645b1158de31fd54976926c9665d6b4
This has its challenges:
Each primitve data type has its own default values and needs to be checked for.
Nested structs will result in object data type which needs to be handled.
Need to iterate over every field of a struct to determine non null values.
1&2 can be mitigated by setting an Optional field(yet to figure out how in proto3 and prost)
But I am not sure how to get over 3.
Iterating over a large struct to find non null fields is not scalable and will be expensive.
Method 2:
I am using prost reflects dynamic reflection to deserialise and get only specified value.
Doing this by ensuring each proto message has two extra fields:
proto -> signifying the proto struct being used when serializing.
key -> signifying the filed which has value when serializing.
let fd = package::FILE_DESCRIPTOR_SET;
let pool = DescriptorPool::decode(fd).unwrap();
let proto_message: package::GenericProto = prost::Message::decode(message.as_ref()).expect("error when de-serialising protobuf message to struct");
let proto = proto_message.proto.as_str();
let key = proto_message.key.as_str();
Then using key , I derive the key from what looks to be a map implemented by prost:
let message_descriptor = pool.get_message_by_name(proto).unwrap();
let dynamic_message = DynamicMessage::decode(message_descriptor, message.as_ref()).unwrap();
let data = dynamic_message.get_field_by_name(key).unwrap().as_ref().clone();
Here :
1.This fails when someone sends a struct with multiple fields filled.
2.This does not work for nested objects, as nested objects in turn needs to be converted to a map.
I am looking for the least expensive way to do the above.
I understand the solution I am looking for is pretty out there and I am taking a one size fits all approach.
Any pointers appreciated.

Struct type first line: _ struct{} [duplicate]

I am working with go, specifically QT bindings. However, I do not understand the use of leading underscores in the struct below. I am aware of the use of underscores in general but not this specific example.
type CustomLabel struct {
core.QObject
_ func() `constructor:"init"`
_ string `property:"text"`
}
Does it relate to the struct tags?
Those are called blank-fields because the blank identifier is used as the field name.
They cannot be referred to (just like any variable that has the blank identifier as its name) but they take part in the struct's memory layout. Usually and practically they are used as padding, to align subsequent fields to byte-positions (or memory-positions) that match layout of the data coming from (or going to) another system. The gain is that so these struct values (or rather their memory space) can be dumped or read simply and efficiently in one step.
#mkopriva's answer details what the specific use case from the question is for.
A word of warning: these blank fields as "type-annotations" should be used sparingly, as they add unnecessary overhead to all (!) values of such struct. These fields cannot be referred to, but they still require memory. If you add a blank field whose size is 8 bytes (e.g. int64), if you create a million elements, those 8 bytes will count a million times. As such, this is a "flawed" use of blank fields: the intention is to add meta info to the type itself (not to its instances), yet the cost is that all elements will require increased memory.
You might say then to use a type whose size is 0, such as struct{}. It's better, as if used in the right position (e.g. being the first field, for reasoning see Struct has different size if the field order is different and also Why position of `[0]byte` in the struct matters?), they won't change the struct's size. Still, code that use reflection to iterate over the struct's fields will still have to loop over these too, so it makes such code less efficient (typically all marshaling / unmarshaling process). Also, since now we can't use an arbitrary type, we lose the advantage of carrying a type information.
This last statement (about when using struct{} we lose the carried type information) can be circumvented. struct{} is not the only type with 0 size, all arrays with 0 length also have zero size (regardless of the actual element type). So we can retain the type information by using a 0-sized array of the type we'd like to incorporate, such as:
type CustomLabel struct {
_ [0]func() `constructor:"init"`
_ [0]string `property:"text"`
}
Now this CustomLabel type looks much better performance-wise as the type in question: its size is still 0. And it is still possible to access the array's element type using Type.Elem() like in this example:
type CustomLabel struct {
_ [0]func() `constructor:"init"`
_ [0]string `property:"text"`
}
func main() {
f := reflect.ValueOf(CustomLabel{}).Type().Field(0)
fmt.Println(f.Tag)
fmt.Println(f.Type)
fmt.Println(f.Type.Elem())
}
Output (try it on the Go Playground):
constructor:"init"
[0]func()
func()
For an overview of struct tags, read related question: What are the use(s) for tags in Go?
You can think of it as meta info of the type, it's not accessible through an instance of that type but can be accessed using reflect or go/ast. This gives the interested package/program some directives as to what to do with that type. For example based on those tags it could generate code using go:generate.
Considering that one of the tags says constructor:"init" and the field's type is func() it's highly probable that this is used with go:generate to generate an constructor function or initializer method named init for the type CustomLabel.
Here's an example of using reflect to get the "meta" info (although as I've already mentioned, the specific qt example is probably meant to be handled by go:generate).
type CustomLabel struct {
_ func() `constructor:"init"`
_ string `property:"text"`
}
fmt.Println(reflect.ValueOf(CustomLabel{}).Type().Field(0).Tag)
// constructor:"init"
fmt.Println(reflect.ValueOf(CustomLabel{}).Type().Field(0).Type)
// func()
https://play.golang.org/p/47yWG4U0uit

Convert interface{} to struct in Golang

I am very new to Go and am trying to get my head around all the different types and how to use them. I have an interface with the following (which was originally in a json file):
[map[item:electricity transform:{fuelType}] map[transform:{fuelType} item:gas]]
and I have the following struct
type urlTransform struct {
item string
transform string
}
I have no idea how to get the interface data into the struct; I'm sure this is really stupid, but I have been trying all day. Any help would be greatly appreciated.
Decode the JSON directly to types you want instead of decoding to an interface{}.
Declare types that match the structure of your JSON data. Use structs for JSON objects and slices for JSON arrays:
type transform struct {
// not enough information in question to fill this in.
}
type urlTransform struct {
Item string
Transform transform
}
var transforms []urlTransform
The field names must be exported (start with uppercase letter).
Unmarshal the JSON to the declared value:
err := json.Unmarshal(data, &transforms)
or
err := json.NewDecoder(reader).Decode(&transforms)
From your response : [map[item:electricity transform:{fuelType}] map[transform:{fuelType} item:gas]].
As you can see here this is a an array that has map in it.
One way to get the value from this is :
values := yourResponse[0].(map[string]interface{}). // convert first index to map that has interface value.
transform := urlTransform{}
transform.Item = values["item"].(string) // convert the item value to string
transform.Transform = values["transform"].(string)
//and so on...
as you can see from the code above I'm getting the the value using map. And convert the value to the appropriate type in this case is string.
You can convert it to appropriate type like int or bool or other type. but this approach is painful as you need to get the value one bye one and assign it your your field struct.

What are the use(s) for struct tags in Go?

In the Go Language Specification, it mentions a brief overview of tags:
A field declaration may be followed by an optional string literal tag,
which becomes an attribute for all the fields in the corresponding
field declaration. The tags are made visible through a reflection
interface but are otherwise ignored.
// A struct corresponding to the TimeStamp protocol buffer.
// The tag strings define the protocol buffer field numbers.
struct {
microsec uint64 "field 1"
serverIP6 uint64 "field 2"
process string "field 3"
}
This is a very short explanation IMO, and I was wondering if anyone could provide me with what use these tags would be?
A tag for a field allows you to attach meta-information to the field which can be acquired using reflection. Usually it is used to provide transformation info on how a struct field is encoded to or decoded from another format (or stored/retrieved from a database), but you can use it to store whatever meta-info you want to, either intended for another package or for your own use.
As mentioned in the documentation of reflect.StructTag, by convention the value of a tag string is a space-separated list of key:"value" pairs, for example:
type User struct {
Name string `json:"name" xml:"name"`
}
The key usually denotes the package that the subsequent "value" is for, for example json keys are processed/used by the encoding/json package.
If multiple information is to be passed in the "value", usually it is specified by separating it with a comma (','), e.g.
Name string `json:"name,omitempty" xml:"name"`
Usually a dash value ('-') for the "value" means to exclude the field from the process (e.g. in case of json it means not to marshal or unmarshal that field).
Example of accessing your custom tags using reflection
We can use reflection (reflect package) to access the tag values of struct fields. Basically we need to acquire the Type of our struct, and then we can query fields e.g. with Type.Field(i int) or Type.FieldByName(name string). These methods return a value of StructField which describes / represents a struct field; and StructField.Tag is a value of type StructTag which describes / represents a tag value.
Previously we talked about "convention". This convention means that if you follow it, you may use the StructTag.Get(key string) method which parses the value of a tag and returns you the "value" of the key you specify. The convention is implemented / built into this Get() method. If you don't follow the convention, Get() will not be able to parse key:"value" pairs and find what you're looking for. That's also not a problem, but then you need to implement your own parsing logic.
Also there is StructTag.Lookup() (was added in Go 1.7) which is "like Get() but distinguishes the tag not containing the given key from the tag associating an empty string with the given key".
So let's see a simple example:
type User struct {
Name string `mytag:"MyName"`
Email string `mytag:"MyEmail"`
}
u := User{"Bob", "bob#mycompany.com"}
t := reflect.TypeOf(u)
for _, fieldName := range []string{"Name", "Email"} {
field, found := t.FieldByName(fieldName)
if !found {
continue
}
fmt.Printf("\nField: User.%s\n", fieldName)
fmt.Printf("\tWhole tag value : %q\n", field.Tag)
fmt.Printf("\tValue of 'mytag': %q\n", field.Tag.Get("mytag"))
}
Output (try it on the Go Playground):
Field: User.Name
Whole tag value : "mytag:\"MyName\""
Value of 'mytag': "MyName"
Field: User.Email
Whole tag value : "mytag:\"MyEmail\""
Value of 'mytag': "MyEmail"
GopherCon 2015 had a presentation about struct tags called:
The Many Faces of Struct Tags (slide) (and a video)
Here is a list of commonly used tag keys:
json      - used by the encoding/json package, detailed at json.Marshal()
xml       - used by the encoding/xml package, detailed at xml.Marshal()
bson      - used by gobson, detailed at bson.Marshal(); also by the mongo-go driver, detailed at bson package doc
protobuf  - used by github.com/golang/protobuf/proto, detailed in the package doc
yaml      - used by the gopkg.in/yaml.v2 package, detailed at yaml.Marshal()
db        - used by the github.com/jmoiron/sqlx package; also used by github.com/go-gorp/gorp package
orm       - used by the github.com/astaxie/beego/orm package, detailed at Models – Beego ORM
gorm      - used by gorm.io/gorm, examples can be found in their docs
valid     - used by the github.com/asaskevich/govalidator package, examples can be found in the project page
datastore - used by appengine/datastore (Google App Engine platform, Datastore service), detailed at Properties
schema    - used by github.com/gorilla/schema to fill a struct with HTML form values, detailed in the package doc
asn       - used by the encoding/asn1 package, detailed at asn1.Marshal() and asn1.Unmarshal()
csv       - used by the github.com/gocarina/gocsv package
env - used by the github.com/caarlos0/env package
Here is a really simple example of tags being used with the encoding/json package to control how fields are interpreted during encoding and decoding:
Try live: http://play.golang.org/p/BMeR8p1cKf
package main
import (
"fmt"
"encoding/json"
)
type Person struct {
FirstName string `json:"first_name"`
LastName string `json:"last_name"`
MiddleName string `json:"middle_name,omitempty"`
}
func main() {
json_string := `
{
"first_name": "John",
"last_name": "Smith"
}`
person := new(Person)
json.Unmarshal([]byte(json_string), person)
fmt.Println(person)
new_json, _ := json.Marshal(person)
fmt.Printf("%s\n", new_json)
}
// *Output*
// &{John Smith }
// {"first_name":"John","last_name":"Smith"}
The json package can look at the tags for the field and be told how to map json <=> struct field, and also extra options like whether it should ignore empty fields when serializing back to json.
Basically, any package can use reflection on the fields to look at tag values and act on those values. There is a little more info about them in the reflect package
http://golang.org/pkg/reflect/#StructTag :
By convention, tag strings are a concatenation of optionally
space-separated key:"value" pairs. Each key is a non-empty string
consisting of non-control characters other than space (U+0020 ' '),
quote (U+0022 '"'), and colon (U+003A ':'). Each value is quoted using
U+0022 '"' characters and Go string literal syntax.
It's some sort of specifications that specifies how packages treat with a field that is tagged.
for example:
type User struct {
FirstName string `json:"first_name"`
LastName string `json:"last_name"`
}
json tag informs json package that marshalled output of following user
u := User{
FirstName: "some first name",
LastName: "some last name",
}
would be like this:
{"first_name":"some first name","last_name":"some last name"}
other example is gorm package tags declares how database migrations must be done:
type User struct {
gorm.Model
Name string
Age sql.NullInt64
Birthday *time.Time
Email string `gorm:"type:varchar(100);unique_index"`
Role string `gorm:"size:255"` // set field size to 255
MemberNumber *string `gorm:"unique;not null"` // set member number to unique and not null
Num int `gorm:"AUTO_INCREMENT"` // set num to auto incrementable
Address string `gorm:"index:addr"` // create index with name `addr` for address
IgnoreMe int `gorm:"-"` // ignore this field
}
In this example for the field Email with gorm tag we declare that corresponding column in database for the field email must be of type varchar and 100 maximum length and it also must have unique index.
other example is binding tags that are used very mostly in gin package.
type Login struct {
User string `form:"user" json:"user" xml:"user" binding:"required"`
Password string `form:"password" json:"password" xml:"password" binding:"required"`
}
var json Login
if err := c.ShouldBindJSON(&json); err != nil {
c.JSON(http.StatusBadRequest, gin.H{"error": err.Error()})
return
}
the binding tag in this example gives hint to gin package that the data sent to API must have user and password fields cause these fields are tagged as required.
So generraly tags are data that packages require to know how should they treat with data of type different structs and best way to get familiar with the tags a package needs is READING A PACKAGE DOCUMENTATION COMPLETELY.

Resources