I have something like a data pipeline.
API response (10k) rows as JSON.
=> Sanitize some of the data into a new structure
=> Create a CSV File
I can currently do that by getting the full response and doing that step by step.
I was wondering if there's a simpler way to stream the response reading into CSV right away and also writing in the file as it goes over the request-response.
Current code:
I will have a JSON like { "name": "Full Name", ...( 20 columns)} and that data repeats about 10-20k times with different values.
For request
var res *http.Response
if res, err = client.Do(request); err != nil {
return errors.Wrap(err, "failed to perform request")
}
For Unmarshal
var record []RecordStruct
if err = json.NewDecoder(res.Body).Decode(&record); err != nil {
return err
}
For CSV
var row []byte
if row, err = csvutil.Marshal(record); err != nil {
return err
}
To stream an array of JSON objects you have to decode nested objects instead of root object. To do this you need read data using tokens (check out Token method). According to the documentation:
Token returns the next JSON token in the input stream. At the end of the input stream, Token returns nil, io.EOF.
Token guarantees that the delimiters [ ] { } it returns are properly nested and matched: if Token encounters an unexpected delimiter in the input, it will return an error.
The input stream consists of basic JSON values—bool, string, number, and null—along with delimiters [ ] { } of type Delim to mark the start and end of arrays and objects. Commas and colons are elided.
That mean you can decode document part by part. Find an official example how to do it here
I will post a code snippet that shows how you can combine json stream technic with writing result to the CSV:
package main
import (
"encoding/csv"
"encoding/json"
"log"
"os"
"strings"
)
type RecordStruct struct {
Name string `json:"name"`
Info string `json:"info"`
// ... any field you want
}
func (rs *RecordStruct) CSVRecord() []string {
// Here we form data for CSV writer
return []string{rs.Name, rs.Info}
}
const jsonData =
`[
{ "name": "Full Name", "info": "..."},
{ "name": "Full Name", "info": "..."},
{ "name": "Full Name", "info": "..."},
{ "name": "Full Name", "info": "..."},
{ "name": "Full Name", "info": "..."}
]`
func main() {
// Create file for storing our result
file, err := os.Create("result.csv")
if err != nil {
log.Fatalln(err)
}
defer file.Close()
// Create CSV writer using standard "encoding/csv" package
var w = csv.NewWriter(file)
// Put your reader here. In this case I use strings.Reader
// If you are getting data through http it will be resp.Body
var jsonReader = strings.NewReader(jsonData)
// Create JSON decoder using "encoding/json" package
decoder := json.NewDecoder(jsonReader)
// Token returns the next JSON token in the input stream.
// At the end of the input stream, Token returns nil, io.EOF.
// In this case our first token is '[', i.e. array start
_, err = decoder.Token()
if err != nil {
log.Fatalln(err)
}
// More reports whether there is another element in the
// current array or object being parsed.
for decoder.More() {
var record RecordStruct
// Decode only the one item from our array
if err := decoder.Decode(&record); err != nil {
log.Fatalln(err)
}
// Convert and put out record to the csv file
if err := writeToCSV(w, record.CSVRecord()); err != nil {
log.Fatalln(err)
}
}
// Our last token is ']', i.e. array end
_, err = decoder.Token()
if err != nil {
log.Fatalln(err)
}
}
func writeToCSV(w *csv.Writer, record []string) error {
if err := w.Write(record); err != nil {
return err
}
w.Flush()
return nil
}
You can also use 3d party packages like github.com/bcicen/jstream
Related
I'm currently using gRPC for my API communication. Now I need the value of my request can accept any data type, be it a struct object or just a primitive data type like single integer, string, or boolean. I tried using google.protobuf.Any as below but it doesn't work when I want to marshal and unmarshal it.
message MyRequest {
google.protobuf.Any item = 1; // could be 9, "a string", true, false, {"key": value}, etc.
}
proto-compiled result:
type MyRequest struct {
Item *anypb.Any `protobuf:"bytes,1,opt,name=item,proto3" json:"item,omitempty"`
}
I will marshal the incoming request through gRPC and save it as a JSON string in the database:
bytes, err = json.Marshal(m.request.Item)
if err != nil {
return err
}
itemStr := string(bytes)
But when I want to unmarshal the string back into *anypb.Any by:
func getItem(itemStr string) (structpb.Value, error) {
var item structpb.Value
err := json.Unmarshal([]byte(itemStr), &item)
if err != nil {
return item, err
}
// also I've no idea on how to convert `structpb.Value` to `anypb.Any`
return item, nil
}
It returns an error which seems to be because the struct *anypb.Any doesn't contain any of the fields in my encoded item string. Is there a correct way to solve this problem?
Consider using anypb
In the documentation it shows an example of how to unmarshal it
m := new(foopb.MyMessage)
if err := any.UnmarshalTo(m); err != nil {
... // handle error
}
... // make use of m
I'm writing a REST API, through which users can register/login and a number of other things that would require them to send JSON in the request body. The JSON sent in the request body will look something like this for registering:
{
"name": "luke",
"email": "l#g.com",
"date_of_birth": "2012-04-23T18:25:43.511Z",
"location": "Galway",
"profile_image_id": 1,
"password": "password",
"is_subscribee_account": false,
"broker_id": 1
}
This data will then be decoded into a struct and the individual items inserted into the database. I'm not familiar with how I should sanitize the data to remove any Javascript, HTML, SQL etc that may interfere with the correct running of the application and possibly result in a compromised system.
The below code is what I have currently to read in and validate the data but it doesn't sanitize the data for html, javascript, sql etc.
Any input would be appreciated on how I should go about sanitizing it. The REST API is written in Go.
//The below code decodes the JSON into the dst variable.
func (app *application) readJSON(w http.ResponseWriter, r *http.Request, dst interface{}) error {
maxBytes := 1_048_576
r.Body = http.MaxBytesReader(w, r.Body, int64(maxBytes))
dec := json.NewDecoder(r.Body)
dec.DisallowUnknownFields()
err := dec.Decode(dst)
if err != nil {
var syntaxError *json.SyntaxError
var unmarshalTypeError *json.UnmarshalTypeError
var invalidUnmarshalError *json.InvalidUnmarshalError
switch {
case errors.As(err, &syntaxError):
return fmt.Errorf("body contains badly-formed JSON (at character %d)", syntaxError.Offset)
case errors.Is(err, io.ErrUnexpectedEOF):
return errors.New("body contains badly-formed JSON")
case errors.As(err, &unmarshalTypeError):
if unmarshalTypeError.Field != "" {
return fmt.Errorf("body contains incorrect JSON type for field %q", unmarshalTypeError.Field)
}
return fmt.Errorf("body contains incorrect JSON type (at character %d)", unmarshalTypeError.Offset)
case errors.Is(err, io.EOF):
return errors.New("body must not be empty")
case strings.HasPrefix(err.Error(), "json: unknown field "):
fieldName := strings.TrimPrefix(err.Error(), "json: unknown field ")
return fmt.Errorf("body contains unknown key %s", fieldName)
case err.Error() == "http: request body too large":
return fmt.Errorf("body must not be larger than %d bytes", maxBytes)
case errors.As(err, &invalidUnmarshalError):
panic(err)
default:
return err
}
}
err = dec.Decode(&struct{}{})
if err != io.EOF {
return errors.New("body must only contain a single JSON value")
}
return nil
}
//The below code validates the email, password and user. If the first parameter in V.Check() is false the following two parameter are returned in the response to the user.
func ValidateEmail(v *validator.Validator, email string) {
v.Check(email != "", "email", "must be provided")
v.Check(validator.Matches(email, validator.EmailRX), "email", "must be a valid email address")
}
func ValidatePasswordPlaintext(v *validator.Validator, password string) {
v.Check(password != "", "password", "must be provided")
v.Check(len(password) >= 8, "password", "must be at least 8 bytes long")
v.Check(len(password) <= 72, "password", "must not be more than 72 bytes long")
}
func ValidateUser(v *validator.Validator, user *User) {
v.Check(user.Name != "", "name", "must be provided")
v.Check(len(user.Name) <= 500, "name", "must not be more than 500 bytes long")
ValidateEmail(v, user.Email)
if user.Password.plaintext != nil {
ValidatePasswordPlaintext(v, *user.Password.plaintext)
}
if user.Password.hash == nil {
panic("missing password hash for user")
}
}
How do I serialize the struct data with this in mind? So it wont cause issues if later inserted in a webpage. Is there a package function that I can pass the string data to that will return safe data?
We want to unmarshal (in golang) a GRPC message and transform it into a map[string]interface{} to further process it. After using this code:
err := ptypes.UnmarshalAny(resource, config)
configMarshal, err := json.Marshal(config)
var configInterface map[string]interface{}
err = json.Unmarshal(configMarshal, &configInterface)
we get the following structure:
{
"name": "envoy.filters.network.tcp_proxy",
"ConfigType": {
"TypedConfig": {
"type_url": "type.googleapis.com/envoy.extensions.filters.network.tcp_proxy.v3.TcpProxy",
"value": "ChBCbGFja0hvbGVDbHVzdGVyEhBCbGFja0hvbGVDbHVzdGVy"
}
}
}
Where the TypedConfig field remains encoded. How can we decode the TypedConfig field? We know the type_url and we know the value, but to unmarshal the field, it needs to be of the pbany.Any type. But because the TypedConfig structure is a map[string] interface {}, our program either fails to compile, or it crashes, complaining that it is expecting a pbany.Any type, but instead it is getting a map[string] interface {}.
We have the following questions:
Is there a way to turn the structure under TypedConfig into a pbany.Any type that can be subsequently unmarshalled?
Is there a way to recursively unmarshal the entire GRPC message?
Edit (provide more information about the code, schemas/packages used)
We are looking at the code of xds_proxy.go here: https://github.com/istio/istio/blob/master/pkg/istio-agent/xds_proxy.go
This code uses a *discovery.DiscoveryResponse structure in this function:
func forwardToEnvoy(con *ProxyConnection, resp *discovery.DiscoveryResponse) {
The protobuf schema for discovery.DiscoveryResponse (and every other structure used in the code) is in the https://github.com/envoyproxy/go-control-plane/ repository in this file: https://github.com/envoyproxy/go-control-plane/blob/main/envoy/service/discovery/v3/discovery.pb.go
We added code to the forwardToEnvoy function to see the entire unmarshalled contents of the *discovery.DiscoveryResponse structure:
var config proto.Message
switch resp.TypeUrl {
case "type.googleapis.com/envoy.config.route.v3.RouteConfiguration":
config = &route.RouteConfiguration{}
case "type.googleapis.com/envoy.config.listener.v3.Listener":
config = &listener.Listener{}
// Six more cases here, truncated to save space
}
for _, resource := range resp.Resources {
err := ptypes.UnmarshalAny(resource, config)
if err != nil {
proxyLog.Infof("UnmarshalAny err %v", err)
return false
}
configMarshal, err := json.Marshal(config)
if err != nil {
proxyLog.Infof("Marshal err %v", err)
return false
}
var configInterface map[string]interface{}
err = json.Unmarshal(configMarshal, &configInterface)
if err != nil {
proxyLog.Infof("Unmarshal err %v", err)
return false
}
}
And this works well, except that now we have these TypedConfig fields that are still encoded:
{
"name": "virtualOutbound",
"address": {
"Address": {
"SocketAddress": {
"address": "0.0.0.0",
"PortSpecifier": {
"PortValue": 15001
}
}
}
},
"filter_chains": [
{
"filter_chain_match": {
"destination_port": {
"value": 15001
}
},
"filters": [
{
"name": "istio.stats",
"ConfigType": {
"TypedConfig": {
"type_url": "type.googleapis.com/udpa.type.v1.TypedStruct",
"value": "CkF0eXBlLmdvb2dsZWFwaXMuY29tL2Vudm95LmV4dGVuc2lvbnMuZmlsdG"
}
}
},
One way to visualize the contents of the TypedConfig fields is to use this code:
for index1, filterChain := range listenerConfig.FilterChains {
for index2, filter := range filterChain.Filters {
proxyLog.Infof("Listener %d: Handling filter chain %d, filter %d", i, index1, index2)
switch filter.ConfigType.(type) {
case *listener.Filter_TypedConfig:
proxyLog.Infof("Found TypedConfig")
typedConfig := filter.GetTypedConfig()
proxyLog.Infof("typedConfig.TypeUrl = %s", typedConfig.TypeUrl)
switch typedConfig.TypeUrl {
case "type.googleapis.com/envoy.extensions.filters.network.tcp_proxy.v3.TcpProxy":
tcpProxyConfig := &tcp_proxy.TcpProxy{}
err := typedConfig.UnmarshalTo(tcpProxyConfig)
if err != nil {
proxyLog.Errorf("Failed to unmarshal TCP proxy configuration")
} else {
proxyLog.Infof("TcpProxy Config for filter chain %d filter %d: %s", index1, index2, prettyPrint(tcpProxyConfig))
}
But then the code becomes very complex, as we have a large number of structures, and these structures can occur in different order in the messages.
So we wanted to get a generic way of unmarshalling these TypedConfig message by using pbAny, and hence our questions.
I have this json that I convert to:
var leerCHAT []interface{}
but I am going through crazy hoops to get to any point on that map inside map and inside map crazyness, specially because some results are different content.
this is the Json
[
null,
null,
"hub:zWXroom",
"presence_diff",
{
"joins":{
"f718a187-6e96-4d62-9c2d-67aedea00000":{
"metas":[
{
"context":{},
"permissions":{},
"phx_ref":"zNDwmfsome=",
"phx_ref_prev":"zDMbRTmsome=",
"presence":"lobby",
"profile":{},
"roles":{}
}
]
}
},
"leaves":{}
}
]
I need to get to profile then inside there is a "DisplayName" field.
so I been doing crazy hacks.. and even like this I got stuck half way...
First is an array so I can just do something[elementnumber]
then is when the tricky mapping starts...
SORRY about all the prints etc is to debug and see the number of elements I am getting back.
if leerCHAT[3] == "presence_diff" {
var id string
presence := leerCHAT[4].(map[string]interface{})
log.Printf("algo: %v", len(presence))
log.Printf("algo: %s", presence["joins"])
vamos := presence["joins"].(map[string]interface{})
for i := range vamos {
log.Println(i)
id = i
}
log.Println(len(vamos))
vamonos := vamos[id].(map[string]interface{})
log.Println(vamonos)
log.Println(len(vamonos))
metas := vamonos["profile"].(map[string]interface{}) \\\ I get error here..
log.Println(len(metas))
}
so far I can see all the way to the meta:{...} but can't continue with my hacky code into what I need.
NOTICE: that since the id after Joins: and before metas: is dynamic I have to get it somehow since is always just one element I did the for range loop to grab it.
The array element at index 3 describes the type of the variant JSON at index 4.
Here's how to decode the JSON to Go values. First, declare Go types for each of the variant parts of the JSON:
type PrescenceDiff struct {
Joins map[string]*Presence // declaration of Presence type to be supplied
Leaves map[string]*Presence
}
type Message struct {
Body string
}
Declare a map associating the type string to the Go type:
var messageTypes = map[string]reflect.Type{
"presence_diff": reflect.TypeOf(&PresenceDiff{}),
"message": reflect.TypeOf(&Message{}),
// add more types here as needed
}
Decode the variant part to a raw message. Use use the name in the element at index 3 to create a value of the appropriate Go type and decode to that value:
func decode(data []byte) (interface{}, error) {
var messageType string
var raw json.RawMessage
v := []interface{}{nil, nil, nil, &messageType, &raw}
err := json.Unmarshal(data, &v)
if err != nil {
return nil, err
}
if len(raw) == 0 {
return nil, errors.New("no message")
}
t := messageTypes[messageType]
if t == nil {
return nil, fmt.Errorf("unknown message type: %q", messageType)
}
result := reflect.New(t.Elem()).Interface()
err = json.Unmarshal(raw, result)
return result, err
}
Use type switches to access the variant part of the message:
defer ws.Close()
for {
_, data, err := ws.ReadMessage()
if err != nil {
log.Printf("Read error: %v", err)
break
}
v, err := decode(data)
if err != nil {
log.Printf("Decode error: %v", err)
continue
}
switch v := v.(type) {
case *PresenceDiff:
fmt.Println(v.Joins, v.Leaves)
case *Message:
fmt.Println(v.Body)
default:
fmt.Printf("type %T not handled\n", v)
}
}
Run it on the playground.
How to parse xml in such silly format:
<key>KEY1</key><string>VALUE OF KEY1</string>
<key>KEY2</key><string>VALUE OF KEY2</string>
<key>KEY3</key><integer>42</integer>
<key>KEY3</key><array>
<integer>1</integer>
<integer>2</integer>
</array>
Parsing would be very simple if all values would have same type - for example strings. But in my case each value could be string, data, integer, boolean, array or dict.
This xml looks nearly like json, but unfortunately format is fixed, and I cannot change it. And I would prefer solution without any external packages.
Use a lower-level parsing interface provided by encoding/xml which allows you to iterate over individual tokens in the XML stream (such as "start element", "end element" etc).
See the Token() method of the encoding/xml's Decoder type.
Since the data is not well structured, and you can't modify the format, you can't use xml.Unmarshal, so you can process the XML elements by creating a new Decoder, then iterate over the tokens and use DecodeElement to process them one by one. In my sample code below, it puts everything in a map. The code is also on github here...
package main
import (
"encoding/xml"
"strings"
"fmt"
)
type PlistArray struct {
Integer []int `xml:"integer"`
}
const in = "<key>KEY1</key><string>VALUE OF KEY1</string><key>KEY2</key><string>VALUE OF KEY2</string><key>KEY3</key><integer>42</integer><key>KEY3</key><array><integer>1</integer><integer>2</integer></array>"
func main() {
result := map[string]interface{}{}
dec := xml.NewDecoder(strings.NewReader(in))
dec.Strict = false
var workingKey string
for {
token, _ := dec.Token()
if token == nil {
break
}
switch start := token.(type) {
case xml.StartElement:
fmt.Printf("startElement = %+v\n", start)
switch start.Name.Local {
case "key":
var k string
err := dec.DecodeElement(&k, &start)
if err != nil {
fmt.Println(err.Error())
}
workingKey = k
case "string":
var s string
err := dec.DecodeElement(&s, &start)
if err != nil {
fmt.Println(err.Error())
}
result[workingKey] = s
workingKey = ""
case "integer":
var i int
err := dec.DecodeElement(&i, &start)
if err != nil {
fmt.Println(err.Error())
}
result[workingKey] = i
workingKey = ""
case "array":
var ai PlistArray
err := dec.DecodeElement(&ai, &start)
if err != nil {
fmt.Println(err.Error())
}
result[workingKey] = ai
workingKey = ""
default:
fmt.Errorf("Unrecognized token")
}
}
}
fmt.Printf("%+v", result)
}