How to Parse big json file with Golang [duplicate] - go

I have a massive JSON array stored in a file ("file.json")
I need to iterate through the array and do some operation on each element.
err = json.Unmarshal(dat, &all_data)
Causes an out of memory - I'm guessing because it loads everything into memory first.
Is there a way to stream the JSON element by element?

There is an example of this sort of thing in encoding/json documentation:
package main
import (
"encoding/json"
"fmt"
"log"
"strings"
)
func main() {
const jsonStream = `
[
{"Name": "Ed", "Text": "Knock knock."},
{"Name": "Sam", "Text": "Who's there?"},
{"Name": "Ed", "Text": "Go fmt."},
{"Name": "Sam", "Text": "Go fmt who?"},
{"Name": "Ed", "Text": "Go fmt yourself!"}
]
`
type Message struct {
Name, Text string
}
dec := json.NewDecoder(strings.NewReader(jsonStream))
// read open bracket
t, err := dec.Token()
if err != nil {
log.Fatal(err)
}
fmt.Printf("%T: %v\n", t, t)
// while the array contains values
for dec.More() {
var m Message
// decode an array value (Message)
err := dec.Decode(&m)
if err != nil {
log.Fatal(err)
}
fmt.Printf("%v: %v\n", m.Name, m.Text)
}
// read closing bracket
t, err = dec.Token()
if err != nil {
log.Fatal(err)
}
fmt.Printf("%T: %v\n", t, t)
}

So, as commenters suggested, you could use the streaming API of "encoding/json" for reading one string at a time:
r := ... // get some io.Reader (e.g. open the big array file)
d := json.NewDecoder(r)
// read "["
d.Token()
// read strings one by one
for d.More() {
s, _ := d.Token()
// do something with s which is the newly read string
fmt.Printf("read %q\n", s)
}
// (optionally) read "]"
d.Token()
Note that for simplicity I've left error handling out which needs to be implemented.

Related

Golang REST api request with token auth to json array reponse

EDIT
This is the working code incase someone finds it useful. The title to this question was originally
"How to parse a list fo dicts in golang".
This is title is incorrect because I was referencing terms I'm familiar with in python.
package main
import (
"encoding/json"
"fmt"
"io/ioutil"
"log"
"net/http"
)
//Regional Strut
type Region []struct {
Region string `json:"region"`
Description string `json:"Description"`
ID int `json:"Id"`
Name string `json:"Name"`
Status int `json:"Status"`
Nodes []struct {
NodeID int `json:"NodeId"`
Code string `json:"Code"`
Continent string `json:"Continent"`
City string `json:"City"`
} `json:"Nodes"`
}
//working request and response
func main() {
url := "https://api.geo.com"
// Create a Bearer string by appending string access token
var bearer = "TOK:" + "TOKEN"
// Create a new request using http
req, err := http.NewRequest("GET", url, nil)
// add authorization header to the req
req.Header.Add("Authorization", bearer)
//This is what the response from the API looks like
//regionJson := `[{"region":"GEO:ABC","Description":"ABCLand","Id":1,"Name":"ABCLand [GEO-ABC]","Status":1,"Nodes":[{"NodeId":17,"Code":"LAX","Continent":"North America","City":"Los Angeles"},{"NodeId":18,"Code":"LBC","Continent":"North America","City":"Long Beach"}]},{"region":"GEO:DEF","Description":"DEFLand","Id":2,"Name":"DEFLand","Status":1,"Nodes":[{"NodeId":15,"Code":"NRT","Continent":"Asia","City":"Narita"},{"NodeId":31,"Code":"TYO","Continent":"Asia","City":"Tokyo"}]}]`
//Send req using http Client
client := &http.Client{}
resp, err := client.Do(req)
if err != nil {
log.Println("Error on response.\n[ERROR] -", err)
}
defer resp.Body.Close()
body, err := ioutil.ReadAll(resp.Body)
if err != nil {
log.Println("Error while reading the response bytes:", err)
}
var regions []Region
json.Unmarshal([]byte(body), &regions)
fmt.Printf("Regions: %+v", regions)
}
Have a look at this playground example for some pointers.
Here's the code:
package main
import (
"encoding/json"
"log"
)
func main() {
b := []byte(`
[
{"key": "value", "key2": "value2"},
{"key": "value", "key2": "value2"}
]`)
var mm []map[string]string
if err := json.Unmarshal(b, &mm); err != nil {
log.Fatal(err)
}
for _, m := range mm {
for k, v := range m {
log.Printf("%s [%s]", k, v)
}
}
}
I reformatted the API response you included because it is not valid JSON.
In Go it's necessary to define types to match the JSON schema.
I don't know why the API appends % to the end of the result so I've ignored that. If it is included, you will need to trim the results from the file before unmarshaling.
What you get from the unmarshaling is a slice of maps. Then, you can iterate over the slice to get each map and then iterate over each map to extract the keys and values.
Update
In your updated question, you include a different JSON schema and this change must be reflect in the Go code by update the types. There are some other errors in your code. Per my comment, I encourage you to spend some time learning the language.
package main
import (
"bytes"
"encoding/json"
"io/ioutil"
"log"
)
// Response is a type that represents the API response
type Response []Record
// Record is a type that represents the individual records
// The name Record is arbitrary as it is unnamed in the response
// Golang supports struct tags to map the JSON properties
// e.g. JSON "region" maps to a Golang field "Region"
type Record struct {
Region string `json:"region"`
Description string `json:"description"`
ID int `json:"id"`
Nodes []Node
}
type Node struct {
NodeID int `json:"NodeId`
Code string `json:"Code"`
}
func main() {
// A slice of byte representing your example response
b := []byte(`[{
"region": "GEO:ABC",
"Description": "ABCLand",
"Id": 1,
"Name": "ABCLand [GEO-ABC]",
"Status": 1,
"Nodes": [{
"NodeId": 17,
"Code": "LAX",
"Continent": "North America",
"City": "Los Angeles"
}, {
"NodeId": 18,
"Code": "LBC",
"Continent": "North America",
"City": "Long Beach"
}]
}, {
"region": "GEO:DEF",
"Description": "DEFLand",
"Id": 2,
"Name": "DEFLand",
"Status": 1,
"Nodes": [{
"NodeId": 15,
"Code": "NRT",
"Continent": "Asia",
"City": "Narita"
}, {
"NodeId": 31,
"Code": "TYO",
"Continent": "Asia",
"City": "Tokyo"
}]
}]`)
// To more closely match your code, create a Reader
rdr := bytes.NewReader(b)
// This matches your code, read from the Reader
body, err := ioutil.ReadAll(rdr)
if err != nil {
// Use Printf to format strings
log.Printf("Error while reading the response bytes\n%s", err)
}
// Initialize a variable of type Response
resp := &Response{}
// Try unmarshaling the body into it
if err := json.Unmarshal(body, resp); err != nil {
log.Fatal(err)
}
// Print the result
log.Printf("%+v", resp)
}

govalidator ValidateMap validate map of array

I want to validate my input map of array with govalidator.ValidateMap.
Please can someone suggest for Sample mapTemplate for map of array.
Please find below the code snippet.
Thanks in Advance
package main
import (
"fmt"
"github.com/asaskevich/govalidator"
)
func main() {
var mapTemplate = map[string]interface{}{
"name": "required,alpha",
"categories": []interface{}{",alpha"}, //error: map validator has to be either map[string]interface{} or string; got []interface {}
}
var inputMap = map[string]interface{}{
"name": "Prabhu",
"categories": []interface{}{"category1", "category2"},
}
result, err := govalidator.ValidateMap(inputMap, mapTemplate)
if err != nil {
fmt.Println("error: " + err.Error())
}
fmt.Printf("result : %v\n", result)
for _, v := range inputMap["categories"].([]interface{}) {
fmt.Printf("category : %v\n", v)
}
}
It seems validation of slices has not yet been implemented. There is no check in the What to contribute list for slices/arrays.
You can however use the function ValidateArray to iterate over a slice and validate its members.
govalidator.ValidateArray(inputMap["categories"], func(val interface{}, i int) bool {
valStr, ok := val.(string)
if !ok {
return false
}
return govalidator.IsAlpha(valStr)
})

Stream data from a request response into a CSV in go

I have something like a data pipeline.
API response (10k) rows as JSON.
=> Sanitize some of the data into a new structure
=> Create a CSV File
I can currently do that by getting the full response and doing that step by step.
I was wondering if there's a simpler way to stream the response reading into CSV right away and also writing in the file as it goes over the request-response.
Current code:
I will have a JSON like { "name": "Full Name", ...( 20 columns)} and that data repeats about 10-20k times with different values.
For request
var res *http.Response
if res, err = client.Do(request); err != nil {
return errors.Wrap(err, "failed to perform request")
}
For Unmarshal
var record []RecordStruct
if err = json.NewDecoder(res.Body).Decode(&record); err != nil {
return err
}
For CSV
var row []byte
if row, err = csvutil.Marshal(record); err != nil {
return err
}
To stream an array of JSON objects you have to decode nested objects instead of root object. To do this you need read data using tokens (check out Token method). According to the documentation:
Token returns the next JSON token in the input stream. At the end of the input stream, Token returns nil, io.EOF.
Token guarantees that the delimiters [ ] { } it returns are properly nested and matched: if Token encounters an unexpected delimiter in the input, it will return an error.
The input stream consists of basic JSON values—bool, string, number, and null—along with delimiters [ ] { } of type Delim to mark the start and end of arrays and objects. Commas and colons are elided.
That mean you can decode document part by part. Find an official example how to do it here
I will post a code snippet that shows how you can combine json stream technic with writing result to the CSV:
package main
import (
"encoding/csv"
"encoding/json"
"log"
"os"
"strings"
)
type RecordStruct struct {
Name string `json:"name"`
Info string `json:"info"`
// ... any field you want
}
func (rs *RecordStruct) CSVRecord() []string {
// Here we form data for CSV writer
return []string{rs.Name, rs.Info}
}
const jsonData =
`[
{ "name": "Full Name", "info": "..."},
{ "name": "Full Name", "info": "..."},
{ "name": "Full Name", "info": "..."},
{ "name": "Full Name", "info": "..."},
{ "name": "Full Name", "info": "..."}
]`
func main() {
// Create file for storing our result
file, err := os.Create("result.csv")
if err != nil {
log.Fatalln(err)
}
defer file.Close()
// Create CSV writer using standard "encoding/csv" package
var w = csv.NewWriter(file)
// Put your reader here. In this case I use strings.Reader
// If you are getting data through http it will be resp.Body
var jsonReader = strings.NewReader(jsonData)
// Create JSON decoder using "encoding/json" package
decoder := json.NewDecoder(jsonReader)
// Token returns the next JSON token in the input stream.
// At the end of the input stream, Token returns nil, io.EOF.
// In this case our first token is '[', i.e. array start
_, err = decoder.Token()
if err != nil {
log.Fatalln(err)
}
// More reports whether there is another element in the
// current array or object being parsed.
for decoder.More() {
var record RecordStruct
// Decode only the one item from our array
if err := decoder.Decode(&record); err != nil {
log.Fatalln(err)
}
// Convert and put out record to the csv file
if err := writeToCSV(w, record.CSVRecord()); err != nil {
log.Fatalln(err)
}
}
// Our last token is ']', i.e. array end
_, err = decoder.Token()
if err != nil {
log.Fatalln(err)
}
}
func writeToCSV(w *csv.Writer, record []string) error {
if err := w.Write(record); err != nil {
return err
}
w.Flush()
return nil
}
You can also use 3d party packages like github.com/bcicen/jstream

Decode very big json to array of structs

I have a web application which have a REST API, get JSON as input and perform transformations of this JSON.
Here is my code:
func (a *API) getAssignments(w http.ResponseWriter, r *http.Request) {
var document DataPacket
err := json.NewDecoder(r.Body).Decode(&document)
if err != nil {
a.handleJSONParseError(err, w)
return
}
// transformations
JSON which I get is a collection of structs. External application use my application and send me very big json files (300-400MB). Decode this json at the one moment of time takes a very big time and amount of memory.
Is there any way to work with this json as stream and decode structs from this collection one by one ?
First, read the documentation.
Package json
import "encoding/json"
func (*Decoder) Decode
func (dec *Decoder) Decode(v interface{}) error
Decode reads the next JSON-encoded value from its input and stores it
in the value pointed to by v.
Example (Stream): This example uses a Decoder to decode a streaming array of JSON
objects.
Playground: https://play.golang.org/p/o6hD-UV85SZ
package main
import (
"encoding/json"
"fmt"
"log"
"strings"
)
func main() {
const jsonStream = `
[
{"Name": "Ed", "Text": "Knock knock."},
{"Name": "Sam", "Text": "Who's there?"},
{"Name": "Ed", "Text": "Go fmt."},
{"Name": "Sam", "Text": "Go fmt who?"},
{"Name": "Ed", "Text": "Go fmt yourself!"}
]
`
type Message struct {
Name, Text string
}
dec := json.NewDecoder(strings.NewReader(jsonStream))
// read open bracket
t, err := dec.Token()
if err != nil {
log.Fatal(err)
}
fmt.Printf("%T: %v\n", t, t)
// while the array contains values
for dec.More() {
var m Message
// decode an array value (Message)
err := dec.Decode(&m)
if err != nil {
log.Fatal(err)
}
fmt.Printf("%v: %v\n", m.Name, m.Text)
}
// read closing bracket
t, err = dec.Token()
if err != nil {
log.Fatal(err)
}
fmt.Printf("%T: %v\n", t, t)
}

Parsing plist xml

How to parse xml in such silly format:
<key>KEY1</key><string>VALUE OF KEY1</string>
<key>KEY2</key><string>VALUE OF KEY2</string>
<key>KEY3</key><integer>42</integer>
<key>KEY3</key><array>
<integer>1</integer>
<integer>2</integer>
</array>
Parsing would be very simple if all values would have same type - for example strings. But in my case each value could be string, data, integer, boolean, array or dict.
This xml looks nearly like json, but unfortunately format is fixed, and I cannot change it. And I would prefer solution without any external packages.
Use a lower-level parsing interface provided by encoding/xml which allows you to iterate over individual tokens in the XML stream (such as "start element", "end element" etc).
See the Token() method of the encoding/xml's Decoder type.
Since the data is not well structured, and you can't modify the format, you can't use xml.Unmarshal, so you can process the XML elements by creating a new Decoder, then iterate over the tokens and use DecodeElement to process them one by one. In my sample code below, it puts everything in a map. The code is also on github here...
package main
import (
"encoding/xml"
"strings"
"fmt"
)
type PlistArray struct {
Integer []int `xml:"integer"`
}
const in = "<key>KEY1</key><string>VALUE OF KEY1</string><key>KEY2</key><string>VALUE OF KEY2</string><key>KEY3</key><integer>42</integer><key>KEY3</key><array><integer>1</integer><integer>2</integer></array>"
func main() {
result := map[string]interface{}{}
dec := xml.NewDecoder(strings.NewReader(in))
dec.Strict = false
var workingKey string
for {
token, _ := dec.Token()
if token == nil {
break
}
switch start := token.(type) {
case xml.StartElement:
fmt.Printf("startElement = %+v\n", start)
switch start.Name.Local {
case "key":
var k string
err := dec.DecodeElement(&k, &start)
if err != nil {
fmt.Println(err.Error())
}
workingKey = k
case "string":
var s string
err := dec.DecodeElement(&s, &start)
if err != nil {
fmt.Println(err.Error())
}
result[workingKey] = s
workingKey = ""
case "integer":
var i int
err := dec.DecodeElement(&i, &start)
if err != nil {
fmt.Println(err.Error())
}
result[workingKey] = i
workingKey = ""
case "array":
var ai PlistArray
err := dec.DecodeElement(&ai, &start)
if err != nil {
fmt.Println(err.Error())
}
result[workingKey] = ai
workingKey = ""
default:
fmt.Errorf("Unrecognized token")
}
}
}
fmt.Printf("%+v", result)
}

Resources