C# - Writing JSON to Parquet file does not work with nested json - parquet

I am trying to convert JSON data to parquet file. Below is my input.
{"time": 1637045320491, "device": {"type_id": 1}, "message": "Test message", "metadata": {"product": {"name": "prodName", "vendor_name": "XYZ"}, "version": "1.0.0", "original_time": "2021-11-16T12:18:40.491893+05:30"}, "tuid": 900201, "cluid": 9002, "activity_id": 1, "severity_id": 4, "cuid": 9}
My output does not show properly as a nested json. Instead it shows as below.
{"time":1637045320491,"device_type_id":1,"message":"Test message","metadata_product_name":"prodName","metadata_product_vendor_name":"XYZ","metadata_version":"1.0.0","metadata_original_time":24520491893000,"tuid":900201,"cluid":9002,"activity_id":1,"severity_id":4,"cuid":9}
Can someone help me get the output the same as the json.
I'm using ChoETL package to convert to Parquet file.
var pqFile = #"D:\Data\" + Guid.NewGuid() + ".parquet";
using (var r = new ChoJSONReader(#"D:\Data\json-dump.json"))
{
using (var w = new ChoParquetWriter(pqFile))
{
w.Write(r);
}
}

Related

Null pointer exception while consuming streams

{
"rules": [
{
"rank": 1,
"grades": [
{
"id": 100,
"hierarchyCode": 32
},
{
"id": 200,
"hierarchyCode": 33
}
]
},
{
"rank": 2,
"grades": []
}
]
}
I've a json like above and I'm using streams to return "hierarchyCode" based on some condition. For example if I pass "200" my result should print 33. So far I did something like this:
request.getRules().stream()
.flatMap(ruleDTO -> ruleDTO.getGrades().stream())
.map(gradeDTO -> gradeDTO.getHierarchyCode())
.forEach(hierarchyCode -> {
//I'm doing some business logic here
Optional<SomePojo> dsf = someList.stream()
.filter(pojo -> hierarchyCode.equals(pojo.getId())) // lets say pojo.getId() returns 200
.findFirst();
System.out.println(dsf.get().getCode());
});
So in the first iteration for the expected output it returns 33, but in the second iteration it is failing with Null pointer instead of just skipping the loop since "grades" array is empty this time. How do I handle the null pointer exception here?
You can use the below code snippet using Java 8:
int result;
int valueToFilter = 200;
List<Grade> gradeList = data.getRules().stream().map(Rule::getGrades).filter(x-> x!=null && !x.isEmpty()).flatMap(Collection::stream).collect(Collectors.toList())
Optional<Grade> optional = gradeList.stream().filter(x -> x.getId() == valueToFilter).findFirst();
if(optional.isPresent()){
result = optional.get().getHierarchyCode();
System.out.println(result);
}
I have created POJO's according to my code, you can try this approach with your code structure.
In case you need POJO's as per this code, i will share the same as well.
Thanks,
Girdhar

Concatenate declared variable with random string

It is possible, using Terratest, to declare a tfvars file with the following variable:
bar = {
name = "test"
domain = "test.com"
regions = [
{ location = "France Central", alias = "france" }
]
}
But include a random prefix to the bar.domain string inside the go code?
I'm using terraformOptions as follows:
terraformOptions := &terraform.Options{
TerraformDir: sourcePath,
VarFiles: []string{variablesPath + "/integration.tfvars"},
}
It is not ideal for one to make use of the tfvars file directly to take the input in case of tests. More on this here
To answer your question :
You can use something similar to this :
options := terraform.Options{
TerraformDir: "sourcePath",
Vars: map[string]interface{}{
"name": "test",
"domain": addRandomprefix()+"test.com",
"region ": map[string]interface{}{
"location" : "France Central",
"alias" : "france",
},
},
}
Just create your own custom addRandomprefix() method. I hope this helps :)

Get top level parent node for any node

Given the format at the end of the question, what's the best way to get the top-level name for a given item?
Top-level names are the ones with parentId = 1.
def getTopLevel(name: String): String = {
// Environment(150) -> Environment(150) - since its parentId is 1
// Assassination -> Security - since Assassination(12) -> Terrorism(10) -> Security(2)
}
Here's my current approach but is there something better?
unmapped = categories.size
Loop through this list until there are still unmapped items.
- build a Map(Int, String) for top levels.
- build a Map(Int, Int) - that maps an id to top level id.
- keep track of unmapped items
once loop exits, I can use both Maps to get the job done.
[
{
"name": "Destination Overview",
"id": 1,
"parentId": null
},
{
"name": "Environment",
"id": 150,
"parentId": 1
},
{
"name": "Security",
"id": 2,
"parentId": 1
},
{
"name": "Armed Conflict",
"id": 10223,
"parentId": 2
},
{
"name": "Civil Unrest",
"id": 21,
"parentId": 2
},
{
"name": "Terrorism",
"id": 10,
"parentId": 2
},
{
"name": "Assassination",
"id": 12,
"parentId": 10
}
]
This is actually two questions.
Parsing Json into a Scala collection and
Using that collection to trace items back to the top parent
For the first question, you can use play-json. The second part can be handled with a tail-recursive function. Here is the full program that solves both problems:
import play.api.libs.json.{Json, Reads}
case class Node(name: String, id: Int, parentId: Option[Int])
object JsonParentFinder {
def main(args: Array[String]): Unit = {
val s =
"""
|[
| {
| "name": "Destination Overview",
| "id": 1,
| "parentId": null
| },
| {
| "name": "Environment",
| "id": 150,
| "parentId": 1
| },
// rest of the json
|]
|""".stripMargin
implicit val NodeReads : Reads[Node] =Json.reads[Node]
val r = Json.parse(s).as[Seq[Node]]
.map(x => x.id -> x).toMap
println(getTopLevelNode(150, r))
println(getTopLevelNode(12, r))
}
def getTopLevelNode(itemId : Int, nodes: Map[Int, Node], path : List[Node] = List.empty[Node]) : List[Node] = {
if(nodes(itemId).id == 1)
nodes(itemId) +: path
else
getTopLevelNode(nodes(nodes(itemId).parentId.get).id, nodes, nodes(itemId) +: path)
}
}
Output will be:
List(Node(Destination Overview,1,None), Node(Environment,150,Some(1)))
List(Node(Destination Overview,1,None), Node(Security,2,Some(1)), Node(Terrorism,10,Some(2)), Node(Assassination,12,Some(10)))
A few notes:
I have not implemented comprehensive error-handling logic. The implicit assumption is that the only item with parentId==None is the root node. nodes(itemId).parentId.get could lead to failure.
Also, in creating the map, the assumption is that all items have unique ids.
Another assumption is that all nodes eventually have a path to the root node. If that is not the case, this will fail. But it should be straightforward to fix these cases by adding more stop conditions.
I am prepending items to the accumulator list(named path here) because prepend operation on Scala's lists takes constant time. You can just reverse the resulting list or use another data structure like Vector to efficiently build the path.

Unmarshal custom types with jsonpb

What's the best way to convert this json object to protobuf?
JSON:
{
"name": "test",
"_list": {
"some1": { "value": 1 },
"some2": [
{ "value": 2 },
{ "value": 3 },
]
}
}
Proto:
message Something {
string name = 1;
message ListType {
repeated string = 1;
}
map<string, ListType> _list = 2;
}
Without having the _list in the message I would use jsonpb.Unmarsal, but I can't think of a way to define the Unmarshaler interface on a type that is generated in a diff package.
I also thought of having _list as a Any (json.RawMessage) and handle it after the Unmarshal (but can't make this to work; err message: Any JSON doesn't have '#type')
With _list being inconsistent (not just a list of strings/map of values/etc) and you mentioning you looked into using Any you could consider making your message:
message Something {
string name = 1;
google.protobuf.Struct _list = 2;
}
https://github.com/golang/protobuf/blob/master/ptypes/struct/struct.proto
With that you can marshal/unmarshal json to/from proto messages using github.com/golang/protobuf/jsonpb which is actually designed for use with the grpc gateway but you can use it too

Go: Removing duplicate rows after SQL join result

I’m running a joined SQL query for locations and events (occuring at the locations). In the results, naturally the location data is replicated per row, as there’s a one-to-many relationship: one location holds multiple events.
What’s an optimal approach to clean up the multiplied location data?
Staying with a single SQL operation, what makes the most sense is performing a check while looping through the query results (rows).
However I cannot seem to access the locations object to check for a pre-existing location ID.
Edit:
This is the SQL output. As you see, location data naturally occurs multiple times, because it's shared across events. Ultimately this will be sent out as JSON eventually, with nested structs, one for locations, one for events.
id title latlng id title locationid
1 Fox Thea... 43.6640673,-79.4213863 1 Bob's Event 1
1 Fox Thea... 43.6640673,-79.4213863 2 Jill's Event 1
2 Wrigley ... 43.6640673,-79.4213863 3 Mary's Event 2
3 Blues Bar 43.6640673,-79.4213863 4 John's Event 3
1 Fox Thea... 43.6640673,-79.4213863 5 Monthly G... 1
1 Fox Thea... 43.6640673,-79.4213863 6 A Special... 1
1 Fox Thea... 43.6640673,-79.4213863 7 The Final... 1
The JSON output. As you see location data is multiplied making for a larger JSON file.
{
"Locations": [
{
"ID": 1,
"Title": "Fox Theatre",
"Latlng": "43.6640673,-79.4213863",
},
{
"ID": 1,
"Title": "Fox Theatre",
"Latlng": "43.6640673,-79.4213863",
},
{
"ID": 2,
"Title": "Wrigley Field",
"Latlng": "43.6640673,-79.4213863",
},
{
"ID": 3,
"Title": "Blues Bar",
"Latlng": "43.6640673,-79.4213863",
},
{
"ID": 1,
"Title": "Fox Theatre",
"Latlng": "43.6640673,-79.4213863",
},
{
"ID": 1,
"Title": "Fox Theatre",
"Latlng": "43.6640673,-79.4213863",
},
{
"ID": 1,
"Title": "Fox Theatre",
"Latlng": "43.6640673,-79.4213863",
}
],
"Events": [
{
"ID": 1,
"Title": "Bob's Event",
"Location": 1
},
{
"ID": 2,
"Title": "Jill's Event",
"Location": 1
},
{
"ID": 3,
"Title": "Mary's Event",
"Location": 2
},
{
"ID": 4,
"Title": "John's Event",
"Location": 3
},
{
"ID": 5,
"Title": "Monthly Gathering",
"Location": 1
},
{
"ID": 6,
"Title": "A Special Event",
"Location": 1
},
{
"ID": 7,
"Title": "The Final Contest",
"Location": 1
}
]
}
Structs:
// Event type
type Event struct {
ID int `schema:"id"`
Title string `schema:"title"`
LocationID int `schema:"locationid"`
}
// Location type
type Location struct {
ID int `schema:"id"`
Title string `schema:"title"`
Latlng string `schema:"latlng"`
}
// LocationsEvents type
type LocationsEvents struct {
Locations []Location `schema:"locations"`
Events []Event `schema:"events"`
}
Function running the query and looping through rows:
func getLocationsEvents(db *sql.DB, start, count int) ([]Location, []Event, error) {
var locations = []Location{}
var events = []Event{}
rows, err := db.Query("SELECT locations.id, locations.title, locations.latlng, events.id, events.title, events.locationid FROM locations LEFT JOIN events ON locations.id = events.locationid LIMIT ? OFFSET ?", count, start)
if err != nil {
return locations, events, err
}
defer rows.Close()
for rows.Next() {
var location Location
var event Event
err := rows.Scan(&location.ID, &location.Title, &location.Latlng, &event.ID, &event.Title, &event.LocationID);
if err != nil {
return locations, events, err
}
// Here I can print locations and see it getting longer with each loop iteration
fmt.Println(locations)
// How can I check if an ID exists in locations?
// Ideally, if location.ID already exists in locations, then only append event, otherwise, append both the location and event
locations = append(locations, location)
events = append(events, event)
}
return locations, events, nil
}
Function called on by router:
func (a *App) getLocationsEventsJSON(w http.ResponseWriter, r *http.Request) {
count := 99
start := 0
if count > 10 || count < 1 {
count = 10
}
if start < 0 {
start = 0
}
locations, events, err := getLocationsEvents(a.DB, start, count)
if err != nil {
respondWithError(w, http.StatusInternalServerError, err.Error())
return
}
var locationsEvents LocationsEvents
locationsEvents.Locations = locations
locationsEvents.Events = events
respondWithJSON(w, http.StatusOK, locationsEvents)
}
Function sending data out as JSON (part of REST API):
func respondWithJSON(w http.ResponseWriter, code int, payload interface{}) {
response, _ := json.Marshal(payload)
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(code)
w.Write(response)
}
UPDATE:
Reverting to doing this with the SQL query, what are the possibilities? Using GROUP BY? Here is an example SQL:
SELECT locations.id, locations.title, locations.latlng, events.id, events.title, events.locationid
FROM locations
LEFT JOIN events ON locations.id = events.locationid
GROUP BY locations.id, events.id
The result set still contains duplicated location data, however it's nicely grouped and sorted.
Then there's the possibility of sub-queries:
http://www.w3resource.com/sql/subqueries/understanding-sql-subqueries.php but now I'm running multiple SQL queries, something I wanted to avoid.
In reality I don't think I can avoid the duplicated location data when using a single join query like I am. How else would I receive a resultset of joined data, without having location data replicated? Having the SQL server send me pre-made JSON data as I need it (locations and events seperated)? From my understanding it's better doing that work after receiving results.
I think you can split your request in two: locations (SELECT * FROM locations) and events (SELECT * FROM events) and then pass them to JSON marshaller.
These 2 requests will be very easy and fast for database to perform. Next they will be easier to cache intermediate results.
but now I'm running multiple SQL queries, something I wanted to avoid.
Could you pls clarify this moment - why do you want to avoid multiple queries? What task do you want to solve and what limitations have? Sometimes set of small easy queries are better than one overcomplicated.
If you are querying the database yourself, you should be able to avoid any duplicates in the first place.
In the end of your query add "GROUP BY {unique field}".
Example that should give a unique list of locations that are on you event list
SELECT location.*
FROM location.ID, location.Title, location.Latlng
INNER JOIN event ON event.ID=location.ID
GROUP BY location.ID

Resources