I am developing an app, where I retrieve data from a firebase realtime database.
One the one hand, I got my objects. There will be around 10000 entries when it is finished. A user can select for each property like "Blütenfarbe" (flower color) 1 (or more) characteristics, where he will then get the result plants, on which these constraints are true. Each property has 2-10 characteristics.
Is querying here powerful enough, to get fast results ? If not, my thought would be to also setup container for each characteristic and put every ID in that, when it is a characteristic of that plant.
This is my first project, so any tip for better structure is welcome. I don't want to create this database and realize afterwards, that it is not well enough structured.
Thanks for your help :)
{
"Pflanzen" : {
"Objekt" : {
"00001" : {
"Belaubung" : "Sommergrün",
"Blütenfarbe" : "Gelb",
"Blütezeit" : "Februar",
"Breite" : 20,
"Duftend" : "Ja",
"Frosthärte" : "Ja",
"Fruchtschmuck" : "Nein",
"Herbstfärbung" : "Gelb",
"Höhe" : 20,
"Pflanzengruppe" : "Laubgehölze",
"Standort" : "Sonnig",
"Umfang" : 10
},
"00002" : {
"Belaubung" : "Sommergrün",
"Blütenfarbe" : "Gelb",
"Blütezeit" : "März",
"Breite" : 25,
"Duftend" : "Nein",
"Frosthärte" : "Ja",
"Fruchtschmuck" : "Nein",
"Herbstfärbung" : "Ja",
"Höhe" : 10,
"Pflanzengruppe" : "Nadelgehölze",
"Standort" : "Schatten",
"Umfang" : 10
},
"Eigenschaften" : {
"Belaubung" : {
"Sommergrün" : [ "00001", "00002" ],
"Wintergrün" : ["..."]
},
"Blütenfarbe" : {
"Braun": ["00002"],
"Blau" : [ "00001" ]
},
}
}
}
}
Related
I have a simple job scheduled every 5 min. Basically it listens to cloudfiles on storage account and writes them into delta table, extremely simple. The code is something like this:
df = (spark
.readStream
.format("cloudFiles")
.option('cloudFiles.format', 'json')
.load(input_path, schema = my_schema)
.select(cols)
.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation", f"{output_path}/_checkpoint")
.trigger(once = True)
.start(output_path))
Sometimes there are new files, sometimes not. After 40-60 batches it gets stuck on one particular batchId, as if there are no new files in the folder. If i run the script manually i get the same result: it points to the last actually processed batch.
{
"id" : "xxx,
"runId" : "xxx",
"name" : null,
"timestamp" : "2022-01-13T15:25:07.512Z",
"batchId" : 64,
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0,
"durationMs" : {
"latestOffset" : 663,
"triggerExecution" : 1183
},
"stateOperators" : [ ],
"sources" : [ {
"description" : "CloudFilesSource[/mnt/source/]",
"startOffset" : {
"seqNum" : 385,
"sourceVersion" : 1,
"lastBackfillStartTimeMs" : 1641982820801,
"lastBackfillFinishTimeMs" : 1641982823560
},
"endOffset" : {
"seqNum" : 385,
"sourceVersion" : 1,
"lastBackfillStartTimeMs" : 1641982820801,
"lastBackfillFinishTimeMs" : 1641982823560
},
"latestOffset" : null,
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0,
"metrics" : {
"numBytesOutstanding" : "0",
"numFilesOutstanding" : "0"
}
} ],
"sink" : {
"description" : "DeltaSink[/mnt/db/table_name]",
"numOutputRows" : -1
}
}
But if I run only the readStream part - it correctly reads the entire list of files ( and starts a new batchId: 0 ). The strangest part is: I have absolutely no Idea what causes it and why it takes around 40-60 batches to get this kind of error. Can anyone help? Or give me some suggestion?
I was thinking about using ForeachBatch() to append new data. Or using trigger .trigger(continuous='5 minutes')
I'm new to AutoLoader
Thank you so much!
I resolved it by using
.option('cloudFiles.useIncrementalListing', 'false')
My filenames are composed of flowname + timestamp, like this:
flow_name_2022-01-18T14-19-50.018Z.json
So my guess is: some combination of dots make the rocksdb go into non-existing directory, that's why the it reports that "found no new files". Once I disabled incremental listing rocksdb stopped making its mini checkpoints based on filenames and now reads the whole directory. This is the only explanation that I have.
If anyone is having the same issue try changing the filename
I have a CosmosDb - MongoDb collection that I'm using purely as a key/value store for arbitrary data where the _id is the key for my collection.
When I run the query below:
globaldb:PRIMARY> db.FieldData.find({_id : new BinData(3, "xIAPpVWVkEaspHxRbLjaRA==")}).explain(true)
I get this result:
{
"_t" : "ExplainResponse",
"ok" : 1,
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "data.FieldData",
"indexFilterSet" : false,
"parsedQuery" : {
"$and" : [ ]
},
"winningPlan" : {
},
"rejectedPlans" : [ ]
},
"executionStats" : {
"executionSuccess" : true,
"nReturned" : 1,
"executionTimeMillis" : 106,
"totalKeysExamined" : 0,
"totalDocsExamined" : 3571,
"executionStages" : {
},
"allPlansExecution" : [ ]
},
"serverInfo" : #REMOVED#
}
Notice that the totalKeysExamined is 0 and the totalDocsExamined is 3571 and the query took over 106ms. If i run without .explain() it does find the document.
I would have expected this query to be lightning quick given that the _id field is automatically indexed as a unique primary key on the collection. As this collection grows in size, I only expect this problem to get worse.
I'm definitely not understanding something about the index and how it works here. Any help would be most appreciated.
Thanks!
I am facing an issue with using different Spring Controllers.
We are using a standard Spring PagingAndSortingRepository annotated with RepositoryRestResource to serve search responses . One of the methods is
Page<Custodian> findByShortNameContainingIgnoreCase(#Param("shortName") String shortName, Pageable p);
It returns all entities of Custodian that satisfy the conditions grouped in Pages.
The result looks like this:
{
"_embedded" : {
"custodians" : [ {
"shortName" : "fr-custodian",
"name" : "french custodian",
"contact" : "Francoir",
"_links" : {
"self" : {
"href" : "http://127.0.0.1:9000/api/custodians/10004"
},
"custodian" : {
"href" : "http://127.0.0.1:9000/api/custodians/10004"
}
}
} ]
},
"_links" : {
"self" : {
"href" : "http://127.0.0.1:9000/api/custodians/search/findByShortNameContainingIgnoreCase?shortName=fr&page=0&size=3&sort=shortName,asc"
}
},
"page" : {
"size" : 3,
"totalElements" : 1,
"totalPages" : 1,
"number" : 0
}
}
This is the format our frontend expects.
However, we need another query that results in a pretty long function (and thus URL) because it takes multiple parameters.
To be specific, it globally searches for a string in Custodian. So every parameter has the same value.
In order to shorten the URL we created a RepositoryRestController annotated with ResponseBody and implemented a function that takes only one parameter, calls the long URL internally and re-returns the result (a Page).
#RequestMapping(value = "/custodian", method = RequestMethod.GET)
public Page<Custodian> search(#RequestParam(value = "keyWord") String keyWord, Pageable p) {
return repo.LONGURL(keyWord, keyWord, p);
}
Unfortunately, Spring doesn't apply the same format to the result of our function.
It looks like this:
{
"content" : [ {
"id" : 10004,
"shortName" : "fr-custodian",
"name" : "french custodian",
"contact" : "Francoir",
} ],
"pageable" : {
"sort" : {
"sorted" : true,
"unsorted" : false
},
"offset" : 0,
"pageSize" : 3,
"pageNumber" : 0,
"unpaged" : false,
"paged" : true
},
"totalElements" : 3,
"totalPages" : 1,
"last" : true,
"size" : 3,
"number" : 0,
"sort" : {
"sorted" : true,
"unsorted" : false
},
"numberOfElements" : 3,
"first" : true
}
How do you get Spring to deliver the same format in our custom method?
Let's say I have the following documents (containing logs) in Elasticsearch index:
PUT logs/_doc/1
{
"commonId" : "111111",
"comment" : "abc",
"phase" : "start"
}
PUT logs/_doc/2
{
"commonId" : "111111",
"comment" : "cde",
"customerNumber" : "234-333"
}
PUT logs/_doc/3
{
"commonId" : "222222",
"comment" : "efg",
"phase" : "stop"
}
PUT logs/_doc/4
{
"commonId" : "222222",
"comment" : "jkl",
"customerNumber" : "234-555"
}
The thing which is common in all logs is commonId attribute.
Problem is:
I want process logs in a way:
All logs with same commonId should exchange each other with missing attributes. So log=1 should add "customerNumber" : "234-333", and log=2 should add "phase" : "start". Same situation with logs=3 and 4.
Is it possible to do this by any Elasticsearch query? Generaly I'm not iterested in any paid option of X-Pack.
I have some problem with a Restfull interface of mongoDB.
I have submitted this query --> http://127.0.0.1:28017/db/collection/?limit=0(I used limit = 0 because I want to find all my result with an ajax request),
and the result in terms of number of rows is "total_rows" : 38185.
But if in my shell if I execute db.collection.count() the result was 496519.
Why I have these difference? Is it possible to get the same result with an ajax request?
Thanks in advance for your help.
I'm sure that results were not impacted by numbers of rows nor directly MongoDB, but indicates be Webserver (at time created to tasks admin). It’s possibly to be the payload size of response break by Webserver something like HTTP error 413 (entity to larger).
In my tests i see entries in log as "[websvr] killcursors: found 1 of 1". This will kill opened cursor between the client (in the case web server) and MongoDB. Most drivers not need call OP_KILL_CURSORS because the MongoDB define a timeout of 10 minutes by default.
Go back for tests i conclude that size payload of response of web server (built-in MongoDB) is limited 38~40MB. Let me show my analyze.
I created a collections with 1,260,000 documents. In REST web interface make query that results total_rows: 379,677 (or avgObjSize * total_rows = 38MB).
db.manyrows.stats()
{
"ns" : "forum.manyrows",
"count" : 1260000,
"size" : 125101640,
"avgObjSize" : 99.28701587301587,
"storageSize" : 174735360,
"numExtents" : 12,
"nindexes" : 1,
"lastExtentSize" : 50798592,
"paddingFactor" : 1,
"systemFlags" : 1,
"userFlags" : 0,
"totalIndexSize" : 48753488,
"indexSizes" : {
"_id_" : 48753488
},
"ok" : 1
}
======
web output
{"total_rows" : 379677 , "query" : {} , "millis" : 6793}
Continuing... dropped/removed some documents of collection to fit 38MB. Do new query results in all documents thats results 379642 of 379642 or 38MB.
> db.manyrows.stats()
{
"ns" : "forum.manyrows",
"count" : 379678,
"size" : 38172128,
"avgObjSize" : 100.53816128403543,
"storageSize" : 174735360,
"numExtents" : 12,
"nindexes" : 1,
"lastExtentSize" : 50798592,
"paddingFactor" : 1,
"systemFlags" : 1,
"userFlags" : 0,
"totalIndexSize" : 12329408,
"indexSizes" : {
"_id_" : 12329408
},
"ok" : 1
}
===
web output
{"total_rows" : 379678 , "query" : {} , "millis" : 27325}
New samples with other collections: Results 39MB with (“avgObjSize": 3440.35 * "total_rows": 11395 = 39MB)
> db.messages.stats()
{
"ns" : "enron.messages",
"count" : 120477,
"size" : 414484160,
"avgObjSize" : 3440.3592386928626,
"storageSize" : 518516736,
"numExtents" : 14,
"nindexes" : 2,
"lastExtentSize" : 140619776,
"paddingFactor" : 1,
"systemFlags" : 1,
"userFlags" : 1,
"totalIndexSize" : 436434880,
"indexSizes" : {
"_id_" : 3924480,
"body_text" : 432510400
},
"ok" : 1
}
=== web output:
{
"total_rows" : 11395 ,
"query" : {} ,
"millis" : 2956
}
You can try make query with a microframework like Bottle.