Databricks Autoloader is getting stuck and does not pass to the next batch - spark-streaming

I have a simple job scheduled every 5 min. Basically it listens to cloudfiles on storage account and writes them into delta table, extremely simple. The code is something like this:
df = (spark
.readStream
.format("cloudFiles")
.option('cloudFiles.format', 'json')
.load(input_path, schema = my_schema)
.select(cols)
.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation", f"{output_path}/_checkpoint")
.trigger(once = True)
.start(output_path))
Sometimes there are new files, sometimes not. After 40-60 batches it gets stuck on one particular batchId, as if there are no new files in the folder. If i run the script manually i get the same result: it points to the last actually processed batch.
{
"id" : "xxx,
"runId" : "xxx",
"name" : null,
"timestamp" : "2022-01-13T15:25:07.512Z",
"batchId" : 64,
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0,
"durationMs" : {
"latestOffset" : 663,
"triggerExecution" : 1183
},
"stateOperators" : [ ],
"sources" : [ {
"description" : "CloudFilesSource[/mnt/source/]",
"startOffset" : {
"seqNum" : 385,
"sourceVersion" : 1,
"lastBackfillStartTimeMs" : 1641982820801,
"lastBackfillFinishTimeMs" : 1641982823560
},
"endOffset" : {
"seqNum" : 385,
"sourceVersion" : 1,
"lastBackfillStartTimeMs" : 1641982820801,
"lastBackfillFinishTimeMs" : 1641982823560
},
"latestOffset" : null,
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0,
"metrics" : {
"numBytesOutstanding" : "0",
"numFilesOutstanding" : "0"
}
} ],
"sink" : {
"description" : "DeltaSink[/mnt/db/table_name]",
"numOutputRows" : -1
}
}
But if I run only the readStream part - it correctly reads the entire list of files ( and starts a new batchId: 0 ). The strangest part is: I have absolutely no Idea what causes it and why it takes around 40-60 batches to get this kind of error. Can anyone help? Or give me some suggestion?
I was thinking about using ForeachBatch() to append new data. Or using trigger .trigger(continuous='5 minutes')
I'm new to AutoLoader
Thank you so much!

I resolved it by using
.option('cloudFiles.useIncrementalListing', 'false')
My filenames are composed of flowname + timestamp, like this:
flow_name_2022-01-18T14-19-50.018Z.json
So my guess is: some combination of dots make the rocksdb go into non-existing directory, that's why the it reports that "found no new files". Once I disabled incremental listing rocksdb stopped making its mini checkpoints based on filenames and now reads the whole directory. This is the only explanation that I have.
If anyone is having the same issue try changing the filename

Related

Cosmos DB Collection not using _id index when querying by _id?

I have a CosmosDb - MongoDb collection that I'm using purely as a key/value store for arbitrary data where the _id is the key for my collection.
When I run the query below:
globaldb:PRIMARY> db.FieldData.find({_id : new BinData(3, "xIAPpVWVkEaspHxRbLjaRA==")}).explain(true)
I get this result:
{
"_t" : "ExplainResponse",
"ok" : 1,
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "data.FieldData",
"indexFilterSet" : false,
"parsedQuery" : {
"$and" : [ ]
},
"winningPlan" : {
},
"rejectedPlans" : [ ]
},
"executionStats" : {
"executionSuccess" : true,
"nReturned" : 1,
"executionTimeMillis" : 106,
"totalKeysExamined" : 0,
"totalDocsExamined" : 3571,
"executionStages" : {
},
"allPlansExecution" : [ ]
},
"serverInfo" : #REMOVED#
}
Notice that the totalKeysExamined is 0 and the totalDocsExamined is 3571 and the query took over 106ms. If i run without .explain() it does find the document.
I would have expected this query to be lightning quick given that the _id field is automatically indexed as a unique primary key on the collection. As this collection grows in size, I only expect this problem to get worse.
I'm definitely not understanding something about the index and how it works here. Any help would be most appreciated.
Thanks!

Differentiating _delete_by_query tasks in a multi-tenant index

Scenario:
I have an index with a bunch of multi-tenant data in Elasticsearch 6.x. This data is frequently deleted (via _delete_by_query) and populated by the tenants.
When issuing a _delete_by_query request with wait_for_completion=false, supplying a query JSON to delete a tenants' data, I am able to see generic task information via the _tasks API. Problem is, with a large number of tenants, it is not actively clear who is deleting data at any given time.
My question is this:
Is there a way I can view the query for which the _delete_by_query task is operating on? Or can I attach an additional param to the URL that is cached in the task to differentiate them?
Side note: looking at the docs: https://www.elastic.co/guide/en/elasticsearch/reference/6.6/tasks.html I see there is a description field in the _tasks API response that has the query as a String, however, I do not see that level of detail in my description field:
"description" : "delete-by-query [myindex]"
Thanks in advance
One way to identify queries is to add the X-Opaque-Id HTTP header to your queries:
For instance, when deleting all tenant data for (e.g.) User 3, you can issue the following command:
curl -XPOST -H 'X-Opaque-Id: 3' -H 'Content-type: application/json' http://localhost:9200/my-index/_delete_by_query?wait_for_completion=false -d '{"query":{"term":{"user": 3}}}'
You then get a task ID, and when checking the related task document, you'll be able to identify which task is/was deleting which tenant data thanks to the headers section which contains your HTTP header:
"_source" : {
"completed" : true,
"task" : {
"node" : "DB0GKYZrTt6wuo7d8B8p_w",
"id" : 20314843,
"type" : "transport",
"action" : "indices:data/write/delete/byquery",
"status" : {
"total" : 3,
"updated" : 0,
"created" : 0,
"deleted" : 3,
"batches" : 1,
"version_conflicts" : 0,
"noops" : 0,
"retries" : {
"bulk" : 0,
"search" : 0
},
"throttled_millis" : 0,
"requests_per_second" : -1.0,
"throttled_until_millis" : 0
},
"description" : "delete-by-query [deletes]",
"start_time_in_millis" : 1570075424296,
"running_time_in_nanos" : 4020566,
"cancellable" : true,
"headers" : {
"X-Opaque-Id" : "3" <--- user 3
}
},

Firebase Realtime Database: Do I need extra container for my queries?

I am developing an app, where I retrieve data from a firebase realtime database.
One the one hand, I got my objects. There will be around 10000 entries when it is finished. A user can select for each property like "Blütenfarbe" (flower color) 1 (or more) characteristics, where he will then get the result plants, on which these constraints are true. Each property has 2-10 characteristics.
Is querying here powerful enough, to get fast results ? If not, my thought would be to also setup container for each characteristic and put every ID in that, when it is a characteristic of that plant.
This is my first project, so any tip for better structure is welcome. I don't want to create this database and realize afterwards, that it is not well enough structured.
Thanks for your help :)
{
"Pflanzen" : {
"Objekt" : {
"00001" : {
"Belaubung" : "Sommergrün",
"Blütenfarbe" : "Gelb",
"Blütezeit" : "Februar",
"Breite" : 20,
"Duftend" : "Ja",
"Frosthärte" : "Ja",
"Fruchtschmuck" : "Nein",
"Herbstfärbung" : "Gelb",
"Höhe" : 20,
"Pflanzengruppe" : "Laubgehölze",
"Standort" : "Sonnig",
"Umfang" : 10
},
"00002" : {
"Belaubung" : "Sommergrün",
"Blütenfarbe" : "Gelb",
"Blütezeit" : "März",
"Breite" : 25,
"Duftend" : "Nein",
"Frosthärte" : "Ja",
"Fruchtschmuck" : "Nein",
"Herbstfärbung" : "Ja",
"Höhe" : 10,
"Pflanzengruppe" : "Nadelgehölze",
"Standort" : "Schatten",
"Umfang" : 10
},
"Eigenschaften" : {
"Belaubung" : {
"Sommergrün" : [ "00001", "00002" ],
"Wintergrün" : ["..."]
},
"Blütenfarbe" : {
"Braun": ["00002"],
"Blau" : [ "00001" ]
},
}
}
}
}

How to store nested document as String in elastic search

Context:
1) We are building a CDC pipeline (using kafka & connect framework)
2) We are using debezium for capturing mysql Tx logs
3) We are using Elastic Search connector to add documents to ES index
Sample change event generated by Debezium:
{
"source" : {
"before" : {
"Id" : 97,
"name" : "Northland",
"code" : "NTL",
"country_id" : 6,
"is_business_mapped" : 0
},
"after" : {
"Id" : 97,
"name" : "Northland",
"code" : "NTL",
"country_id" : 6,
"is_business_mapped" : 1
},
"source" : {
"version" : "0.7.5",
"name" : "__",
"server_id" : 252639387,
"ts_sec" : 1547805940,
"gtid" : null,
"file" : "mysql-bin-changelog.000570",
"pos" : 236,
"row" : 0,
"snapshot" : false,
"thread" : 614,
"db" : "bazaarify",
"table" : "state"
},
"op" : "u",
"ts_ms" : 1547805939683
}
What we want :
We want to visualize only 3 columns in kibana :
1) before - containing the nested JSON as string
2) after - containing the nested JSON as string
3) source - containing the nested JSON as string
I can think below possibilities here :
a) Either converting nested JSON as string
b) Combining column data in elastic search
I am a newbie to elastic search . Can someone please guide me how to do that.
I tried defining custom mapping as well but it is giving me exception.
You can always view your document as a Raw JSON in Kibana.
You don't need to manipulate it before indexing in elastic.
As this is related to visualization, handle this in Kibana only.
Check this link for a screenshot.
Refer this to add the columns which you want to see onto the results
I don't fully understand your use case, but if you would like to turn some json's to their representing strings, then you can use logstash for that, or even Elasticsearch ingest capabilities to convert an object (json) to a string.
From the link above, an example:
PUT _ingest/pipeline/my-pipeline-id { "description": "converts the
content of the id field to an integer", "processors" : [
{
"convert" : {
"field" : "source",
"type": "string"
}
} ] }

Not getting incremental data with jdbc importer from sql to elastic search

As per jdbc importer :
It is recommended to use timestamps in UTC for synchronization. This example fetches all product rows which has added since the last run, using a millisecond resolution column mytimestamp:
{
"type" : "jdbc",
"jdbc" : {
"url" : "jdbc:mysql://localhost:3306/test",
"user" : "",
"password" : "",
"sql" : [
{
"statement" : "select * from \"products\" where \"mytimestamp\" > ?",
"parameter" : [ "$metrics.lastexecutionstart" ]
}
],
"index" : "my_jdbc_index",
"type" : "my_jdbc_type"
}
}
I want to input data incrementally based on a column modified data whose format is 2015-08-20 14:52:09 also i use a scheduler which runs every minute . I tried with the value of sql key as
"statement" : "select * from \"products\" where \"modifiedDate\" > ?",
But data was not loaded.
Am I missing out something ?
the format of lastexecutionstart like this "2016-03-27T06:37:09.165Z".
it contain 'T' and 'Z' . So that is why your data was not loaded.
If you want to know more.
here is link
https://github.com/jprante/elasticsearch-jdbc

Resources