I'm trying to create a custom Kibana visualization here, using Vega Lite. I have my datapull:
data: {
url: {
%context%: true
%timefield%: dateStamp
index: index_1
body: {
size: 10000
}
}
I have a simple bind parameter in Vega Lite like so:
{
name: displayBars
value: 10
bind: {
input: range
min: 3
max: 40
step: 1
}
}
I'd like to change the max to a variable hits.total, like this:
{
name: displayBars
value: 10
bind: {
input: range
min: 3
max: hits.total
step: 1
}
}
Where hits.total represents the total documents returned by the kibana search. Is this possible?
Related
Here's the logic I am trying to accomplish:
I am using Elasticsearch to display top selling Products and randomly inserting newly created products in the results using function_score query DSL.
The issue I am facing is that I am using random_score fn for newly created products and the query does inserts new products up till page 2 or 3 but then rest all the other newly created products pushed towards the end of search results.
Here's the logic written for function_score:
function_score: {
query: query,
functions: [
{
filter: [
{ terms: { product_type: 'sponsored') } },
{ range: { live_at: { gte: 'CURRENT_DATE - 1.MONTH' } } }
],
random_score: {
seed: Time.current.to_i / (60 * 10), # new seed every 10 minutes
field: '_seq_no'
},
weight: 0.975
},
{
filter: { range: { live_at: { lt: 'CURRENT_DATE - 1.MONTH' } } },
linear: {
weighted_sales_rate: {
decay: 0.9,
origin: 0.5520974289580515,
scale: 0.5520974289580515
}
},
weight: 1
}
],
score_mode: 'sum',
boost_mode: 'replace'
}
And then I am sorting based on {"_score" => { "order" => "desc" } }
Let's say there are 100 sponsored products created in last 1 month. Then the above Elasticsearch query displays 8-10 random products (3 to 4 per page) as I scroll through 2 or 3 pages but then all other 90-92 products are displayed in last few pages of the result. - This is because the score calculated by random_score for 90-92 products is coming lower than the score calculated by linear
decay function.
Kindly suggest how can I modify this query so that I continue to see newly created Products as I navigate through pages and can prevent pushing new records towards the end of results.
[UPDATE]
I tried adding gauss decay function to this query (so that I can somehow modify the score of the products appearing towards the end of result) like below:
{
filter: [
{ terms: { product_type: 'sponsored' } },
{ range: { live_at: { gte: 'CURRENT_DATE - 1.MONTH' } } },
{ range: { "_score" => { lt: 0.9 } } }
],
gauss: {
views_per_age_and_sales: {
origin: 1563.77,
scale: 1563.77,
decay: 0.95
}
},
weight: 0.95
}
But this too is not working.
Links I have referred to:
https://intellipaat.com/community/12391/how-to-get-3-random-search-results-in-elasticserch-query
Query to get random n items from top 100 items in Elastic Search
https://www.elastic.co/guide/en/elasticsearch/reference/7.17/query-dsl-function-score-query.html
I am not sure if this is the best solution, but I was able to accomplish this with wrapping up the original query with script_score query + I have added a new ElasticSearch indexing called sort_by_views_per_year. Here's how the solution looks:
Link I referred to: https://github.com/elastic/elasticsearch/issues/7783
attribute(:sort_by_views_per_year) do
object.live_age&.positive? ? object.views_per_year.to_f / object.live_age : 0.0
end
Then while querying ElasticSearch:
def search
#...preparation of query...#
query = original_query(query)
query = rearrange_low_scoring_docs(query)
sort = apply_sort opts[:sort]
Product.search(query: query, sort: sort)
end
I have not changed anything in original_query (i.e. using random_score to products <= 1.month.ago and then use linear decay function).
def rearrange_low_scoring_docs query
{
function_score: {
query: query,
functions: [
{
script_score: {
script: "if (_score.doubleValue() < 0.9) {return 0.9;} else {return _score;}"
}
}
],
#score_mode: 'sum',
boost_mode: 'replace'
}
}
end
Then finally my sorting looks like this:
def apply_sort
[
{ '_score' => { 'order' => 'desc' } },
{ 'sort_by_views_per_year' => { 'order' => 'desc' } }
]
end
It would be way too helpful if ElasticSearch random_score query DSL starts supporting something like: max_doc_to_include and min_score attributes. So that I can use it like:
{
filter: [
{ terms: { product_type: 'sponsored' } },
{ range: { live_at: { gte: 'CURRENT_DATE - 1.MONTH' } } }
],
random_score: {
seed: 123456, # new seed every 10 minutes
field: '_seq_no',
max_doc_to_include: 10,
min_score: 0.9
},
weight: 0.975
},
I am creating scatter plot and its working fine until I choose time range from calendar which does not contain data. I have data until May 2021 so when I go for 1 Year its fine, but when I choose last 4 months, it gives me error - for X axis Infinite extent for field "time": [Infinity, -Infinity] and also error for Y axis Infinite extent for field "Kilometers": [Infinity, -Infinity].
Its probably problem in date conversation but why its working for periods containing data and why is it giving me error for kilometers as well. Any help here please?
Elastic query returns timefield in unix time.
Thank you
(I have same problem also with line chart in vega)
Screenshot of error
{
$schema: https://vega.github.io/schema/vega-lite/v2.6.0.json
data: {
url: {
%context%: true
%timefield%: timefield
index: indextrains
body: {
size: 10000
_source: [
timefield
km
]
}
}
format: {
property: hits.hits
}
}
transform: [
{
calculate: datetime(datum._source['timefield'])
as: time
}
{
calculate: datum._source['km']
as: Kilometers
}
]
mark: {
type: circle
}
encoding: {
x: {
field: time
type: temporal
}
y: {
field: Kilometers
type: quantitative
}
}
}
I am trying to access a key element in Golang with the following schema via terraform config file:
"vehicles": {
Type: schema.TypeSet,
Optional: true,
MaxItems: 5,
Elem: &schema.Resource{
Schema: map[string]*schema.Schema{
"car": {
Type: schema.TypeList,
Optional: true,
MaxItems: 2,
Elem: &schema.Resource{
Schema: map[string]*schema.Schema{
"make": {
Type: schema.TypeString,
Optional: true,
},
"model": {
Type: schema.TypeString,
Optional: true,
},
},
},
},
},
},
}
In config file,
resource "type_test" "type_name" {
vehicles {
car {
make = "Toyota"
model = "Camry"
}
car {
make = "Nissan"
model = "Rogue"
}
}
}
I want to iterate over the list and access the vehicles map via Golang.
The terraform crashes with the below code:
vehicles_map, ok = d.getOK("vehicles")
if ok {
vehicleSet := vehicles_d.(*schema.Set)List()
for i, vehicle := range vehicleSet {
mdi, ok = vehicle.(map[string]interface{})
if ok {
log.Printf("%v", mdi["vehicles"].(map[string]interface{})["car"])
}
}
Crash Log:
2019-12-25T21 [DEBUG] plugin.terraform-provider: panic: interface conversion: interface {} is nil, not map[string]interface {}
for line "log.Printf("%v", mdi["vehicles"].(map[string]interface{})["car"])"
I want to print and access the each vehicles element in the config file, any help would be appreciated.
d.getOK("vehicles") already performs the indexing with "vehicles" key, which results in a *schema.Set. Calling its Set.List() method, you get a slice (of type []interface{}). Iterating over its elements will give you values that represent a car, modeled with type map[string]interface{}. So inside the loop you just have to type assert to this type, and not index again with "vehicles" nor with "car".
Something like this:
for i, vehicle := range vehicleSet {
car, ok := vehicle.(map[string]interface{})
if ok {
log.Printf("model: %v, make: %v\n", car["model"], car["make"])
}
}
Suppose I have record like this:
{
id: 1,
statistics: {
stat1: 1,
global: {
stat2: 3
},
stat111: 99
}
}
I want to make update on record with object:
{
statistics: {
stat1: 8,
global: {
stat2: 6
},
stat4: 3
}
}
And it should be added to current record as delta. So, the result record should looks like this:
{
id: 1,
statistics: {
stat1: 9,
global: {
stat2: 9
},
stat4: 3,
stat111: 99
}
}
Is it possible to make this with one query?
Do you want something generic or something specific?
Specific is easy, this is the generic case:
const updateValExpr = r.expr(updateVal);
const updateStats = (stats, val) => val
.keys()
.map(key => r.branch(
stats.hasFields(key),
[key, stats(key).add(val(key))],
[key, val(key)]
))
.coerceTo('object')
r.table(...)
.update(stats =>
updateStats(stats.without('global'), updateValExpr.without('global'))
.merge({ global: updateStats(stats('global'), updateValExpr('global'))
)
There might be some bugs here sincce it's untested but the solution key point is the updateStats function, the fact that you can get all the keys with .keys() and that coerceTo('object') transforms this array: [['a',1],['b',2]] to this object: { a: 1, b: 2 },
Edit:
You can do it recursively, although with limited stack (since you can't send recursive stacks directly, they resolve when the query is actually built:
function updateStats(stats, val, stack = 10) {
return stack === 0
? {}
: val
.keys()
.map(key => r.branch(
stats.hasFields(key).not(),
[key, val(key)],
stats(key).typeOf().eq('OBJECT'),
[key, updateStats(stats(key), val(key), stack - 1)],
[key, stats(key).add(val(key))]
)).coerceTo('object')
}
r.table(...).update(row => updateStats(row, r(updateVal)).run(conn)
// test in admin panel
updateStats(r({
id: 1,
statistics: {
stat1: 1,
global: {
stat2: 3
},
stat111: 99
}
}), r({
statistics: {
stat1: 8,
global: {
stat2: 6
},
stat4: 3
}
}))
Is there a way to monitor the input and output throughput of a Spark cluster, to make sure the cluster is not flooded and overflowed by incoming data?
In my case, I set up Spark cluster on AWS EC2, so I'm thinking of using AWS CloudWatch to monitor the NetworkIn and NetworkOut for each node in the cluster.
But my idea seems to be not accurate and network does not meaning incoming data for Spark only, maybe also some other data would be calculated too.
Is there a tool or way to monitor specifically for Spark cluster streaming data status? Or there's already a built-in tool in Spark that I missed?
update: Spark 1.4 released, monitoring at port 4040 is significantly enhanced with graphical display
Spark has a configurable metric subsystem.
By default it publishes a JSON version of the registered metrics on <driver>:<port>/metrics/json. Other metrics syncs, like ganglia, csv files or JMX can be configured.
You will need some external monitoring system that collects metrics on a regular basis an helps you make sense of it. (n.b. We use Ganglia but there's other open source and commercial options)
Spark Streaming publishes several metrics that can be used to monitor the performance of your job. To calculate throughput, you would combine:
(lastReceivedBatch_processingEndTime-lastReceivedBatch_processingStartTime)/lastReceivedBatch_records
For all metrics supported, have a look at StreamingSource
Example: Starting a local REPL with Spark 1.3.1 and after executing a trivial streaming application:
import org.apache.spark.streaming._
val ssc = new StreamingContext(sc, Seconds(10))
val queue = scala.collection.mutable.Queue(1,2,3,45,6,6,7,18,9,10,11)
val q = queue.map(elem => sc.parallelize(Seq(elem)))
val dstream = ssc.queueStream(q)
dstream.print
ssc.start
one can GET localhost:4040/metrics/json and that returns:
{
version: "3.0.0",
gauges: {
local-1430558777965.<driver>.BlockManager.disk.diskSpaceUsed_MB: {
value: 0
},
local-1430558777965.<driver>.BlockManager.memory.maxMem_MB: {
value: 2120
},
local-1430558777965.<driver>.BlockManager.memory.memUsed_MB: {
value: 0
},
local-1430558777965.<driver>.BlockManager.memory.remainingMem_MB: {
value: 2120
},
local-1430558777965.<driver>.DAGScheduler.job.activeJobs: {
value: 0
},
local-1430558777965.<driver>.DAGScheduler.job.allJobs: {
value: 6
},
local-1430558777965.<driver>.DAGScheduler.stage.failedStages: {
value: 0
},
local-1430558777965.<driver>.DAGScheduler.stage.runningStages: {
value: 0
},
local-1430558777965.<driver>.DAGScheduler.stage.waitingStages: {
value: 0
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.lastCompletedBatch_processingDelay: {
value: 44
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.lastCompletedBatch_processingEndTime: {
value: 1430559950044
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.lastCompletedBatch_processingStartTime: {
value: 1430559950000
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.lastCompletedBatch_schedulingDelay: {
value: 0
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.lastCompletedBatch_submissionTime: {
value: 1430559950000
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.lastCompletedBatch_totalDelay: {
value: 44
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime: {
value: 1430559950044
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.lastReceivedBatch_processingStartTime: {
value: 1430559950000
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.lastReceivedBatch_records: {
value: 0
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.lastReceivedBatch_submissionTime: {
value: 1430559950000
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.receivers: {
value: 0
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.retainedCompletedBatches: {
value: 2
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.runningBatches: {
value: 0
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.totalCompletedBatches: {
value: 2
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.totalProcessedRecords: {
value: 0
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.totalReceivedRecords: {
value: 0
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.unprocessedBatches: {
value: 0
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.waitingBatches: {
value: 0
}
},
counters: { },
histograms: { },
meters: { },
timers: { }
}
I recommend using https://spark.apache.org/docs/latest/monitoring.html#metrics with Prometheus (https://prometheus.io/).
Metrics generated by Spark metrics can be captured using Prometheus and It offers UI as well. Prometheus is a free tool.