Is there a way to monitor the input and output throughput of a Spark cluster, to make sure the cluster is not flooded and overflowed by incoming data?
In my case, I set up Spark cluster on AWS EC2, so I'm thinking of using AWS CloudWatch to monitor the NetworkIn and NetworkOut for each node in the cluster.
But my idea seems to be not accurate and network does not meaning incoming data for Spark only, maybe also some other data would be calculated too.
Is there a tool or way to monitor specifically for Spark cluster streaming data status? Or there's already a built-in tool in Spark that I missed?
update: Spark 1.4 released, monitoring at port 4040 is significantly enhanced with graphical display
Spark has a configurable metric subsystem.
By default it publishes a JSON version of the registered metrics on <driver>:<port>/metrics/json. Other metrics syncs, like ganglia, csv files or JMX can be configured.
You will need some external monitoring system that collects metrics on a regular basis an helps you make sense of it. (n.b. We use Ganglia but there's other open source and commercial options)
Spark Streaming publishes several metrics that can be used to monitor the performance of your job. To calculate throughput, you would combine:
(lastReceivedBatch_processingEndTime-lastReceivedBatch_processingStartTime)/lastReceivedBatch_records
For all metrics supported, have a look at StreamingSource
Example: Starting a local REPL with Spark 1.3.1 and after executing a trivial streaming application:
import org.apache.spark.streaming._
val ssc = new StreamingContext(sc, Seconds(10))
val queue = scala.collection.mutable.Queue(1,2,3,45,6,6,7,18,9,10,11)
val q = queue.map(elem => sc.parallelize(Seq(elem)))
val dstream = ssc.queueStream(q)
dstream.print
ssc.start
one can GET localhost:4040/metrics/json and that returns:
{
version: "3.0.0",
gauges: {
local-1430558777965.<driver>.BlockManager.disk.diskSpaceUsed_MB: {
value: 0
},
local-1430558777965.<driver>.BlockManager.memory.maxMem_MB: {
value: 2120
},
local-1430558777965.<driver>.BlockManager.memory.memUsed_MB: {
value: 0
},
local-1430558777965.<driver>.BlockManager.memory.remainingMem_MB: {
value: 2120
},
local-1430558777965.<driver>.DAGScheduler.job.activeJobs: {
value: 0
},
local-1430558777965.<driver>.DAGScheduler.job.allJobs: {
value: 6
},
local-1430558777965.<driver>.DAGScheduler.stage.failedStages: {
value: 0
},
local-1430558777965.<driver>.DAGScheduler.stage.runningStages: {
value: 0
},
local-1430558777965.<driver>.DAGScheduler.stage.waitingStages: {
value: 0
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.lastCompletedBatch_processingDelay: {
value: 44
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.lastCompletedBatch_processingEndTime: {
value: 1430559950044
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.lastCompletedBatch_processingStartTime: {
value: 1430559950000
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.lastCompletedBatch_schedulingDelay: {
value: 0
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.lastCompletedBatch_submissionTime: {
value: 1430559950000
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.lastCompletedBatch_totalDelay: {
value: 44
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime: {
value: 1430559950044
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.lastReceivedBatch_processingStartTime: {
value: 1430559950000
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.lastReceivedBatch_records: {
value: 0
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.lastReceivedBatch_submissionTime: {
value: 1430559950000
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.receivers: {
value: 0
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.retainedCompletedBatches: {
value: 2
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.runningBatches: {
value: 0
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.totalCompletedBatches: {
value: 2
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.totalProcessedRecords: {
value: 0
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.totalReceivedRecords: {
value: 0
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.unprocessedBatches: {
value: 0
},
local-1430558777965.<driver>.Spark shell.StreamingMetrics.streaming.waitingBatches: {
value: 0
}
},
counters: { },
histograms: { },
meters: { },
timers: { }
}
I recommend using https://spark.apache.org/docs/latest/monitoring.html#metrics with Prometheus (https://prometheus.io/).
Metrics generated by Spark metrics can be captured using Prometheus and It offers UI as well. Prometheus is a free tool.
Related
I'm trying to create a custom Kibana visualization here, using Vega Lite. I have my datapull:
data: {
url: {
%context%: true
%timefield%: dateStamp
index: index_1
body: {
size: 10000
}
}
I have a simple bind parameter in Vega Lite like so:
{
name: displayBars
value: 10
bind: {
input: range
min: 3
max: 40
step: 1
}
}
I'd like to change the max to a variable hits.total, like this:
{
name: displayBars
value: 10
bind: {
input: range
min: 3
max: hits.total
step: 1
}
}
Where hits.total represents the total documents returned by the kibana search. Is this possible?
After struggling against k6, with the help of the answers to my other question, i was able (i think) to test subscriptions with websocket.
Now i'm trying to desing a good test. Our app is a Single Page App that mainly uses websocket subscriptions instead of queries. The problem is that i'm a bit lost on how to define how many subscriptiosn i should test in order to detect if my app will be able to support around 800 users working simultaneously in real time. (I started testing it first with a top of 150 VUs).
For now, i'm running the test with 3 commonly used subscriptions in our user path, but:
Should I try more subscriptions, trying to cover the most used ones in our user path, or should I keep it simple and pick the 3 most used ones and add a timeout between them?
Is this a correct approach for load/stress testing with GRAPHQL subscriptions/websocket? I'm not sure if i'm being too cautious about this, but i'm afraid of giving a false status about our situation.
I am a bit confused on how to interpret the results screen, especially on how I can infer if we will be able to give a good experience to our users. Should I take the avg and p(95) of ws_ping as a reference for this?
k6 screen result
As a reference, here is part of the code i'm using to perform the test
Thanks in advance!
main.js
import { Httpx } from 'https://jslib.k6.io/httpx/0.0.5/index.js';
const session = new Httpx({
baseURL: `https://${enviorment}`
});
const wsUri = `wss://${enviorment}/v1/graphql`;
const pauseMin = 2;
const pauseMax = 6;
export const options = {
stages: [
{ duration: '30s', target: 50 },
{ duration: '30s', target: 100 },
{ duration: '30s', target: 150 },
{ duration: '120s', target: 150 },
{ duration: '60s', target: 50 },
{ duration: '30s', target: 0 },
]
};
export default function () {
session.addHeader('Content-Type', 'application/json');
todas(keys.profesorFirebaseKey, keys.desafioCursoKey, keys.nivelCurso, keys.profesorKey)
}
todas.js:
import ws from 'k6/ws';
import {fail,check} from 'k6'
import exec from 'k6/execution';
export function todas(id, desafioCursoKey, nivel, profesorKey) {
const queryList = [`];
let ArraySubscribePayload = []
for (let i = 0; i < 3; i++) {
let subscribePayload = {
id: String(1 + 3 * i),
payload: {
extensions: {},
query: queryList[i],
variables: {},
},
type: "start",
}
ArraySubscribePayload.push(subscribePayload)
}
const initPayload = {
payload: {
headers: {
xxxxxxxxxxxxx
},
lazy: true,
},
type: "connection_init",
};
const res = ws.connect(wsUri, initPayload, function (socket) {
socket.on('open', function () {
socket.setInterval(function timeout() {
socket.send(JSON.stringify(initPayload));
socket.send(JSON.stringify(ArraySubscribePayload[0]));
socket.setTimeout(function timeout() {
socket.send(JSON.stringify(ArraySubscribePayload[1]));
socket.send(JSON.stringify(ArraySubscribePayload[2]));
ArraySubscribePayload[1].id = String(parseInt(ArraySubscribePayload[1].id) + 1)
ArraySubscribePayload[2].id = String(parseInt(ArraySubscribePayload[2].id) + 1)
}, 3000);
ArraySubscribePayload[0].id = String(parseInt(ArraySubscribePayload[0].id) + 1)
socket.ping();
}, 7000);
})
});
}
I have mongodb on windows so there is no logrotate or anything. The log consumes 175GB per week! I need to cut this down a lot..
Currently db.GetProfilingLevel() returns 0 and db.getLogComponents() returns -1 for all components and still I get almost 2000 of these bad boys a minute:
2018-06-25T15:44:59.653+0200 I COMMAND [conn2355] command mydb.LimitStubs command: find { find: "LimitStubs", filter: { Limit: "asdl;" }, skip: 0, noCursorTimeout: false } planSummary: IXSCAN { Limit: 1, Holder: 1 } keysExamined:0 docsExamined:0 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:0 nreturned:0 reslen:129 locks:{ Global: { acquireCount: { r: 2 } }, MMAPV1Journal: { acquireCount: { r: 1 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { R: 1 } } } protocol:op_query 0ms
Any suggestions?
I have a schema that looks like this
var user = new Schema({
preference1: { // preference is a number between 1 - 10
type: Number
},
preference2: { // preference is a number between 1 - 10
type: Number
}
})
how do I find the documents and sort by a function on the preferences fields? Say fn is something like this
fn = Math.abs(preference1 - 3) + preference2 ^ 2
I kind of find a temporary solution. It works but isn't really what I was looking for... the code is really messy and apparently you cannot take arbitrary fn for sorting..
for example, say fn = (a+3) * (b+5)
Media.aggregate()
.project({
"type": 1,
"status": 1,
"newField1": { "$add": [ "$type", 3 ] },
"newField2": { "$add": [ 5, "$status" ] },
})
.project({
"newField3": { "$multiply": [ "$newField1", "$newField2" ] },
})
.sort("newField3")
.exec()
I am trying to use no data option from c3.js but somehow it does not work for me.
My js fiddle:
http://jsfiddle.net/ymqef2ut/7/
I am trying to use the empty option according to c3 documentation:
empty: { label: { text: "No Data Available" } }
There are two issues in your fiddle:
Problem1:
data: {
columns: [
['electricity plants', elec_plants],
['CHP plants', chp_planrs],
['Unallocated autoproducers / Other energy industry own use', auto_pro],
['Other', other_elec],
],
type : 'pie'
},
empty: { label: { text: "No Data Available" } },//this is wrong should be a part of data
The empty should be a part of the data json like below
data: {
columns: [
['electricity plants', elec_plants],
['CHP plants', chp_planrs],
['Unallocated autoproducers / Other energy industry own use', auto_pro],
['Other', other_elec],
],
type : 'pie',
empty: { label: { text: "No Data Available" } },//this is correct
},
Problem 2:
When data is not present the column array should be an empty array
var col5 = [];//set empty array
if (resi || com || agri || other_sec){
col5 = [['Residential', resi],
['Commercial and public services', com],
['Agriculture/forestry', agri],
['Other', other_sec]]
}
//if all are 0 then col = []
var chart = c3.generate({
bindto: "#chart_5",
data: {
columns: col5,
type: 'pie',
empty: {
label: {
text: "No Data Available"
}
}
},
Working code here
Testcase: Check for Iraq
Hope this helps!