identify live spark master at the time of spark-submit - spark-streaming

I have 5 node spark cluster where 2 node are running master. in HA(by Zookeeper) scenario any one will be elected as master.
at the time of submitting application using command
/bin/spark-submit --class SparkAggregator.java --deploy-mode cluster --supervise --master spark://host1:7077
getting error
Can only accept driver submissions in ALIVE state. Current state: STANDBY.
spark-submit doe not allow multiple master name in --master.
Question:
How to identify the elected master at the time of spark-submit.
Thanks
Pankaj

The master option can take multiple spark masters, so if you have more than one list them with a comma between them. e.g.
/bin/spark-submit --class SparkAggregator.java --deploy-mode cluster --supervise --master spark://host1:7077,host2:7077,host3:7077
If will try to connect to all of them, the first that responds is used, this allow you to use multiple masters in a cluster where only one is active and the rest are in standby.

Spark has a hidden API which tell you about the status of the Spark Cluster
API Request- http://SPARK_MASTER_IP:8080/json/
Output -
{
"url" : "spark://10.204.216.233:7077",
"workers" : [ {
"id" : "worker-20170606104140-10.204.217.96-40047",
"host" : "10.204.217.96",
"port" : 40047,
"webuiaddress" : "http://10.204.217.96:8081",
"cores" : 4,
"coresused" : 0,
"coresfree" : 4,
"memory" : 29713,
"memoryused" : 0,
"memoryfree" : 29713,
"state" : "ALIVE",
"lastheartbeat" : 1496760671542
}, {
"id" : "worker-20170606104144-10.204.219.15-42749",
"host" : "10.204.219.15",
"port" : 42749,
"webuiaddress" : "http://10.204.219.15:8081",
"cores" : 4,
"coresused" : 0,
"coresfree" : 4,
"memory" : 29713,
"memoryused" : 0,
"memoryfree" : 29713,
"state" : "ALIVE",
"lastheartbeat" : 1496760675649
}, {
"id" : "worker-20170606104151-10.204.217.249-35869",
"host" : "10.204.217.249",
"port" : 35869,
"webuiaddress" : "http://10.204.217.249:8081",
"cores" : 4,
"coresused" : 0,
"coresfree" : 4,
"memory" : 29713,
"memoryused" : 0,
"memoryfree" : 29713,
"state" : "ALIVE",
"lastheartbeat" : 1496760682270
} ],
"cores" : 12,
"coresused" : 0,
"memory" : 89139,
"memoryused" : 0,
"activeapps" : [ ],
"completedapps" : [ ],
"activedrivers" : [ ],
"status" : "ALIVE"
}

Related

external access to ElasticSearch cluster

Using this link I can easily setup a 3-node cluster on a single host, with docker-compose.
This is all fine if I just use ES via the included Kibana container.
However I need to access this cluster from external hosts. This becomes problematic because the nodes inside the cluster are exposed through their docker-internal IP address. The application uses this API call below to get the addresses, and then of course errors out.
$ curl 172.16.0.146:9200/_nodes/http?pretty
{
"_nodes" : {
"total" : 3,
"successful" : 3,
"failed" : 0
},
"cluster_name" : "es-cluster-test",
"nodes" : {
"hYCGiuBLQMK4vn5I3C3pQQ" : {
"name" : "es01",
"transport_address" : "192.168.48.3:9300",
"host" : "192.168.48.3",
"ip" : "192.168.48.3",
"version" : "8.2.2",
.....
How can I overcome this? I have tried exposing the 9200/9300 ports for all 3 nodes to different ports on the docker-host, and then adding a network.publish_host=172.16.0.146 environment setting to each node, but this results in three 1-node clusters.
Someone must have faced this one in the past...

Differentiating _delete_by_query tasks in a multi-tenant index

Scenario:
I have an index with a bunch of multi-tenant data in Elasticsearch 6.x. This data is frequently deleted (via _delete_by_query) and populated by the tenants.
When issuing a _delete_by_query request with wait_for_completion=false, supplying a query JSON to delete a tenants' data, I am able to see generic task information via the _tasks API. Problem is, with a large number of tenants, it is not actively clear who is deleting data at any given time.
My question is this:
Is there a way I can view the query for which the _delete_by_query task is operating on? Or can I attach an additional param to the URL that is cached in the task to differentiate them?
Side note: looking at the docs: https://www.elastic.co/guide/en/elasticsearch/reference/6.6/tasks.html I see there is a description field in the _tasks API response that has the query as a String, however, I do not see that level of detail in my description field:
"description" : "delete-by-query [myindex]"
Thanks in advance
One way to identify queries is to add the X-Opaque-Id HTTP header to your queries:
For instance, when deleting all tenant data for (e.g.) User 3, you can issue the following command:
curl -XPOST -H 'X-Opaque-Id: 3' -H 'Content-type: application/json' http://localhost:9200/my-index/_delete_by_query?wait_for_completion=false -d '{"query":{"term":{"user": 3}}}'
You then get a task ID, and when checking the related task document, you'll be able to identify which task is/was deleting which tenant data thanks to the headers section which contains your HTTP header:
"_source" : {
"completed" : true,
"task" : {
"node" : "DB0GKYZrTt6wuo7d8B8p_w",
"id" : 20314843,
"type" : "transport",
"action" : "indices:data/write/delete/byquery",
"status" : {
"total" : 3,
"updated" : 0,
"created" : 0,
"deleted" : 3,
"batches" : 1,
"version_conflicts" : 0,
"noops" : 0,
"retries" : {
"bulk" : 0,
"search" : 0
},
"throttled_millis" : 0,
"requests_per_second" : -1.0,
"throttled_until_millis" : 0
},
"description" : "delete-by-query [deletes]",
"start_time_in_millis" : 1570075424296,
"running_time_in_nanos" : 4020566,
"cancellable" : true,
"headers" : {
"X-Opaque-Id" : "3" <--- user 3
}
},

How to get HBase IP address for Phoenix URL

I can ssh to the Hadoop Cluster and can run the hbase command. But I need to connect using the Phoenix JDBC driver which needs the IP address of the HBase server.
I tried the IP address I used for the cluster with no luck.
This is probably just a generic Hadoop question but where are the IP addresses configured?
If you are aware of the hadoop cluster namenodes, then you can try pinging them or send a curl request like below
curl 'http://my-namenode-lv-101:50070/jmx?qry=Hadoop:service=NameNode,name=NameNodeStatus'
{
"beans" : [ {
"name" : "Hadoop:service=NameNode,name=NameNodeStatus",
"modelerType" : "org.apache.hadoop.hdfs.server.namenode.NameNode",
"SecurityEnabled" : false,
"NNRole" : "NameNode",
"HostAndPort" : "my-namenode-lv-101:8020",
"LastHATransitionTime" : 1561605051455,
"State" : "standby"
} ]
}
If the state is Standby, then that is the current inactive node, you have to try the other nodes to find for which the State says 'active' ... example below:
curl 'http://my-namenode-lv-102:50070/jmx?qry=Hadoop:service=NameNode,name=NameNodeStatus'
{
"beans" : [ {
"name" : "Hadoop:service=NameNode,name=NameNodeStatus",
"modelerType" : "org.apache.hadoop.hdfs.server.namenode.NameNode",
"State" : "active",
"SecurityEnabled" : false,
"NNRole" : "NameNode",
"HostAndPort" : "my-namenode-lv-102:8020",
"LastHATransitionTime" : 1561605054944
} ]
}
To connect to phoenix-hbase use the zookeeper address, port & zookeeper.znode.parent configuration's value which is configured in your cluster. (it can be found in your hbase-site.xml file)

Elasticsearch basics : transportclient or not?

I set up a graylog stack (graylog / ES/ Mongo) everything went smooth (well almost), yesterday I tried to get some info using the following command :
curl 'http://127.0.0.1:9200/_nodes/process?pretty'
{
"cluster_name" : "log_server_graylog",
"nodes" : {
"Znz_72SZSyikw6DEC4Wgzg" : {
"name" : "graylog-27274b66-3bbd-4975-99ee-1ee3d692c522",
"transport_address" : "127.0.0.1:9350",
"host" : "127.0.0.1",
"ip" : "127.0.0.1",
"version" : "2.4.4",
"build" : "fcbb46d",
"attributes" : {
"client" : "true",
"data" : "false",
"master" : "false"
},
"process" : {
"refresh_interval_in_millis" : 1000,
"id" : 788,
"mlockall" : false
}
},
"XO77zz8MRu-OOSymZbefLw" : {
"name" : "test",
"transport_address" : "127.0.0.1:9300",
"host" : "127.0.0.1",
"ip" : "127.0.0.1",
"version" : "2.4.4",
"build" : "fcbb46d",
"http_address" : "127.0.0.1:9200",
"process" : {
"refresh_interval_in_millis" : 1000,
"id" : 946,
"mlockall" : false
}
}
}
}
I does look like (to me at least that there is 2 nodes running, someone on the ES IRC told me that there might be a transport client running (which show up as a second node)...
I really don't understand why where this transport client comes from, also, the guy from IRC told me it used to be a common setup (using transport client) but this is discouraged now, how can I reverse the config to follow ES best practices ? (which I couldn't find on the docs)
FYI, my config file :
cat /etc/elasticsearch/elasticsearch.yml
cluster.name: log_server_graylog
node.name: test
path.data: /tt/elasticsearch/data
path.logs: /tt/elasticsearch/log
network.host: 127.0.0.1
action.destructive_requires_name: true
# Folowing are useless as we are defining swappiness to 1, this shloud prevent ES memeory space from being sawpped, unless emergency
#bootstrap.mlockall: true
#bootstrap.memory_lock: true
Thanks
I found the answer using the graylog IRC, the second client is the graylog client created by.... Graylog server :)
So everything is normal and as expected.

Mongodb total rows limit

I have some problem with a Restfull interface of mongoDB.
I have submitted this query --> http://127.0.0.1:28017/db/collection/?limit=0(I used limit = 0 because I want to find all my result with an ajax request),
and the result in terms of number of rows is "total_rows" : 38185.
But if in my shell if I execute db.collection.count() the result was 496519.
Why I have these difference? Is it possible to get the same result with an ajax request?
Thanks in advance for your help.
I'm sure that results were not impacted by numbers of rows nor directly MongoDB, but indicates be Webserver (at time created to tasks admin). It’s possibly to be the payload size of response break by Webserver something like HTTP error 413 (entity to larger).
In my tests i see entries in log as "[websvr] killcursors: found 1 of 1". This will kill opened cursor between the client (in the case web server) and MongoDB. Most drivers not need call OP_KILL_CURSORS because the MongoDB define a timeout of 10 minutes by default.
Go back for tests i conclude that size payload of response of web server (built-in MongoDB) is limited 38~40MB. Let me show my analyze.
I created a collections with 1,260,000 documents. In REST web interface make query that results total_rows: 379,677 (or avgObjSize * total_rows = 38MB).
db.manyrows.stats()
{
"ns" : "forum.manyrows",
"count" : 1260000,
"size" : 125101640,
"avgObjSize" : 99.28701587301587,
"storageSize" : 174735360,
"numExtents" : 12,
"nindexes" : 1,
"lastExtentSize" : 50798592,
"paddingFactor" : 1,
"systemFlags" : 1,
"userFlags" : 0,
"totalIndexSize" : 48753488,
"indexSizes" : {
"_id_" : 48753488
},
"ok" : 1
}
======
web output
{"total_rows" : 379677 , "query" : {} , "millis" : 6793}
Continuing... dropped/removed some documents of collection to fit 38MB. Do new query results in all documents thats results 379642 of 379642 or 38MB.
> db.manyrows.stats()
{
"ns" : "forum.manyrows",
"count" : 379678,
"size" : 38172128,
"avgObjSize" : 100.53816128403543,
"storageSize" : 174735360,
"numExtents" : 12,
"nindexes" : 1,
"lastExtentSize" : 50798592,
"paddingFactor" : 1,
"systemFlags" : 1,
"userFlags" : 0,
"totalIndexSize" : 12329408,
"indexSizes" : {
"_id_" : 12329408
},
"ok" : 1
}
===
web output
{"total_rows" : 379678 , "query" : {} , "millis" : 27325}
New samples with other collections: Results 39MB with (“avgObjSize": 3440.35 * "total_rows": 11395 = 39MB)
> db.messages.stats()
{
"ns" : "enron.messages",
"count" : 120477,
"size" : 414484160,
"avgObjSize" : 3440.3592386928626,
"storageSize" : 518516736,
"numExtents" : 14,
"nindexes" : 2,
"lastExtentSize" : 140619776,
"paddingFactor" : 1,
"systemFlags" : 1,
"userFlags" : 1,
"totalIndexSize" : 436434880,
"indexSizes" : {
"_id_" : 3924480,
"body_text" : 432510400
},
"ok" : 1
}
=== web output:
{
"total_rows" : 11395 ,
"query" : {} ,
"millis" : 2956
}
You can try make query with a microframework like Bottle.

Resources