Related
I'm trying to create a visulaization of the HDFS block distribution of a cluster.
I plan to create this using Tableau but was wondering what type of visualizations would be able to give you an idea of what nodes need re-balancing, and also an efficient way to get the server log data into tableau?
Before investing too much time in this, you might want to take a look at Twitter's open source HDFS-DU project. This provides a view of utilization based on paths within the file system rather than DataNodes within the cluster, but perhaps that's still helpful for your requirements.
If the goal is just to identify nodes in need of rebalancing, then this information is already accessible on the NameNode web UI "Datanodes" tab. You could also run hdfs dfsadmin -report to get utilization stats for each node in a script.
If none of the above meets your requirements, and you need to proceed with integrating the information into an external reporting tool like Tableau, then a helpful integration point might be the JMX metrics exposed via HTTP on the NameNode. See below for an example curl command that queries some of this information from the NameNode. Note in particular the LiveNodes section, which contains capacity information about each DataNode.
Some additional information about these metrics is available in the Apache Hadoop Metrics documentation.
> curl 'http://127.0.0.1:9870/jmx?qry=Hadoop:service=NameNode,name=NameNodeInfo'
{
"beans" : [ {
"name" : "Hadoop:service=NameNode,name=NameNodeInfo",
"modelerType" : "org.apache.hadoop.hdfs.server.namenode.FSNamesystem",
"Threads" : 46,
"Version" : "3.0.0-alpha2-SNAPSHOT, rdf497b3a739714c567c9c2322608f0659da20cc4",
"Used" : 5263360,
"Free" : 884636377088,
"Safemode" : "",
"NonDfsUsedSpace" : 114431086592,
"PercentUsed" : 5.266863E-4,
"BlockPoolUsedSpace" : 5263360,
"PercentBlockPoolUsed" : 5.266863E-4,
"PercentRemaining" : 88.52252,
"CacheCapacity" : 0,
"CacheUsed" : 0,
"TotalBlocks" : 50,
"NumberOfMissingBlocks" : 0,
"NumberOfMissingBlocksWithReplicationFactorOne" : 0,
"LiveNodes" : "{\"192.168.0.117:9866\":{\"infoAddr\":\"127.0.0.1:9864\",\"infoSecureAddr\":\"127.0.0.1:0\",\"xferaddr\":\"127.0.0.1:9866\",\"lastContact\":2,\"usedSpace\":5263360,\"adminState\":\"In Service\",\"nonDfsUsedSpace\":114431086592,\"capacity\":999334871040,\"numBlocks\":50,\"version\":\"3.0.0-alpha2-SNAPSHOT\",\"used\":5263360,\"remaining\":884636377088,\"blockScheduled\":0,\"blockPoolUsed\":5263360,\"blockPoolUsedPercent\":5.266863E-4,\"volfails\":0}}",
"DeadNodes" : "{}",
"DecomNodes" : "{}",
"BlockPoolId" : "BP-1429209999-10.195.15.240-1484933797029",
"NameDirStatuses" : "{\"active\":{\"/Users/naurc001/hadoop-deploy-trunk/data/dfs/name\":\"IMAGE_AND_EDITS\"},\"failed\":{}}",
"NodeUsage" : "{\"nodeUsage\":{\"min\":\"0.00%\",\"median\":\"0.00%\",\"max\":\"0.00%\",\"stdDev\":\"0.00%\"}}",
"NameJournalStatus" : "[{\"manager\":\"FileJournalManager(root=/Users/naurc001/hadoop-deploy-trunk/data/dfs/name)\",\"stream\":\"EditLogFileOutputStream(/Users/naurc001/hadoop-deploy-trunk/data/dfs/name/current/edits_inprogress_0000000000000000862)\",\"disabled\":\"false\",\"required\":\"false\"}]",
"JournalTransactionInfo" : "{\"MostRecentCheckpointTxId\":\"861\",\"LastAppliedOrWrittenTxId\":\"862\"}",
"NNStartedTimeInMillis" : 1485715900031,
"CompileInfo" : "2017-01-03T21:06Z by naurc001 from trunk",
"CorruptFiles" : "[]",
"NumberOfSnapshottableDirs" : 0,
"DistinctVersionCount" : 1,
"DistinctVersions" : [ {
"key" : "3.0.0-alpha2-SNAPSHOT",
"value" : 1
} ],
"SoftwareVersion" : "3.0.0-alpha2-SNAPSHOT",
"NameDirSize" : "{\"/Users/naurc001/hadoop-deploy-trunk/data/dfs/name\":2112351}",
"RollingUpgradeStatus" : null,
"ClusterId" : "CID-4526ea43-52e6-4b3f-9ddf-5fd4412e322e",
"UpgradeFinalized" : true,
"Total" : 999334871040
} ]
}
I have a Titan database with Cassandra storage backend, and I am trying to create a mixed index based on two property keys.
I am able to register the Index using following commands:
graph=TitanFactory.open(config);
graph.tx().rollback()
m = graph.openManagement();
m.buildIndex("titleBodyMixed", Vertex.class).addKey(m.getPropertyKey("title")).addKey(m.getPropertyKey("body")).buildMixedIndex("search");
m.commit();
m.awaitGraphIndexStatus(graph, 'titleBodyMixed').status(SchemaStatus.REGISTERED).timeout(3, java.time.temporal.ChronoUnit.MINUTES).call();
And when I am checking, the Index is successfully registered after a few seconds. At next step, I try to reindex the database using the following commands:
m = graph.openManagement();
m.updateIndex(m.getGraphIndex('titleBodyMixed'), SchemaAction.REINDEX).get();
However, updateIndex command is not finishing, (After 12 hours).
I have about 300k data entry in the database and each data entry has one Title and one Body to index.
My question is that how can I speed up the indexing?
When I am using top command I see that my CPU is not saturated by indexing processes:
My Titan config file is as bellow:
config =new BaseConfiguration();
config.setProperty("storage.backend","cassandra");
config.setProperty("storage.hostname", "127.0.0.1");
config.setProperty("storage.cassandra.keyspace", "smartgraph");
config.setProperty("index.search.elasticsearch.interface", "NODE");
config.setProperty("index.search.backend", "elasticsearch");
The following is showing elasticsearch service properties:
curl -X GET 'http://localhost:9200'
{
"status" : 200,
"name" : "Ms. Marvel",
"cluster_name" : "elasticsearch",
"version" : {
"number" : "1.7.2",
"build_hash" : "e43676b1385b8125d647f593f7202acbd816e8ec",
"build_timestamp" : "2015-09-14T09:49:53Z",
"build_snapshot" : false,
"lucene_version" : "4.10.4"
},
"tagline" : "You Know, for Search"
}
The idea is, the index reindexing process will not start unless all sessions are closed. You most probably have sessions open with the database. Therefore, the reindex job is never triggered.
With this Gremlin script, you could close all sessions. You should see that the indexing will take place afterwards.
Will that help?
I've got a simple DataPipeline job which has only a single an EmrActivity with a single step attempting to execute a hive script from my s3 bucket.
The config for the EmrActivity looks like this:
{
"name" : "Extract and Transform",
"id" : "HiveActivity",
"type" : "EmrActivity",
"runsOn" : { "ref" : "EmrCluster" },
"step" : ["command-runner.jar,/usr/share/aws/emr/scripts/hive-script --run-hive-script --args -f s3://[bucket-name-removed]/s1-tracer-hql.q -d INPUT=s3://[bucket-name-removed] -d OUTPUT=s3://[bucket-name-removed]"],
"runsOn" : { "ref": "EmrCluster" }
}
And the config for the corresponding EmrCluster resource it's running on:
{
"id" : "EmrCluster",
"type" : "EmrCluster",
"name" : "Hive Cluster",
"keyPair" : "[removed]",
"masterInstanceType" : "m3.xlarge",
"coreInstanceType" : "m3.xlarge",
"coreInstanceCount" : "2",
"coreInstanceBidPrice": "0.10",
"releaseLabel": "emr-4.1.0",
"applications": ["hive"],
"enableDebugging" : "true",
"terminateAfter": "45 Minutes"
}
The error message I'm getting is always the following:
java.io.IOException: Cannot run program "/usr/share/aws/emr/scripts/hive-script --run-hive-script --args -f s3://[bucket-name-removed]/s1-tracer-hql.q -d INPUT=s3://[bucket-name-removed] -d OUTPUT=s3://[bucket-name-removed]" (in directory "."): error=2, No such file or directory
at com.amazonaws.emr.command.runner.ProcessRunner.exec(ProcessRunner.java:139)
at com.amazonaws.emr.command.runner.CommandRunner.main(CommandRunner.java:13)
...
The main error msg being "... (in directory "."): error=2, No such file or directory".
I've logged into the master node and verified the existence of /usr/share/aws/emr/scripts/hive-script. I've also tried specifying an s3-based location for the hive-script, among a few other places; always the same error result.
I can manually create a cluster directly in EMR that looks exactly like what I'm specifying in this DataPipeline, with a Step that uses the identical "command-runner.jar,/usr/share/aws/emr/scripts/hive-script ..." command string, and it works without error.
Has anyone experienced this, and can advise me on what I'm missing and/or doing wrong? I've been at this one for awhile now.
I'm able to answer my own q, after some long research and try-error.
There were 3 things, maybe 4, wrong with my Step script:
needed the 'script-runner.jar', rather than the 'command-runner.jar', as we're running a script (which I ended up just pulling from EMR's libs dir on s3)
need to get the 'hive-script' from elsewhere - so, also went to the public EMR libs dir in s3 for this
a fun one, yay thanks AWS; for the Steps args (everything after the 'hive-script' specification)...need to comma-separate every value in it when in DataPipeline (as opposed to space-separating as you do when specifying args in a Step directly in EMR)
And then the "maybe 4th":
included the base folder in s3 and specific hive release we're working with for the hive-script (I added this as result of seeing something similar in an AWS blog, but haven't yet tested whether it makes a difference in my case, too drained with everything else)
So, in the end, my working EmrActivity ended looking like so:
{
"name" : "Extract and Transform",
"id" : "HiveActivity",
"type" : "EmrActivity",
"runsOn" : { "ref" : "EmrCluster" },
"step" : ["s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar,s3://us-east-1.elasticmapreduce/libs/hive/hive-script,--base-path,s3://us-east-1.elasticmapreduce/libs/hive/,--hive-versions,latest,--run-hive-script,--args,-f,s3://[bucket-name-removed]/s1-tracer-hql.q,-d,INPUT=s3://[bucket-name-removed],-d,OUTPUT=s3://[bucket-name-removed],-d,LIBS=s3://[bucket-name-removed]"],
"runsOn" : { "ref": "EmrCluster" }
}
Hope this helps save someone else from the same time-sink I invested. Happy coding!
Why do I get these warnings after adding more data to my elasticsearch?
And the warnings are different every time I browse the dashboard.
"Courier Fetch: 30 of 60 shards failed."
More details:
It's a sole node on a CentOS 7.1
/etc/elasticsearch/elasticsearch.yml
index.number_of_shards: 3
index.number_of_replicas: 1
bootstrap.mlockall: true
threadpool.bulk.queue_size: 1000
indices.fielddata.cache.size: 50%
threadpool.index.queue_size: 400
index.refresh_interval: 30s
index.number_of_shards: 5
index.number_of_replicas: 1
/usr/share/elasticsearch/bin/elasticsearch.in.sh
ES_HEAP_SIZE=3G
#I use this Garbage Collector instead of the default one.
JAVA_OPTS="$JAVA_OPTS -XX:+UseG1GC"
cluster status
{
"cluster_name" : "my_cluster",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 61,
"active_shards" : 61,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 61
}
cluster details
{
"cluster_name" : "my_cluster",
"nodes" : {
"some weird number" : {
"name" : "ES 1",
"transport_address" : "inet[localhost/127.0.0.1:9300]",
"host" : "some host",
"ip" : "150.244.58.112",
"version" : "1.4.4",
"build" : "c88f77f",
"http_address" : "inet[localhost/127.0.0.1:9200]",
"process" : {
"refresh_interval_in_millis" : 1000,
"id" : 7854,
"max_file_descriptors" : 65535,
"mlockall" : false
}
}
}
}
I'm curious about the "mlockall" : false because on the yml I did write bootstrap.mlockall: true
logs
lots of lines like:
org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution (queue capacity 1000) on org.elasticsearch.search.action.SearchServiceTransportAction$23#a9a34f5
For me tuning the threadpool search queue_size solved the issue. I tried a number of other things and this is the one that solved it.
I added this to my elasticsearch.yml
threadpool.search.queue_size: 10000
and then restarted elasticsearch.
Reasoning... (from the docs)
A node holds several thread pools in order to improve how threads
memory consumption are managed within a node. Many of these pools also
have queues associated with them, which allow pending requests to be
held instead of discarded.
and for search in particular...
For count/search operations. Defaults to fixed with a size of int((#
of available_processors * 3) / 2) + 1, queue_size of 1000.
For more information you can refer to the elasticsearch docs here...
I had trouble finding this information so I hope this helps others!
I got this error when my query was missing a closing quote:
field:"value
In my ElasticSearch logs I see these exceptions:
Caused by: org.elasticsearch.index.query.QueryShardException:
Failed to parse query [field:"value]
...
Caused by: org.apache.lucene.queryparser.classic.ParseException:
Cannot parse 'field:"value': Lexical error at line 1, column 13.
Encountered: <EOF> after : "\"value"
Using Elasticsearch 5.4 thread_pool has an underscore it it.
thread_pool.search.queue_size: 10000
See documentation at Elasticsearch Thread Pool module documentation
This is likely an indication that there's a problem with your cluster's health. Without knowing more about your cluster, there's not much more that can be said.
I agree with #Philip's opinion, But it's necessary to restart elasticsearch at least on Elasticsearch >=1.5.2, because you can dynamically set threadpool.search.queue_size.
curl -XPUT http://your_es:9200/_cluster/settings
{
"transient":{
"threadpool.search.queue_size":10000
}
}
from Elasticsearch >= version 5, its not possible to update cluster settings for thread_pool.search.queue_size using _cluster/settings API. In my case updating ElasticSearch Node yml file is not an option either since if node fails then auto scaling code would bring other ES node with default yml settings.
I have a cluster with 3 nodes and having 400 active primary shards with 7 active threads for queue size of 1000. Increasing number of nodes to 5 with similar config has resolved the issue as queries are getting distributed horizontally to more available nodes.
this will not work on elasticsearch 5.6.
{
"error": {
"root_cause": [
{
"type": "remote_transport_exception",
"reason": "[colmbmiscxx.xx][172.29.xx.xx:9300][cluster:admin/settings/update]"
}
],
"type": "illegal_argument_exception",
"reason": "transient setting [threadpool.search.queue_size], not dynamically updateable"
},
"status": 400
}
I created a MongoDB account on MongoLab cloud server and I have a DB created and its empty (no collection, just 1 user) as of now.
this is what the command given at Mongolab to connect to the DB
mongo dbh13.mongolab.com:27137/myDB -u <username> -p <password>
is the user name here mongolab account credentials or the user i created in the myDB there.? I tried both, it's not authorizing.
but if try to connect to directly with out authorization (from Windows), it worked
with this command
mongo dbh13.mongolab.com:27137/myDB
but after which if i try to do something like show dbs/show collections it fails with the following message
> show dbs
assert failed : listDatabases failed:{
"assertion" : "unauthorized db:admin lock type:-1 client:38.117.159.162"
,
"assertionCode" : 10057,
"errmsg" : "db assertion failure",
"ok" : 0
}
Fri Aug 12 16:30:50 uncaught exception: assert failed : listDatabases failed:{
"assertion" : "unauthorized db:admin lock type:-1 client:38.117.159.162"
,
"assertionCode" : 10057,
"errmsg" : "db assertion failure",
"ok" : 0
}
Any ideas?
I got the solution for authorization from windows shell
> mongo "dbh13.mongolab.com:27137/myDB"
MongoDB shell version: 1.6.5
connecting to: dbh13.mongolab.com:27137/myDB
> db.auth("<username>","<password>")
http://support.mongolab.com/entries/20177338-i-m-using-the-windows-mongo-shell-and-can-t-connect-help