HDFS Visulization of block distribution - hadoop

I'm trying to create a visulaization of the HDFS block distribution of a cluster.
I plan to create this using Tableau but was wondering what type of visualizations would be able to give you an idea of what nodes need re-balancing, and also an efficient way to get the server log data into tableau?

Before investing too much time in this, you might want to take a look at Twitter's open source HDFS-DU project. This provides a view of utilization based on paths within the file system rather than DataNodes within the cluster, but perhaps that's still helpful for your requirements.
If the goal is just to identify nodes in need of rebalancing, then this information is already accessible on the NameNode web UI "Datanodes" tab. You could also run hdfs dfsadmin -report to get utilization stats for each node in a script.
If none of the above meets your requirements, and you need to proceed with integrating the information into an external reporting tool like Tableau, then a helpful integration point might be the JMX metrics exposed via HTTP on the NameNode. See below for an example curl command that queries some of this information from the NameNode. Note in particular the LiveNodes section, which contains capacity information about each DataNode.
Some additional information about these metrics is available in the Apache Hadoop Metrics documentation.
> curl 'http://127.0.0.1:9870/jmx?qry=Hadoop:service=NameNode,name=NameNodeInfo'
{
"beans" : [ {
"name" : "Hadoop:service=NameNode,name=NameNodeInfo",
"modelerType" : "org.apache.hadoop.hdfs.server.namenode.FSNamesystem",
"Threads" : 46,
"Version" : "3.0.0-alpha2-SNAPSHOT, rdf497b3a739714c567c9c2322608f0659da20cc4",
"Used" : 5263360,
"Free" : 884636377088,
"Safemode" : "",
"NonDfsUsedSpace" : 114431086592,
"PercentUsed" : 5.266863E-4,
"BlockPoolUsedSpace" : 5263360,
"PercentBlockPoolUsed" : 5.266863E-4,
"PercentRemaining" : 88.52252,
"CacheCapacity" : 0,
"CacheUsed" : 0,
"TotalBlocks" : 50,
"NumberOfMissingBlocks" : 0,
"NumberOfMissingBlocksWithReplicationFactorOne" : 0,
"LiveNodes" : "{\"192.168.0.117:9866\":{\"infoAddr\":\"127.0.0.1:9864\",\"infoSecureAddr\":\"127.0.0.1:0\",\"xferaddr\":\"127.0.0.1:9866\",\"lastContact\":2,\"usedSpace\":5263360,\"adminState\":\"In Service\",\"nonDfsUsedSpace\":114431086592,\"capacity\":999334871040,\"numBlocks\":50,\"version\":\"3.0.0-alpha2-SNAPSHOT\",\"used\":5263360,\"remaining\":884636377088,\"blockScheduled\":0,\"blockPoolUsed\":5263360,\"blockPoolUsedPercent\":5.266863E-4,\"volfails\":0}}",
"DeadNodes" : "{}",
"DecomNodes" : "{}",
"BlockPoolId" : "BP-1429209999-10.195.15.240-1484933797029",
"NameDirStatuses" : "{\"active\":{\"/Users/naurc001/hadoop-deploy-trunk/data/dfs/name\":\"IMAGE_AND_EDITS\"},\"failed\":{}}",
"NodeUsage" : "{\"nodeUsage\":{\"min\":\"0.00%\",\"median\":\"0.00%\",\"max\":\"0.00%\",\"stdDev\":\"0.00%\"}}",
"NameJournalStatus" : "[{\"manager\":\"FileJournalManager(root=/Users/naurc001/hadoop-deploy-trunk/data/dfs/name)\",\"stream\":\"EditLogFileOutputStream(/Users/naurc001/hadoop-deploy-trunk/data/dfs/name/current/edits_inprogress_0000000000000000862)\",\"disabled\":\"false\",\"required\":\"false\"}]",
"JournalTransactionInfo" : "{\"MostRecentCheckpointTxId\":\"861\",\"LastAppliedOrWrittenTxId\":\"862\"}",
"NNStartedTimeInMillis" : 1485715900031,
"CompileInfo" : "2017-01-03T21:06Z by naurc001 from trunk",
"CorruptFiles" : "[]",
"NumberOfSnapshottableDirs" : 0,
"DistinctVersionCount" : 1,
"DistinctVersions" : [ {
"key" : "3.0.0-alpha2-SNAPSHOT",
"value" : 1
} ],
"SoftwareVersion" : "3.0.0-alpha2-SNAPSHOT",
"NameDirSize" : "{\"/Users/naurc001/hadoop-deploy-trunk/data/dfs/name\":2112351}",
"RollingUpgradeStatus" : null,
"ClusterId" : "CID-4526ea43-52e6-4b3f-9ddf-5fd4412e322e",
"UpgradeFinalized" : true,
"Total" : 999334871040
} ]
}

Related

AWS Elasticsearch showing cluster health yellow, how should I fix it?

I am using AWS Elasticsearch. My cluster status is yellow for past 48 hours on the recommendation provided here:
https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/aes-handling-errors.html
I've updated my nodes to be 15 data and it has 3 master nodes.
Even though it has more spaces for around 60 Gb in each nodes , it is still in yellow state.
When i executed this command GET /_cluster/allocation/explain
"index" : "***********************************",
"shard" : 4,
"primary" : false,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "ALLOCATION_FAILED",
"at" : "2020-10-09T16:19:41.803Z",
"failed_allocation_attempts" : 5,
"details" : "failed shard on node [f6hB7EYOSR-GiJLFXBn01w]: failed recovery, failure RecoveryFailedException[[******************************][4]: Recovery failed from {70c36ff18063566c3a6089f3d696440a}{*******************}{*************}{di}{di_number=39, zone=us-east-1d, distributed_snapshot_deletion_enabled=true} into {**********************}{****************}{*************}{*****}{*******}{di}{distributed_snapshot_deletion_enabled=true, zone=us-east-1d, di_number=39}]; nested: RemoteTransportException[[****************][*********][internal:index/shard/recovery/start_recovery]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [1554462628/1.4gb], which is larger than the limit of [1513521152/1.4gb], real usage: [1554460888/1.4gb], new bytes reserved: [1740/1.6kb], usages [request=0/0b, fielddata=621718551/592.9mb, in_flight_requests=73378/71.6kb, accounting=35794764/34.1mb]]; ",
"last_allocation_status" : "no_attempt"
}
This is what it says. How can i resolve this?

Elasticsearch: what does "shard allocation" mean?

We encountered a production incident, that Elasticsearch cluster health check returned red status. The health check report shows marvel-2019.06.20 has 2 unassigned_shards, which seems the root cause.
curl -XGET 'localhost:9200/_cluster/health?level=indices&pretty'
{
"cluster_name" : "sap-jam-jam8",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 122,
"active_shards" : 239,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 7,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"indices" : {
...
...
".marvel-2019.06.20" : {
"status" : "red",
"number_of_shards" : 1,
"number_of_replicas" : 1,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 2
}
}
we checked the config of Elasticseach, found cluster.routing.allocation has been disabled.
curl -XGET 'localhost:9200/_cluster/settings?pretty'
{
"persistent" : { },
"transient" : {
"cluster" : {
"routing" : {
"allocation" : {
"enable" : "none"
}
}
}
}
}
As this stackoverflow post suggested, we forced a shard to be assigned, and this issue has gone.
curl -XPOST -d '{ "commands" : [ {
"allocate" : {
"index" : ".marvel-2014.05.21",
"shard" : 0,
"node" : "SOME_NODE_HERE",
"allow_primary":true
}
} ] }' http://localhost:9200/_cluster/reroute?pretty
After resolved this incident, I think it's necessary to figure out the basic concept shard allocation. I did some research, but the following questions are still confusing for me.
1. Why elasticsearch needs to assign shard to other nodes?
In my case, we have two elasticsearch nodes, A and B. Two shards have already been created in A, and consumed disk space.
When B is not available, why not just active those two shards in server A?
At least it return a yellow health status.
2. What's the procedures of assign a shard?
In the first question, we suppose both primary shard and replica has been created in server A. when saying assign shard to B, what does that mean?
Doest that mean copy shard from server A to server B?
3. How to explain this zero active shard?
Both primary shard and replicate has been created, but are not active. How is it possible? Besides disk storage, is there other overhead to activate a shard? e.g. Memory?
".marvel-2019.06.20" : {
"status" : "red",
"number_of_shards" : 1,
"number_of_replicas" : 1,
"active_primary_shards" : 0,
"active_shards" : 0, // both shards are inactive.
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 2
}
4. Is the following assumption true?
To make a shard active, Elasticsearch need do the following steps:
Create a shard.
Find a server, which has enough disk space and RAM to run it.
copy this shard from source server to destination server.
Activate this shard.
Reference
Elasticsearch blog: Red Elasticsearch Cluster? Panic no longer
Stack overflow: elasticsearch - what to do with unassigned shards
I'm no expert but are some thoughts:
You have 2 node cluster. ES will try to allocate shards on both nodes to provide HA in case that one of the node fails (and even for better read/write performance)
Related to 1. ES tries to allocate shards accros nodes to provide HA. If both shards are allocated on node A and you add node B to cluster ES will try to move replica shards to B.
Only index is created, not shards. So it tells you "ok, I've managed to create index but I can't find where to place its data". You can find why by running cat shards (https://www.elastic.co/guide/en/elasticsearch/reference/current/cat-shards.html). Shards do cost memory as theirs metadata has to be cached.
That's probably procedure for adding replica shards with possible few additional conditions (like try to create replica on node which doesn't contain primary shard and which has least load and can contain data...)

Elasticsearch indexing is very slow

I have a Titan database with Cassandra storage backend, and I am trying to create a mixed index based on two property keys.
I am able to register the Index using following commands:
graph=TitanFactory.open(config);
graph.tx().rollback()
m = graph.openManagement();
m.buildIndex("titleBodyMixed", Vertex.class).addKey(m.getPropertyKey("title")).addKey(m.getPropertyKey("body")).buildMixedIndex("search");
m.commit();
m.awaitGraphIndexStatus(graph, 'titleBodyMixed').status(SchemaStatus.REGISTERED).timeout(3, java.time.temporal.ChronoUnit.MINUTES).call();
And when I am checking, the Index is successfully registered after a few seconds. At next step, I try to reindex the database using the following commands:
m = graph.openManagement();
m.updateIndex(m.getGraphIndex('titleBodyMixed'), SchemaAction.REINDEX).get();
However, updateIndex command is not finishing, (After 12 hours).
I have about 300k data entry in the database and each data entry has one Title and one Body to index.
My question is that how can I speed up the indexing?
When I am using top command I see that my CPU is not saturated by indexing processes:
My Titan config file is as bellow:
config =new BaseConfiguration();
config.setProperty("storage.backend","cassandra");
config.setProperty("storage.hostname", "127.0.0.1");
config.setProperty("storage.cassandra.keyspace", "smartgraph");
config.setProperty("index.search.elasticsearch.interface", "NODE");
config.setProperty("index.search.backend", "elasticsearch");
The following is showing elasticsearch service properties:
curl -X GET 'http://localhost:9200'
{
"status" : 200,
"name" : "Ms. Marvel",
"cluster_name" : "elasticsearch",
"version" : {
"number" : "1.7.2",
"build_hash" : "e43676b1385b8125d647f593f7202acbd816e8ec",
"build_timestamp" : "2015-09-14T09:49:53Z",
"build_snapshot" : false,
"lucene_version" : "4.10.4"
},
"tagline" : "You Know, for Search"
}
The idea is, the index reindexing process will not start unless all sessions are closed. You most probably have sessions open with the database. Therefore, the reindex job is never triggered.
With this Gremlin script, you could close all sessions. You should see that the indexing will take place afterwards.
Will that help?

Mongodb high CPU - many slow queries on special virtual collection db.$cmd

In diagnosing a high CPU mongodb, we found many slow (6-7 secs) queries. All of those are related to "ns" : "mydb.$cmd".
Slow query entry look like below :
{
"_id" : ObjectId("5571b739f65f7e64bb806362"),
"op" : "command",
"ns" : "mydb.$cmd",
"command" : {
"aggregate" : "MyCollection",
"pipeline" : [
{
"$mergeCursors" : [
{
"host" : "abc:27005",
"id" : NumberLong(82775337156)
}
]
}
]
},
"keyUpdates" : 0,
"numYield" : 0,
"lockStats" : {
"timeLockedMicros" : {
"r" : NumberLong(12),
"w" : NumberLong(0)
},
"timeAcquiringMicros" : {
"r" : NumberLong(2),
"w" : NumberLong(2680)
}
},
"responseLength" : 12312,
"millis" : 6142,
"execStats" : {},
"ts" : ISODate("2015-06-05T12:35:40.801Z"),
"client" : "1.1.1.1",
"allUsers" : [],
"user" : ""
}
We are not sure what part of code causing these queries. How shall we proceed to find / debug what queries from application causing these $cmd slow queries ?
Those logs are actually the queries issued when running a command against the specified database (mydb in your case). This is therefore just some aggregation command being run against your MongoDB.
If your application is not doing this directly, it would appear (as documented in http://dbattish.tumblr.com/post/108652372056/joins-in-mongodb) that the $mergecursors variant is used from v2.6 to merge queries across shards.
My test shows that MongoDB uses always ~90-100% CPU when it deals with concurrent requests. This beacause I move to MySQL. My app with the thame simple queries work 3x faster with MySQL and i uses much less CPU. I will create an artciel soon with full testing. For now, just look to CPU usage of MongoDB and MariaDB for queries with X=5, 10, 25, 50, 100, 500, 1000 concurrent connections.
siege -b -cX -t1M url
As I realized, high CPU usage doesn't related to hight CPU usage. I mean, even very simple queries with concurrent requests make MongoDB to use 100% CPU.
All tests with 1vCPU and 1Gb memory and connection pool size 10
MongoDB
MySQL
I did many tests with different configurations (4vCPU, 6G Memory) and always MongoDB was use more CPU then MySQL. What you can try with MongoDB is:
Change connection loop size. I hope you don't open connection per query.
Are you using Mongoose? Try with Native Nodejs MYSQL Drivers - it much faster.
I very disappointed from MongodDB for reading data. Not only that MySQL uses much less CPU, it was always at least 3x faster!

How do I find out whats wrong with elasticsearch replication when the status is red

I'm seeing very high CPU on my two elasticsearch nodes and profiling shows that its associated with elasticsearch replication.
I've executed the health status command:
curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'
which returns this:
{
"cluster_name" : "elasticsearch",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 2003,
"active_shards" : 4006,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 34
}
I can see the status is red so there is some sort of problem which presumably is resulting in the high cpu.
But how do i find out what is actually wrong to i can rectify it?
I'd start with the debugging approach outlined here:
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_cluster_health.html
To recap, try running:
GET _cluster/health?level=indices
This will give you info about which indices are having issues - but in general a Red status means you have missing primary and replica shards which means you have missing data (not good).
I'd take a look at individual node health:
GET _nodes/stats
From there I'd focus on heap (memory) usage and disk usage - in particular look for a full disk. And then I'd be logging into each node separately to check disk usage.

Resources