AWS Datapipeline, EmrActivity step to run a hive script fails immediately with 'No such file or directory' - hadoop

I've got a simple DataPipeline job which has only a single an EmrActivity with a single step attempting to execute a hive script from my s3 bucket.
The config for the EmrActivity looks like this:
{
"name" : "Extract and Transform",
"id" : "HiveActivity",
"type" : "EmrActivity",
"runsOn" : { "ref" : "EmrCluster" },
"step" : ["command-runner.jar,/usr/share/aws/emr/scripts/hive-script --run-hive-script --args -f s3://[bucket-name-removed]/s1-tracer-hql.q -d INPUT=s3://[bucket-name-removed] -d OUTPUT=s3://[bucket-name-removed]"],
"runsOn" : { "ref": "EmrCluster" }
}
And the config for the corresponding EmrCluster resource it's running on:
{
"id" : "EmrCluster",
"type" : "EmrCluster",
"name" : "Hive Cluster",
"keyPair" : "[removed]",
"masterInstanceType" : "m3.xlarge",
"coreInstanceType" : "m3.xlarge",
"coreInstanceCount" : "2",
"coreInstanceBidPrice": "0.10",
"releaseLabel": "emr-4.1.0",
"applications": ["hive"],
"enableDebugging" : "true",
"terminateAfter": "45 Minutes"
}
The error message I'm getting is always the following:
java.io.IOException: Cannot run program "/usr/share/aws/emr/scripts/hive-script --run-hive-script --args -f s3://[bucket-name-removed]/s1-tracer-hql.q -d INPUT=s3://[bucket-name-removed] -d OUTPUT=s3://[bucket-name-removed]" (in directory "."): error=2, No such file or directory
at com.amazonaws.emr.command.runner.ProcessRunner.exec(ProcessRunner.java:139)
at com.amazonaws.emr.command.runner.CommandRunner.main(CommandRunner.java:13)
...
The main error msg being "... (in directory "."): error=2, No such file or directory".
I've logged into the master node and verified the existence of /usr/share/aws/emr/scripts/hive-script. I've also tried specifying an s3-based location for the hive-script, among a few other places; always the same error result.
I can manually create a cluster directly in EMR that looks exactly like what I'm specifying in this DataPipeline, with a Step that uses the identical "command-runner.jar,/usr/share/aws/emr/scripts/hive-script ..." command string, and it works without error.
Has anyone experienced this, and can advise me on what I'm missing and/or doing wrong? I've been at this one for awhile now.

I'm able to answer my own q, after some long research and try-error.
There were 3 things, maybe 4, wrong with my Step script:
needed the 'script-runner.jar', rather than the 'command-runner.jar', as we're running a script (which I ended up just pulling from EMR's libs dir on s3)
need to get the 'hive-script' from elsewhere - so, also went to the public EMR libs dir in s3 for this
a fun one, yay thanks AWS; for the Steps args (everything after the 'hive-script' specification)...need to comma-separate every value in it when in DataPipeline (as opposed to space-separating as you do when specifying args in a Step directly in EMR)
And then the "maybe 4th":
included the base folder in s3 and specific hive release we're working with for the hive-script (I added this as result of seeing something similar in an AWS blog, but haven't yet tested whether it makes a difference in my case, too drained with everything else)
So, in the end, my working EmrActivity ended looking like so:
{
"name" : "Extract and Transform",
"id" : "HiveActivity",
"type" : "EmrActivity",
"runsOn" : { "ref" : "EmrCluster" },
"step" : ["s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar,s3://us-east-1.elasticmapreduce/libs/hive/hive-script,--base-path,s3://us-east-1.elasticmapreduce/libs/hive/,--hive-versions,latest,--run-hive-script,--args,-f,s3://[bucket-name-removed]/s1-tracer-hql.q,-d,INPUT=s3://[bucket-name-removed],-d,OUTPUT=s3://[bucket-name-removed],-d,LIBS=s3://[bucket-name-removed]"],
"runsOn" : { "ref": "EmrCluster" }
}
Hope this helps save someone else from the same time-sink I invested. Happy coding!

Related

Index with ! in their name cant be filtered for recovering

I have an ES cluster whith indices name like web.analytics.data.api!monthly!2018-07_v0 and doing regular snapshots/backups
Now, when I want to restore all of them, all works pretty well. If I want to restore just a specific index however, es wont do it. The command I use:
curl -X POST "localhost:9200/_snapshot/s3_backups/20191218_060001/_restore?pretty&wait_for_completion=true" -H 'Content-Type: application/json' -d'
{
"indices": "web.analytics.data.api!monthly!2018-07_v0",
"index_settings": {
"index.number_of_replicas": 0
}
}
'
The result I get is:
{
"snapshot" : {
"snapshot" : "20191218_060001",
"indices" : [ ],
"shards" : {
"total" : 0,
"failed" : 0,
"successful" : 0
}
}
}
Please note, that If I use index without ! in its name (e.g. .kibana), it works well. Any ideas of how I can solve that? Preferably without telling developers to rename the indices. The ES in question has version 1.7.3 I am aware it is EOL, but it is what I have to work with right now.
So it was my bad in the end. The index I got did not exist (typo in it) but I was told ! is problematic so i did not double check and the test indices were picked by me, so of course they were correct...

In nightwatch, how do I specify additional string arguments after the selenium_port

came across a similar question here which wasn't truly addressed - https://github.com/nightwatchjs/nightwatch/issues/1911
You cannot do what #beatfactor suggested with the above example, the port is in the middle i.e. "selenium_host" : "us1.appium.testobject.com:443/wd/hub",
I'm facing a similar problem right now, how do I provide arguments so it attempts to hit a host like the above? Currently, my failing options are providing no port which defaults to 4444 or providing a port which results in attempting to hit us1.appium.testobject.com/wd/hub:443
The desired result is :
"selenium_host" : "us1.appium.testobject.com:443/wd/hub",
TLDR - How do you provide a port in the middle of your selenium host argument given the port is always appended to the end and if you don't provide one, a default is used?
Just define your selenium_port upstream, in the declaration section and use a Template Literal:
const selenium_port = '443';
"test_settings" : {
"default" : {
"launch_url" : "http://test.com",
"selenium_port" : selenium_port
"selenium_host" : `us1.appium.testobject.com:${selenium_port}/wd/hub`,
"silent" : true,
"screenshots" : {
"enabled" : true,
"path" : "screenshots"
}
},
Hope I understood correctly. Cheers!

HDFS Visulization of block distribution

I'm trying to create a visulaization of the HDFS block distribution of a cluster.
I plan to create this using Tableau but was wondering what type of visualizations would be able to give you an idea of what nodes need re-balancing, and also an efficient way to get the server log data into tableau?
Before investing too much time in this, you might want to take a look at Twitter's open source HDFS-DU project. This provides a view of utilization based on paths within the file system rather than DataNodes within the cluster, but perhaps that's still helpful for your requirements.
If the goal is just to identify nodes in need of rebalancing, then this information is already accessible on the NameNode web UI "Datanodes" tab. You could also run hdfs dfsadmin -report to get utilization stats for each node in a script.
If none of the above meets your requirements, and you need to proceed with integrating the information into an external reporting tool like Tableau, then a helpful integration point might be the JMX metrics exposed via HTTP on the NameNode. See below for an example curl command that queries some of this information from the NameNode. Note in particular the LiveNodes section, which contains capacity information about each DataNode.
Some additional information about these metrics is available in the Apache Hadoop Metrics documentation.
> curl 'http://127.0.0.1:9870/jmx?qry=Hadoop:service=NameNode,name=NameNodeInfo'
{
"beans" : [ {
"name" : "Hadoop:service=NameNode,name=NameNodeInfo",
"modelerType" : "org.apache.hadoop.hdfs.server.namenode.FSNamesystem",
"Threads" : 46,
"Version" : "3.0.0-alpha2-SNAPSHOT, rdf497b3a739714c567c9c2322608f0659da20cc4",
"Used" : 5263360,
"Free" : 884636377088,
"Safemode" : "",
"NonDfsUsedSpace" : 114431086592,
"PercentUsed" : 5.266863E-4,
"BlockPoolUsedSpace" : 5263360,
"PercentBlockPoolUsed" : 5.266863E-4,
"PercentRemaining" : 88.52252,
"CacheCapacity" : 0,
"CacheUsed" : 0,
"TotalBlocks" : 50,
"NumberOfMissingBlocks" : 0,
"NumberOfMissingBlocksWithReplicationFactorOne" : 0,
"LiveNodes" : "{\"192.168.0.117:9866\":{\"infoAddr\":\"127.0.0.1:9864\",\"infoSecureAddr\":\"127.0.0.1:0\",\"xferaddr\":\"127.0.0.1:9866\",\"lastContact\":2,\"usedSpace\":5263360,\"adminState\":\"In Service\",\"nonDfsUsedSpace\":114431086592,\"capacity\":999334871040,\"numBlocks\":50,\"version\":\"3.0.0-alpha2-SNAPSHOT\",\"used\":5263360,\"remaining\":884636377088,\"blockScheduled\":0,\"blockPoolUsed\":5263360,\"blockPoolUsedPercent\":5.266863E-4,\"volfails\":0}}",
"DeadNodes" : "{}",
"DecomNodes" : "{}",
"BlockPoolId" : "BP-1429209999-10.195.15.240-1484933797029",
"NameDirStatuses" : "{\"active\":{\"/Users/naurc001/hadoop-deploy-trunk/data/dfs/name\":\"IMAGE_AND_EDITS\"},\"failed\":{}}",
"NodeUsage" : "{\"nodeUsage\":{\"min\":\"0.00%\",\"median\":\"0.00%\",\"max\":\"0.00%\",\"stdDev\":\"0.00%\"}}",
"NameJournalStatus" : "[{\"manager\":\"FileJournalManager(root=/Users/naurc001/hadoop-deploy-trunk/data/dfs/name)\",\"stream\":\"EditLogFileOutputStream(/Users/naurc001/hadoop-deploy-trunk/data/dfs/name/current/edits_inprogress_0000000000000000862)\",\"disabled\":\"false\",\"required\":\"false\"}]",
"JournalTransactionInfo" : "{\"MostRecentCheckpointTxId\":\"861\",\"LastAppliedOrWrittenTxId\":\"862\"}",
"NNStartedTimeInMillis" : 1485715900031,
"CompileInfo" : "2017-01-03T21:06Z by naurc001 from trunk",
"CorruptFiles" : "[]",
"NumberOfSnapshottableDirs" : 0,
"DistinctVersionCount" : 1,
"DistinctVersions" : [ {
"key" : "3.0.0-alpha2-SNAPSHOT",
"value" : 1
} ],
"SoftwareVersion" : "3.0.0-alpha2-SNAPSHOT",
"NameDirSize" : "{\"/Users/naurc001/hadoop-deploy-trunk/data/dfs/name\":2112351}",
"RollingUpgradeStatus" : null,
"ClusterId" : "CID-4526ea43-52e6-4b3f-9ddf-5fd4412e322e",
"UpgradeFinalized" : true,
"Total" : 999334871040
} ]
}

Is there a way I can get historic performance data of various alerts in Nagios as json/xml?

I am looking to get performance data of various alerts setup in my Nagios Core/XI. I think it is stored in RRDs. Are there ways I can get access to it?
If you're using Nagios XI you can get this data a few different ways.
If you're using XI 5 or later, then the easiest way that springs to mind is the API. Log in to your XI server as an administrator, navigate to 'Help' menu, then select 'Objects Reference' on the left hand side navigation and find 'GET objects/rrdexport' from the Objects Reference navigation box (or just scroll down to near the bottom).
An example curl might look like this:
curl -XGET "http://nagiosxi/nagiosxi/api/v1/objects/rrdexport?apikey=YOURAPIKEY&pretty=1&host_name=localhost"
Your response should look something like:
{
"meta": {
"start": "1453838100",
"step": "300",
"end": "1453838400",
"rows": "2",
"columns": "4",
"legend": {
"entry": [
"rta",
"pl",
"rtmax",
"rtmin"
]
}
},
"data": {
"row": [
{
"t": "1453838100",
"v": [
"6.0373333333e-03",
"0.0000000000e+00",
"1.7536000000e-02",
"3.0000000000e-03"
]
},
{
"t": "1453838400",
"v": [
"6.0000000000e-03",
"0.0000000000e+00",
"1.7037333333e-02",
"3.0000000000e-03"
]
}
]
}
}
BUT WAIT, THERE IS ANOTHER WAY
This way will work no matter what version you're on, and would actually work if you were processing performance data with NPCD on a Core system as well.
Log in to your server via ssh or console and get your butt over to the /usr/local/nagios/share/perfdata directory. From here we're going to use the localhost object as an example..
$ cd /usr/local/nagios/share/perfdata/
$ ls
localhost
$ cd localhost/
$ ls
Current_Load.rrd Current_Users.xml HTTP.rrd PING.xml SSH.rrd Swap_Usage.xml
Current_Load.xml _HOST_.rrd HTTP.xml Root_Partition.rrd SSH.xml Total_Processes.rrd
Current_Users.rrd _HOST_.xml PING.rrd Root_Partition.xml Swap_Usage.rrd Total_Processes.xml
$ rrdtool dump _HOST_.rrd
Once you run the rrdtool dump command, there is going to be an awful lot of output, so I keep that as an exercise for you, the reader ;)
If you're trying to automate something of some kind, then you should note that the xml files contain meta data for the rrd files and could potentially be useful to parse first.
Also, if you're anything like me, you love reading technical manuals. Here is a great one to read: RRDTool documentation
Hope this helped!

Elasticsearch indexing is very slow

I have a Titan database with Cassandra storage backend, and I am trying to create a mixed index based on two property keys.
I am able to register the Index using following commands:
graph=TitanFactory.open(config);
graph.tx().rollback()
m = graph.openManagement();
m.buildIndex("titleBodyMixed", Vertex.class).addKey(m.getPropertyKey("title")).addKey(m.getPropertyKey("body")).buildMixedIndex("search");
m.commit();
m.awaitGraphIndexStatus(graph, 'titleBodyMixed').status(SchemaStatus.REGISTERED).timeout(3, java.time.temporal.ChronoUnit.MINUTES).call();
And when I am checking, the Index is successfully registered after a few seconds. At next step, I try to reindex the database using the following commands:
m = graph.openManagement();
m.updateIndex(m.getGraphIndex('titleBodyMixed'), SchemaAction.REINDEX).get();
However, updateIndex command is not finishing, (After 12 hours).
I have about 300k data entry in the database and each data entry has one Title and one Body to index.
My question is that how can I speed up the indexing?
When I am using top command I see that my CPU is not saturated by indexing processes:
My Titan config file is as bellow:
config =new BaseConfiguration();
config.setProperty("storage.backend","cassandra");
config.setProperty("storage.hostname", "127.0.0.1");
config.setProperty("storage.cassandra.keyspace", "smartgraph");
config.setProperty("index.search.elasticsearch.interface", "NODE");
config.setProperty("index.search.backend", "elasticsearch");
The following is showing elasticsearch service properties:
curl -X GET 'http://localhost:9200'
{
"status" : 200,
"name" : "Ms. Marvel",
"cluster_name" : "elasticsearch",
"version" : {
"number" : "1.7.2",
"build_hash" : "e43676b1385b8125d647f593f7202acbd816e8ec",
"build_timestamp" : "2015-09-14T09:49:53Z",
"build_snapshot" : false,
"lucene_version" : "4.10.4"
},
"tagline" : "You Know, for Search"
}
The idea is, the index reindexing process will not start unless all sessions are closed. You most probably have sessions open with the database. Therefore, the reindex job is never triggered.
With this Gremlin script, you could close all sessions. You should see that the indexing will take place afterwards.
Will that help?

Resources