Unable to pass consul kv to nomad job - consul

I wanted to pass multiple consul key-value pairs to my nomad job and this is the template from the documentation that I’ve tried to use on my nomad job.
template {
data = <<EOF
{{range ls "app/test"}}
{{.Key}}={{.Value}}
{{end}}
EOF
destination = "env/file.env"
env = true
}
When I try consul kv get -recurse app/test this is the result:
app/test/:
app/test/APP_NAME:consul
app/test/TEST_KEY:123
I’m not getting the consul kv even though the nomad job is running without any errors. Are there other configurations on the job or perhaps on the setup that I might have missed?

Related

Running spark jobs on emr using airflow

I have an EC2 instance and an EMR. I want to run spark jobs on EMR using airflow. Where would airflow needs to be installed for this?
On EC2 instance.
On EMR master node.
I am considering using SparkSubmit operator for this. What arguments should I provide while creating the airflow task?
You will be installing airflow on ec2 and I will suggest installing a containerized version of it. See this answer.
For submitting spark jobs, you will need EmrAddStepsOperator from airflow, and you will need to provide the step for spark-submit.
(Note: If you are starting the cluster from the script, you will need to use EmrCreateJobFlowOperator as well, see details here)
A typical submit step will look something like this
spark_submit_step = [
{
'Name': 'Run Spark',
'ActionOnFailure': 'TERMINATE_CLUSTER',
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': ['spark-submit',
'--jars',
"/emr/instance-controller/lib/bootstrap-actions/1/spark-iforest-2.4.0.jar,/home/hadoop/mysql-connector-java-5.1.47.jar",
'--py-files',
'/home/hadoop/mysqlConnect.py',
'/home/hadoop/main.py',
'custum_argument',
another_custum_argument,
another_custom_argument]
}
}
]

YARN log aggregation on aws emr - expecting one single container file but still find multiple

I am trying to aggregate all the container logs for an application to a single file for better debugging the spark application.
I configured yarn log-aggregation as suggested by https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-debugging.html. From the docs, "Log aggregation (Hadoop 2.x) compiles logs from all containers for an individual application into a single file. " -> this sounds like what I need.
I created an emr cluster with:
[
{
"Classification": "yarn-site",
"Properties": {
"yarn.log-aggregation-enable": "true",
"yarn.log-aggregation.retain-seconds": "-1",
"yarn.nodemanager.remote-app-log-dir": "s3:\/\/mybucket\/emr_log"
}
}
]
I verify the above configuration is in the yarn-site.xml on the new cluster. But submitting jobs to the cluster, I am still seeing multiple container logs when looking at the s3 log directory.
I expected to see one file under s3://mybucket/emr_log/cluster_id/containers/application_id/stderr.gz
But I still found several container directories under the application_id directories:
Ex:
s3://mybucket/emr_log/cluster_id/containers/application_id/container_XXXXXXXX_0001_01_000001/stderr.gz
s3://mybucket/emr_log/cluster_id/containers/application_id/container_XXXXXXXX_0001_01_000002/stderr.gz
s3://mybucket/emr_log/cluster_id/containers/application_id/container_XXXXXXXX_0001_01_000003/stderr.gz
Am I missing anything? Is it possible to have one single container log for one application?

YARN: get containers by applicationId

I'd like to list the nodes on which the containers are running for a particular MR job.
I only have the application_id.
Is it possible to do it with Hadoop REST API and/or through command line?
This can be done using the yarn command.
Run yarn applicationattempt -list <Application Id> to get an app attempt id
Run yarn container -list <Application Attempt Id> to get the container ids
Run yarn container -status <Container Id> to get the host for any particular container.
If you want this in a bash script or want to get every host for an application with a large number of containers you will probably want to parse out the attempt/container id and host, but this is at least a start.
You can find them using Resource manager UI. Find your application by ID among the existing applications and click on the link with ID you have. You will see your application stats. Fint the tracking URL and click on the link 'History'. There you'll be able to find the tasks in your map operation and recude optration. You can open each task and see the information, to which node it was assigned for, nubmer of attempts, logs for each task and attempts and lots of other usefull information.
For getting the information about the container status from command line you can use yarn container -status command from bash

Kafka console producer Error in Hortonworks HDP 2.3 Sandbox

I have searched it all over and couldn't find the error.
I have checked This Stackoverflow Issue but it is not the problem with me
I have started a zookeeper server
Command to start server was
bin/zookeeper-server-start.sh config/zookeeper.properties
Then I SSH into VM by using Putty and started kafka server using
$ bin/kafka-server-start.sh config/server.properties
Then I created Kafka Topic and when I list the topic, it appears.
Then I opened another putty and started kafka-console-producer.sh and typed any message (even enter) and get this long repetitive exception.
Configuration files for zookeeper.properties, server.properties, kafka-producer.properties are as following (respectively)
The version of Kafka i am running is 8.2.2. something as I saw it in kafka/libs folder.
P.S. I get no messages in consumer.
Can any body figure out the problem?
The tutorial I was following was [This][9]
8http://%60http://www.bogotobogo.com/Hadoop/BigData_hadoop_Zookeeper_Kafka_single_node_single_broker_cluster.php%60
On the hortonworks sandbox have a look at the server configuration:
$ less /etc/kafka/conf/server.properties
In my case it said
...
listeners=PLAINTEXT://sandbox.hortonworks.com:6667
...
This means you have to use the following command to successfully connect with the console-producer
$ cd /usr/hdp/current/kafka-broker
$ bin/kafka-console-producer.sh --topic test --broker-list sandbox.hortonworks.com:6667
It won't work, if you use --broker-list 127.0.0.1:6667 or --broker-list localhost:6667 . See also http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_installing_manually_book/content/configure_kafka.html
To consume the messages use
$ bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test --from-beginning
As you mentioned in your question that you are using HDP 2.3 and for that when you are running Console-Producer
You need to provide sandbox.hortonworks.com:6667 in Broker-list.
Please use the same while running Console-Consumer.
Please let me know in case still you face any issue.
Within Kafka internally there is a conversation that goes on between both producers and consumers (clients) and the broker (server). During those conversations clients often ask the server for the address of a server broker that's managing a particular partition. The answer is always a fully-qualified host name. Without going into specifics if you ever refer to a broker with an address that is not that broker's fully-qualified host name there are situations when the Kafka implementation runs into trouble.
Another mistake that's easy to make, especially with the Sandbox, is referring to a broker by an address that's not defined to the DNS. That's why every node on the cluster has to be able to address every other node in the cluster by fully-qualified host name. It's also why, when accessing the sandbox from another virtual image running on the same machine you have to add sandbox.hortonworks.com to the image's hosts file.

How to sync Hadoop configuration files to multiple nodes?

I uesd to manage a cluster of only 3 Centos machines running Hadoop. So scp is enough for me to copy the configuration files to the other 2 machines.
However, I have to setup a Hadoop cluster to more than 10 machines. It is really frustrated to sync the files so many times using scp.
I want to find a tool that I can easily sync the files to all machines. And the machine names are defined in a config file, such as:
node1
node2
...
node10
Thanks.
If you do not want to use Zookeeper you can modify your hadoop script in $HADOOP_HOME/bin/hadoop and add something like :
if [ "$COMMAND" == "deployConf" ]; then
for HOST in `cat $HADOOP_HOME/conf/slaves`
do
scp $HADOOP_HOME/conf/mapred-site.xml $HOST:$HADOOP_HOME/conf
scp $HADOOP_HOME/conf/core-site.xml $HOST:$HADOOP_HOME/conf
scp $HADOOP_HOME/conf/hdfs-site.xml $HOST:$HADOOP_HOME/conf
done
exit 0
fi
That's what I'm using now and it does the job.
Use Zookeeper with Hadoop.
ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
Reference: http://wiki.apache.org/hadoop/ZooKeeper
You have several options to do that. One way is to use tools like rsync. The Hadoop control scripts can distribute configuration files to all nodes of the cluster using rsync. Alternatively, you can make use of tools like Cloudera Manager or Ambari if you need a more sophisticated way to achieve that.
If you use InfoSphere BigInsights then there is the script syncconf.sh

Resources