Is it possible to execute CMD at the middle of docker file? - hadoop

I am installing hadoop-0.20.2 using docker. I have two files one is for java installation and another is for hadoop installation. I am starting services using CMD command
cmd ["path/to/start-all.sh"]
Now, i want to to write third dockerfile which executes an example Map-Reduce job. But the problem is
Third docker file depends on second hadoop-docker file. fo eg:
FROM sec_doc_file
RUN /bin/hadoop fs -mkdir input
It requires hadoop services. But hadoop services ll be started only after running second docker file. But i want to run it as part of third docker file before starting MR job? Is it possible? If so, please provide an example. If not, what could be the other possibilities?
#something like
From sec_doc_file
#Start_Service
RUN /bin/hadoop fs -mkdir input
#continue_map_reduce_job

The docker image you use as base for the new container is a base for files, not for processes supposed to be running. To do what you want you would need to start the process(es) you need during dockerbuild and run the commands to set up properly. Each RUN creates a new AUFS layer, but does not keep the possible previous running services. So, if you need a service to be up to perform some setup during docker build you would need to run it in one line (concatenating commands or with a custom script). Example:
FROM Gops/sec_doc_file
RUN path/to/start-all.sh && /bin/hadoop fs -mkdir input
So for setting up HDFS folders and files during docker build you'd need to run the hdfs daemons and perform the action you wish in the same RUN command:
RUN /etc/hadoop/hadoop-env.sh &&\
/opt/hadoop/sbin/start-dfs.sh &&\
/opt/hadoop/bin/hdfs dfs -mkdir input

Related

Problem in executing a shell script present on host using docker exec

I'm trying to execute a script on the master node of AWS EMR cluster. The intention is to create a new conda env and link it to jupyter. I'm following this doc from AWS. Problem is, whatever be the content of the script, I'm getting the same error: bash: /home/hadoop/scripts/bootstrap.sh: No such file or directory while executing sudo docker exec jupyterhub bash /home/hadoop/scripts/bootstrap.sh. I've made sure the sh file is in the correct location.
But if I copy the bootstrap.sh file inside the container, and then run the same docker exec cmd, it's working fine. What am I missing here? I've tried with a simple script with the following entries, but it throws the same error:
#!/bin/bash
echo "Hello"
The doc clearly says:
Kernels are installed within the Docker container. The easiest way to
accomplish this is to create a bash script with installation commands,
save it to the master node, and then use the sudo docker exec
jupyterhub script_name command to run the script within the jupyterhub
container.
The docker exec command runs a command within the container's namespaces. One of those namespaces is the filesystem. So unless the command is part of the image, written into the container directly, or you have mounted a host volume to map a host directory into the container, you won't be able to execute it. A host volume could look like:
docker run -v /host/scripts:/container/scripts -n your_container $your_image
docker exec -it your_container /container/scripts/test.sh
That host volume could be the same path on both the host and the container.
If it is a shell script, you could use I/O redirection, e.g.:
docker exec -i $container_id /bin/bash <local_script.sh
but be aware that you cannot do interactive stuff this way since the script content has replaced your terminal as stdin. This works because the shell inside the container is just processing commands from stdin.
Other than those scenarios, I don't know what to tell you other than the documentation from AWS appears to be wrong.

How to run cron in Docker container from Ruby image

I've tried setting up cron to run in my Docker container, but without success thus far.
This is the cron-related parts of the Dockerfile:
FROM ruby:2.2.2
# Add crontab file in the cron directory
RUN apt-get install -y rsyslog
ADD crontab /etc/cron.d/hello-cron
# Give execution rights on the cron job
RUN chmod +x /etc/cron.d/hello-cron
# Create the log file to be able to run tail
RUN touch /var/log/cron.log
# Run the command on container startup
RUN service cron start
When I log on to the container instance, cron appears to be running:
$ service cron status
cron is running.
And /etc/cron.d has my job:
$ cat /etc/cron.d/hello-cron
* * * * * root echo "Hello world" >> /var/log/cron.log 2>&1
But nothing is appended to /var/log/cron.log, so it doesn't appear to run.
If I then, from within the container, runs $ cron it registers my hello-cron file and the log file will have "Hello world" appended every minute.
Your analysis is correct, the cron jobs are not running. This happens because normally, and by best practices, the container only runs a single process, such as Apache, NGINX, etc. - it does not run any of the normal operating system daemons such as crond.
No crond means, there is nothing that would read or execute your crontab.
There are several possibilities to solve this, but no perfect solution that I know of.
The worst one is to actually install crond, along with something like supervisord. It makes your container dramatically more complex.
You can create a separate container that runs nothing but cron. Mount whatever you need from the other containers as volumes. This is generally the recommended option, but it has limitations. The cron container needs to know a lot about the internals of your other containers, and the cron jobs don't execute in the same context as the rest of the containers.
You can create a cron job on the host, and have it execute scripts in the containers with docker exec. That works well, but creates a dependency between host and container. It may also not work at all if you don't have access to the host's operating system (for instance, in a hosted situation, or if a different team manages the host).

Dockerize ruby script that takes directories as input/output

I am very new to docker, and I need help to dockerize a ruby script that takes a a input directory and output directory.
i.e generate_rr_pair.rb BuildRR -n /data/ -o /output
What the script does, is it will take the -n option (input) and check if the directory exists, if it does it uses the files inside as input. The script will then output data to the -o option (output). If the output directory doesn't exist, the script will create the directory and output files there.
How can I create a Dockerfile to handle this? Should I pass these in, as environment variables? Or should I use mounted Volumes? But since the script handles fileIO, I am not sure if I want volumes. The input directory should already exist on the host, and the output directory will get created. Both directories, should remain after docker container stops.
Use the official ruby image in your docker file:
FROM ruby:2.1-onbuild
CMD ["ruby", "generate_rr_pair.rb"]
Building the container as normal
docker build -t myruby .
Which can then be run as follows:
docker run --rm -it -v /data:/data -v /output:/output myruby BuildRR -n /data -o /output
Note that volume mappings are required if you want the ruby script within the container to operate on directories mounted on the host machine.

File not found exception while starting Flume agent

I have installed Flume for the first time. I am using hadoop-1.2.1 and flume 1.6.0
I tried setting up a flume agent by following this guide.
I executed this command : $ bin/flume-ng agent -n $agent_name -c conf -f conf/flume-conf.properties.template
It says log4j:ERROR setFile(null,true) call failed.
java.io.FileNotFoundException: ./logs/flume.log (No such file or directory)
Isn't the flume.log file generated automatically? If not, how can I rectify this error ?
Try this:
mkdir ./logs
sudo chown `whoami` ./logs
bin/flume-ng agent -n $agent_name -c conf -f conf/flume-conf.properties.template
The first line creates the logs directory in the current directory if it does not already exist. The second one sets the owner of that directory to the current user (you) so that flume-ng running as your user can write to it.
Finally, please note that this is not the recommended way to run Flume, just a quick hack to try it.
You are getting this error probably because you are running command directly on console, you've to first go to the bin in flume and try running your command there over console.
As #Botond says, you need to set the right permissions.
However, if you run Flume within a program, like supervisor or with a custom script, you might want to change the default path, as it's relative to the launcher.
This path is defined in your /path/to/apache-flume-1.6.0-bin/conf/log4j.properties. There you can change the line
flume.log.dir=./logs
to use an absolute path that you would like to use - you still need the right permissions, though.

How to make mahout interact with hadoop HDFS

I am using HDP mahout version 0.8. I have set MAHOUT_LOCAL="". When I run mahout, I see the message HADOOP LOCAL NOT SET RUNNING ON HADOOP but my program is not writing output to HDFS directory.
Can anyone tell me how to make my mahout program take input from HDFS and write output to HDFS?
Did you set the $MAHOUT_HOME/bin and $HADOOP_HOME/bin on the PATH ?
For example on Linux:
export PATH=$PATH:$MAHOUT_HOME/bin/:$HADOOP_HOME/bin/
export HADOOP_CONF_DIR=$HADOOP_HOME/conf/
Then, almost all the Mahout's commands use the options -i (input) and -o (output).
For example:
mahout seqdirectory -i <input_path> -o <output_path> -chunk 64
Assuming you have your mahout jar build which takes input and write to hdfs. Do the following:
From hadoop bin directory:
./hadoop jar /home/kuntal/Kuntal/BIG_DATA/mahout-recommender.jar mia.recommender.RecommenderIntro --tempDir /home/kuntal/Kuntal/BIG_DATA --recommenderClassName org.apache.mahout.cf.taste.impl.recommender.GenericItemBasedRecommender
#Input Output Args specify if required
-Dmapred.input.dir=./ratingsLess.txt -Dmapred.output.dir=/input/output
Please check this:
http://chimpler.wordpress.com/2013/02/20/playing-with-the-mahout-recommendation-engine-on-a-hadoop-cluster/

Resources