How to persist HDFS data in docker container

How to persist HDFS data in docker container - hadoop

I have a docker image for hadoop. (in my case it is https://github.com/kiwenlau/hadoop-cluster-docker, but the question applies to any hadoop docker image)
I am running the docker container as below..
sudo docker run -itd --net=hadoop --user=root -p 50070:50070 \
-p 8088:8088 -p 9000:9000 --name hadoop-master --hostname hadoop-master \
kiwenlau/hadoop
I am writing data to the hdfs file system from java running in the host ubuntu machine.
FileSystem hdfs = FileSystem.get(new URI(hdfs://0.0.0.0:9000"), configuration)
hdfs.create(new Path("hdfs://0.0.0.0:9000/user/root/input/NewFile.txt")),
How should I mount the volume when starting docker such that the "NewFile1.txt" is persisted.
Which "path" inside the container corresponds to the HDFS path "/user/root/input/NewFile.txt" ?

You should inspect the dfs.datanode.data.dir in the hdfs-site.xml file to know where data is stored to the container filesystem
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///root/hdfs/datanode</value>
<description>DataNode directory</description>
</property>
Without this file/property, the default location would be in file:///tmp/hadoop-${user.name}/dfs/data
For docker,. mind that the default user that runs the processes is the root user.
You will also need to persist the namenode files, again seen from that XML file
Which "path" inside the container corresponds to the HDFS path "/user/root/input/NewFile.txt"
The container path holds the blocks of the HDFS file, not the whole file itself

Related

Case sensitive host volume mount in docker for windows

I am running a linux docker container on windows 10. I need my host to have access to the data that my container generates. I also need the data to persist if I update the container's image.
I created a folder on the host (On a NTFS formated drive), in the docker settings, I share that drive with docker. I then create the container with the host directory mounted (using the -v option on the docker run command)
The problem is that docker creates a cifs mount to my shared drive on the host. It seems like the CIFS protocol is not case sensitive. I create two files:
/data/Test
/data/test
But only one file will be generated. I setup the kernel to support case sensitive files. For example, if I mount the same folder inside cygwin bash, I can create those two files without any problem. The problem is with the CIFS implementation I think.
My current thoughts of solving this issue:
Use Cygwin to create an NFS server on the host, and mount the NFS volume from within the linux container. I am not sure how I can automate this processes though.
Create another linux container with a SAMBA server. Create a volume on that container:
docker run -d -v /data --name dbstore --name a-samba-server
Then use that volume in my container:
docker run -d --volumes-from dbstore --name my-container my-container-image
Then I need to share /data in the samba server and create a map to that share on my host.
Both solutions seem quite cumbersome and I would like to know if there is anyway I can solve this directly with the CIFS share that docker natively creates.

How to delete the HDFS data in Docker containers

I run hadoop cluster in Docker by mount a local folder by -v.
Then I login the hadoop cluster and 'cd' to the mount folder and execute hdfs dfs -put ./data/* input/. It works.
But my problem is that I cannot delete the data that I copied to hdfs. I delete containers by docker rm ,but the data still exist. Now I only can reset Docker and the data can be deleted.
Is there any other solution?
This is my docker info
➜ hadoop docker info
Containers: 5
Running: 5
Paused: 0
Stopped: 0
Images: 1
Server Version: 1.12.3
Storage Driver: aufs
Root Dir: /var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 22
Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: null bridge host overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: seccomp
Kernel Version: 4.4.27-moby
Operating System: Alpine Linux v3.4
OSType: linux
Architecture: x86_64
CPUs: 5
Total Memory: 11.71 GiB
Name: moby
ID: NPR6:2ZTU:CREI:BHWE:4TQI:KFAC:TZ4P:S5GM:5XUZ:OKBH:NR5C:NI4T
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
File Descriptors: 56
Goroutines: 81
System Time: 2016-11-22T08:10:37.120826598Z
EventsListeners: 2
Username: chaaaa
Registry: https://index.docker.io/v1/
WARNING: No kernel memory limit support
Insecure Registries:
127.0.0.0/8

This is an issue. https://github.com/docker/for-mac/issues/371
If you can remove all images/containers then:
Stop Docker.
run
docker rm $(docker ps -a -q)
docker rmi $(docker images -q)
docker volume rm $(docker volume ls |awk '{print $2}')
rm -rf ~/Library/Containers/com.docker.docker/Data/*
Start Docker, you have yours GB back.

To delete data in a HDFS you need to make a similar call like the one you did to put the file, in this case:
hdfs dfs -rm ./data/*
If there are directories you should add -r
hdfs dfs -rm -R ./data/*
And finally, by default Hadoop move deleted files/directories to a trash directory, which would be in the home of the hadoop user you're using for this requests, something like /user/<you>/.Trash/
About HDFS
Usually in the namenode there is some metadata about the structure of the HDFS, like the directories or files in it and where are the blocks forming it stored (Which datanodes). While datanodes will keep blocks of HDFS data, the data stored usually is not usable, since it will usually be just part of the data blocks in the HDFS.
Because of this, all operations with the HDFS are done through the namenode using hdfs calls, like put, get, rm, mkdir... instead of regular operating system command line tools.

Can't mount HOST folder into Amazon Docker Container?

I'm using an EC2 instance to run docker. From my local machine using OSX, I'm using docker machine to create containers and volumes. However when I'm trying to mount a local folder to any container is not possible.
docker create -v /data --name data-only-container ubuntu /bin/true
docker run -it --volumes-from data-only-container -v $(pwd)/data:/backup ubuntu bash
With the first command I create a data only container and I'm executing the second command to get into a container that should have the data-only-container volumes and the one I'm trying to mount, however when access it the folder /backup is empty
What I'm doing wrong?
EDIT:
I'm trying to mount a host folder in order to restore backuped data from my PC to container. In that case what would be a different approach?
Shall I try to use Flocker?

A host volume mounted with -v /path/to/dir:/container/mnt mounts a directory from the docker host inside the container. When you run this command on your OSX system, the $(pwd)/data will reference a directory on your local machine that doesn't exist on the docker host, the EC2 instance. If you log into your EC2 instance, you'll likely find the $(pwd)/data directory created there and empty.
If you want to mount folders from your OSX system into a docker container, you'll need to run Docker on the OSX system itself.
Edit: To answer the added question of how to move data up to your container in the cloud, there are often ways to move your data to the cloud provider itself, outside of docker, and then include it directly inside the container. To do a docker only approach, you can do something like:
tar -cC /source . | \
docker run --rm -i -v app-data:/target busybox \
/bin/sh -c "tar -xC /target"
This will upload your data with tar over a pipe into a named volume on your docker host. You can then include the named "app-data" volume in any other containers. If you need to do this multiple times with larger data sets, creating an rsync container would be more efficient.

Hadoop namenode cant fire the datanode

I have a multinode setup on separate machines the namenode cant fire the datanode and the task tracker, the namenode, secondary node , jobtracker works fine
the namenode machine named namenode#namenode IP 192.168.1.1
the datanode machine named datanode2#datanode2 IP 192.168.1.2
the ssh server is setup and the id_rsa.pub is copied to the datanode
but when applying the start-all.sh command
when firing the datanode it asks for a password for namenode#datanode2
when providing the password it say permission denied

You need to have core-site.xml with your namenode address. This needs to be same across cluster.
<property>
<name>fs.default.name</name>
<value>hdfs://$namenode.full.hostname:8020</value>
<description>Enter your NameNode hostname</description>
</property>
You can use script to start individual demons . Follow this SO post.

Change permissions for .ssh folder and authorized_keys file as follows:
sudo chmod 700 ~/.ssh
sudo chmod 640 ~/.ssh/authorized_keys
or
sudo chmod 700 /home/hadoop/.ssh
sudo chmod 640 /home/hadoop/.ssh/authorized_keys
Refer this for more details.
UPDATE I:
Try 600 instead of 640 like this:
sudo chmod 600 $HOME/.ssh/authorized_keys
sudo chown 'hadoop' $HOME/.ssh/authorized_keys
If this did not work,try this one:
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoopusrname#HOSTNAME.local
Change HOSTNAME with you local hostname and hadoopusrname with your hadoop username.

How to execute a command on a running docker container?

I have a container running hadoop. I have another docker file which contains Map-Reduce job commands like creating input directory, processing a default example, displaying output. Base image for the second file is hadoop_image created from first docker file.
EDIT
Dockerfile - for hadoop
#base image is ubuntu:precise
#cdh installation
#hadoop-0.20-conf-pseudo installation
#CMD to start-all.sh
start-all.sh
#start all the services under /etc/init.d/hadoop-*
hadoop base image created from this.
Dockerfile2
#base image is hadoop
#flume-ng and flume-ng agent installation
#conf change
#flume-start.sh
flume-start.sh
#start flume services
I am running both containers separately. It works fine. But if i run
docker run -it flume_service
it starts flume and show me a bash prompt [/bin/bash is the last line of flume-start.sh]. The i execute
hadoop fs -ls /
in the second running container, i am getting the following error
ls: Call From 514fa776649a/172.17.5.188 to localhost:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
I understand i am getting this error because hadoop services are not started yet. But my doubt is my first container is running. I am using this as base image for second container. Then why am i getting this error? Do i need to change anything in hdfs-site.xml file on flume contianer?
Pseudo-Distributed mode installation.
Any suggestions?
Or Do i need to expose any ports and like so? If so, please provide me an example
EDIT 2
iptables -t nat -L -n
I see
sudo iptables -t nat -L -n
Chain PREROUTING (policy ACCEPT)
target prot opt source destination
DOCKER all -- 0.0.0.0/0 0.0.0.0/0 ADDRTYPE match dst-
Chain POSTROUTING (policy ACCEPT)
target prot opt source destination
MASQUERADE tcp -- 192.168.122.0/24 !192.168.122.0/24 masq ports: 1024-6
MASQUERADE udp -- 192.168.122.0/24 !192.168.122.0/24 masq ports: 1024-6
MASQUERADE all -- 192.168.122.0/24 !192.168.122.0/24
MASQUERADE all -- 172.17.0.0/16 0.0.0.0/0
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
DOCKER all -- 0.0.0.0/0 !127.0.0.0/8 ADDRTYPE match dst-
Chain DOCKER (2 references)
target prot opt source destination
It is in docker#domian. Not inside a container.
EDIT
See last comment under surazj' answer

Have you tried linking the container?
For example, your container named hadoop is running in psedo dist mode. You want to bring up another container that contains flume. You could link the container like
docker run -it --link hadoop:hadoop --name flume ubuntu:14.04 bash
when you get inside the flume container - type env command to see ip and port exposed by hadoop container.
From the flume container you should be able to do something like. (ports on hadoop container should be exposed)
$ hadoop fs -ls hdfs://<hadoop containers IP>:8020/
The error you are getting might be related to some hadoop services not running on flume. do jps to check services running. But I think if you have hadoop classpath setup correctly on flume container, then you can run the above hdfs command (-ls hdfs://:8020/) without starting anything. But if you want
hadoop fs -ls /
to work on flume container, then you need to start hadoop services on flume container also.
On your core-site.xml add dfs.namenode.rpc-address like this so namenode listens to connection from all ip
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address</name>
<value>0.0.0.0:8020</value>
</property>
Make sure to restart the namenode and datanode
sudo /etc/init.d/hadoop-hdfs-namenode restart && sudo /etc/init.d/hadoop-hdfs-datanode restart
Then you should be able to do this from your hadoop container without connection error, eg
hadoop fs -ls hdfs://localhost:8020/
hadoop fs -ls hdfs://172.17.0.11:8020/
On the linked container. Type env to see exposed ports by your hadoop container
env
You should see something like
HADOOP_PORT_8020_TCP=tcp://172.17.0.11:8020
Then you can verify the connection from your linked container.
telnet 172.17.0.11 8020

I think I met the same problem yet. I either can't start hadoop namenode and datanode by hadoop command "start-all.sh" in docker1.
That is because it launch namenode and datanode through "hadoop-daemons.sh" but it failed. The real problem is "ssh" is not work in docker.
So, you can do either
(solution 1) :
Replace all terms "daemons.sh" to "daemon.sh" in start-dfs.sh,
than run start-dfs.sh
(solution 2) : do
$HADOOP_PREFIX/sbin/hadoop-daemon.sh start datanode
$HADOOP_PREFIX/sbin/hadoop-daemon.sh start namenode
You can see datanode and namenode are working fine by command "jps"
Regards.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio