Tachyon configuration for s3 under filesystem - alluxio

I am trying to set up Tachyon on S3 filesystem. For HDFS, tachyon has a parameter called TACHYON_UNDERFS_HDFS_IMPL which is set to "org.apache.hadoop.hdfs.DistributedFileSystem". Does anyone know if such a parameter exists for S3? If so, what is its value?
Thanks in advance for any help!

Hadoop FS type you mentioned (org.apache.hadoop.hdfs.DistributedFileSystem) is just the interface and it fits your needs. Instead, Tachyon create the s3n FileSystem implementation basing on scheme specified in the uri of remote dfs which is configured with TACHYON_UNDERFS_ADDRESS.
For Amazon, you will need to specify something like this:
export TACHYON_UNDERFS_ADDRESS=s3n://your_bucket
Note "s3n", not "s3" here.
Additional setup you will need to work with s3 (see also
Error in setting up Tachyon on S3 under filesystem and http://tachyon-project.org/Setup-UFS.html):
in ${TACHYON}/bin/tachyon-env.sh: add key id and the secret key to TACHYON_JAVA_OPTS:
-Dfs.s3n.awsAccessKeyId=123
-Dfs.s3n.awsSecretAccessKey=456
Publish extra dependencies required by s3n Hadoop FileSystem implementation, the version depends on the version of Hadoop installed. These are : commons-httpclients-* and jets3t-*.
For that, publish the TACHYON_CLASSPATH as mentioned in one of links above. This can be done by adding export of TACHYON_CLASSPATH in ${TACHYON}/libexec/tachyon-config.sh before exporting CLASSPATH:
export TACHYON_CLASSPATH=~/.m2/repository/commons-httpclient/commons-httpclient/3.1/commons-httpclient-3.1.jar:~/.m2/repository/net/java/dev/jets3t/jets3t/0.9.0/jets3t-0.9.0.jar
export CLASSPATH="$TACHYON_CONF_DIR/:$TACHYON_JAR:$TACHYON_CLASSPATH":
Start Tachyon cluster:
./bin/tachyon format
./bin/tachyon-start.sh local
Check its availability via web interface:
http://localhost:19999/
in logs:
${TACHYON}/logs
Your core-site.xml should contain following sections to make sure you are integrated with Tachyon (see Spark reference http://tachyon-project.org/Running-Spark-on-Tachyon.html for configuration right from scala)
fs.defaultFS - specify the Tachyon master host-port (below are defaults)
fs.default.name - default name of fs, the same as before
fs.tachyon.impl - Tachyon's hadoop.FileSystem implementation hint
fs.s3n.awsAccessKeyId - Amazon key id
fs.s3n.awsSecretAccessKey - Amazon secret key
<configuration>
<property>
<name>fs.defaultFS</name>
<value>tachyon://localhost:19998</value>
</property>
<property>
<name>fs.default.name</name>
<value>tachyon://localhost:19998</value>
<description>The name of the default file system. A URI
whose scheme and authority determine the
FileSystem implementation.
</description>
</property>
<property>
<name>fs.tachyon.impl</name>
<value>tachyon.hadoop.TFS</value>
</property>
...
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>123</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>345</value>
</property>
...
</configuration>
Refer to any path using tachyon scheme and master host port:
tachyon://master_host:master_port/path
Example with default Tachyon master host-port:
tachyon://localhost:19998/remote_dir/remote_file.csv

Related

How to configure HBase in a HA mode?

I don't understand one parameter from hbase-site.xml :
<property>
<name>hbase.rootdir</name>
<value>hdfs://hdfsHost:8020/hbase</value>
</property>
What we have to put in that parameter if we configured HDFS cluster in HA mode? I mean we have 2 name nodes (nn1, nn2) and 2 data nodes (dn1, dn2) then which node we have to use in "hbase.rootdir" parameter?
The most logical answer is the name node which is currently active. But if we will use active name node and it fails then hbase cluster becomes unavailable even if our nn2 will change its status to active. Hbase cluster will not understand that we have changed our active NN.
Moreover, I have configured HBase cluster with the following parameter:
<property>
<name>hbase.rootdir</name>
<value>hdfs://nn1:8020/hbase</value>
</property>
It doesn't work.
1. HMaster starts
2. I put "http://nn1:16010" into browser
3. HMaster disappears
Here is my logs/hbase-hadoop-master-nn1.log :
http://paste.openstack.org/show/549232/
I couldn't find answers in documentation. Please, help me to find out how to configure it
You should insert the whole nameservice there instead of concrete namenode. I'm assuming that you have only one nameservice configured. Look at the dfs.nameservices property in hdfs-site.xml. There should be something like "nameservice1" in there. Then change hbase.rootdir like so :
<property>
<name>hbase.rootdir</name>
<value>hdfs://nameservice1:8020/hbase</value>
</property>
(fs.defaultFS property in core-site.xml also uses the same notation)
One thing to watch for is that hbase should have access to the latest hdfs configuration with HA. Otherwise it will complain about the nameservice name.
copy the hdfs-site.xml and core-site.xml to hbase/conf folder, this way you won't see the error for unknown name of the HA nameservice that you created.

How to make Hbase resilient to name node failures in Hadoop 2

there is solution for HA hadoop + hbase stack for hadoop 1, but i can't find any mentions on such solution for hadoop 2.
It has name node avaliability but you still need to set hostname in hadoop setup, so if master name node goes down hbase remains blinded.
What solutions can you suggest for making hbase resilient to name node failures?
You need to configure name service and use name service instead of specifying specific IP.
For example here "mycluster" is name service name.
<property>
<name>dfs.nameservices</name>
<value>mycluster</value>
</property>
And then configure for HA
<property>
<name>dfs.ha.namenodes.mycluster</name>
<value>nn1,nn2</value>
</property>
In hbase-site.xml also you can use "mycluster" name service to refer the cluster.
For more details, Please refer here

Error on starting Hbase 1.0.0

I have just installed Hbase through brew install hbase. Edited hbase-site.xml
<configuration>
<property>
<name>hbase.rootdir</name>
<value>file:///usr/local/Cellar/hbase/databases/hbase-${user.name}/hbase</value>
<description>The directory shared by region servers and into
which HBase persists. The URL should be 'fully-qualified'
to include the filesystem scheme. For example, to specify the
HDFS directory '/hbase' where the HDFS instance's namenode is
running at namenode.example.org on port 9000, set this value to:
hdfs://namenode.example.org:9000/hbase. By default HBase writes
into /tmp. Change this configuration else all data will be lost
on machine restart.
</description>
</property>
</configuration>
Exported JAVA_HOME and HBASE_HOME.
When i'm trying to start i m getting following exception:
Abhisheks-MacBook-Pro:bin abhishek$ start-hbase.sh
Error: Could not find or load main class org.apache.hadoop.hbase.util.HBaseConfTool
Error: Could not find or load main class org.apache.hadoop.hbase.zookeeper.ZKServerTool
starting master, logging to /usr/local/Cellar/hbase/1.0.0/logs/hbase-abhishek-master-Abhisheks-MacBook-Pro.local.out
Error: Could not find or load main class org.apache.hadoop.hbase.master.HMaster
cat: /usr/local/Cellar/hbase/1.0.0/conf/regionservers: No such file or directory
cat: /usr/local/Cellar/hbase/1.0.0/conf/regionservers: No such file or directory
I have Hadoop2.6.0 and Hbase1.0.0. Though i'm seeing many people have already faced this problem but i cannot find the solution. What else needs to be done to start Hbase without any issue?
Solution:
HBASE_HOME=/usr/local/Cellar/hbase/1.0.0/libexec
it should be configured such that conf folder lies in HBASE_HOME directory.
Checking master-status:
localhost:60010
edit hbase-site.xml
<configuration>
<property>
<name>hbase.rootdir</name>
<value>file:///usr/local/Cellar/hbase/databases/hbase-${user.name}/hbase</value>
<description>The directory shared by region servers and into
which HBase persists. The URL should be 'fully-qualified'
to include the filesystem scheme. For example, to specify the
HDFS directory '/hbase' where the HDFS instance's namenode is
running at namenode.example.org on port 9000, set this value to:
hdfs://namenode.example.org:9000/hbase. By default HBase writes
into /tmp. Change this configuration else all data will be lost
on machine restart.
</description>
</property>
<property >
<name>hbase.master.port</name>
<value>60000</value>
<description>The port the HBase Master should bind to.</description>
</property>
<property>
<name>hbase.master.info.port</name>
<value>60010</value>
<description>The port for the HBase Master web UI.
Set to -1 if you do not want a UI instance run.</description>
</property>
</configuration>

get "ERROR: Can't get master address from ZooKeeper; znode data == null" when using Hbase shell

I installed Hadoop2.2.0 and Hbase0.98.0 and here is what I do :
$ ./bin/start-hbase.sh
$ ./bin/hbase shell
2.0.0-p353 :001 > list
then I got this:
ERROR: Can't get master address from ZooKeeper; znode data == null
Why am I getting this error ? Another question:
do I need to run ./sbin/start-dfs.sh and ./sbin/start-yarn.sh before I run base ?
Also, what are used ./sbin/start-dfs.sh and ./sbin/start-yarn.sh for ?
Here is some of my conf doc :
hbase-sites.xml
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://127.0.0.1:9000/hbase</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.tmp.dir</name>
<value>/Users/apple/Documents/tools/hbase-tmpdir/hbase-data</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>localhost</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/Users/apple/Documents/tools/hbase-zookeeper/zookeeper</value>
</property>
</configuration>
core-sites.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
<description>The name of the default file system.</description>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/Users/micmiu/tmp/hadoop</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>io.native.lib.available</name>
<value>false</value>
</property>
</configuration>
yarn-sites.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
If you just want to run HBase without going into Zookeeper management for standalone HBase, then remove all the property blocks from hbase-site.xml except the property block named hbase.rootdir.
Now run /bin/start-hbase.sh. HBase comes with its own Zookeeper, which gets started when you run /bin/start-hbase.sh, which will suffice if you are trying to get around things for the first time. Later you can put distributed mode configurations for Zookeeper.
You only need to run /sbin/start-dfs.sh for running HBase since the value of hbase.rootdir is set to hdfs://127.0.0.1:9000/hbase in your hbase-site.xml. If you change it to some location on local the filesystem using file:///some_location_on_local_filesystem, then you don't even need to run /sbin/start-dfs.sh.
hdfs://127.0.0.1:9000/hbase says it's a place on HDFS and /sbin/start-dfs.sh starts namenode and datanode which provides underlying API to access the HDFS file system. For knowing about Yarn, please look at http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/YARN.html.
This could also happen if the vm or the host machine is put to sleep ,Zookeeper will not stay live.
Restarting the VM should solve the problem.
You need to start zookeeper and then run Hbase-shell
{HBASE_HOME}/bin/hbase-daemons.sh {start,stop} zookeeper
and you may want to check this property in hbase-env.sh
# Tell HBase whether it should manage its own instance of Zookeeper or not.
export HBASE_MANAGES_ZK=false
Refer to Source - Zookeeper
One quick solution could be to Restart hbase:
1) Stop-hbase.sh
2) Start-hbase.sh
I had the exact same error. The Linux firewall was blocking connectivity. One can test ports via telnet. A quick fix is to turn off the firewall and see if it fixes it:
Completely disable the firewall on all of your nodes. Note: this command will not survive a reboot of your machines.
systemctl stop firewalld
Long term fix is that you must configure the firewall to allow the hbase ports.
Note, your version of hbase may use different ports:
https://issues.apache.org/jira/browse/HBASE-10123
The output from Hbase shell is quite high level that many misconfiguration would cause this message. To help yourself debug, it would be much better to look into the hbase log in
/var/log/hbase
to figure out the root cause of the issue.
I had the same problem too. For me, my root cause was due to hadoop-kms having a conflicting port number with my hbase-master. Both of them are using port 16000 so my HMaster didn't even get started when I invoke hbase shell. After I fixed that, my hbase worked.
Again, kms port conflict might not be your root-cause. Strongly suggest looking into /var/log/hbase to find the root cause.
In my case with same error in running hbase - I did not include the zookeeper properties in the hbase-site.xml and still get the above error messages (as based in Apache hbase guide, only the two properites: rootdir, and distributed are essential).
I can also trace back my output of jps command that find out that indeed my Hregion server and Hmaster were not properly up and running.
After stop and start (like a reset), I did have these two up and running and can run hbase properly.
if it's happening in VMWare or virtual box please restart Cloudera by command init1 please check you have root privilege and retry hope it will help :)
hbase shell

HBase binding to an incorrect address

I'm attempting to run running HBase in pseudo-distributed mode. I have followed all of the steps in the tutorial.
My hbase-site.xml looks like this:
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://localhost:9000/hbase</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
My regionservers looks like this (default):
localhost
In the logs, Zookeeper starts OK, MiniZK starts OK, then I get a BindException with this being the culprit:
Caused by: java.net.BindException: Problem binding to /192.168.0.1:0 : Cannot assign requested address
Where in the world did it get the address 192.168.0.1? And why is it trying to bind to port 0? That IP is my NAT gateway. The IP address of the machine it's on is 192.168.0.200.
I have looked in all of the config files but don't see anywhere that I would specify that address.
** UPDATE **
It looks like the problem was that HBase was trying to reverse-lookup my IP address by my hostname which-- because I'm using my router as a DNS-- resolved to ... my router.
When I add an "alias" in the /etc/hosts file to 127.0.0.1 it resolves just fine.
#arnon-rotem-gal-oz, I just installed whatever came in the HBase tarball. I'm assuming miniZK is a scaled-down version of Zookeeper? I'm not running a separate instance of it.
The code you posted did the trick to resolve the next problem that came up.
Check the zookeeper configuration file (zoo.cfg in the zookeeper/conf directory)
Also why do you have both zookeeper and miniZK?
Also (not directly related to your question) you need to tell hbase where to find the zookeeper e.g. adding the following to your hbase-site.xml
<property>
<name>hbase.zookeeper.quorum</name>
<value>localhost</value>
</property>

Resources