Spark with custom Hadoop FileSystem - hadoop

I already have a cluster with Yarn, configured to use a custom Hadoop FileSystem in core-site.xml:
<property>
<name>fs.custom.impl</name>
<value>package.of.custom.class.CustomFileSystem</value>
</property>
I want to run a Spark Job on this Yarn cluster, which reads an input RDD from this CustomFilesystem:
final JavaPairRDD<String, String> files =
sparkContext.wholeTextFiles("custom://path/to/directory");
Is there some way I can do this without re-configuring Spark? i.e. Can I point Spark to the existing core-site.xml, and what would be the best way to do that?

Set HADOOP_CONF_DIR to the directory that contains core-site.xml. (This is documented in Running Spark on YARN.)
You will still need to make sure package.of.custom.class.CustomFileSystem is on the classpath.

Related

How to change java.io.tmpdir for spark job running on yarn

How can I change java.io.tmpdir folder for my Hadoop 3 Cluster running on YARN?
By default it gets something like /tmp/***, but my /tmp filesystem is to small for everythingYARN Job will write there.
Is there a way to change it ?
I have also set hadoop.tmp.dir in core-site.xml, but it looks like, it is not really used.
perhaps its a duplicate of What should be hadoop.tmp.dir ?. Also, go through all .conf's in /etc/hadoop/conf and search tmp, see if anything is hardcoded. Also specify:
Whether you see (any) files getting created # what you specified as hadoop.tmp.dir.
What pattern of files are being formed # /tmp/** after your changes are applied.
I have also noticed hive creating files in /tmp. So, you may also have a look # hive-site.xml. Similar for any other ecosystem product you are using.
I have configured yarn.nodemanager.local-dirs property in yarn-site.xml and restarted the cluster. After that spark stopped using /tmp file system and used directories, configured in yarn.nodemanager.local-dirs.
java.io.tmpdir property for spark executors was also set to directories defined in yarn.nodemanager.local-dirs property.
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/somepath1,/anotherpath2</value>
</property>

Hadoop : HDFS Cluster running out of space even though space is available

We have 4 datanode HDFS cluster ...there is large amount of space available on each data node of about 98gb ...but when i look at the datanode information .. it's only using about 10gb and running out of space ...
How can we make it use all the 98gb and not run out of space as indicated in image
this is the disk space configuration
this is the hdfs-site.xml on name node
<property>
<name>dfs.name.dir</name>
<value>/test/hadoop/hadoopinfra/hdfs/namenode</value>
</property>
this is the hdfs-site.xml under data node
<property>
<name>dfs.data.dir</name>
<value>/test/hadoop/hadoopinfra/hdfs/datanode</value>
</property>
Eventhough /test has 98GB and hdfs is configured to use it it's not using it
Am I missing anything while doing the configuration changes? And how can we make sure 98GB is used?
According to this Hortonworks Community Portal link, the steps to amend your Data Node directory are as follows:
Stop the cluster.
Go to the ambari HDFS configuration and edit the datanode directory configuration: Remove /hadoop/hdfs/data and /hadoop/hdfs/data1. Add [new directory location].
Login into each datanode (via SSH) and copy the contents of /data and /data1 into the new directory.
Change the ownership of the new directory and everything under it to “hdfs”.
Start the cluster.
I'm assuming that you're technically already up to Step 2 since you've displayed your correctly configured core-site.xml files in the original question. Make sure you've done the other steps and that all Hadoop services have been stopped. From there, change the ownership to the user running Hadoop (typically hdfs but I've worked in a place where root was running the Hadoop processes) and you should be good to go :)

Without yarn map-reduce work?

I'm studying about hadoop map-reduce on centos 6.5 and hadoop 2.7.2. I learned that hdfs is just distributed file system and Yarn administers map-reduce work, so I thought that if i don't turn on Yarn(resource manager, node manager), map-reduce doesn't work.
Therefore, I think, wordcount should not do map-reduce process in the system working only hdfs not yarn.
(on the pseudo distribute mode)
But when I turn hdfs on not Yarn as you see in the below, and execute wordcount example, it show 'map-reduce framework'. What's it meaning? Does it possible only hdfs process map-reduce without Yarn? Because Yarn manage resource and job, is it right that map-reduce doesn't work without Yarn?
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar wordcount /user/input /user/output
With Hadoop 2.0 YARN takes responsibility of resource management, this is true. But even without YARN the Map Reduce applications can run using the older flavor.
The mapred-site.xml has a configuration - mapreduce.framework.name
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>local</value>
</property>
</configuration>
The above can be configured to choose whether to use YARN or not. The possible values for this property are - local, classic or yarn
The default value is "local". Set this to yarn, if you want to use YARN

Get a yarn configuration from commandline

In EMR, is there a way to get a specific value of the configuration given the configuration key using the yarn command?
For example I would like to do something like this
yarn get-config yarn.scheduler.maximum-allocation-mb
It's a bit non-intuitive, but it turns out the hdfs getconf command is capable of checking configuration properties for YARN and MapReduce, not only HDFS.
> hdfs getconf -confKey fs.defaultFS
hdfs://localhost:19000
> hdfs getconf -confKey dfs.namenode.name.dir
file:///Users/chris/hadoop-deploy-trunk/data/dfs/name
> hdfs getconf -confKey yarn.resourcemanager.address
0.0.0.0:8032
> hdfs getconf -confKey mapreduce.framework.name
yarn
A benefit of using this is that you'll see the actual, final results of any configuration properties as they are actually used by Hadoop. This would account for some of the more advanced configuration patterns, such as use of XInclude in the XML files or property substitutions, like this:
<property>
<description>The address of the applications manager interface in the RM.</description>
<name>yarn.resourcemanager.address</name>
<value>${yarn.resourcemanager.hostname}:8032</value>
</property>
Any scripting approach that tries to parse the XML files directly is unlikely to accurately match the implementation as its done inside Hadoop, so it's better to ask Hadoop itself.
You might be wondering why an hdfs command can get configuration properties for YARN and MapReduce. Great question! It's somewhat of a coincidence of the implementation needing to inject an instance of MapReduce's JobConf into some objects created via reflection. The relevant code is visible here:
https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/ReflectionUtils.java#L82-L114
This code is executed as part of running the hdfs getconf command. By triggering a reference to JobConf, it forces class loading and static initialization of the relevant MapReduce and YARN classes that add yarn-default.xml, yarn-site.xml, mapred-default.xml and mapred-site.xml to the set of configuration files in effect.
Since it's a coincidence of the implementation, it's possible that some of this behavior will change in future versions, but it would be a backwards-incompatible change, so we definitely wouldn't change that behavior inside the current Hadoop 2.x line. The Apache Hadoop Compatibility policy commits to backwards-compatibility within a major version line, so you can trust that this will continue working at least within the 2.x version line.

Why do we need to format HDFS after every time we restart machine?

I have installed Hadoop in pseudo distributed mode on my laptop, OS is Ubuntu.
I have changed paths where hadoop will store its data (by default hadoop stores data in /tmp folder)
hdfs-site.xml file looks as below :
<property>
<name>dfs.data.dir</name>
<value>/HADOOP_CLUSTER_DATA/data</value>
</property>
Now whenever I restart machine and try to start hadoop cluster using start-all.sh script, data node never starts. I confirmed that data node is not start by checking logs and by using jps command.
Then I
Stopped cluster using stop-all.sh script.
Formatted HDFS using hadoop namenode -format command.
Started cluster using start-all.sh script.
Now everything works fine even if I stop and start cluster again. Problem occurs only when I restart machine and try to start the cluster.
Has anyone encountered similar problem?
Why this is happening and
How can we solve this problem?
By changing dfs.datanode.data.dir away from /tmp you indeed made the data (the blocks) survive across a reboot. However there is more to HDFS than just blocks. You need to make sure all the relevant dirs point away from /tmp, most notably dfs.namenode.name.dir (I can't tell what other dirs you have to change, it depends on your config, but the namenode dir is mandatory, could be also sufficient).
I would also recommend using a more recent Hadoop distribution. BTW, the 1.1 namenode dir setting is dfs.name.dir.
For those who use hadoop 2.0 or above versions config file names may be different.
As this answer points out, go to the /etc/hadoop directory of your hadoop installation.
Open the file hdfs-site.xml. This user configuration will override the default hadoop configurations, that are loaded by the java classloader before.
Add dfs.namenode.name.dir property and set a new namenode dir (default is file://${hadoop.tmp.dir}/dfs/name).
Do the same for dfs.datanode.data.dir property (default is file://${hadoop.tmp.dir}/dfs/data).
For example:
<property>
<name>dfs.namenode.name.dir</name>
<value>/Users/samuel/Documents/hadoop_data/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/Users/samuel/Documents/hadoop_data/data</value>
</property>
Other property where a tmp dir appears is dfs.namenode.checkpoint.dir. Its default value is: file://${hadoop.tmp.dir}/dfs/namesecondary.
If you want, you can easily also add this property:
<property>
<name>dfs.namenode.checkpoint.dir</name>
<value>/Users/samuel/Documents/hadoop_data/namesecondary</value>
</property>

Resources