How can I access S3/S3n from a local Hadoop 2.6 installation? - hadoop

I am trying to reproduce an Amazon EMR cluster on my local machine. For that purpose, I have installed the latest stable version of Hadoop as of now - 2.6.0.
Now I would like to access an S3 bucket, as I do inside the EMR cluster.
I have added the aws credentials in core-site.xml:
<property>
<name>fs.s3.awsAccessKeyId</name>
<value>some id</value>
</property>
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>some id</value>
</property>
<property>
<name>fs.s3.awsSecretAccessKey</name>
<value>some key</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>some key</value>
</property>
Note: Since there are some slashes on the key, I have escaped them with %2F
If I try to list the contents of the bucket:
hadoop fs -ls s3://some-url/bucket/
I get this error:
ls: No FileSystem for scheme: s3
I edited core-site.xml again, and added information related to the fs:
<property>
<name>fs.s3.impl</name>
<value>org.apache.hadoop.fs.s3.S3FileSystem</value>
</property>
<property>
<name>fs.s3n.impl</name>
<value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value>
</property>
This time I get a different error:
-ls: Fatal internal error
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3.S3FileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2074)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2578)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
Somehow I suspect the Yarn distribution does not have the necessary jars to be able to read S3, but I have no idea where to get those. Any pointers in this direction would be greatly appreciated.

For some reason, the jar hadoop-aws-[version].jar which contains the implementation to NativeS3FileSystem is not present in the classpath of hadoop by default in the version 2.6 & 2.7. So, try and add it to the classpath by adding the following line in hadoop-env.sh which is located in $HADOOP_HOME/etc/hadoop/hadoop-env.sh:
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HADOOP_HOME/share/hadoop/tools/lib/*
Assuming you are using Apache Hadoop 2.6 or 2.7
By the way, you could check the classpath of Hadoop using:
bin/hadoop classpath

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0 pyspark-shell'
import pyspark
sc = pyspark.SparkContext("local[*]")
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
hadoopConf = sc._jsc.hadoopConfiguration()
myAccessKey = input()
mySecretKey = input()
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId", myAccessKey)
hadoopConf.set("fs.s3.awsSecretAccessKey", mySecretKey)
df = sqlContext.read.parquet("s3://myBucket/myKey")

#Ashrith's answer worked for me with one modification: I had to use $HADOOP_PREFIX rather than $HADOOP_HOME when running v2.6 on Ubuntu. Perhaps this is because it sounds like $HADOOP_HOME is being deprecated?
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:${HADOOP_PREFIX}/share/hadoop/tools/lib/*
Having said that, neither worked for me on my Mac with v2.6 installed via Homebrew. In that case, I'm using this extremely cludgy export:
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$(brew --prefix hadoop)/libexec/share/hadoop/tools/lib/*

To resolve this issue I tried all the above, which failed (for my environment anyway).
However I was able to get it working by copying the two jars mentioned above from the tools dir and into common/lib.
Worked fine after that.

If you are using HDP 2.x or greater you can try modifying the following property in the MapReduce2 configuration settings in Ambari.
mapreduce.application.classpath
Append the following value to the end of the existing string:
/usr/hdp/${hdp.version}/hadoop-mapreduce/*

Related

Access hdfs from outside the cluster

I have a hadoop cluster on aws and I am trying to access it from outside the cluster through a hadoop client. I can successfully hdfs dfs -ls and see all contents but when I try to put or get a file I get this error:
Exception in thread "main" java.lang.NullPointerException
at org.apache.hadoop.fs.FsShell.displayError(FsShell.java:304)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:289)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:340)
I have hadoop 2.6.0 installed in both my cluster and my local machine. I have copied the conf files of the cluster to the local machine and have these options in hdfs-site.xml (along with some other options).
<property>
<name>dfs.client.use.datanode.hostname</name>
<value>true</value>
</property>
<property>
<name>dfs.permissions.enable</name>
<value>false</value>
</property>
My core-site.xml contains a single property in both the cluster and the client:
<property>
<name>fs.defaultFS</name>
<value>hdfs://public-dns:9000</value>
<description>NameNode URI</description>
</property>
I found similar questions but wasn't able to find a solution to this.
How about you SSH into that machine?
I know this is a very bad idea but to get the work done, you can first copy that file on machine using scp and then SSH into that cluster/master and do hdfs dfs -put on that copied local file.
You can also automate this via a script but again, this is just to get the work done for now.
Wait for someone else to answer to know the proper way!
I had similar issue with my cluster when running hadoop fs -get and I could resolve it. Just check if all your data nodes are resolvable using FQDN(Fully Qualified Domain Name) from your local host. In my case nc command was successful using ip addresses for data nodes but not with host name.
run below command :
for i in cat /<host list file>; do nc -vz $i 50010; done
50010 is default datanode port
when you run any hadoop command it try to connect to data nodes using FQDN and thats where it gives this weird NPE.
Do below export and run your hadoop command
export HADOOP_ROOT_LOGGER=DEBUG,console
you will see this NPE comes when it is trying to connect to any datanode for data transfer.
I had a java code which was also doing hadoop fs -get using APIs and there ,exception was more clearer
java.lang.Exception: java.nio.channels.UnresolvedAddressException
Let me know if this helps you.

How do I run sqlline with Phoenix?

When I try to run Phoenix's sqlline.py localhostcommand, I get
WARN util.DynamicClassLoader: Failed to identify the fs of
dir hdfs://localhost:54310/hbase/lib, ignored
java.io.IOException: No FileSystem for scheme:
hdfs at org.apache.hadoop.fs.FileSystem.getFileSystemClass...
and nothing else happens. I also could not get Squirrel to work (it freezes when I click 'list drivers').
As per these instructions, I have copied phoenix-4.2.1-server.jar to my hbase/lib folder and restarted hbase. I have also copied core-site.xml and hbase-site.xml to my phoenix/bin directory.
I have not added 'the phoenix-[version]-client.jar to the classpath of any Phoenix client'
since I do not know what this refers to.
I am using HBase 0.98.6.1-hadoop2, Phoenix 4.2.1 and hadoop 2.2.0.
I fix the same issue by adding setting in
${PHOENIX_HOME}/bin/hbase-site.xml
<property>
<name>fs.hdfs.impl</name>
<value>org.apache.hadoop.hdfs.DistributedFileSystem</value>
</property>

How to use Hive without hadoop

I am a new to NoSQL solutions and want to play with Hive. But installing HDFS/Hadoop takes a lot of resources and time (maybe without experience but I got no time to do this).
Are there ways to install and use Hive on a local machine without HDFS/Hadoop?
yes you can run hive without hadoop
1.create your warehouse on your local system
2. give default fs as file:///
than you can run hive in local mode with out hadoop installation
In Hive-site.xml
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
</property>
<property>
<!-- this should eventually be deprecated since the metastore should supply this -->
<name>hive.metastore.warehouse.dir</name>
<value>file:///tmp</value>
<description></description>
</property>
<property>
<name>fs.default.name</name>
<value>file:///tmp</value>
</property>
</configuration>
If you are just talking about experiencing Hive before making a decision you can just use a preconfigured VM as #Maltram suggested (Hortonworks, Cloudera, IBM and others all offer such VMs)
What you should keep in mind that you will not be able to use Hive in production without Hadoop and HDFS so if it is a problem for you, you should consider alternatives to Hive
You cant, just download Hive, and run:
./bin/hiveserver2
Cannot find hadoop installation: $HADOOP_HOME or $HADOOP_PREFIX must be set or hadoop must be in the path
Hadoop is like a core, and Hive need some library from it.
Update This answer is out-of-date : with Hive on Spark it is no longer necessary to have hdfs support.
Hive requires hdfs and map/reduce so you will need them. The other answer has some merit in the sense of recommending a simple / pre-configured means of getting all of the components there for you.
But the gist of it is: hive needs hadoop and m/r so in some degree you will need to deal with it.
Top answer works for me. But need few more setups. I spend a quite some time search around to fix multiple problems until I finally set it up. Here I summarize the steps from scratch:
Download hive, decompress it
Download hadoop, decompress it, put it in the same parent folder as hive
Setup hive-env.sh
$ cd hive/conf
$ cp hive-env.sh.template hive-env.sh
Add following environment in hive-env.sh (change path accordingly based
on actual java/hadoop version)
JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_281.jdk/Contents/Home
export path=$JAVA_HOME/bin:$path
export HADOOP_HOME=${bin}/../../hadoop-3.3.1
setup hive-site.xml
$ cd hive/conf
$ cp hive-default.xml.template hive-site.xml
Replace all the variable ${system:***} with constant paths (Not sure why this is not recognized in my system).
Set database path to local with following attributes (copied from top answer)
<configuration>
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
</property>
<property>
<!-- this should eventually be deprecated since the metastore should supply this -->
<name>hive.metastore.warehouse.dir</name>
<value>file:///tmp</value>
<description></description>
</property>
<property>
<name>fs.default.name</name>
<value>file:///tmp</value>
</property>
</configuration>
setup hive-log4j2.properties (optional, good for troubleshooting)
cp hive-log4j2.properties.template hive-log4j2.properties
Replace all the variable ${sys:***} to constant path
Setup metastore_db
If you directly run hive, when do any DDL, you will got error of:
FAILED: HiveException org.apache.hadoop.hive.ql.metadata.HiveException:MetaException(message:Hive metastore database is not initialized. Please use schematool (e.g. ./schematool -initSchema -dbType ...) to create the schema. If needed, don't forget to include the option to auto-create the underlying database in your JDBC connection string (e.g. ? createDatabaseIfNotExist=true for mysql))
In that case we need to recreate metastore_db with following command
$ cd hive/bin
$ rm -rf metastore_db
$ ./schematool -initSchema -dbType derby
Start hive
$ cd hive/bin
$ ./hive
Now you should be able run hive on you local file system. One thing to note, the metastore_db will always be created on you current directory. If you start hive in a different directory, you need to recreate it again.
Although, there are some details that you have to keep in mind it's completely normal to use Hive without HDFS. There are a few details one should keep in mind.
As a few commenters mentioned above you'll still need some .jar files from hadoop common.
As of today(XII 2020) it's difficult to run Hive/hadoop3 pair. Use stable hadoop2 with Hive2.
Make sure POSIX permissions are set correctly, so your local hive can access warehouse and eventually derby database location.
Initialize your database by manual call to schematool
You can use site.xml file pointing to local POSIX filesystem, but you can also set those options in HIVE_OPTS environmen variable.
I covered that with examples of errors I've seen on my blog post

Hadoop/MR temporary directory

I've been struggling with getting Hadoop and Map/Reduce to start using a separate temporary directory instead of the /tmp on my root directory.
I've added the following to my core-site.xml config file:
<property>
<name>hadoop.tmp.dir</name>
<value>/data/tmp</value>
</property>
I've added the following to my mapreduce-site.xml config file:
<property>
<name>mapreduce.cluster.local.dir</name>
<value>${hadoop.tmp.dir}/mapred/local</value>
</property>
<property>
<name>mapreduce.jobtracker.system.dir</name>
<value>${hadoop.tmp.dir}/mapred/system</value>
</property>
<property>
<name>mapreduce.jobtracker.staging.root.dir</name>
<value>${hadoop.tmp.dir}/mapred/staging</value>
</property>
<property>
<name>mapreduce.cluster.temp.dir</name>
<value>${hadoop.tmp.dir}/mapred/temp</value>
</property>
No matter what job I run though, it's still doing all of the intermediate work out in the /tmp directory. I've been watching it do it via df -h and when I go in there, there are all of the temporary files it creates.
Am I missing something from the config?
This is on a 10 node Linux CentOS cluster running 2.1.0.2.0.6.0 of Hadoop/Yarn Mapreduce.
EDIT:
After some further research, the settings seem to be working on my management and namednode/secondarynamed nodes boxes. It is only on the data nodes that this is not working and it is only with the mapreduce temporary output files that are still going to /tmp on my root drive, not the my data mount where I have set in the configuration files.
If you are running Hadoop 2.0, then the proper name of the config file you need to change is mapred-site.xml, not mapreduce-site.xml.
An example can be found on the Apache site: http://hadoop.apache.org/docs/r2.3.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
and it uses the mapreduce.cluster.local.dir property name, with a default value of ${hadoop.tmp.dir}/mapred/local
Try renaming your mapreduce-site.xml file to mapred-site.xml in your /etc/hadoop/conf/ directories and see if that fixes it.
If you are using Ambari, you should be able to just go to use the "Add Property" button on the MapReduce2 / Custom mapred-site.xml section, enter 'mapreduce.cluster.local.dir' for the property name, and a comma separated list of directories you want to use.
I think you need to specify this property in hdfs-site.xml rather than core-site.xml.Try setting this property in hdfs-site.xml. I hope this will solve your problem
The mapreduce properties should be in mapred-site.xml.
I was facing a similar issue where some nodes would not honor the hadoop.tmp.dir set in the config.
A reboot of the misbehaving nodes fixed it for me.

Why do we need to format HDFS after every time we restart machine?

I have installed Hadoop in pseudo distributed mode on my laptop, OS is Ubuntu.
I have changed paths where hadoop will store its data (by default hadoop stores data in /tmp folder)
hdfs-site.xml file looks as below :
<property>
<name>dfs.data.dir</name>
<value>/HADOOP_CLUSTER_DATA/data</value>
</property>
Now whenever I restart machine and try to start hadoop cluster using start-all.sh script, data node never starts. I confirmed that data node is not start by checking logs and by using jps command.
Then I
Stopped cluster using stop-all.sh script.
Formatted HDFS using hadoop namenode -format command.
Started cluster using start-all.sh script.
Now everything works fine even if I stop and start cluster again. Problem occurs only when I restart machine and try to start the cluster.
Has anyone encountered similar problem?
Why this is happening and
How can we solve this problem?
By changing dfs.datanode.data.dir away from /tmp you indeed made the data (the blocks) survive across a reboot. However there is more to HDFS than just blocks. You need to make sure all the relevant dirs point away from /tmp, most notably dfs.namenode.name.dir (I can't tell what other dirs you have to change, it depends on your config, but the namenode dir is mandatory, could be also sufficient).
I would also recommend using a more recent Hadoop distribution. BTW, the 1.1 namenode dir setting is dfs.name.dir.
For those who use hadoop 2.0 or above versions config file names may be different.
As this answer points out, go to the /etc/hadoop directory of your hadoop installation.
Open the file hdfs-site.xml. This user configuration will override the default hadoop configurations, that are loaded by the java classloader before.
Add dfs.namenode.name.dir property and set a new namenode dir (default is file://${hadoop.tmp.dir}/dfs/name).
Do the same for dfs.datanode.data.dir property (default is file://${hadoop.tmp.dir}/dfs/data).
For example:
<property>
<name>dfs.namenode.name.dir</name>
<value>/Users/samuel/Documents/hadoop_data/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/Users/samuel/Documents/hadoop_data/data</value>
</property>
Other property where a tmp dir appears is dfs.namenode.checkpoint.dir. Its default value is: file://${hadoop.tmp.dir}/dfs/namesecondary.
If you want, you can easily also add this property:
<property>
<name>dfs.namenode.checkpoint.dir</name>
<value>/Users/samuel/Documents/hadoop_data/namesecondary</value>
</property>

Resources