Unable to start Hive CLI Hadoop(MapR) - hadoop

I am trying to access hive CLI. However, it is failing to start with the following AccessControl issue.
Strangly enough, I am able to query hive data from Hue without the AccessControl issue. However, hive CLI is not working.
I am on a MapR cluster.
Any help is much appreciated.
[<user_name>#<edge_node> ~]$ hive
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/mapr/hive/hive-2.1/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/mapr/lib/slf4j-log4j12-1.7.12.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Logging initialized using configuration in file:/opt/mapr/hive/hive-2.1/conf/hive-log4j2.properties Async: true
2017-09-23 23:52:08,988 WARN [main] DataNucleus.General: Plugin (Bundle) "org.datanucleus.api.jdo" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/opt/mapr/spark/spark-2.1.0/jars/datanucleus-api-jdo-4.2.4.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/opt/mapr/hive/hive-2.1/lib/datanucleus-api-jdo-4.2.1.jar."
2017-09-23 23:52:08,993 WARN [main] DataNucleus.General: Plugin (Bundle) "org.datanucleus" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/opt/mapr/spark/spark-2.1.0/jars/datanucleus-core-4.1.6.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/opt/mapr/hive/hive-2.1/lib/datanucleus-core-4.1.6.jar."
2017-09-23 23:52:09,004 WARN [main] DataNucleus.General: Plugin (Bundle) "org.datanucleus.store.rdbms" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/opt/mapr/spark/spark-2.1.0/jars/datanucleus-rdbms-4.1.19.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/opt/mapr/hive/hive-2.1/lib/datanucleus-rdbms-4.1.7.jar."
2017-09-23 23:52:09,038 INFO [main] DataNucleus.Persistence: Property datanucleus.cache.level2 unknown - will be ignored
2017-09-23 23:52:09,039 INFO [main] DataNucleus.Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
2017-09-23 23:52:14,2251 ERROR JniCommon fs/client/fileclient/cc/jni_MapRClient.cc:2172 Thread: 20235 mkdirs failed for /user/<user_name>, error 13
Exception in thread "main" java.lang.RuntimeException: org.apache.hadoop.security.AccessControlException: User <user_name>(user id 50005586) has been denied access to create <user_name>
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:617)
at org.apache.hadoop.hive.ql.session.SessionState.beginStart(SessionState.java:531)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:714)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:646)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:641)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: org.apache.hadoop.security.AccessControlException: User <user_name>(user id 50005586) has been denied access to create <user_name>
at com.mapr.fs.MapRFileSystem.makeDir(MapRFileSystem.java:1256)
at com.mapr.fs.MapRFileSystem.mkdirs(MapRFileSystem.java:1276)
at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1913)
at org.apache.hadoop.hive.ql.exec.tez.DagUtils.getDefaultDestDir(DagUtils.java:823)
at org.apache.hadoop.hive.ql.exec.tez.DagUtils.getHiveJarDirectory(DagUtils.java:917)
at org.apache.hadoop.hive.ql.exec.tez.TezSessionState.createJarLocalResource(TezSessionState.java:616)
at org.apache.hadoop.hive.ql.exec.tez.TezSessionState.openInternal(TezSessionState.java:256)
at org.apache.hadoop.hive.ql.exec.tez.TezSessionState.beginOpen(TezSessionState.java:220)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:614)
... 10 more

The error is saying you're defined access to create a directory in the file system. This is likely /user/<user name>, which will need to be added by the HDFS / MapR FS super user.
I am able to query hive data from Hue without the AccessControl
Hue communicates via Thrift and HiveServer2.
Hive CLI bypasses HiveServer2 and is deprecated.
You should use Beeline instead.
beeline -n $(whoami) -u jdbc:hive2://hiveserver:10000/default
And if you're in a kerberized cluster, then you'll need some extra options there.

Related

Install Hive on Windows

I am trying to install hive 3.1.2 on windows 10, hadoop 3.2.2.
I can start hadoop server and start hive shell by run "hive".
First problem is it show a lot of WARN:
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2021-09-09T21:01:22,001 INFO [main] org.apache.hadoop.hive.conf.HiveConf - Found configuration file file:/C:/my_programs/hive_3.1.2/conf/hive-site.xml
2021-09-09T21:01:22,303 WARN [main] org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.server2.enable.impersonation does not exist
2021-09-09T21:01:23,557 WARN [main] org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.server2.enable.impersonation does not exist
Hive Session ID = f879881f-c49b-449b-b8cf-81302c585358
Logging initialized using configuration in jar:file:/C:/my_programs/hive_3.1.2/lib/hive-common-3.1.2.jar!/hive-log4j2.properties Async: true
2021-09-09T21:01:25,309 INFO [main] org.apache.hadoop.hive.ql.session.SessionState - Created HDFS directory: /tmp/hive/admin/f879881f-c49b-449b-b8cf-81302c585358
2021-09-09T21:01:25,317 INFO [main] org.apache.hadoop.hive.ql.session.SessionState - Created local directory: C:/Users/admin/AppData/Local/Temp/admin/f879881f-c49b-449b-b8cf-81302c585358
2021-09-09T21:01:25,325 INFO [main] org.apache.hadoop.hive.ql.session.SessionState - Created HDFS directory: /tmp/hive/admin/f879881f-c49b-449b-b8cf-81302c585358/_tmp_space.db
2021-09-09T21:01:25,345 INFO [main] org.apache.hadoop.hive.conf.HiveConf - Using the default value passed in for log id: f879881f-c49b-449b-b8cf-81302c585358
2021-09-09T21:01:25,345 INFO [main] org.apache.hadoop.hive.ql.session.SessionState - Updating thread name to f879881f-c49b-449b-b8cf-81302c585358 main
2021-09-09T21:01:25,383 WARN [f879881f-c49b-449b-b8cf-81302c585358 main] org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.server2.enable.impersonation does not exist
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
2021-09-09T21:01:53,237 INFO [f879881f-c49b-449b-b8cf-81302c585358 main] CliDriver - Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
hive>
It still run hive shell when I start run
hive> show databases;
it come to error like this:
hive> show databases;
2021-09-09T21:05:01,341 INFO [f879881f-c49b-449b-b8cf-81302c585358 main] org.apache.hadoop.hive.conf.HiveConf - Using the default value passed in for log id: f879881f-c49b-449b-b8cf-81302c585358
FAILED: HiveException java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
2021-09-09T21:05:43,092 INFO [f879881f-c49b-449b-b8cf-81302c585358 main] org.apache.hadoop.hive.conf.HiveConf - Using the default value passed in for log id: f879881f-c49b-449b-b8cf-81302c585358
2021-09-09T21:05:43,093 INFO [f879881f-c49b-449b-b8cf-81302c585358 main] org.apache.hadoop.hive.ql.session.SessionState - Resetting thread name to main
hive>
I have read some solution and I think that problem come from hive metastore.
I followed by tutorial that connect derby metastore with hive.
But when I try to run
schematool -dbType derby -initSchema
Window cannot run schematool as a command line.
So I really confuse how can init db to hive, or can do in another way?
Update 2021/9/20:
I have fix all variable in my paths, and right now I got stuck at new problem. The error is quite clear but no solution found on my research:
PS C:\my_programs\hive_3.1.2> .\bin\hive
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/C:/my_programs/hive_3.1.2/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/C:/my_programs/hadoop-3.2.2/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2021-09-20T20:22:34,274 INFO [main] org.apache.hadoop.hive.conf.HiveConf - Found configuration file file:/C:/my_programs/hive_3.1.2/conf/hive-site.xml
2021-09-20T20:22:34,694 WARN [main] org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.server2.enable.impersonation does not exist
2021-09-20T20:22:37,728 WARN [main] org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.server2.enable.impersonation does not exist
Hive Session ID = 89fc5e06-2a55-496c-aea0-ab5512839ac3
Logging initialized using configuration in jar:file:/C:/my_programs/hive_3.1.2/lib/hive-exec-3.1.2.jar!/hive-log4j2.properties Async: true
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.getTimeDuration(Ljava/lang/String;JLjava/util/concurrent/TimeUnit;Ljava/util/concurrent/TimeUnit;)J
at org.apache.hadoop.hdfs.client.impl.DfsClientConf.<init>(DfsClientConf.java:248)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:307)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:291)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:173)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3354)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3403)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3371)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:477)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:226)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:624)
at org.apache.hadoop.hive.ql.session.SessionState.beginStart(SessionState.java:591)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:747)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:683)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
at org.apache.hadoop.util.RunJar.main(RunJar.java:236)
PS C:\my_programs\hive_3.1.2>
It seems like error at hdfs config. But I don't know what I need to do with hadoop config, core-site.xml or hdfs-site.xml, or st else. Do we need some config at hadoop side to connect with hive???

Hive remote postgres metastore

I was doing multi-node setup using Apache distribution .I was able to complete hadoop installation successfully (Hadoop 2.7.3).
When I tried hive (Hive 2.3),its working without issues with the default metastore(derby).Then I changed the hive-site.xml to point to my external postgresDB
I gave host,username,password as per the tutorial .But when I ran the schemainit it is faliling as bellow ,still showing derby details and initialization
is failing .Anybody faced the same issue ever?
bash-4.2$ /data/hive/bin/schematool -initSchema -dbType postgres --verbose
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/data/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/data/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Metastore connection URL: jdbc:derby:;databaseName=metastore_db;create=true
Metastore Connection Driver : org.apache.derby.jdbc.EmbeddedDriver
Metastore connection User: APP
Starting metastore schema initialization to 2.3.0
Initialization script hive-schema-2.3.0.postgres.sql
Connecting to jdbc:derby:;databaseName=metastore_db;create=true
Connected to: Apache Derby (version 10.10.2.0 - (1582446))
Driver: Apache Derby Embedded JDBC Driver (version 10.10.2.0 - (1582446))
Transaction isolation: TRANSACTION_READ_COMMITTED
0: jdbc:derby:> !autocommit on
Autocommit status: true
0: jdbc:derby:> SET statement_timeout = 0
Error: Syntax error: Encountered "statement_timeout" at line 1, column 5. (state=42X01,code=30000)
Closing: 0: jdbc:derby:;databaseName=metastore_db;create=true
org.apache.hadoop.hive.metastore.HiveMetaException: Schema initialization FAILED! Metastore state would be inconsistent !!
Underlying cause: java.io.IOException : Schema script failed, errorcode 2
org.apache.hadoop.hive.metastore.HiveMetaException: Schema initialization FAILED! Metastore state would be inconsistent !!
at org.apache.hive.beeline.HiveSchemaTool.doInit(HiveSchemaTool.java:590)
at org.apache.hive.beeline.HiveSchemaTool.doInit(HiveSchemaTool.java:563)
at org.apache.hive.beeline.HiveSchemaTool.main(HiveSchemaTool.java:1145)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: java.io.IOException: Schema script failed, errorcode 2
at org.apache.hive.beeline.HiveSchemaTool.runBeeLine(HiveSchemaTool.java:980)
at org.apache.hive.beeline.HiveSchemaTool.runBeeLine(HiveSchemaTool.java:959)
at org.apache.hive.beeline.HiveSchemaTool.doInit(HiveSchemaTool.java:586)
... 8 more
*** schemaTool failed ***

Unable to read HiveServer2 configs from ZooKeeper

I use HDP3.1. And I Ambari to deploy hadoop cluster and hive. After deployed, I can run hive in shell successfully. And then I deploy Apache Kylin2.6, it can sync hive table. But when I build the cube, I got the following error:
java.io.IOException: OS command error exit with return code: 1, error message: SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/hdp/3.1.0.0-78/hive/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/3.1.0.0-78/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Connecting to jdbc:hive2://datacenter1:2181,datacenter2:2181,datacenter3:2181/default;password=hdfs;serviceDiscoveryMode=zooKeeper;user=hdfs;zooKeeperNamespace=hiveserver2
19/02/15 10:04:53 [main]: INFO jdbc.HiveConnection: Connected to datacenter3:10000
19/02/15 10:04:53 [main]: WARN jdbc.HiveConnection: Failed to connect to datacenter3:10000
19/02/15 10:04:53 [main]: ERROR jdbc.Utils: Unable to read HiveServer2 configs from ZooKeeper
Error: Could not open client transport for any of the Server URI's in ZooKeeper: Failed to open new session: java.lang.IllegalArgumentException: Cannot modify dfs.replication at runtime. It is not in list of params that are allowed to be modified at runtime (state=08S01,code=0)
Cannot run commands specified using -e. No current connection
The command is:
hive -e "USE default;
I run hive command in shell. It's success. The connection string is same as the string when run build cube in kylin. I'm confused why it is success in shell but failed in building cube.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/hdp/3.1.0.0-78/hive/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/3.1.0.0-78/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Connecting to jdbc:hive2://datacenter1:2181,datacenter2:2181,datacenter3:2181/default;password=hdfs;serviceDiscoveryMode=zooKeeper;user=hdfs;zooKeeperNamespace=hiveserver2
19/02/15 12:10:19 [main]: INFO jdbc.HiveConnection: Connected to datacenter3:10000
Connected to: Apache Hive (version 3.1.0.3.1.0.0-78)
Driver: Hive JDBC (version 3.1.0.3.1.0.0-78)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 3.1.0.3.1.0.0-78 by Apache Hive
0: jdbc:hive2://datacenter1:2181,datacenter2:>
You can try to add these two properties to hive-site.xml.
<property>
<name>hive.security.authorization.sqlstd.confwhitelist</name>
<value>mapred.*|hive.*|mapreduce.*|spark.*</value>
</property>
<property>
<name>hive.security.authorization.sqlstd.confwhitelist.append</name>
<value>mapred.*|hive.*|mapreduce.*|spark.*</value>
</property>
Finally, I found the root cause. There is 'Cannot modify dfs.replication at runtime.' error message in the error log. Kylin set this property in $KYLIN_HOME/conf/kylin_hive_conf.xml. And when it is running hive command, it will auto append the properties in that file. The final command likes: hive --hiveconf dfs.replication=2 ..........
It looks like that dfs.replication property can't be appened to hive command. I removed this property in kylin_hive_conf.xml. And it works now.

Failed to initialize schema for HiveServer2 in Apache Hive 3.0.0 on Cygwin (Windows 10)

I already had a Hadoop 3.0.0 cluster consisting of 2 machine: 1 namenode + RM and 1 datanode. I tried to install Apache Hive 3.0.0 by following this document.
When I run schematool -dbType derby -initSchema --verbose on Cygwin, an exception was thrown:
$ schematool -dbType derby -initSchema --verbose
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/C:/BigSol/apache-hive-3.0.0-bin/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/C:/BigSol/hadoop-3.0.0/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Metastore connection URL: jdbc:derby:;databaseName=metastore_db;create=true
Metastore Connection Driver : org.apache.derby.jdbc.EmbeddedDriver
Metastore connection User: APP
Starting metastore schema initialization to 3.0.0
org.apache.hadoop.hive.metastore.HiveMetaException: Unknown version specified for initialization: 3.0.0
org.apache.hadoop.hive.metastore.HiveMetaException: Unknown version specified for initialization: 3.0.0
at org.apache.hadoop.hive.metastore.MetaStoreSchemaInfo.generateInitFileName(MetaStoreSchemaInfo.java:137)
at org.apache.hive.beeline.HiveSchemaTool.doInit(HiveSchemaTool.java:580)
at org.apache.hive.beeline.HiveSchemaTool.doInit(HiveSchemaTool.java:562)
at org.apache.hive.beeline.HiveSchemaTool.main(HiveSchemaTool.java:1445)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:239)
at org.apache.hadoop.util.RunJar.main(RunJar.java:153)
*** schemaTool failed ***
When viewing the line of code that thrown the exception, I found that Hive tried to find a SQL schema located at %HIVE_HOME%\scripts\metastore\upgrade\derby\hive-schema-3.0.0.derby.sql.
I doubt that Cygwin messed up the path so that Hive didn't find that schema.
My questions:
How can I correct the path (or fix the problem)?
Are there batch files equivalent to *.sh files in %HIVE_HOME%\bin directory as Hive 2.1.1 have?
I found the solution. After running schematool on a Linux machine and copied metastore_db directory to Windows machine, I managed to start HiveServer2 but the beeline CLI said that the jar in C:\cygdrive\c\BigSol\apache-hive-3.0.0-bin\lib\hive-beeline-3.1.0.jar was not found.
It turned out that java in Cygwin parse the wrong path. I made a symbolic link from C:\cygdrive\c to C:\ and it worked.

Spark shell throwing exception after trying to integrate s3 / hadoop

I'm working on a windows machine trying to set up a spark teststack - the aim is to read/write file to an s3 bucket.
I'm running 1.6.1. When I run spark-shell I now receive an error:
16/03/22 15:19:48 INFO metastore.HiveMetaStore: 0: get_functions: db=default pat=*
16/03/22 15:19:48 INFO HiveMetaStore.audit: ugi=Administrator ip=unknown-ip-addr cmd=get_functions: db=default pat=*
16/03/22 15:19:48 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table.
java.lang.RuntimeException: java.io.IOException: No FileSystem for scheme: s3n
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
at org.apache.spark.sql.hive.client.ClientWrapper.<init>(ClientWrapper.scala:204)
doing some reading lead me to believe that I need to add the aws jars as an argument - the jars are included in the hadoop structure.
I then run C:\Spark\hadoop\share\hadoop\tools\lib>spark-shell --jars aws-java-sdk-1.7.4.jar, hadoop-aws-2.7.1.jar
thinking that I'm now including the jars and so it must be ok...how foolish of me - I get the exact same error.
I then tried to include just the hadoop-aws jar and all kinds of exceptions were thrown including not being able to instantiate hive, s3a couldn't be instantiated, awscredentials wasn't happy and so on.
I'm at a bit of a loss, if anyone can shed some light on what I might be doing wrong I'll happily buy them a pint :)
EDIT:
I've since updates the core-site.xml file, by removing the fs.defaultFS property witha value os s3n://mybucketname, spark will now load.
In it's stead i have the hdfs://0.0.0.0:19000 which is working fine.
Soi I guess my question changes from 'gaaaaah to 'gaaaaah, how does one include s3 correctly as a filesystem'

Resources