How to access s3 files using on prem hadoop cluster?

How to access s3 files using on prem hadoop cluster? - hadoop

I have a cloudera VM and able to set up aws CLI and set up keys.But, I am not able to read s3 files or access s3 files using hadoop fs -ls s3://gft-ri or any hadoop command. I could see the directory/files using aws CLI.
Snapshot of the commands:
(base) [cloudera#quickstart conf]$ **aws s3 ls s3://gft-risk-aml-market-dev/**
PRE test/
2019-11-27 04:11:26 458 required
(base) [cloudera#quickstart conf]$ **hdfs dfs -ls s3://gft-risk-aml-market-dev/**
19/11/27 05:30:45 WARN fs.FileSystem: S3FileSystem is deprecated and will be removed in future releases. Use NativeS3FileSystem or S3AFileSystem instead.
ls: `s3://gft-risk-aml-market-dev/': No such file or directory
I have put the core-site.xml details.
<property>
<name>fs.s3.impl</name>
<value>org.apache.hadoop.fs.s3.S3FileSystem</value>
</property>
<property>
<name>fs.s3.awsAccessKeyId</name>
<value>ANHS</value>
</property>
<property>
<name>fs.s3.awsSecretAccessKey</name>
<value>EOo</value>
</property>
<property>
<name>fs.s3.path.style.access</name>
<value>true</value>
</property>
<property>
<name>fs.s3.endpoint</name>
<value>s3.us-east-1.amazonaws.com</value>
</property>
<property>
<name>fs.s3.connection.ssl.enabled</name>
<value>false</value>
</property>

Finally. Cloudera Quickstart V13 and below core-site.xml worked.
<property>
<name>fs.s3a.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>
<property>
<name>fs.s3a.awsAccessKeyId</name>
<value>AKIAxxxx</value>
</property>
<property>
<name>fs.s3a.awsSecretAccessKey</name>
<value>Xxxxxx</value>
</property>
<property>
<name>fs.s3a.path.style.access</name>
<value>true</value>
</property>
<property>
<name>fs.AbstractFileSystem.s3a.impl</name>
<value>org.apache.hadoop.fs.s3a.S3A</value>
<description>The implementation class of the S3A AbstractFileSystem.</description>
</property>
<property>
<name>fs.s3a.endpoint</name>
<value>s3.us-east-1.amazonaws.com</value>
</property>
<property>
<name>fs.s3a.connection.ssl.enabled</name>
<value>false</value>
</property>
<property>
<name>fs.s3a.readahead.range</name>
<value>64K</value>
<description>Bytes to read ahead during a seek() before closing and
re-opening the S3 HTTP connection. This option will be overridden if
any call to setReadahead() is made to an open stream.</description>
</property>
<property>
<name>fs.s3a.list.version</name>
<value>2</value>
<description>Select which version of the S3 SDK's List Objects API to use.
Currently support 2 (default) and 1 (older API).</description>
</property>

I would use the Linux console to mount the S3 bucket and then move files from there to HDFS in that fashion. You will probably need to install it on the Cloudera quickstart by sudo'ing into root first, e.g., sudo yum install s3fs-fuse

Related

I can't proceed hadoop example

I do put README.txt file and do jar command but hdfs don't proceed anymore This is last terminal screen
I think "SASL encryption trust check" or "Unable to find 'resource-types.xml'" are the problem so I tried to insert
export HADOOP_SECURE_DN_USER=
to HADOOP-env.sh and insert
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
to mapred-site.xml
but It didn't work again
Hadoop version is 3.1.3
Java version is oracle java 1.8.0_212
hdfs-site.xml
core-site.xml
mapred-site.xml
yarn-site.xml
please help me...
This is 8088 page Is it YARN UI?

Cloudera CDH can not start yarn timeline server for tez-ui

When I enable the yarn time server by adding the configurations to yarn-site.xml in Cloudera Manager Advanced Configuration Options:
<property>
<name>yarn.timeline-service.hostname</name>
<value>yarn-hostname</value>
</property>
<property>
<name>yarn.timeline-service.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.timeline-service.generic-application-history.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.timeline-service.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.timeline-service.ttl-enable</name>
<value>true</value>
</property>
<property>
<name> yarn.resourcemanager.system-metrics-publisher.enabled</name>
<value>true</value>
</property>
then restart the cluster,but timeline server not start.How to solve the problem?What is the mechanism of CM Managing hadoop? I can not find any timeline log in yarn logs.
CDH version is CDH-5.3.6-1.cdh5.3.6.p0.11,hadoop version is 2.5.0.

If you want to start timeline server, connect to your yarn node,and run command below:
yarn timelineserver
... then you will start it. if you want to start with you own config, you can edit yarn-site.xml in CM.

GridGain No FileSystem for scheme: ggfs

everyone
I want to use GridGain in Hadoop 2.4.0
my hadoop config under that
core-site.xml
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/hadoop-data</value>
</property>
<property>
<name>fs.trash.interval</name>
<value>1440</value>
</property>
<property>
<name>io.file.buffer.size</name>
<value>131072</value>
</property>
<property>
<name>fs.defaultFS</name>
<value>ggfs://ggfs#R</value>
</property>
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/usr/hadoop-data/journal</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>r,host002,host004</value>
</property>
<property>
<name>fs.AbstractFileSystem.ggfs.impl</name>
<value>org.gridgain.grid.ggfs.hadoop.v2.GridGgfsHadoopFileSystem</value>
</property>
<property>
<name>dfs.client.block.write.replace-datanode-on-failure.policy</name>
<value>NEVER</value>
</property>
</configuration>
finish setting and start hdfs
I use
hadoop fs -ls /
ls: No FileSystem for scheme: ggfs
How should I do
Thanks

Add the followings to the core-site.xml:
<property>
<name>fs.ggfs.impl</name>
<value>org.gridgain.grid.ggfs.hadoop.v1.GridGgfsHadoopFileSystem</value>
</property>
The second version of Hadoop File System API is used rarely. The most of parts of Hadoop ecosystem works through first version of API.
And if you want to use GGFS only you don't need to start HDFS services.

Hbase master not running

I am trying to run Hbase in a pseudo-distributed mode. I followed this link.
I am using ubuntu version 12.04 Hbase version 0.94.8 Hadoop Version 2.4.0
In hbase/conf/hbase-env.sh, i added the following
export JAVA_HOME=/usr/lib/jvm/jdk1.7.0_25
export HBASE_REGIONSERVERS=/usr/lib/hbase/hbase-0.94.8/conf/regionservers
export HBASE_MANAGES_ZK=true
Then I set the HBASE_HOME path in bashrc file
In hbase/conf/hbase-site.xml I added the following,
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://localhost:9000/hbase</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>localhost</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>hbase.zookeeper.property.clientPort</name>
<value>2181</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/home/prashasti/Installed/hbase-0.94.8/HBASE/zookeeper</value>
</property>
</configuration>
To prevent version mismatch between hadoop and hbase, I added
hadoop-common-2.4.0.jar
and
hadoop-mapreduce-client-core-2.4.0.jar
in hbase/lib folder
When I start hbase using
$./bin/start-hbase.sh
No error turns up, but the Hmaster doesn't start.

can you pl try removing all the configuration parameters from hbase-site.xml except hbase.rootdir and then try starting the hbase.
Also comment out export HBASE_REGIONSERVERS export HBASE_MANAGES_ZK in hbase-env.xml

How to configure hadoop 1.2.1 with storage on Amazon S3?

I have tried to put the following in my core-site.xml file by following this wiki http://wiki.apache.org/hadoop/AmazonS3, but my hadoop cannot make connection with S3n. I have hadoop 1.2.1 deployed on EC2 cluster. What is the correct way of configuring Hadoop with S3 storage? thanks a lot!
<property>
<name>fs.default.name</name>
<value>s3://BUCKET</value>
</property>
<property>
<name>fs.s3.awsAccessKeyId</name>
<value>ID</value>
</property>
<property>
<name>fs.s3.awsSecretAccessKey</name>
<value>SECRET</value>
</property>

Did you try to replace s3 with s3n in your configuration?
Try this:
<property>
<name>fs.default.name</name>
<value>s3n://BUCKET</value>
</property>
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>ID</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>SECRET</value>
</property>
By the way there're great explanations of the difference between s3 and s3n here
Additionally you may need to start and stop HDFS cluster (not sure that there is a need to start it, as we don't use it anymore).

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to access s3 files using on prem hadoop cluster? - hadoop

I would use the Linux console to mount the S3 bucket and then move files from there to HDFS in that fashion. You will probably need to install it on the Cloudera quickstart by sudo'ing into root first, e.g., sudo yum install s3fs-fuse

Related

I can't proceed hadoop example

Cloudera CDH can not start yarn timeline server for tez-ui

GridGain No FileSystem for scheme: ggfs

Hbase master not running

How to configure hadoop 1.2.1 with storage on Amazon S3?

Categories

Resources