File is not Loaded Into HDFS from Local Using Streamsets (validated Successfully!) - hadoop

I just have started using streamsets, and i'm trying to load a text file from local to HDFS.
Please note: I'm using Cloudera Manager, here is a view of "core-site.xml":
<property>
<name>hadoop.ssl.server.conf</name>
<value>ssl-server.xml</value>
<final>true</final>
</property>
<property>
<name>hadoop.ssl.client.conf</name>
<value>ssl-client.xml</value>
<final>true</final>
</property>
<property>
<name>hadoop.proxyuser.sdc.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.sdc.groups</name>
<value>*</value>
</property>
</configuration>
The local file is a text file stored in "/home/cloudera/Desktop".
Here is a view of the source (Local) configuration in Streamsets:
Here is a view of Hadoop fs configuration in Streamsets:
It was validated successfully!
After I played the pipline, I'm supposed to find the file in HDFS directory that I specified, especially at "/user/cloudera".
But when I run it the file hasn't been loaded.
I'm sure I missed something, and I couldn't find answer for this.
Could you please help!
Thanks,

You need to play the pipeline, not only validate it.

Related

Hadoop localhost:9870 browser interface is not working

I need to do data analysis using Hadoop. Therefore I have installed Hadoop and configured as below. But localhost:9870 is not working. Even I have format namenode every time I worked with that. Some articles and answers of this forum mentioned that 9870 is the updated one from 50070. I have win 10. I also referred answers in this forum but none of them worked. Java-home and hadoop-home paths are set. Paths to bin and sbin of hadoop are also set up. Can anyone please tell me what I am doing wrong in here?
I referred this site to do the installation and configuration.
https://medium.com/#pedro.a.hdez.a/hadoop-3-2-2-installation-guide-for-windows-10-454f5b5c22d3
core-site.xml
I have set up the Java path in this xml as well.
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9870</value>
</property>
hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>C:\hadoop-3.2.2\data\namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>C:\hadoop-3.2.2\data\datanode</value>
</property>
mapred-site.xml
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
yarn-site.xml
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
If you look at the namenode logs, it very likely has an error saying something about a port already being in use.
The default fs.defaultFS port should be 9000 - https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html ; you shouldn't change this without good reason.
The Namenode web UI isn't the value in fs.defaultFS. It's default port is 9870, and is defined by dfs.namenode.http-address in hdfs-site.xml
need to do data analysis
You can do analysis on Windows without Hadoop using Spark, Hive, MapReduce, etc. directly and it'll have direct access to your machine without being limited by YARN container sizes.

set up Hadoop multi cluster on 2 windows 10

I am trying to set up a multi-node Hadoop cluster between 2 windows devices. I am using Hadoop 2.9.2.
how can I achieve that, please.
after a lot of trial and error the following did the job me.
do same configuration as previous answer by #AbsoluteBeginner.
disable windows firewall on all machines (i think you could keep it on and just mess around with the rules, but thats for you to find out)
hdfs namenode -format all nodes (master and slaves)
make sure that the datanode folder is empty in all 3 nodes (just shift+del)
in master node run start-all.cmd. all the following should appear.
50436 NameNode
54696 NodeManager
54744 DataNode
60028 Jps
7340 ResourceManager
in slave nodes run start-all.cmd. all the following should appear
6116 DataNode
2408 Jps
3208 NodeManager
note the reason that nameode and resource manager isn't appearing, is becuase they are running on master node and already occupy the port, and you only need the master resourcemanger and name node running
note if you saw multi-cluster tutorial of linux the master node also shows SeceondryNameNode when executing jps. not really sure why its not appearing in windows.
go to master:50070, and navigate to data nodes you should see something like this
go to master:8088, and navigate to Node you should see something like this
Install open-ssh server on both of your systems using this guide. Generating a new SSH public and private key pair on your local computer is the first step towards authenticating with a remote server without a password. Add the public key to the authorized_keys and add your hostname to list of known hosts. You can find guides on how to do this by searching the internet.
2.Add your hadoop master and slave ips to your hosts file. Open “C:\Windows\System32\drivers\etc\hosts”
and add
your-master-ip hadoopMaster
your-salve-ip hadoopSlave
you can use these names in your configuration files.
much like Linux systems, these are the steps you have to follow in order to run a Hadoop cluster on windows:
3. First you need to have Java installed on your system and JAVA_HOME must be added to your environment variables. You can download Java from Oracle website and install it.
Download Hadoop binary files from Apache website and extract it.
Note that you shouldn't have space in your folder names or you might encounter problems.
Next you have to add Java and Hadoop home and bin folders to your environment variables. just open start menu and type "environment variable" and open the edit environment variables window from control panel.
Add
HADOOP_HOME=”root of your hadoop extracted folder\hadoop-2.9.2″
HADOOP_BIN=”root of hadoop extracted folder\hadoop-2.9.2\bin”
JAVA_HOME=<Root of your JDK installation>”
Edit your "path" environment variable and add %JAVA_HOME%, %HADOOP_HOME%, %HADOOP_BIN%, %HADOOP_HOME%/sbin to your PATH one by one.
you can validate your additions by opening cmd and type in:
echo %HADOOP_HOME%
echo %HADOOP_BIN%
echo %PATH%
CONFIGURING HADOOP:
10. Open "your hadoop root\hadoop-2.9.2\etc\hadoop\hadoop-env.cmd" and add the following lines to the bottom of the file:
set HADOOP_PREFIX=%HADOOP_HOME%
set HADOOP_CONF_DIR=%HADOOP_PREFIX%\etc\hadoop
set YARN_CONF_DIR=%HADOOP_CONF_DIR%
set PATH=%PATH%;%HADOOP_PREFIX%\bin
11.Open "your-hadoop-root\hadoop-2.9.2\etc\hadoop\hdfs-site.xml" and add the below content:
<property>
<name>dfs.name.dir</name>
<value>your desired address</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>your desired address</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.datanode.use.datanode.hostname</name>
<value>false</value>
</property>
<property>
<name>dfs.namenode.datanode.registration.ip-hostname-check</name>
<value>false</value>
</property>
<property>
<name>dfs.namenode.http-address</name>
<value>hadoopMaster:50070</value>
<description>Your NameNode hostname for http access.</description>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>hadoopMaster:50090</value>
<description>Your Secondary NameNode hostname for http access.</description>
</property>
edit your core-site.xml and add:
<property>
<name>fs.default.name</name>
<value>hdfs://hadoopMaster:9000</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>your-temp-directory</value>
<description>A base for other temporary directories.</description>
</property>
Open "root to hadoop\hadoop-2.9.2\etc\hadoop\mapred-site.xml" and add below content within tags. If you don’t see mapred-site.xml then open mapred-site.xml.template file and rename it to mapred-site.xml
<property>
<name>mapred.job.tracker</name>
<value>hadoopMaster:9001</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
14.Edit your yarn-site.xml and add:
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce.shuffle</value>
<description>Long running service which executes on Node Manager(s) and provides MapReduce Sort and Shuffle functionality.</description>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
<description>Enable log aggregation so application logs are moved onto hdfs and are viewable via web ui after the application completed. The default location on hdfs is '/log' and can be changed via yarn.nodemanager.remote-app-log-dir property</description>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>hadoopMaster:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>hadoopMaster:8031</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>hadoopMaster:8032</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>hadoopMaster:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>hadoopMaster:8088</value>
</property>
In your slaves file in "root-hadoop-directory/hadoop/bin" add
hadoopSlave
Do these steps on your slave nodes too.
open cmd and cd to your sbin folder in hadoop directory.
18.format your nameNode
hadoop namenode -format
19.run the following command:
start-dfs.sh
then run:
start-yarn.sh

Hadoop-Apache Ranger: StackOverflowError on namenode restart

I am getting this error after enabling hdfs plugin in apache ranger.
When I run enable-hdfs-plugin.sh ranger adds following configuration in hdfs-site.xml.
<property>
<name>dfs.permissions.enabled</name>
<value>true</value>
</property>
<property>
<name>dfs.permissions</name>
<value>true</value>
</property>
<property>
<name>dfs.namenode.inode.attributes.provider.class</name>
<value>org.apache.ranger.authorization.hadoop.RangerHdfsAuthorizer</value>
</property>
But if I remove the above property and restart my namenode, it starts with no error. Also, when I try to format the namenode it gives me the same error.
This is my install.properties of ranger's hdfs-plugin.
Link ranger-1.0.0-SNAPSHOT-hdfs-plugin/lib/ranger-hdfs-plugin-impl to /var/local/hadoop/hadoop-2.7.3/share/hadoop/hdfs/lib/ranger-hdfs-plugin-impl
Link ranger-1.0.0-SNAPSHOT-hdfs-plugin/lib/ranger-hdfs-plugin-shim-1.0.0-SNAPSHOT.jar to /var/local/hadoop/hadoop-2.7.3/share/hadoop/hdfs/lib/ranger-hdfs-plugin-shim-1.0.0-SNAPSHOT.jar
Link ranger-1.0.0-SNAPSHOT-hdfs-plugin/lib/ranger-plugin-classloader-1.0.0-SNAPSHOT.jar to /var/local/hadoop/hadoop-2.7.3/share/hadoop/hdfs/lib/ranger-plugin-classloader-1.0.0-SNAPSHOT.jar
follow these instruction as per your file path. The problem is due to classloader is not found in your hadoop file path.

Duplicate YARN conf settins

I'm am using hadoop 2.6
In yarn-site.xml I have the following defined:
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
But if I look inside the yarn UI /conf URL I get the following 2 definitions:
<property>
<name>yarn.nodemanamger.vmem-check-enabled</name>
<value>false</value>
<source>java.io.BufferedInputStream#2893de87</source>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>true</value>
<source>yarn-default.xml</source>
</property>
Which one of the two properties are actually being followed?
Am I correct in saying that the <source>java.io.BufferedInputStream#2893de87</source> overwrites <source>yarn-default.xml</source> ?
The <source> is not the property value.
The value is read from your configuration through some external source (an InputStream), then it is placed in the file with <value>false</value>
How that file gets read is up to the Configuration object in the Hadoop API.

hadoop wordcount and upload file into hdfs

hello everyone i am very new in hadoop and i install hadoop in pseudo mode.
configurations files are here
core-site.xml
<configuration>
<property>
<name>fs.default.name </name>
<value> hdfs://localhost:9000 </value>
</property>
</configuration>
hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///home/hadoop_usr/hadoopinfra/hdfs/namenode </value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///home/hadoop_usr/hadoopinfra/hdfs/datanode </value>
</property>
</configuration>
and am successfully start datanode and namenode
Now i want to put my file into hdfs by using following way
what's going wrong why i get error message. Please help me to resolve this problem
If i using following way to put file into hdfs that time command is working fine. now i appand hdfs url.
Please help me why i getting error in first way.
Because when in running my wordcount.jar that time am also getting error message when i mentioned data.txt as input file on which operation sould be performed.
Thanks in advance.
The reason the first put operation to data/data.txt is not working is likely that you do not have a folder data in your hdfs yet.
You can just create it using hadoop fs -mkdir /data.

Resources