flum agent with syslogs source and hbase sink - hadoop

I try to use flume with syslogs source and hbase sink.
when I run flume agent I get this error : Failed to start agent because dependencies were not found in classpath. Error follows. java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration, which means (from that question) that some hbase lib are missing, to solve it I need to set in flume-env.sh file the path to these lib,that what I did, and run flume but the error persisted here is the command that I used to run flume agent : bin/flume-ng agent --conf ./conf --conf-file ./conf/flume.properties --name agent -Dflume.root.logger=INFO,console so my question is, If the solution that I used is correct (I need to add lib to flume) why I still get the same error, if not how to solve that problem
EDIT
from the doc I read : The flume-ng executable looks for and sources a file named "flume-env.sh" in the conf directory specified by the --conf/-c commandline option..
I haven't test it yet but I think that is the solution (I just need a confirmation )

I would recommend you to download HBase full tar ball and set the environment variables like HBASE_HOME etc to the right locations. Then Flume can automatically pick the libraries from HBase repo.

Related

Add path with aux jars for Hive client

I did have HDP 2.6.1.0-129
I have external Jar example.jar for serialized flume data file.
I did add new parametr in section Custom hive-site
name = hive.aux.jars.path
value hdfs:///user/libs/
Did save new configuration and did restart hadoop componens and in more time restart all hadoop cluster.
After in Hive client I did try to run select
select * from example_serealized_table
and hive did return error
FAILED: RuntimeException MetaException(message:org.apache.hadoop.hive.serde2.SerDeException java.lang.ClassNotFoundException: Class com.my.bigtable.example.model.gen.TSerializedRecord not found)
How solve this problem?
p.s.
If did try add in current session,
add jar hdfs:///user/libs/example-spark-SerializedRecord.jar;
Did try to put *.jar to local folder.
Problem same.
I did not say that library write my my colleague did write a library.
It did turn out that it redefines the variables that affect the level of logging the field.
After excluding overridden variables in the library, the problem of reproducing did stopp.

Why MR2 map task is running under 'yarn' user and not under user I ran hadoop job?

I'm trying to run mapreduce job on MR2, Hadoop ver. 2.6.0-cdh5.8.0. Job has relative path to directory which has a lot of files to be compressed based on some criteria(not really necessary for this question). I'm running my job as following:
sudo -u my_user hadoop jar my_jar.jar com.example.Main
There is a folder on HDFS under path /user/my_user/ with files. But when I'm running my job I got following exception:
java.io.FileNotFoundException: File /user/yarn/<path_from_job> does not exist.
I'm migrating this job from MR1 where this job is working correctly. My suggestion is this is happening due to YARN, because each container started under YARN user. In my job configuration I've tried to set mapreduce.job.user.name="my_user" but this didn't help.
I've found ${user.home} usage in me Job configuration, but I don't know aware where it is set and is it possible to change this.
The only solution I found so far is to provide absolute path to folder. Is there any other way around, because I feel like this is not correct approach.
Thank you

Which configuration file client is using to connect to hadoop cluster

When the edge node had multiple hadoop distributions, there can be multiple configuration files scattered across the directories.
In those cases, how to know which configuration file the client is referring to, for it to connect to the cluster. ( say, for Yarn ). One option is to look at .bashrc file to find out if the HADOOP_HOME variable is set.
Are there are any other options to find this out . ( obviously, using the find command to search for a file will not solve the purpose ).
Hadoop provides classpath command. Read the description of the command below:
classpath prints the class path needed to get the
Hadoop jar and the required libraries
You can execute this command as:
hadoop classpath
or
yarn classpath
Both these commands, should give you almost identical results.
For e.g. I got following output, for hadoop classpath
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\bin>hadoop classpath
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\etc\hadoop;
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\share\hadoop\common\lib\*;
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\share\hadoop\common\*;
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\share\hadoop\hdfs;
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\share\hadoop\hdfs\lib\*;
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\share\hadoop\hdfs\*;
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\share\hadoop\yarn\lib\*;
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\share\hadoop\yarn\*;
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\share\hadoop\mapreduce\lib\*;
e:\hdp\hadoop-2.7.1.2.3.0.0-2557\share\hadoop\mapreduce\*;
e:\hdp\tez-0.7.0.2.3.0.0-2557\conf\;
e:\hdp\tez-0.7.0.2.3.0.0-2557\*;
e:\hdp\tez-0.7.0.2.3.0.0-2557\lib\*;
All these paths contain HADOOP_HOME as the parent path. In my case, it is: "e:\hdp\hadoop-2.7.1.2.3.0.0-2557". From this path, you can easily figure out, which distribution of Hadoop, is your client referring to.
In my case, my client is using the Hadoop configurations and jars from: "e:\hdp\hadoop-2.7.1.2.3.0.0-2557" directory.
You can run env command to get HADOOP_HOME for the session. Even if you overwrite HADOOP_HOME, env will give current value of the session.

Flume to HBase dependencie failure

I have installed HBase and Flume using Apache Cloudera. I have a flume agent running on a linux server, where the HBase current master is running.
I'm trying to write from a spooldir to HBase but I get the following error:
...
ERROR org.apache.flume.node.PollingPropertiesFileConfigurationProvider: Failed to start agent because dependencies were not found in classpath. Error follows.
java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration
at org.apache.flume.sink.hbase.HBaseSink.<init>(HbaseSink.java:116)
...
Flume configuration:
...
#Sinks
tier1.sinks.hbase-sink.channel = memory-channel
tier1.sinks.hbase-sink.type = org.apache.flume.sink.hbase.HBaseSink
tier1.sinks.hbase-sink.table = FlumeTable
tier1.sinks.hbase-sink.columnFamily = FlumeColumn
I tried to modify the flume-env.sh and set HBASE_HOME HADOOP_HOME, but it changed nothing.
I have succeeded to write to HDFS, but the HBase is making problems.
I could resolve this problem by adding the path of the hbase-libraries to the FLUME_CLASSPATH in the conf/flume-env.sh, i.e., in my case the file looked like:
FLUME_CLASSPATH="/home/USERNAME/hbase-1.0.1.1/lib/*"
Hope it helps.

Hadoop on Mesos fails with "Could not find or load main class org.apache.hadoop.mapred.MesosExecutor"

I have a Mesos cluster setup -- I have verified that the master can see the slaves -- but when I attempt to run a Hadoop job, all tasks wind up with a status of LOST. The same error is present in all the slave stderr logs:
Error: Could not find or load main class org.apache.hadoop.mapred.MesosExecutor
and that is the only line in the stderr logs.
Following the instructions on http://mesosphere.io/learn/run-hadoop-on-mesos/, I have put a modified Hadoop distribution on HDFS which each slave can access.
In the lib directory of the Hadoop distribution, I have added hadoop-mesos-0.0.4.jar and mesos-0.14.2.jar.
I have verified that each slave does in fact download this Hadoop distribution, and that hadoop-mesos-0.0.4.jar contains the class org.apache.hadoop.mapred.MesosExecutor, so I cannot figure out why the class cannot be found.
I am using Hadoop from CDH4.4.0 and mesos-0.15.0-rc4.
Does any one have any suggestions as to what might be the problem? I know I would always start with a CLASSPATH problem, but, in this case, the mesos-slave is downloading, unpacking, and attempting to run a Hadoop TaskTracker so I would imagine any CLASSPATH would be setup by the mesos-slave.
In the stdout of the slave logs, the environment is printed. There is a MESOS_HADOOP_HOME which is empty. Should this be set to something? If it is supposed to be set to the downloaded Hadoop distribution, I cannot set it in advance because the Hadoop distribution is downloaded to a new location every time.
In the event that is related (some permissions issue maybe), when attempting to browse slave logs via the master UI, I get the error Error browsing path: ....
The user running mesos-slave can browse to the correct directory when I do so manually.
I found the problem. bin/hadoop of the downloaded Hadoop distribution attempts to find its location by running which $0. However, that will find a current Hadoop installation if one exists (i.e. /usr/lib/hadoop), and will load the jars under that installation's lib directory instead of the downloaded one's lib directory.
I had to modify bin/hadoop of the downloaded distribution to find its own location with dirname $0 instead of which $0.

Resources