The "Spring XD" xd-shell can't run the hadoop fs ls command, the command returns a java exception - hadoop

I compiled the latest spring-xd as I needed CDH support. I am able to start the server however when I connect to the server via the xd-shell I try to change a "configuration". Also this is a kerberized cluster, I am not sure how xd will/can handle that.
1st scenario:
admin config server --uri http://testdomain:10111
hadoop config fs --namenode hdfs://nameservice1:8020
hadoop config props set hadoop.security.group.mapping=org.apache.hadoop.security.ShellBasedUnixGroupsMapping
hadoop config props load hadoop.security.group.mapping
hadoop fs ls
Error message:
xd:>hadoop fs ls
-ls: Fatal internal error
java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:128)
at org.apache.hadoop.security.Groups.<init>(Groups.java:55)
at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:182)
at org.apache.hadoop.security.UserGroupInformation.initUGI(UserGroupInformation.java:252)
at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:223)
at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:214)
at org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:277)
at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:668)
at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:573)
at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:2428)
at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:2420)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2288)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:316)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:162)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:300)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:194)
at org.apache.hadoop.fs.shell.PathData.expandAsGlob(PathData.java:270)
at org.apache.hadoop.fs.shell.Command.expandArgument(Command.java:224)
at org.apache.hadoop.fs.shell.Command.expandArguments(Command.java:207)
at org.apache.hadoop.fs.shell.Command.processRawArguments(Command.java:190)
at org.apache.hadoop.fs.shell.Command.run(Command.java:154)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:254)
at org.springframework.xd.shell.hadoop.FsShellCommands.run(FsShellCommands.java:412)
at org.springframework.xd.shell.hadoop.FsShellCommands.runCommand(FsShellCommands.java:407)
at org.springframework.xd.shell.hadoop.FsShellCommands.ls(FsShellCommands.java:110)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.springframework.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:191)
at org.springframework.shell.core.SimpleExecutionStrategy.invoke(SimpleExecutionStrategy.java:64)
at org.springframework.shell.core.SimpleExecutionStrategy.execute(SimpleExecutionStrategy.java:48)
at org.springframework.shell.core.AbstractShell.executeCommand(AbstractShell.java:127)
at org.springframework.shell.core.JLineShell.promptLoop(JLineShell.java:483)
at org.springframework.shell.core.JLineShell.run(JLineShell.java:157)
at java.lang.Thread.run(Thread.java:679)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:532)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:126)
... 35 more
Caused by: java.lang.NoClassDefFoundError: Could not initialize class org.apache.hadoop.security.JniBasedUnixGroupsMapping
at org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback.<init>(JniBasedUnixGroupsMappingWithFallback.java:38)
... 40 more
2nd scenario
alternatively I remove some java opts
run steps 1, 2 from previous scenario
then
hadoop config props set hadoop.security.authorization=true
hadoop config props set hadoop.security.authentication=kerberos
error below
16:50:29,682 WARN Spring Shell util.NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
ls: Authorization (hadoop.security.authorization) is enabled but authentication (hadoop.security.authentication) is configured as simple. Please configure another method like kerberos or digest.
Thanks for you assistance - can't wait to get this working!

Thanks for raising this - we haven't tested with authorization/authentication in the shell for a while - though it is tested as part of the project https://github.com/vmware-serengeti/serengeti-ws
Are you able to perform operations using the standard hadoop file system shell. e.g.
hdfs dfs -ls /user/hadoop/file1

There is currently no specific support in XD for running against a secured Hadoop cluster.
Feel free to open a JIRA ticket at https://jira.springsource.org/browse/XD -- this is something we know we will have to address soon.

Related

Apache Phoenix - How can start the query server and thin client on Kerberos cluster

I have recently spent several days trying to run the phoenix thin (queryserver.py and sqlline-thin.py) and thick via zookeeper to secure cluster.But, I could not able to start or connect the phoenix service on secure cluster.
Faced Below issues on phoenix thin and thick clients
17/09/27 08:41:47 WARN util.NativeCodeLoader: Unable to load native-hadoop libra
ry for your platform... using builtin-java classes where applicable
Error: java.lang.RuntimeException: java.lang.NullPointerException (state=,code=0
)
java.sql.SQLException: java.lang.RuntimeException: java.lang.NullPointerExceptio
n
at org.apache.phoenix.query.ConnectionQueryServicesImpl$12.call(Connecti
onQueryServicesImpl.java:2465)
at org.apache.phoenix.query.ConnectionQueryServicesImpl$12.call(Connecti
onQueryServicesImpl.java:2382)
at org.apache.phoenix.util.PhoenixContextExecutor.call(PhoenixContextExe
cutor.java:76)
at org.apache.phoenix.query.ConnectionQueryServicesImpl.init(ConnectionQ
ueryServicesImpl.java:2382)
at org.apache.phoenix.jdbc.PhoenixDriver.getConnectionQueryServices(Phoe
nixDriver.java:255)
at org.apache.phoenix.jdbc.PhoenixEmbeddedDriver.createConnection(Phoeni
xEmbeddedDriver.java:149)
at org.apache.phoenix.jdbc.PhoenixDriver.connect(PhoenixDriver.java:221)
at sqlline.DatabaseConnection.connect(DatabaseConnection.java:157)
at sqlline.DatabaseConnection.getConnection(DatabaseConnection.java:203)
at sqlline.Commands.connect(Commands.java:1064)
at sqlline.Commands.connect(Commands.java:996)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
sorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at sqlline.ReflectiveCommandHandler.execute(ReflectiveCommandHandler.jav
a:38)
at sqlline.SqlLine.dispatch(SqlLine.java:809)
at sqlline.SqlLine.initArgs(SqlLine.java:588)
at sqlline.SqlLine.begin(SqlLine.java:661)
at sqlline.SqlLine.start(SqlLine.java:398)
at sqlline.SqlLine.main(SqlLine.java:291)
Caused by: java.lang.RuntimeException: java.lang.NullPointerException
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(R
pcRetryingCaller.java:218)
at org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:
326)
at org.apache.hadoop.hbase.client.ClientScanner.nextScanner(ClientScanne
r.java:301)
at org.apache.hadoop.hbase.client.ClientScanner.initializeScannerInConst
ruction(ClientScanner.java:166)
at org.apache.hadoop.hbase.client.ClientScanner.<init>(ClientScanner.jav
a:161)
at org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:797)
at org.apache.hadoop.hbase.MetaTableAccessor.fullScan(MetaTableAccessor.
java:602)
at org.apache.hadoop.hbase.MetaTableAccessor.tableExists(MetaTableAccess
or.java:366)
at org.apache.hadoop.hbase.client.HBaseAdmin.tableExists(HBaseAdmin.java
:406)
at org.apache.phoenix.query.ConnectionQueryServicesImpl$12.call(Connecti
onQueryServicesImpl.java:2410)
... 20 more
Caused by: java.lang.NullPointerException
at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.getMetaReplicaNode
s(ZooKeeperWatcher.java:489)
at org.apache.hadoop.hbase.zookeeper.MetaTableLocator.blockUntilAvailabl
e(MetaTableLocator.java:558)
at org.apache.hadoop.hbase.client.ZooKeeperRegistry.getMetaRegionLocatio
n(ZooKeeperRegistry.java:61)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplement
ation.locateMeta(ConnectionManager.java:1211)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplement
ation.locateRegion(ConnectionManager.java:1178)
at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getR
egionLocations(RpcRetryingCallerWithReadReplicas.java:305)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(Scann
erCallableWithReplicas.java:156)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(Scann
erCallableWithReplicas.java:60)
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(R
pcRetryingCaller.java:210)
... 29 more
sqlline version 1.2.0
0: jdbc:phoenix:Namenode1>
Followed below steps
Removed the hbase-site.xml file on phoenix and set the HBase package environment.
Provided the Kerberos principal and keytab for the Phoenix Query Server in the $HBASE_CONF_DIR/hbase-site.xml file.
phoenix.queryserver.kerberos.principal
HTTP/webuser#Domain.COM
phoenix.queryserver.kerberos.keytab
C:\KeyTab\webuser.keytab
Execute thin server and Client by
a. Start Query server - /bin>python queryserver.ph which started properly.
b. Start thin client - /bin>python sqlline.py http://namenode1:8765;authentication=SPENGO;principal=HTTP/webuser#Domain.COM;keytab=C:\\KeyTab\\webuser.keytab
4.Execute on Thick client by below commands
/bin> python sqlline.py Namenode1:2181:/hbase-secure:HTTP/webuser#Domain.COM:C:\\KeyTab\\webuser.keytab
Let me know, Any other configurations are required an Apache phoenix with secure cluster.

Google Cloud connector for Hadoop doesn't work with Pig

I'm using Hadoop with HDFS 2.7.1.2.4 and Pig 0.15.0.2.4 (Hortonworks HDP 2.4) and trying to use Google Cloud Storage Connector for Spark and Hadoop (bigdata-interop on GitHub). It works correctly when I try, say,
hadoop fs -ls gs://bucket-name
But when I try the following in Pig (in mapreduce mode):
data = LOAD 'gs://softline/o365.avro' USING AvroStorage();
data = STORE data INTO 'gs://softline/o366.avro' USING AvroStorage();
Pig fails with the following errors:
org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Wrong FS scheme: hdfs, in path: hdfs://hdp.slweb.ru:8020/user/root, expected scheme: gs
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:279)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
at org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob.submit(ControlledJob.java:335)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.pig.backend.hadoop23.PigJobControl.submit(PigJobControl.java:128)
at org.apache.pig.backend.hadoop23.PigJobControl.run(PigJobControl.java:194)
at java.lang.Thread.run(Thread.java:745)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$1.run(MapReduceLauncher.java:276)
Caused by: java.lang.IllegalArgumentException: Wrong FS scheme: hdfs, in path: hdfs://hdp.slweb.ru:8020/user/root, expected scheme: gs
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.checkPath(GoogleHadoopFileSystemBase.java:741)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem.checkPath(GoogleHadoopFileSystem.java:90)
at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:466)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.makeQualified(GoogleHadoopFileSystemBase.java:701)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem.getGcsPath(GoogleHadoopFileSystem.java:163)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.setWorkingDirectory(GoogleHadoopFileSystemBase.java:1094)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:235)
... 18 more
If needed I could post the logs of GC connectors.
Hame somebody used Pig with this connectors? Any help would be appeciated.
TL;DR explicitly set workmapreduce.job.working.dir=/user/root/ when starting the pig job
If a working directory has not been explicitly set during job submission then Hadoop will set the working directory to be the working directory of the default filesystem. When using HDFS as your default FS the working directory will generally be something like 'hdfs://namenode:port/user/<your username>'.
When PigInputFormat#getSplits is called, it fetches the FileSystem associated with the path of the input that it is operating on. In this case the filesystem is an instance of GoogleHadoopFileSystem. Pig then inspects the path of its input and if the path is non-local calls FileSystem#setWorkingDirectory(job.getWorkingDirectory()). The problem here is that the job's working directory is 'hdfs://namenode:port/user/<your username>' which GoogleHadoopFileSystem will reject as a path to set as its own working directory (as it only supports 'gs://' paths).

Why do the Spark examples fail to spark-submit on EC2 with spark-ec2 scripts?

I downloaded spark-1.5.2 and I setup a cluster on ec2 using the spark-ec2 doc here.
After that I went to examples/ and run mvn package and packaged the examples in a jar.
In the end I run the submit with:
bin/spark-submit --class org.apache.spark.examples.JavaTC --master spark://url_here.eu-west-1.compute.amazonaws.com:7077 --deploy-mode cluster /home/aki/Projects/spark-1.5.2/examples/target/spark-examples_2.10-1.5.2.jar
Instead of it running, I get the error:
WARN RestSubmissionClient: Unable to connect to server spark://url_here.eu-west-1.compute.amazonaws.com:7077.
Warning: Master endpoint spark://url_here.eu-west-1.compute.amazonaws.com:7077 was not a REST server. Falling back to legacy submission gateway instead.
15/12/22 17:36:07 WARN Utils: Your hostname, aki-linux resolves to a loopback address: 127.0.1.1; using 192.168.10.63 instead (on interface wlp4s0)
15/12/22 17:36:07 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
15/12/22 17:36:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.lookupTimeout
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcEnv.scala:214)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcEnv.scala:229)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcEnv.scala:225)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcEnv.scala:242)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:98)
at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:116)
at org.apache.spark.deploy.Client$$anonfun$7.apply(Client.scala:233)
at org.apache.spark.deploy.Client$$anonfun$7.apply(Client.scala:233)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at org.apache.spark.deploy.Client$.main(Client.scala:233)
at org.apache.spark.deploy.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [120 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcEnv.scala:241)
... 21 more
Are you sure the URL to master contains "url-here"?
spark://url_here.eu-west-1.compute.amazonaws.com:7077
Or maybe you are trying to obfuscate it for this post.
If you can you connect the Spark UI at
http://url_here.eu-west-1.compute.amazonaws.com:4040 or depending on your spark version http://url_here.eu-west-1.compute.amazonaws.com:8080, make sure you are using the URL variable seen on the Spark UI for your spark://...:7070 command line argument

Job via Oozie HDP 2.1 not creating job.splitmetainfo

When trying to execute a sqoop job which has my Hadoop program passed as a jar file in -jarFiles parameter, the execution blows off with below error. Any resolution seems to be not available. Other jobs with same Hadoop user is getting executed successfully.
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.FileNotFoundException: File does not exist: hdfs://sandbox.hortonworks.com:8020/user/root/.staging/job_1423050964699_0003/job.splitmetainfo
at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$InitTransition.createSplits(JobImpl.java:1541)
at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$InitTransition.transition(JobImpl.java:1396)
at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$InitTransition.transition(JobImpl.java:1363)
at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:976)
at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:135)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:1241)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceStart(MRAppMaster.java:1041)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$1.run(MRAppMaster.java:1452)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1448)
at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1381)
So here is the way I solved it. We are using CDH5 to run Camus to pull data from kafka. We run CamusJob which is responsible for getting data from kafka using comman line:
hadoop jar...
The problem is that new hosts didn't get so-called "yarn-gateway". Cloudera names pack of configs related to service and copied to /etc/hadoop/conf
as "gateway". So I just clicked "deploy client configuration" in CM UI. YARN client conf has been copied to each YARN NodeManager node and it solved problem.

Error writing event data into HDFS through flume

I am using cdh3 update 4 tarball for development purpose. I have hadoop up and running. Now, I also downloaded equivalent flume tarball from cloudera viz 1.1.0 and tried writing a tail of log file into hdfs using hdfs-sink. When I run the flume agent, it starts okay but ends up in error when it attempts writing the new event data into hdfs. I couldn't find better group to post this question than stackoverflow.
here is flume configuration I am using
agent.sources=exec-source
agent.sinks=hdfs-sink
agent.channels=ch1
agent.sources.exec-source.type=exec
agent.sources.exec-source.command=tail -F /locationoffile
agent.sinks.hdfs-sink.type=hdfs
agent.sinks.hdfs-sink.hdfs.path=hdfs://localhost:8020/flume
agent.sinks.hdfs-sink.hdfs.filePrefix=apacheaccess
agent.channels.ch1.type=memory
agent.channels.ch1.capacity=1000
agent.sources.exec-source.channels=ch1
agent.sinks.hdfs-sink.channel=ch1
Also, this is a small snippet of error that gets displayed in console when it receives new event data and tries writing it into hdfs.
13/03/16 17:59:21 INFO hdfs.BucketWriter: Creating hdfs://localhost:8020/user/hdfs-user/flume/apacheaccess.1363436060424.tmp
13/03/16 17:59:22 WARN hdfs.HDFSEventSink: HDFS IO error
java.io.IOException: Failed on local exception: java.io.IOException: Broken pipe; Host Details : local host is: "sumit-HP-Pavilion-dv3-Notebook-PC/127.0.0.1"; destination host is: "localhost":8020;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:759)
at org.apache.hadoop.ipc.Client.call(Client.java:1164)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
at $Proxy9.create(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
at $Proxy9.create(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:192)
at org.apache.hadoop.hdfs.DFSOutputStream.<init>(DFSOutputStream.java:1298)
at org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1317)
at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1215)
at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1173)
at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:272)
at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:261)
at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:78)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:805)
at org.apache.hadoop.io.SequenceFile$Writer.<init>(SequenceFile.java:1060)
at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:270)
at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:369)
at org.apache.flume.sink.hdfs.HDFSSequenceFile.open(HDFSSequenceFile.java:65)
at org.apache.flume.sink.hdfs.HDFSSequenceFile.open(HDFSSequenceFile.java:49)
at org.apache.flume.sink.hdfs.BucketWriter.doOpen(BucketWriter.java:190)
at org.apache.flume.sink.hdfs.BucketWriter.access$000(BucketWriter.java:50)
at org.apache.flume.sink.hdfs.BucketWriter$1.run(BucketWriter.java:157)
at org.apache.flume.sink.hdfs.BucketWriter$1.run(BucketWriter.java:154)
at org.apache.flume.sink.hdfs.BucketWriter.runPrivileged(BucketWriter.java:127)
at org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:154)
at org.apache.flume.sink.hdfs.BucketWriter.append(BucketWriter.java:316)
at org.apache.flume.sink.hdfs.HDFSEventSink$1.call(HDFSEventSink.java:718)
at org.apache.flume.sink.hdfs.HDFSEventSink$1.call(HDFSEventSink.java:715)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: Broken pipe
at sun.nio.ch.FileDispatcher.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:100)
at sun.nio.ch.IOUtil.write(IOUtil.java:71)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334)
at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:62)
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:143)
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:153)
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:114)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
at java.io.DataOutputStream.flush(DataOutputStream.java:106)
at org.apache.hadoop.ipc.Client$Connection.sendParam(Client.java:861)
at org.apache.hadoop.ipc.Client.call(Client.java:1141)
... 37 more
13/03/16 17:59:27 INFO hdfs.BucketWriter: Creating hdfs://localhost:8020/user/hdfs-user/flume/apacheaccess.1363436060425.tmp
13/03/16 17:59:27 WARN hdfs.HDFSEventSink: HDFS IO error
java.io.IOException: Failed on local exception: java.io.IOException: Broken pipe; Host Details : local host is: "sumit-HP-Pavilion-dv3-Notebook-PC/127.0.0.1"; destination host is: "localhost":8020;
As people in cloudera mail list suggest, there are probable reasons of this error:
The HDFS safemode is turned on. Try to run hadoop fs -safemode leave and see if the error goes away.
Flume and Hadoop versions are mismatched. To check this replace the hadoop-core.jar in flume/lib directory with the one found in hadoop's installation folder.

Resources