error in streaming twitter data to Hadoop using flume - hadoop

I am using Hadoop-1.2.1 on Ubuntu 14.04
I am trying to stream data from twitter to HDFS by using Flume-1.6.0. I have downloaded flume-sources-1.0-SNAPSHOT.jar and included it in flume/lib folder. I have set path of flume-sources-1.0-SNAPSHOT.jar as FLUME_CLASSPATH in conf/flume-env.sh . This is my flume agent conf file:
#setting properties of agent
Twitter-agent.sources=source1
Twitter-agent.channels=channel1
Twitter-agent.sinks=sink1
#configuring sources
Twitter-agent.sources.source1.type=com.cloudera.flume.source.TwitterSource
Twitter-agent.sources.source1.channels=channel1
Twitter-agent.sources.source1.consumerKey=<consumer-key>
Twitter-agent.sources.source1.consumerSecret=<consumer Secret>
Twitter-agent.sources.source1.accessToken=<access Toekn>
Twitter-agent.sources.source1.accessTokenSecret=<acess Token Secret>
Twitter-agent.sources.source1.keywords= morning, night, hadoop, bigdata
#configuring channels
Twitter-agent.channels.channel1.type=memory
Twitter-agent.channels.channel1.capacity=10000
Twitter-agent.channels.channel1.transactionCapacity=100
#configuring sinks
Twitter-agent.sinks.sink1.channel=channel1
Twitter-agent.sinks.sink1.type=hdfs
Twitter-agent.sinks.sink1.hdfs.path=flume/twitter/logs
Twitter-agent.sinks.sink1.rollSize=0
Twitter-agent.sinks.sink1.rollCount=1000
Twitter-agent.sinks.sink1.batchSize=100
Twitter-agent.sinks.sink1.fileType=DataStream
Twitter-agent.sinks.sink1.writeFormat=Text
when I run this agent, I am getting an error like this:
15/06/22 14:14:49 INFO source.DefaultSourceFactory: Creating instance of source source1, type com.cloudera.flume.source.TwitterSource
15/06/22 14:14:49 ERROR node.PollingPropertiesFileConfigurationProvider: Unhandled error
java.lang.NoSuchMethodError: twitter4j.conf.Configuration.isStallWarningsEnabled()Z
at twitter4j.TwitterStreamImpl.<init>(TwitterStreamImpl.java:60)
at twitter4j.TwitterStreamFactory.<clinit>(TwitterStreamFactory.java:40)
at com.cloudera.flume.source.TwitterSource.<init>(TwitterSource.java:64)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at java.lang.Class.newInstance(Class.java:442)
at org.apache.flume.source.DefaultSourceFactory.create(DefaultSourceFactory.java:44)
at org.apache.flume.node.AbstractConfigurationProvider.loadSources(AbstractConfigurationProvider.java:322)
at org.apache.flume.node.AbstractConfigurationProvider.getConfiguration(AbstractConfigurationProvider.java:97)
at org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run(PollingPropertiesFileConfigurationProvider.java:140)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
My flume/lib folder already has twitter4j-core-3.0.3.jar
How do I rectify this error?

I found a solution to this issue. As flume-sources-1.0-SNAPSHOT.jar and twitter4j-stream-3.0.3.jar contains the same FilterQuery.class, there arises a jar conflict. All twitter4j-3.x.x uses this class so it would be better to download twitter jars of version 2.2.6(twitter4j-core,twitter4j-stream,twitter4j-media-support) and replace 3.x.x with these newly downloaded jars under flume/lib directory.
Run the agent again and twitter data will be streamed to HDFS.

Change
Twitter-agent.sources.source1.type=com.cloudera.flume.source.TwitterSource
with
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource

Related

How to stream data from kafka avro console to HDFS using kafka-connect-hdfs?

I am trying to run kafka-connect-hdfs without any success.
I have added the following line to .bash_profile and ran 'source ~/.bash_profile'
export LOG_DIR=~/logs
The quickstart-hdfs.properties configuration file is
name=hdfs-sink
connector.class=io.confluent.connect.hdfs.HdfsSinkConnector
tasks.max=1
hdfs.url=xxx.xxx.xxx.xxx:xxxx # placeholder
flush.size=3
hadoop.conf.dir = /etc/hadoop/conf/
logs.dir = ~/logs
topics.dir = ~/topics
topics=test_hdfs
I am following the quickstart instructions outlined in
https://docs.confluent.io/current/connect/connect-hdfs/docs/hdfs_connector.html
The contents of the connector-avro-stanalone.properties file are:
bootstrap.servers=yyy.yyy.yyy.yyy:yyyy # This is the placeholder for the Kafka broker url with the appropriate port
key.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url=http://localhost:8081
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://localhost:8081
internal.key.converter=org.apache.kafka.connect.json.JsonConverter
internal.value.converter=org.apache.kafka.connect.json.JsonConverter
internal.key.converter.schemas.enable=false
internal.value.converter.schemas.enable=false
offset.storage.file.filename=/tmp/connect.offsets
I have the quickstart-hdfs.properties and connector-avro-stanalone.properties in my home directory and I run:
confluent load hdfs-sink -d quickstart-hdfs.properties
I am not sure how I am accessing the information in the connector-avro-stanalone.properties file in my home directory.
When I run: 'confluent log connect', I get the following error:
[2018-04-26 17:36:00,217] INFO Couldn't start HdfsSinkConnector: (io.confluent.connect.hdfs.HdfsSinkTask:90)
org.apache.kafka.connect.errors.ConnectException: java.lang.reflect.InvocationTargetException
at io.confluent.connect.storage.StorageFactory.createStorage(StorageFactory.java:56)
at io.confluent.connect.hdfs.DataWriter.<init>(DataWriter.java:213)
at io.confluent.connect.hdfs.DataWriter.<init>(DataWriter.java:101)
at io.confluent.connect.hdfs.HdfsSinkTask.start(HdfsSinkTask.java:82)
at org.apache.kafka.connect.runtime.WorkerSinkTask.initializeAndStart(WorkerSinkTask.java:267)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:163)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:170)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:214)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at io.confluent.connect.storage.StorageFactory.createStorage(StorageFactory.java:51)
... 12 more
Caused by: java.io.IOException: No FileSystem for scheme: xxx.xxx.xxx.xxx -> This is the hdfs_url in quickstart-hdfs.properties file without the port
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.getUnique(FileSystem.java:2691)
at org.apache.hadoop.fs.FileSystem.newInstance(FileSystem.java:420)
at io.confluent.connect.hdfs.storage.HdfsStorage.<init>(HdfsStorage.java:56)
... 17 more
[2018-04-26 17:36:00,217] INFO Shutting down HdfsSinkConnector. (io.confluent.connect.hdfs.HdfsSinkTask:91)
Any help to fix this issue will be greatly appreciated.
Caused by: java.io.IOException: No FileSystem for scheme: xxx.xxx.xxx.xxx
You need hdfs.url=hdfs://xxx.yyy.zzz.abc along with the namenode port
Also, you'll want to remove the spaces around the equals signs in the properties files

How can Apache Spark history-server refer to Amazon S3?

[version]
Apache Spark 2.2.0
Hadoop 2.7
I want to set up Apache Spark histroy server.
Spark events log located in Amazon S3.
I can save log file in S3, but cannot read from history server.
Apache Spark installed at /usr/local/spark
so, $SPARK_HOME is /usr/local/spark
$ cd /usr/local/spark/sbin
$ sh start-history-server.sh
I got following error
Exception in thread "main" java.lang.ClassNotFoundException: org.apache.hadoop.fs.s3a.S3AFileSystem
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:230)
....
my spark-defaults.conf is below
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.history.provider org.apache.hadoop.fs.s3a.S3AFileSystem
spark.history.fs.logDirectory s3a://xxxxxxxxxxxxx
spark.eventLog.enabled true
spark.eventLog.dir s3a://xxxxxxxxxxxxxxx
I installed this 2 jar files in /usr/local/spark/jars/
aws-java-sdk-1.7.4.jar
hadoop-aws-2.7.3.jar
but error is same.
What is wrong?
Please add the following in spark-defaults.conf file and retry again.
spark.driver.extraClassPath :/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/* :
spark.executor.extraClassPath :/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:

HORTONWORKS - Hbase/Phoenix - WALEditCodec - missing

I am receiving the following error while trying to run Phoenix on top of Hbase:
EXCEPTION #1:
2017-11-07 12:40:12,620 WARN [RS_LOG_REPLAY_OPS-XXX:16020-0]
regionserver.SplitLogWorker: log splitting of
WALs/XXX.XXX.XXX.XXX,16020,1507179047656-
splitting/XXX.XXX.XXX.XXX%2C16020%2C1507179047656.default.1507179049782 failed, returning error
java.io.IOException: Cannot get log reader
at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:355)
at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:267)
at org.apache.hadoop.hbase.wal.WALSplitter.getReader(WALSplitter.java:839)
at org.apache.hadoop.hbase.wal.WALSplitter.getReader(WALSplitter.java:763)
at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:297)
at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:235)
at org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:104)
at org.apache.hadoop.hbase.regionserver.handler.WALSplitterHandler.process(WALSplitterHandler.java:72)
at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
:$
at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:355)
at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:267)
at org.apache.hadoop.hbase.wal.WALSplitter.getReader(WALSplitter.java:839)
at org.apache.hadoop.hbase.wal.WALSplitter.getReader(WALSplitter.java:763)
at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:297)
at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:235)
at org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:104)
at org.apache.hadoop.hbase.regionserver.handler.WALSplitterHandler.process(WALSplitterHandler.java:72)
at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.UnsupportedOperationException: Unable to find org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec
at org.apache.hadoop.hbase.util.ReflectionUtils.instantiateWithCustomCtor(ReflectionUtils.java:36)
at org.apache.hadoop.hbase.regionserver.wal.WALCellCodec.create(WALCellCodec.java:103)
at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.getCodec(ProtobufLogReader.java:297)
at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.initAfterCompression(ProtobufLogReader.java:307)
at org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:82)
at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.init(ProtobufLogReader.java:164)
at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:303)
... 11 more
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at org.apache.hadoop.hbase.util.ReflectionUtils.instantiateWithCustomCtor(ReflectionUtils.java:32)
... 17 more
APPLIED PATCHES #1:
I have applied the following settings through the Ambari web UI for the advanced Hbase configs as specified by the Hortonworks document:
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.2/bk_command-line-upgrade/content/configure-phoenix-25.html
EXCEPTION #2
FATAL [RS_LOG_REPLAY_OPS-XXX:16020-1] conf.Configuration: error parsing conf core-site.xml
java.io.FileNotFoundException: /etc/hadoop/2.6.1.0-129/0/core-site.xml (Too many open files)
APPLIED PATCHES #2
I checked each 'core-site.xml' file on each server that contained a Hbase region server and made sure it ended with a </configuration>. As well as the core-site.xml files in the specified directory '/etc/hadoop/2.6.1.0-129/0/core-site.xml'
Haven't been able to find any other information regarding this issue.
I went into the HDFS and deleted all WAL split logs using the following command:
hdfs dfs -rm -r /apps/hbase/data/WALs/*splitting*
This resolved exception #1. Keep in mind from what I've read this will incur data loss.
For exception #2 I went back and checked the open file limits for each server (ulimit -n) and updated where applicable with respect to the Hortonworks doc:
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.2/bk_security/content/kerb-config-limits.html

HDFS IO error org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4 i

I am using Flume 1.6.0 in a virtual machine and Hadoop 2.7.1 in another virtual machine .
When I send Avro Events to the Flume 1.6.0 and it try to write on Hadoop 2.7.1 HDFS System. The follwing exception occurs
(SinkRunner-PollingRunner-DefaultSinkProcessor) [WARN - org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:455)] HDFS IO error
org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4
at org.apache.hadoop.ipc.Client.call(Client.java:1113)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
at com.sun.proxy.$Proxy6.getProtocolVersion(Unknown Source)
at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:62)
at com.sun.proxy.$Proxy6.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.checkVersion(RPC.java:422)
at org.apache.hadoop.hdfs.DFSClient.createNamenode(DFSClient.java:183)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:281)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:245)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:100)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1446)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:67)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1464)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:263)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
at org.apache.flume.sink.hdfs.BucketWriter$1.call(BucketWriter.java:243)
at org.apache.flume.sink.hdfs.BucketWriter$1.call(BucketWriter.java:235)
at org.apache.flume.sink.hdfs.BucketWriter$9$1.run(BucketWriter.java:679)
at org.apache.flume.auth.SimpleAuthenticator.execute(SimpleAuthenticator.java:50)
at org.apache.flume.sink.hdfs.BucketWriter$9.call(BucketWriter.java:676)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Thread
I try by adding these .jars in flume lib folder
=>
hadoop-common-2.7.1.jar
avro-1.7.7.jar instead of avro-1.7.4.jar
avro-ipc-1.7.7.jar instead of avro-ipc-1.7.4.jar
guava-18.0.jar instead of guava-11.0.2.jar
But the problem is still unsolved.
FlumeNG HDFS Sink depends on the following .jar files
hadoop-auth-2.4.0 jar
hadoop-common-2.4.0.jar
hadoop-hdfs-2.4.0.jar
commons-configuration-1.10.jar
which are not included in Flume lib folder .
By adding these .jars , the exception has been successfully overcome.

After upgrading hadoop clusters to clodera 4 b1 ,Invalid "mapreduce.jobtracker.address" configuration error comes

I recently upgraded to clodera4b1 .Before upgradation the mapreduce jobs were running fine but now when I execute any mapreduce program,Thefollowing error comes:
Command run:
hadoop jar /usr/lib/hadoop/hadoop-mapreduce-examples-0.23.0-cdh4b1.jar grep *.xml /user/out/ 'dfs'
12/04/10 19:23:15 INFO mapreduce.Cluster: Failed to use org.apache.hadoop.mapred.LocalClientProtocolProvider due to error: Invalid "mapreduce.jobtracker.address" configuration value for LocalJobRunner : "dev-xxx:yyyy"
12/04/10 19:23:15 ERROR security.UserGroupInformation: PriviledgedActionException as:anchauhan (auth:SIMPLE) cause:java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.
Exception in thread "main" java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.
at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:123)
at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:85)
at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:78)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1185)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1181)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1167)
at org.apache.hadoop.mapreduce.Job.connect(Job.java:1180)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1209)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1233)
at com.nextag.mapred.MerchantImport.doTask(MerchantImport.java:221)
at com.nextag.mapred.Main.main(Main.java:9)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:200)
And for my custom mapred jobs,I changed the library files to the new ones while compiling jar files,but no success
We had the same exception and the reason was that we had the wrong dependencies in the pom files. We were pointing to hadoop-client version 2.0.0-cdh4.0.0 which is for yarn but we are using MRv1 so we should have been pointing to 2.0.0-mr1-cdh4.0.0. You should not have any yarn jar files in your dependencies.
Reference

Resources