Configuring Flume to Collect Data into HDFS from Twitter - hadoop

Iam getting this log info continously for a whole day.
2016-10-12 21:32:05,696 (conf-file-poller-0) [DEBUG -
org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run(PollingPropertiesFileConfigurationProvider.java:126)]
Checking file:conf/flume.conf for changes when executing the command
FLUME_HOME/bin/flume-ng agent --conf ./conf/ -f conf/flume.conf
-Dflume.root.logger=DEBUG,console -n TwitterAgent
Iam getting this error now after modifying the conf file
2016-10-12 22:09:19,592 (lifecycleSupervisor-1-0) [DEBUG -
com.cloudera.flume.source.TwitterSource.start(TwitterSource.java:124)]
Setting up Twitter sample stream using consumer key and
access token 2016-10-12 22:09:19,592
(lifecycleSupervisor-1-0) [ERROR -
org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(LifecycleSupervisor.java:253)]
Unable to start EventDrivenSourceRunner: {
source:com.cloudera.flume.source.TwitterSource{name:Twitter,state:IDLE}
} - Exception follows. java.lang.IllegalStateException: consumer
key/secret pair already set.

As far as I can understand from what you have provide here, I think you need to add a1.sources.r1.type = org.apache.flume.source.twitter.TwitterSource into your conf file to define your Twitter sources, also make sure you are using your credentials for accessing twitter API.

Related

logstash not runs config

I'm using filebeat on client side > logstash on serverside > elasticsearch on server side
filebeat on clientside works properly by sending file, but the configuration i've made on logstash returning
Fail
[WARN ] 2019-12-18 14:53:30.987 [LogStash::Runner] multilocal - Ignoring the 'pipelines.yml' file because modules or command line options are specified
[FATAL] 2019-12-18 14:53:31.341 [LogStash::Runner] runner - Logstash could not be started because there is already another instance using the configured data directory. If you wish to run multiple instances, you must change the "path.data" setting.
[ERROR] 2019-12-18 14:53:31.364 [LogStash::Runner] Logstash - java.lang.IllegalStateException: Logstash stopped processing because of an error: (SystemExit) exit
Here is my configfile
input {
beats {
port =>5044
}
}
filter {
grok {
match => { "message" =>"%{TIMESTAMP_ISO8601:timestamp}] %{WORD:test}\[%{NUMBER:nom}]\[%{DATA:tes}\] %{DATA:module_name}\: %{WORD:method}%{GREEDYDATA:log_message}" }
}
}
output {
elasticsearch
{
hosts => "127.0.0.1:9200"
index=>"test_log_pbx"
}
}
code to run my logstash config
/usr/share/logstash/bin/logstash -f logstash.conf
when i run configtest it returns
Thread.exclusive is deprecated, use Thread::Mutex
WARNING: Could not find logstash.yml which is typically located in $LS_HOME/config or /etc/logstash. You can specify the path using --path.settings. Continuing using the defaults
Could not find log4j2 configuration at path /usr/share/logstash/config/log4j2.properties. Using default config which logs errors to the console
[WARN ] 2019-12-18 14:59:53.300 [LogStash::Runner] multilocal - Ignoring the 'pipelines.yml' file because modules or command line options are specified
[INFO ] 2019-12-18 14:59:56.566 [LogStash::Runner] Reflections - Reflections took 139 ms to scan 1 urls, producing 20 keys and 40 values
Configuration OK
[INFO ] 2019-12-18 14:59:57.923 [LogStash::Runner] runner - Using config.test_and_exit mode. Config Validation Result: OK. Exiting Logstash
please help me i dont know whats wrong
A logstash instance already running, so you can not run another instance.If you made your logstash as service, you should stop the service. If you want to run multiple instances, you should modify pipelines.yml
If you want to learn more about pipelines.yml, I put link the below.
https://www.elastic.co/guide/en/logstash/current/multiple-pipelines.html

kafka.common.KafkaException: Failed to parse the broker info from zookeeper from EC2 to elastic search

I have aws MSK set up and i am trying to sink records from MSK to elastic search.
I am able to push data into MSK into json format .
I want to sink to elastic search .
I am able to do all set up correctly .
This is what i have done on EC2 instance
wget /usr/local http://packages.confluent.io/archive/3.1/confluent-oss-3.1.2-2.11.tar.gz -P ~/Downloads/
tar -zxvf ~/Downloads/confluent-oss-3.1.2-2.11.tar.gz -C ~/Downloads/
sudo mv ~/Downloads/confluent-3.1.2 /usr/local/confluent
/usr/local/confluent/etc/kafka-connect-elasticsearch
After that i have modified kafka-connect-elasticsearch and set my elastic search url
name=elasticsearch-sink
connector.class=io.confluent.connect.elasticsearch.ElasticsearchSinkConnector
tasks.max=1
topics=AWSKafkaTutorialTopic
key.ignore=true
connection.url=https://search-abcdefg-risdfgdfgk-es-ex675zav7k6mmmqodfgdxxipg5cfsi.us-east-1.es.amazonaws.com
type.name=kafka-connect
The producer sends message like below fomrat
{
"data": {
"RequestID": 517082653,
"ContentTypeID": 9,
"OrgID": 16145,
"UserID": 4,
"PromotionStartDateTime": "2019-12-14T16:06:21Z",
"PromotionEndDateTime": "2019-12-14T16:16:04Z",
"SystemStartDatetime": "2019-12-14T16:17:45.507000000Z"
},
"metadata": {
"timestamp": "2019-12-29T10:37:31.502042Z",
"record-type": "data",
"operation": "insert",
"partition-key-type": "schema-table",
"schema-name": "dbo",
"table-name": "TRFSDIQueue"
}
}
I am little confused in how will the kafka connect start here ?
if yes how can i start that ?
I also have started schema registry like below which gave me error.
/usr/local/confluent/bin/schema-registry-start /usr/local/confluent/etc/schema-registry/schema-registry.properties
When i do that i get below error
[2019-12-29 13:49:17,861] ERROR Server died unexpectedly: (io.confluent.kafka.schemaregistry.rest.SchemaRegistryMain:51)
kafka.common.KafkaException: Failed to parse the broker info from zookeeper: {"listener_security_protocol_map":{"CLIENT":"PLAINTEXT","CLIENT_SECURE":"SSL","REPLICATION":"PLAINTEXT","REPLICATION_SECURE":"SSL"},"endpoints":["CLIENT:/
Please help .
As suggested in answer i upgraded the kafka connect version but then i started getting below error
ERROR Error starting the schema registry (io.confluent.kafka.schemaregistry.rest.SchemaRegistryRestApplication:63)
io.confluent.kafka.schemaregistry.exceptions.SchemaRegistryInitializationException: Error initializing kafka store while initializing schema registry
at io.confluent.kafka.schemaregistry.storage.KafkaSchemaRegistry.init(KafkaSchemaRegistry.java:210)
at io.confluent.kafka.schemaregistry.rest.SchemaRegistryRestApplication.initSchemaRegistry(SchemaRegistryRestApplication.java:61)
at io.confluent.kafka.schemaregistry.rest.SchemaRegistryRestApplication.setupResources(SchemaRegistryRestApplication.java:72)
at io.confluent.kafka.schemaregistry.rest.SchemaRegistryRestApplication.setupResources(SchemaRegistryRestApplication.java:39)
at io.confluent.rest.Application.createServer(Application.java:201)
at io.confluent.kafka.schemaregistry.rest.SchemaRegistryMain.main(SchemaRegistryMain.java:41)
Caused by: io.confluent.kafka.schemaregistry.storage.exceptions.StoreInitializationException: Timed out trying to create or validate schema topic configuration
at io.confluent.kafka.schemaregistry.storage.KafkaStore.createOrVerifySchemaTopic(KafkaStore.java:168)
at io.confluent.kafka.schemaregistry.storage.KafkaStore.init(KafkaStore.java:111)
at io.confluent.kafka.schemaregistry.storage.KafkaSchemaRegistry.init(KafkaSchemaRegistry.java:208)
... 5 more
Caused by: java.util.concurrent.TimeoutException
at org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:108)
at org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:274)
at io.confluent.kafka.schemaregistry.storage.KafkaStore.createOrVerifySchemaTopic(KafkaStore.java:161)
... 7 more
First, Confluent Platform 3.1.2 is fairly old. I suggest you get the version that aligns with the Kafka version
You start Kafka Connect using the appropriate connect-* scripts and properties located under bin and etc/kafka folders
For example,
/usr/local/confluent/bin/connect-standalone \
/usr/local/confluent/etc/kafka/kafka-connect-standalone.properties \
/usr/local/confluent/etc/kafka-connect-elasticsearch/quickstart.properties
If that works, you can move onto using connect-distributed command instead
Regarding Schema Registry, you can search its Github issues for multiple people trying to get MSK to work, but the root issue is related to MSK not exposing a PLAINTEXT listener and the Schema Registry not supporting named listeners. (This may have changed since versions 5.x)
You could also try using Connect and Schema Registry containers in ECS / EKS rather than extracting in an EC2 machine

EOFException from Kafka in Flume

I am trying to set up a simple data pipeline from a console Kafka producer to the Hadoop file system (HDFS). I am working on a 64bit Ubuntu Virtual Machine and have created separate users for both Hadoop and Kafka as was suggested by the guides that I have followed. Consuming the produced input in Kafka with a console consumer works and the HDFS seems to be up and running.
Now I want to use Flume to pipe the input into the HDFS. I am using the following configuration file:
tier1.sources = source1
tier1.channels = channel1
tier1.sinks = sink1
tier1.sources.source1.type = org.apache.flume.source.kafka.KafkaSource
tier1.sources.source1.zookeeperConnect = 127.0.0.1:2181
tier1.sources.source1.topic = test
tier1.sources.source1.groupId = flume
tier1.sources.source1.channels = channel1
tier1.sources.source1.interceptors = i1
tier1.sources.source1.interceptors.i1.type = timestamp
tier1.sources.source1.kafka.consumer.timeout.ms = 2000
tier1.channels.channel1.type = memory
tier1.channels.channel1.capacity = 10000
tier1.channels.channel1.transactionCapacity = 1000
tier1.sinks.sink1.type = hdfs
tier1.sinks.sink1.hdfs.path = hdfs://flume/kafka/%{topic}/%y-%m-%d
tier1.sinks.sink1.hdfs.rollInterval = 5
tier1.sinks.sink1.hdfs.rollSize = 0
tier1.sinks.sink1.hdfs.rollCount = 0
tier1.sinks.sink1.hdfs.fileType = DataStream
tier1.sinks.sink1.channel = channel1
Now when I run Flume with the following command
bin/flume-ng agent --conf ./conf -f conf/flume.conf -Dflume.root.logger=DEBUG,console -n tier1
I get the same exception in the console output over and over again:
2017-10-19 12:17:04,279 (lifecycleSupervisor-1-2) [DEBUG - org.apache.kafka.clients.NetworkClient.handleConnections(NetworkClient.java:467)] Completed connection to node 2147483647
2017-10-19 12:17:04,279 (lifecycleSupervisor-1-2) [DEBUG - org.apache.kafka.common.network.Selector.poll(Selector.java:307)] Connection with Ubuntu-Sandbox/127.0.1.1 disconnected
java.io.EOFException
at org.apache.kafka.common.network.NetworkReceive.readFromReadableChannel(NetworkReceive.java:83)
at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:71)
at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:153)
at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:134)
at org.apache.kafka.common.network.Selector.poll(Selector.java:286)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:256)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.clientPoll(ConsumerNetworkClient.java:320)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:213)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:193)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:163)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:222)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.ensurePartitionAssignment(ConsumerCoordinator.java:311)
at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:890)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:853)
at org.apache.flume.source.kafka.KafkaSource.doStart(KafkaSource.java:529)
at org.apache.flume.source.BasicSourceSemantics.start(BasicSourceSemantics.java:83)
at org.apache.flume.source.PollableSourceRunner.start(PollableSourceRunner.java:71)
at org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(LifecycleSupervisor.java:249)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
The only way to stop Flume is to kill the Java process.
I thought that it might have something to do with the separate users for Hadoop and Kafka, but even when running everything with the Kafka user I get the same result. I haven't found anything concerning the EOFException method online either, which is strange considering that I have just followed the "Getting Started" guides and used pretty standard configurations for everything.
Maybe it has something to do with the preceding line ("Ubuntu-Sandbox/127.0.1.1 disconnected") and hence the configuration of my VM?
Any help is highly appreciated!
Have you considered using Kafka Connect (part of Apache Kafka) and the HDFS connector instead? This is generally seen to have superseded Flume. It is easy to use, with a similar file-based configuration as Flume.

unable to download data from twitter through flume

bin/flume-ng agent -n TwitterAgent --conf ./conf/ -f conf/flume-twitter.conf -Dflume.root.logger=DEBUG,console
When I run the above command it generate the following errors:
2016-05-06 13:33:31,357 (Twitter Stream consumer-1[Establishing connection]) [INFO - twitter4j.internal.logging.SLF4JLogger.info(SLF4JLogger.java:83)] 404:The URI requested is invalid or the resource requested, such as a user, does not exist. Unknown URL. See Twitter Streaming API documentation at http://dev.twitter.com/pages/streaming_api
This is my flume-twitter.conf file located in flume/conf folder:
TwitterAgent.sources= Twitter TwitterAgent.channels= MemChannel TwitterAgent.sinks=HDFS TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource TwitterAgent.sources.Twitter.channels=MemChannel TwitterAgent.sources.Twitter.consumerKey=jtlmThaz307pCCQtlw9lvrrOq TwitterAgent.sources.Twitter.consumerSecret=oaGCt6OaUas13Ji5NTnPN6TFjdSKtsAUQdq4ZhAq0BFn9jgHPU TwitterAgent.sources.Twitter.accessToken=921523328-xxY9nrWijDSVC77iK40eRNVmRIopvLXovpoxBnDs TwitterAgent.sources.Twitter.accessTokenSecret=fbtuDENfBNxTooPD0EEgEo15Pg51cxNQa1CochI56gqSO TwitterAgent.sources.Twitter.keywords= WT20,hadoop,election,sports, cricket,Big data,IPL2016,Panamaleaks,Pollingday TwitterAgent.sinks.HDFS.channel=MemChannel TwitterAgent.sinks.HDFS.type=hdfs TwitterAgent.sinks.HDFS.hdfs.path=hdfs://HadoopMaster:9000/user/flume/tweets TwitterAgent.sinks.HDFS.hdfs.fileType=DataStream TwitterAgent.sinks.HDFS.hdfs.writeformat=Text TwitterAgent.sinks.HDFS.hdfs.batchSize=1000 TwitterAgent.sinks.HDFS.hdfs.rollSize=0 TwitterAgent.sinks.HDFS.hdfs.rollCount=10000 TwitterAgent.sinks.HDFS.hdfs.rollInterval=600 TwitterAgent.channels.MemChannel.type=memory TwitterAgent.channels.MemChannel.capacity=10000 TwitterAgent.channels.MemChannel.transactionCapacity=100*
Try replacing your flume-sources-1.x-SNAPSHOT.jar with the jar file downloaded from this link.
As Twitter broke their old APIs few days ago. The old jar file will not work. You can Download the modified jar from the link I have given above.
P.S. I am getting results through this method.

flume agent throws debug,what could be the issue?

When i try to run the flume agent, i am getting following statement repeatedly.unless i am stopping the task forcefully it displaying continuously, what could be the issue
please help me out
2013-05-27 03:47:12,517 (conf-file-poller-0) [DEBUG - org.apache.flume.conf.file.AbstractFileConfigurationProvider$FileWatcherRunnable.run(AbstractFileConfigurationProvider.java:188)] Checking file:/etc/flume-ng/conf![enter image description here][1]/loclog.conf for changes
2013-05-27 03:47:12,517 (conf-file-poller-0) [DEBUG - org.apache.flume.conf.file.AbstractFileConfigurationProvider$FileWatcherRunnable.run(AbstractFileConfigurationProvider.java:188)] Checking file:/etc/flume-ng/conf![enter image description here][1]/loclog.conf for changes
2013-05-27 03:47:12,517 (conf-file-poller-0) [DEBUG - org.apache.flume.conf.file.AbstractFileConfigurationProvider$FileWatcherRunnable.run(AbstractFileConfigurationProvider.java:188)] Checking file:/etc/flume-ng/conf![enter image description here][1]/loclog.conf for changes
This is normal behaviour and should be ignored.
Flume automatically checks its config files for changes. If it detects a change it will reconfigure itself with those changes. The DEBUG entries you see above are Flume checking its config file.
Note that the reconfiguration process will not pick up all changes. I've noticed that new sources and sinks will often require a process restart.

Resources