Writing Tweets to the HDFS using Flume doesn't work - hadoop

I used the Cloudera CDH5 QuickStart VM with VMware and all the services are installed via the Cloudera Manager.
I created a /user/flume/tweets and a flume user and group. I restarted all the services but, doesn't matter how long I wait, no tweets will be written to the HDFS. The /user/flume/tweets/ directory is still EMPTY!
Why?
This is my flume.conf:
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS
TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey = **
TwitterAgent.sources.Twitter.consumerSecret = **
TwitterAgent.sources.Twitter.accessToken = **
TwitterAgent.sources.Twitter.accessTokenSecret = ***
TwitterAgent.sources.Twitter.keywords = hadoop, big data, analytics, bigdata, cloudera, data science, data scientiest, business intelligence, mapreduce, data warehouse, data warehousing, mahout, hbase, nosql, newsql, businessintelligence, cloudcomputing
TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost.localdomain:804/user/flume/tweets/%Y/%m/%d/%H/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 100
This is what I get on flume log:
[cloudera#localhost ~]$ tail -f /var/log/flume-ng/flume.log
27 May 2014 21:40:28,536 INFO [conf-file-poller-0] (org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.addProperty:1016) - Processing:HDFS
27 May 2014 21:40:28,536 INFO [conf-file-poller-0] (org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.addProperty:1016) - Processing:HDFS
27 May 2014 21:40:28,536 INFO [conf-file-poller-0] (org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.addProperty:1016) - Processing:HDFS
27 May 2014 21:40:28,537 INFO [conf-file-poller-0] (org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.addProperty:1016) - Processing:HDFS
27 May 2014 21:40:28,537 INFO [conf-file-poller-0] (org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.addProperty:1016) - Processing:HDFS
27 May 2014 21:40:28,562 WARN [conf-file-poller-0] (org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.isValid:319) - Agent configuration for 'agent' does not contain any channels. Marking it as invalid.
27 May 2014 21:40:28,564 WARN [conf-file-poller-0] (org.apache.flume.conf.FlumeConfiguration.validateConfiguration:127) - Agent configuration invalid for agent 'agent'. It will be removed.
27 May 2014 21:40:28,564 INFO [conf-file-poller-0] (org.apache.flume.conf.FlumeConfiguration.validateConfiguration:140) - Post-validation flume configuration contains configuration for agents: [TwitterAgent]
27 May 2014 21:40:28,564 WARN [conf-file-poller-0] (org.apache.flume.node.AbstractConfigurationProvider.getConfiguration:138) - No configuration found for this host:agent
27 May 2014 21:40:28,592 INFO [conf-file-poller-0] (org.apache.flume.node.Application.startAllComponents:138) - Starting new configuration:{ sourceRunners:{} sinkRunners:{} channels:{} }
How can I fix that?
Thanks in advance.

Did you setup the Flume Configuration using Cloudera Manager? Please follow this link http://javet.org/?p=279 for implementing twitter firehose in CDH5.

Related

Kafka stream app failing to fetch offsets for partition

I created a kafka cluster with 3 brokers and following details:
Created 3 topics, each one with replication factor=3 and partitions=2.
Created 2 producers each one writing to one of the topics.
Created a Streams application to process messages from 2 topics and write to the 3rd topic.
It was all running fine till now but I suddenly started getting the following warning when starting the Streams application:
[WARN ] 2018-06-08 21:16:49.188 [Stream3-4f7403ad-aba6-4d34-885d-60114fc9fcff-StreamThread-1] org.apache.kafka.clients.consumer.internals.Fetcher [Consumer clientId=Stream3-4f7403ad-aba6-4d34-885d-60114fc9fcff-StreamThread-1-restore-consumer, groupId=] Attempt to fetch offsets for partition Stream3-KSTREAM-OUTEROTHER-0000000005-store-changelog-0 failed due to: Disk error when trying to access log file on the disk.
Due to this warning, Streams application is not processing anything from the 2 topics.
I tried following things:
Stopped all brokers, deleted kafka-logs directory for each broker and restarted the brokers. It didn't solve the issue.
Stopped zookeeper and all brokers, deleted zookeeper logs as well as kafka-logs for each broker, restarted zookeeper and brokers and created the topics again. This too didn't solve the issue.
I am not able to find anything related to this error on official docs or web. Does anyone have an idea of why am I getting this error suddenly?
EDIT:
Out of 3 brokers, 2 brokers(broker-0 and broker-2) continously emit these logs:
Broker-0 logs:
[2018-06-09 02:03:08,750] INFO [ReplicaFetcher replicaId=0, leaderId=1, fetcherId=0] Retrying leaderEpoch request for partition initial11_topic-1 as the leader reported an error: NOT_LEADER_FOR_PARTITION (kafka.server.ReplicaFetcherThread)
[2018-06-09 02:03:08,750] INFO [ReplicaFetcher replicaId=0, leaderId=1, fetcherId=0] Retrying leaderEpoch request for partition initial12_topic-0 as the leader reported an error: NOT_LEADER_FOR_PARTITION (kafka.server.ReplicaFetcherThread)
Broker-2 logs:
[2018-06-09 02:04:46,889] INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Retrying leaderEpoch request for partition initial11_topic-1 as the leader reported an error: NOT_LEADER_FOR_PARTITION (kafka.server.ReplicaFetcherThread)
[2018-06-09 02:04:46,889] INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Retrying leaderEpoch request for partition initial12_topic-0 as the leader reported an error: NOT_LEADER_FOR_PARTITION (kafka.server.ReplicaFetcherThread)
Broker-1 shows following logs:
[2018-06-09 01:21:26,689] INFO [GroupMetadataManager brokerId=1] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
[2018-06-09 01:31:26,689] INFO [GroupMetadataManager brokerId=1] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
[2018-06-09 01:39:44,667] ERROR [KafkaApi-1] Number of alive brokers '0' does not meet the required replication factor '1' for the offsets topic (configured via 'offsets.topic.replication.factor'). This error can be ignored if the cluster is starting up and not all brokers are up yet. (kafka.server.KafkaApis)
[2018-06-09 01:41:26,689] INFO [GroupMetadataManager brokerId=1] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
I again stopped zookeeper and brokers, deleted their logs and restarted. As soon as I create the topics again, I start getting the above logs.
Topic details:
[zk: localhost:2181(CONNECTED) 3] get /brokers/topics/initial11_topic
{"version":1,"partitions":{"1":[1,0,2],"0":[0,2,1]}}
cZxid = 0x53
ctime = Sat Jun 09 01:25:42 EDT 2018
mZxid = 0x53
mtime = Sat Jun 09 01:25:42 EDT 2018
pZxid = 0x54
cversion = 1
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 52
numChildren = 1
[zk: localhost:2181(CONNECTED) 4] get /brokers/topics/initial12_topic
{"version":1,"partitions":{"1":[2,1,0],"0":[1,0,2]}}
cZxid = 0x61
ctime = Sat Jun 09 01:25:47 EDT 2018
mZxid = 0x61
mtime = Sat Jun 09 01:25:47 EDT 2018
pZxid = 0x62
cversion = 1
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 52
numChildren = 1
[zk: localhost:2181(CONNECTED) 5] get /brokers/topics/final11_topic
{"version":1,"partitions":{"1":[0,1,2],"0":[2,0,1]}}
cZxid = 0x48
ctime = Sat Jun 09 01:25:32 EDT 2018
mZxid = 0x48
mtime = Sat Jun 09 01:25:32 EDT 2018
pZxid = 0x4a
cversion = 1
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 52
numChildren = 1
Any clue?
I found out the issue. It was due to following incorrect config in server.properties of broker-1:
advertised.listeners=PLAINTEXT://10.23.152.109:9094
Mistakenly port for advertised.listeners got changed to same as port of advertised.listeners of broker-2.

Kafka - ERROR Stopping after connector error java.lang.IllegalArgumentException: Number of groups must be positive

Working on setting up Kafka running from our RDS Postgres 9.6 to Redhift. Using the guidelines at https://blog.insightdatascience.com/from-postgresql-to-redshift-with-kafka-connect-111c44954a6a and we have the all of the infrastructure set up, and am working on fully setting up Confluent. I'm getting the error of ava.lang.IllegalArgumentException: Number of groups must be positive. when trying to set stuff up. Here's my config file:
name=source-postgres
connector.class=io.confluent.connect.jdbc.JdbcSourceConnector
tasks.max=16
connection.url= ((correct url and information here))
mode=timestamp+incrementing
timestamp.column.name=updated_at
incrementing.column.name=id
topic.prefix=postgres_
Full error:
/usr/local/confluent$ /usr/local/confluent/bin/connect-standalone
/usr/local/confluent/etc/schema-registry/connect-avro-standalone.properties
/usr/local/confluent/etc/kafka-connect-jdbc/source-postgres.properties
SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found
binding in
[jar:file:/usr/local/confluent/share/java/kafka-serde-tools/slf4j-log4j12-1.7.6.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/usr/local/confluent/share/java/kafka-connect-elasticsearch/slf4j-simple-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/usr/local/confluent/share/java/kafka-connect-hdfs/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/usr/local/confluent/share/java/kafka/slf4j-log4j12-1.7.21.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation. SLF4J: Actual binding is of type
[org.slf4j.impl.Log4jLoggerFactory] [2018-01-29 16:49:49,820] INFO
StandaloneConfig values:
access.control.allow.methods =
access.control.allow.origin =
bootstrap.servers = [localhost:9092]
internal.key.converter = class org.apache.kafka.connect.json.JsonConverter
internal.value.converter = class org.apache.kafka.connect.json.JsonConverter
key.converter = class io.confluent.connect.avro.AvroConverter
offset.flush.interval.ms = 60000
offset.flush.timeout.ms = 5000
offset.storage.file.filename = /tmp/connect.offsets
rest.advertised.host.name = null
rest.advertised.port = null
rest.host.name = null
rest.port = 8083
task.shutdown.graceful.timeout.ms = 5000
value.converter = class io.confluent.connect.avro.AvroConverter
(org.apache.kafka.connect.runtime.standalone.StandaloneConfig:180)
[2018-01-29 16:49:49,942] INFO Logging initialized #549ms
(org.eclipse.jetty.util.log:186) [2018-01-29 16:49:50,301] INFO Kafka
Connect starting (org.apache.kafka.connect.runtime.Connect:52)
[2018-01-29 16:49:50,302] INFO Herder starting
(org.apache.kafka.connect.runtime.standalone.StandaloneHerder:70)
[2018-01-29 16:49:50,302] INFO Worker starting
(org.apache.kafka.connect.runtime.Worker:113) [2018-01-29
16:49:50,302] INFO Starting FileOffsetBackingStore with file
/tmp/connect.offsets
(org.apache.kafka.connect.storage.FileOffsetBackingStore:60)
[2018-01-29 16:49:50,304] INFO Worker started
(org.apache.kafka.connect.runtime.Worker:118) [2018-01-29
16:49:50,305] INFO Herder started
(org.apache.kafka.connect.runtime.standalone.StandaloneHerder:72)
[2018-01-29 16:49:50,305] INFO Starting REST server
(org.apache.kafka.connect.runtime.rest.RestServer:98) [2018-01-29
16:49:50,434] INFO jetty-9.2.15.v20160210
(org.eclipse.jetty.server.Server:327) Jan 29, 2018 4:49:51 PM
org.glassfish.jersey.internal.Errors logErrors WARNING: The following
warnings have been detected: WARNING: The (sub)resource method
listConnectors in
org.apache.kafka.connect.runtime.rest.resources.ConnectorsResource
contains empty path annotation. WARNING: The (sub)resource method
createConnector in
org.apache.kafka.connect.runtime.rest.resources.ConnectorsResource
contains empty path annotation. WARNING: The (sub)resource method
listConnectorPlugins in
org.apache.kafka.connect.runtime.rest.resources.ConnectorPluginsResource
contains empty path annotation. WARNING: The (sub)resource method
serverInfo in
org.apache.kafka.connect.runtime.rest.resources.RootResource contains
empty path annotation. [2018-01-29 16:49:51,385] INFO Started
o.e.j.s.ServletContextHandler#5aabbb29{/,null,AVAILABLE}
(org.eclipse.jetty.server.handler.ContextHandler:744) [2018-01-29
16:49:51,409] INFO Started
ServerConnector#54dab9ac{HTTP/1.1}{0.0.0.0:8083}
(org.eclipse.jetty.server.ServerConnector:266) [2018-01-29
16:49:51,409] INFO Started #2019ms
(org.eclipse.jetty.server.Server:379) [2018-01-29 16:49:51,410] INFO
REST server listening at http://127.0.0.1:8083/, advertising URL
http://127.0.0.1:8083/
(org.apache.kafka.connect.runtime.rest.RestServer:150) [2018-01-29
16:49:51,410] INFO Kafka Connect started
(org.apache.kafka.connect.runtime.Connect:58) [2018-01-29
16:49:51,412] INFO ConnectorConfig values:
connector.class = io.confluent.connect.jdbc.JdbcSourceConnector
key.converter = null
name = source-postgres
tasks.max = 16
value.converter = null (org.apache.kafka.connect.runtime.ConnectorConfig:180) [2018-01-29
16:49:51,413] INFO Creating connector source-postgres of type
io.confluent.connect.jdbc.JdbcSourceConnector
(org.apache.kafka.connect.runtime.Worker:159) [2018-01-29
16:49:51,416] INFO Instantiated connector source-postgres with version
3.1.2 of type class io.confluent.connect.jdbc.JdbcSourceConnector (org.apache.kafka.connect.runtime.Worker:162) [2018-01-29
16:49:51,419] INFO JdbcSourceConnectorConfig values:
batch.max.rows = 100
connection.url =
incrementing.column.name = id
mode = timestamp+incrementing
poll.interval.ms = 5000
query =
schema.pattern = null
table.blacklist = []
table.poll.interval.ms = 60000
table.types = [TABLE]
table.whitelist = []
timestamp.column.name = updated_at
timestamp.delay.interval.ms = 0
topic.prefix = postgres_
validate.non.null = true (io.confluent.connect.jdbc.source.JdbcSourceConnectorConfig:180)
[2018-01-29 16:49:52,129] INFO Finished creating connector
source-postgres (org.apache.kafka.connect.runtime.Worker:173)
[2018-01-29 16:49:52,130] INFO SourceConnectorConfig values:
connector.class = io.confluent.connect.jdbc.JdbcSourceConnector
key.converter = null
name = source-postgres
tasks.max = 16
value.converter = null (org.apache.kafka.connect.runtime.SourceConnectorConfig:180)
[2018-01-29 16:49:52,209] ERROR Stopping after connector error
(org.apache.kafka.connect.cli.ConnectStandalone:102)
java.lang.IllegalArgumentException: Number of groups must be positive.
at org.apache.kafka.connect.util.ConnectorUtils.groupPartitions(ConnectorUtils.java:45)
at io.confluent.connect.jdbc.JdbcSourceConnector.taskConfigs(JdbcSourceConnector.java:123)
at org.apache.kafka.connect.runtime.Worker.connectorTaskConfigs(Worker.java:193)
at org.apache.kafka.connect.runtime.standalone.StandaloneHerder.recomputeTaskConfigs(StandaloneHerder.java:251)
at org.apache.kafka.connect.runtime.standalone.StandaloneHerder.updateConnectorTasks(StandaloneHerder.java:281)
at org.apache.kafka.connect.runtime.standalone.StandaloneHerder.putConnectorConfig(StandaloneHerder.java:163)
at org.apache.kafka.connect.cli.ConnectStandalone.main(ConnectStandalone.java:96)
[2018-01-29 16:49:52,210] INFO Kafka Connect stopping
(org.apache.kafka.connect.runtime.Connect:68) [2018-01-29
16:49:52,210] INFO Stopping REST server
(org.apache.kafka.connect.runtime.rest.RestServer:154) [2018-01-29
16:49:52,213] INFO Stopped
ServerConnector#54dab9ac{HTTP/1.1}{0.0.0.0:8083}
(org.eclipse.jetty.server.ServerConnector:306) [2018-01-29
16:49:52,218] INFO Stopped
o.e.j.s.ServletContextHandler#5aabbb29{/,null,UNAVAILABLE}
(org.eclipse.jetty.server.handler.ContextHandler:865) [2018-01-29
16:49:52,224] INFO REST server stopped
(org.apache.kafka.connect.runtime.rest.RestServer:165) [2018-01-29
16:49:52,224] INFO Herder stopping
(org.apache.kafka.connect.runtime.standalone.StandaloneHerder:76)
[2018-01-29 16:49:52,224] INFO Stopping connector source-postgres
(org.apache.kafka.connect.runtime.Worker:218) [2018-01-29
16:49:52,225] INFO Stopping table monitoring thread
(io.confluent.connect.jdbc.JdbcSourceConnector:137) [2018-01-29
16:49:52,225] INFO Stopped connector source-postgres
(org.apache.kafka.connect.runtime.Worker:229) [2018-01-29
16:49:52,225] INFO Worker stopping
(org.apache.kafka.connect.runtime.Worker:122) [2018-01-29
16:49:52,225] INFO Stopped FileOffsetBackingStore
(org.apache.kafka.connect.storage.FileOffsetBackingStore:68)
[2018-01-29 16:49:52,225] INFO Worker stopped
(org.apache.kafka.connect.runtime.Worker:142) [2018-01-29
16:49:57,334] INFO Reflections took 6952 ms to scan 263 urls,
producing 12036 keys and 80097 values
(org.reflections.Reflections:229) [2018-01-29 16:49:57,346] INFO
Herder stopped
(org.apache.kafka.connect.runtime.standalone.StandaloneHerder:86)
[2018-01-29 16:49:57,346] INFO Kafka Connect stopped
(org.apache.kafka.connect.runtime.Connect:73)
We were using DMS between our RDS Postgres (9.6) to Redshift. It has been failing, and simply miserable, as well as almost at this point almost unweidly expensive, so we are moving into this as a possible solution. I am kind of at a wall here, and would really like to get some help on this.
I'm working on a very similar issue to this, and what I found is that if the connector doesn't have configuration to tell it what to pull, it will simply error. Trying adding the following to your connector configuration:
table.whitelist=
Then specifying a list of tables to grab.
I had this error with a JDBC Source Connector job. The issue was that the table.whitelist setting was case sensitive, even though the underlying DB wasn't (RDBMS was MS Sql Server).
So my table was tableName, and I had "table.whitelist": "tablename",. This failed, and I got the above error. Changing it to "table.whitelist": "tableName", fixed the error.
This despite the fact that SELECT * FROM tablename and SELECT * FROM tableName both work in MS Sql Manager.

java.lang.IllegalArgumentException: Can't find HmacSHA1 algorithm

After install hadoop through brew install hadoop,i want to start up hadoop,
when run hadoop2.7.2 /start-all.sh on mac,it went wrong,the logs:
Swing Hu 19:53:45
16/08/19 19:50:25 INFO namenode.FSNamesystem: fsOwner = swinghu (auth:SIMPLE)
16/08/19 19:50:25 INFO namenode.FSNamesystem: supergroup = supergroup
16/08/19 19:50:25 INFO namenode.FSNamesystem: isPermissionEnabled = true
16/08/19 19:50:25 INFO namenode.FSNamesystem: HA Enabled: false
16/08/19 19:50:25 INFO namenode.FSNamesystem: Append Enabled: true
16/08/19 19:50:25 ERROR namenode.FSNamesystem: FSNamesystem initialization failed.
java.lang.IllegalArgumentException: Can't find HmacSHA1 algorithm.
at org.apache.hadoop.security.token.SecretManager.<init>(SecretManager.java:146)
at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.<init>(AbstractDelegationTokenSecretManager.java:104)
at org.apache.hadoop.hdfs.security.token.delegation.DelegationTokenSecretManager.<init>(DelegationTokenSecretManager.java:95)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.createDelegationTokenSecretManager(FSNamesystem.java:6600)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:829)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:697)
at org.apache.hadoop.hdfs.server.namenode.NameNode.format(NameNode.java:984)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1429)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1554)
16/08/19 19:50:25 INFO namenode.FSNamesystem: Stopping services started for acti
After two days after,it goes well,it may be i reboot my mac book pro.Strangely

Make Cygnus use WebHDFS to write to local HDFS

I'm trying to make a local Orion+Cygnus persist Orion's data on a local HDFS through WebHDFS.
On Cygnus' instructions on gitub, very little is mentioned about WebHDFS, as the configuration is more about HttpFS.
On the .md OrionHDFSsink it's said that hdfs_port=50070 is for WebHDFS, as indeed my HDFS is. So I would expect by setting the port this way cygnus would automatically use WebHDFS, but on my case it doesn't seem to be working this way.
So, here's my agent_1.conf:
cygnusagent.sources = http-source
cygnusagent.sinks = hdfs-sink
cygnusagent.channels = hdfs-channel
# source configuration
cygnusagent.sources.http-source.channels = hdfs-channel
cygnusagent.sources.http-source.type = org.apache.flume.source.http.HTTPSource
cygnusagent.sources.http-source.port = 5050
cygnusagent.sources.http-source.handler = com.telefonica.iot.cygnus.handlers.OrionRestHandler
cygnusagent.sources.http-source.handler.notification_target = /notify
cygnusagent.sources.http-source.handler.default_service = def_serv
cygnusagent.sources.http-source.handler.default_service_path = def_servpath
cygnusagent.sources.http-source.handler.events_ttl = 4
cygnusagent.sources.http-source.interceptors = ts gi
cygnusagent.sources.http-source.interceptors.ts.type = timestamp
cygnusagent.sources.http-source.interceptors.gi.type = com.telefonica.iot.cygnus.interceptors.GroupingInterceptor$Builder
cygnusagent.sources.http-source.interceptors.gi.grouping_rules_conf_file = /usr/cygnus/conf/grouping_rules.conf
# OrionHDFSSink configuration
cygnusagent.sinks.hdfs-sink.channel = hdfs-channel
cygnusagent.sinks.hdfs-sink.type = com.telefonica.iot.cygnus.sinks.OrionHDFSSink
cygnusagent.sinks.hdfs-sink.hdfs_host = localHDFS.ip
cygnusagent.sinks.hdfs-sink.hdfs_port = 50070
cygnusagent.sinks.hdfs-sink.hdfs_username = HDFSrootUser
cygnusagent.sinks.hdfs-sink.attr_persistence = column
# hdfs-channel configuration
cygnusagent.channels.hdfs-channel.type = memory
cygnusagent.channels.hdfs-channel.capacity = 1000
cygnusagent.channels.hdfs-channel.transactionCapacity = 100
When I update an Entity on Orion, to whom Cygnus is subbed, Cygnus logs the following:
02 Sep 2015 20:09:12,353 INFO [2055470757#qtp-1523539038-0] (com.telefonica.iot.cygnus.handlers.OrionRestHandler.getEvents:150) - Starting transaction (1441217314-956-0000000000)
02 Sep 2015 20:09:12,362 INFO [2055470757#qtp-1523539038-0] (com.telefonica.iot.cygnus.handlers.OrionRestHandler.getEvents:236) - Received data ({ "subscriptionId" : "55e735c9b89e8535f8ca5ef2", "originator" : "localhost", "contextResponses" : [ { "contextElement" : { "type" : "Reading", "isPattern" : "false", "id" : "Reading1.1", "attributes" : [ { "name" : "Cost", "type" : "double", "value" : "32" }, { "name" : "Reading_ID", "type" : "integer", "value" : "14" }, { "name" : "Threshold", "type" : "double", "value" : "30" }, { "name" : "email", "type" : "string", "value" : "arthurmvieira#hotmail.com" } ] }, "statusCode" : { "code" : "200", "reasonPhrase" : "OK" } } ]})
02 Sep 2015 20:09:12,366 INFO [2055470757#qtp-1523539038-0] (com.telefonica.iot.cygnus.handlers.OrionRestHandler.getEvents:258) - Event put in the channel (id=2020008711, ttl=4)
02 Sep 2015 20:09:12,432 INFO [SinkRunner-PollingRunner-DefaultSinkProcessor] (com.telefonica.iot.cygnus.sinks.OrionSink.process:128) - Event got from the channel (id=2020008711, headers={fiware-servicepath=def_servpath, destination=reading1.1_reading, content-type=application/json, fiware-service=def_serv, ttl=4, transactionId=1441217314-956-0000000000, timestamp=1441217352368}, bodyLength=812)
02 Sep 2015 20:09:12,549 INFO [SinkRunner-PollingRunner-DefaultSinkProcessor] (com.telefonica.iot.cygnus.sinks.OrionHDFSSink.persist:356) - [hdfs-sink] Persisting data at OrionHDFSSink. HDFS file (def_serv/def_servpath/reading1.1_reading/reading1.1_reading.txt), Data ({"recvTime":"2015-09-02T18:09:12.368Z","Cost":"32", "Cost_md":[],"Reading_ID":"14", "Reading_ID_md":[],"Threshold":"30", "Threshold_md":[],"email":"arthurmvieira#hotmail.com", "email_md":[]})
02 Sep 2015 20:09:12,557 ERROR [SinkRunner-PollingRunner-DefaultSinkProcessor] (com.telefonica.iot.cygnus.sinks.OrionSink.process:143) - Persistence error (The /user/root/def_serv/def_servpath/reading1.1_reading directory could not be created in HDFS. HttpFS response: 503 Service unavailable)
02 Sep 2015 20:09:12,558 INFO [SinkRunner-PollingRunner-DefaultSinkProcessor] (com.telefonica.iot.cygnus.sinks.OrionSink.process:173) - An event was put again in the channel (id=2020008711, ttl=3)
02 Sep 2015 20:09:12,558 INFO [SinkRunner-PollingRunner-DefaultSinkProcessor] (com.telefonica.iot.cygnus.sinks.OrionSink.process:193) - Finishing transaction (1441217314-956-0000000000)
02 Sep 2015 20:09:13,560 INFO [SinkRunner-PollingRunner-DefaultSinkProcessor] (com.telefonica.iot.cygnus.sinks.OrionSink.process:128) - Event got from the channel (id=2020008711, headers={fiware-servicepath=def_servpath, destination=reading1.1_reading, content-type=application/json, fiware-service=def_serv, ttl=3, transactionId=1441217314-956-0000000000, timestamp=1441217352368}, bodyLength=812)
02 Sep 2015 20:09:13,574 INFO [SinkRunner-PollingRunner-DefaultSinkProcessor] (com.telefonica.iot.cygnus.sinks.OrionHDFSSink.persist:356) - [hdfs-sink] Persisting data at OrionHDFSSink. HDFS file (def_serv/def_servpath/reading1.1_reading/reading1.1_reading.txt), Data ({"recvTime":"2015-09-02T18:09:12.368Z","Cost":"32", "Cost_md":[],"Reading_ID":"14", "Reading_ID_md":[],"Threshold":"30", "Threshold_md":[],"email":"arthurmvieira#hotmail.com", "email_md":[]})
02 Sep 2015 20:09:13,574 ERROR [SinkRunner-PollingRunner-DefaultSinkProcessor] (com.telefonica.iot.cygnus.sinks.OrionSink.process:143) - Persistence error (The /user/root/def_serv/def_servpath/reading1.1_reading directory could not be created in HDFS. HttpFS response: 503 Service unavailable)
02 Sep 2015 20:09:13,575 INFO [SinkRunner-PollingRunner-DefaultSinkProcessor] (com.telefonica.iot.cygnus.sinks.OrionSink.process:173) - An event was put again in the channel (id=2020008711, ttl=2)
02 Sep 2015 20:09:13,575 INFO [SinkRunner-PollingRunner-DefaultSinkProcessor] (com.telefonica.iot.cygnus.sinks.OrionSink.process:193) - Finishing transaction (1441217314-956-0000000000)
02 Sep 2015 20:09:15,576 INFO [SinkRunner-PollingRunner-DefaultSinkProcessor] (com.telefonica.iot.cygnus.sinks.OrionSink.process:128) - Event got from the channel (id=2020008711, headers={fiware-servicepath=def_servpath, destination=reading1.1_reading, content-type=application/json, fiware-service=def_serv, ttl=2, transactionId=1441217314-956-0000000000, timestamp=1441217352368}, bodyLength=812)
02 Sep 2015 20:09:15,590 INFO [SinkRunner-PollingRunner-DefaultSinkProcessor] (com.telefonica.iot.cygnus.sinks.OrionHDFSSink.persist:356) - [hdfs-sink] Persisting data at OrionHDFSSink. HDFS file (def_serv/def_servpath/reading1.1_reading/reading1.1_reading.txt), Data ({"recvTime":"2015-09-02T18:09:12.368Z","Cost":"32", "Cost_md":[],"Reading_ID":"14", "Reading_ID_md":[],"Threshold":"30", "Threshold_md":[],"email":"arthurmvieira#hotmail.com", "email_md":[]})
02 Sep 2015 20:09:15,599 ERROR [SinkRunner-PollingRunner-DefaultSinkProcessor] (com.telefonica.iot.cygnus.sinks.OrionSink.process:143) - Persistence error (The /user/root/def_serv/def_servpath/reading1.1_reading directory could not be created in HDFS. HttpFS response: 503 Service unavailable)
02 Sep 2015 20:09:15,600 INFO [SinkRunner-PollingRunner-DefaultSinkProcessor] (com.telefonica.iot.cygnus.sinks.OrionSink.process:173) - An event was put again in the channel (id=2020008711, ttl=1)
02 Sep 2015 20:09:15,600 INFO [SinkRunner-PollingRunner-DefaultSinkProcessor] (com.telefonica.iot.cygnus.sinks.OrionSink.process:193) - Finishing transaction (1441217314-956-0000000000)
02 Sep 2015 20:09:18,601 INFO [SinkRunner-PollingRunner-DefaultSinkProcessor] (com.telefonica.iot.cygnus.sinks.OrionSink.process:128) - Event got from the channel (id=2020008711, headers={fiware-servicepath=def_servpath, destination=reading1.1_reading, content-type=application/json, fiware-service=def_serv, ttl=1, transactionId=1441217314-956-0000000000, timestamp=1441217352368}, bodyLength=812)
02 Sep 2015 20:09:18,615 INFO [SinkRunner-PollingRunner-DefaultSinkProcessor] (com.telefonica.iot.cygnus.sinks.OrionHDFSSink.persist:356) - [hdfs-sink] Persisting data at OrionHDFSSink. HDFS file (def_serv/def_servpath/reading1.1_reading/reading1.1_reading.txt), Data ({"recvTime":"2015-09-02T18:09:12.368Z","Cost":"32", "Cost_md":[],"Reading_ID":"14", "Reading_ID_md":[],"Threshold":"30", "Threshold_md":[],"email":"arthurmvieira#hotmail.com", "email_md":[]})
02 Sep 2015 20:09:18,618 ERROR [SinkRunner-PollingRunner-DefaultSinkProcessor] (com.telefonica.iot.cygnus.sinks.OrionSink.process:143) - Persistence error (The /user/root/def_serv/def_servpath/reading1.1_reading directory could not be created in HDFS. HttpFS response: 503 Service unavailable)
02 Sep 2015 20:09:18,621 INFO [SinkRunner-PollingRunner-DefaultSinkProcessor] (com.telefonica.iot.cygnus.sinks.OrionSink.process:173) - An event was put again in the channel (id=2020008711, ttl=0)
02 Sep 2015 20:09:18,621 INFO [SinkRunner-PollingRunner-DefaultSinkProcessor] (com.telefonica.iot.cygnus.sinks.OrionSink.process:193) - Finishing transaction (1441217314-956-0000000000)
02 Sep 2015 20:09:22,622 INFO [SinkRunner-PollingRunner-DefaultSinkProcessor] (com.telefonica.iot.cygnus.sinks.OrionSink.process:128) - Event got from the channel (id=2020008711, headers={fiware-servicepath=def_servpath, destination=reading1.1_reading, content-type=application/json, fiware-service=def_serv, ttl=0, transactionId=1441217314-956-0000000000, timestamp=1441217352368}, bodyLength=812)
02 Sep 2015 20:09:22,635 INFO [SinkRunner-PollingRunner-DefaultSinkProcessor] (com.telefonica.iot.cygnus.sinks.OrionHDFSSink.persist:356) - [hdfs-sink] Persisting data at OrionHDFSSink. HDFS file (def_serv/def_servpath/reading1.1_reading/reading1.1_reading.txt), Data ({"recvTime":"2015-09-02T18:09:12.368Z","Cost":"32", "Cost_md":[],"Reading_ID":"14", "Reading_ID_md":[],"Threshold":"30", "Threshold_md":[],"email":"arthurmvieira#hotmail.com", "email_md":[]})
02 Sep 2015 20:09:22,635 ERROR [SinkRunner-PollingRunner-DefaultSinkProcessor] (com.telefonica.iot.cygnus.sinks.OrionSink.process:143) - Persistence error (The /user/root/def_serv/def_servpath/reading1.1_reading directory could not be created in HDFS. HttpFS response: 503 Service unavailable)
02 Sep 2015 20:09:22,635 WARN [SinkRunner-PollingRunner-DefaultSinkProcessor] (com.telefonica.iot.cygnus.sinks.OrionSink.process:163) - The event TTL has expired, it is no more re-injected in the channel (id=2020008711, ttl=0)
02 Sep 2015 20:09:22,635 INFO [SinkRunner-PollingRunner-DefaultSinkProcessor] (com.telefonica.iot.cygnus.sinks.OrionSink.process:193) - Finishing transaction (1441217314-956-0000000000)
So you can see it's trying to use HttpFS, as it logs the response:
HttpFS response: 503 Service unavailable
...on each writing try.
How should I configure the agent to use WebHDFS?
Thank you
I don't know what was happening, but the configuration mentioned is correct and is working now.
After several tries at rebooting the instance, rewriting the config files and other log errors than the one mentioned, it worked.
At some point Cygnus was trying to write to localhost:50075, instead of {localHDFS.ip}:50070, but that was gone after rebooting cygnus.
All instances are at their latest version (important).
Cygnus configuration for WebHDFS is just about setting the port to 50070, nothing else is required.
Regarding the connections you mention to 50075, they are correct as well, since that's the behaviour of WebHDFS: when you want to upload data to HDFS, first the client (in this case, Cygnus) accesses the Namenode through TCP/50070 port, then the namenode responds with a redirection location pointing to the datanode where the data will be effectively uploaded; such a redirection uses the TCP/50075 port, and thus that datanode:50075 must be accessible by the client (Cygnus). That's why we are using HttpFS in the global instance of Cosmos at FIWARE Lab: HttpFS works as a gateway hiding the details of the datanodes, and a single entry point and port (14000) is required.

Unable to load file to Hadoop using flume

Im using flume to move files to hdfs ... while moving file its showing this error.. please help me to solve this issue.
15/05/20 15:49:26 INFO instrumentation.MonitoredCounterGroup: Component type: SOURCE, name: r1 started
15/05/20 15:49:26 INFO avro.ReliableSpoolingFileEventReader: Preparing to move file /home/crayondata.com/shanmugapriya/apache-flume-1.5.2-bin/staging/HypeVisitorTest.java to /home/crayondata.com/shanmugapriya/apache-flume-1.5.2-bin/staging/HypeVisitorTest.java.COMPLETED
15/05/20 15:49:26 INFO source.SpoolDirectorySource: Spooling Directory Source runner has shutdown.
15/05/20 15:49:26 INFO hdfs.HDFSDataStream: Serializer = TEXT, UseRawLocalFileSystem = false
15/05/20 15:49:26 INFO hdfs.BucketWriter: Creating hdfs://localhost:9000/sha/HypeVisitorTest.java.1432117166377.tmp
15/05/20 15:49:26 ERROR hdfs.HDFSEventSink: process failed
java.lang.UnsupportedOperationException: Not implemented by the DistributedFileSystem FileSystem implementation
at org.apache.hadoop.fs.FileSystem.getScheme(FileSystem.java:216)
at org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2564)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2574)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.flume.sink.hdfs.BucketWriter$1.call(BucketWriter.java:270)
at org.apache.flume.sink.hdfs.BucketWriter$1.call(BucketWriter.java:262)
at org.apache.flume.sink.hdfs.BucketWriter$9$1.run(BucketWriter.java:718)
at org.apache.flume.sink.hdfs.BucketWriter.runPrivileged(BucketWriter.java:183)
at org.apache.flume.sink.hdfs.BucketWriter.access$1700(BucketWriter.java:59)
at org.apache.flume.sink.hdfs.BucketWriter$9.call(BucketWriter.java:715)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/05/20 15:49:26 ERROR flume.SinkRunner: Unable to deliver event. Exception follows.
org.apache.flume.EventDeliveryException: java.lang.UnsupportedOperationException: Not implemented by the DistributedFileSystem FileSystem implementation
at org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:471)
at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.UnsupportedOperationException: Not implemented by the DistributedFileSystem FileSystem implementation
at org.apache.hadoop.fs.FileSystem.getScheme(FileSystem.java:216)
at org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2564)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2574)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.flume.sink.hdfs.BucketWriter$1.call(BucketWriter.java:270)
at org.apache.flume.sink.hdfs.BucketWriter$1.call(BucketWriter.java:262)
at org.apache.flume.sink.hdfs.BucketWriter$9$1.run(BucketWriter.java:718)
at org.apache.flume.sink.hdfs.BucketWriter.runPrivileged(BucketWriter.java:183)
at org.apache.flume.sink.hdfs.BucketWriter.access$1700(BucketWriter.java:59)
at org.apache.flume.sink.hdfs.BucketWriter$9.call(BucketWriter.java:715)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
... 1 more
15/05/20 15:49:26 INFO source.SpoolDirectorySource: Spooling Directory Source runner has shutdown.
Here is my flumeconf.conf file
# example.conf: A single-node Flume configuration
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /home/shanmugapriya/apache-flume-1.5.2-bin/staging
a1.sources.r1.fileHeader = true
a1.sources.r1.maxBackoff = 10000
a1.sources.r1.basenameHeader = true
# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://localhost:9000/sha
a1.sinks.k1.hdfs.writeFormat = Text
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.rollInterval = 0
a1.sinks.k1.hdfs.rollSize = 0
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.idleTimeout = 100
a1.sinks.k1.hdfs.filePrefix = %{basename}
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 100000
a1.channels.c1.transactionCapacity = 1000
a1.channels.c1.byteCapacity = 0
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
please help to solve this.. TIA...
#Shan Please confirm you have the relevant Hadoop HDFS jars in your classpath for Apache Flume
Also from your sink to HDFS I see that you have port 9000, however the default port is normally 8020, is this correct?

Resources