Need help debugging kafka source to hdfs sink with flume - hadoop

I'm trying to send data from kafka (eventually we'll use kafka running on a different instance) to hdfs. I think flume or some sort of ingestion protocol is necessary to get data into hdfs. So we're using cloudera's flume service and hdfs.
This is my flume-conf file. The other conf file is empty
tier1.sources=source1
tier1.channels=channel1
tier1.sinks=sink1
tier1.sources.source1.type=org.apache.flume.source.kafka.KafkaSource
tier1.sources.source1.zookeeperConnect=localhost:2181
tier1.sources.source1.topic=test
tier1.sources.source1.groupId=flume
tier1.sources.source1.channels=channel1
tier1.sources.source1.interceptors=i1
tier1.sources.source1.interceptors.i1.type=timestamp
tier1.sources.source1.kafka.consumer.timeout.ms=100
tier1.channels.channel1.type=memory
tier1.channels.channel1.capacity=10000
tier1.channels.channel1.transactionCapacity=1000
tier1.sinks.sink1.type=hdfs
tier1.sinks.sink1.hdfs.path=/tmp/kafka/test/data
tier1.sinks.sink1.hdfs.rollInterval=5
tier1.sinks.sink1.hdfs.rollSize=0
tier1.sinks.sink1.hdfs.rollCount=0
tier1.sinks.sink1.hdfs.fileType=DataStream
When I start a kafka consumer it can get messages from a kafka producer just fine on localhost:2181. But I don't see any errors from the flume agent and nothing gets put into hdfs. I also can't find any log files.
This is how I start the agent.
flume-ng agent --conf /opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/flume-ng/conf --conf-file flume-conf --name agent1 -Dflume.root.logger=DEBUG,INFO,console
Help please?

Fixed it.
Have to change
--name agent1
to --name tier1

Related

How to specify the address of ResourceManager to bin/yarn-session.sh?

I am a newbie in Flink.
I'm confused about how to specify the address of ResourceManager when run bin/yarn-session.sh?
When starting a Flink Yarn session via bin/yarn-session.sh then it will create a .yarn-properties-USER file in your tmp directory. This file will contain the connection information for the Flink cluster. When trying to submit a job via bin/flink run <JOB_JAR>, the client will use the connection information from this file.

Confluent HDFS Connector

I want to move kafka log file to hadoop log file. So i make follow HDFS Connector Configuration
/quickstart-hdfs.properties
name=hdfs-sink
connector.class=io.confluent.connect.hdfs.HdfsSinkConnector
tasks.max=1
topics=kafka_log_test
hdfs.url=hdfs://10.100.216.60:9000
flush.size=100000
hive.integration=true
hive.metastore.uris=thrift://localhost:9083
schema.compatibility=BACKWARD
format.class=io.confluent.connect.hdfs.parquet.ParquetFormat
partitioner.class=io.confluent.connect.hdfs.partitioner.Hour‌​lyPartitioner
/connect-avro-standalone.properties
bootstrap.servers=localhost:9092
key.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url=http://localhost:8081
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://localhost:8081
internal.key.converter=org.apache.kafka.connect.json.JsonConverter
internal.value.converter=org.apache.kafka.connect.json.JsonConverter
internal.key.converter.schemas.enable=false
internal.value.converter.schemas.enable=false
offset.storage.file.filename=/tmp/connect.offsets
When i run the HDFS Connector, just writes avro schema in .avro file. Not data.
/kafka_log_test+0+0000000018+0000000020.avro
avro.schema {"type":"record","name":"myrecord","fields":[{"name":"f1","type":"string"}],"connect.version":1,"connect.name":"myrecord"}
Topic have lots of data but confluent hdfs connector doesn't move data to hdfs.
How can i do that to resolve this problem ?
By definition, unless the messages are otherwise compacted or expired between offsets 18 and 20 then the file containing the name 0+0000000018+0000000020 will have 2 records from partition 0.
You should use tojson command of avro-tools rather than getmeta.
Or you can use Spark or Pig to read that file.
You might also want to verify the connectors is continuing to run after starting it because setting hive.metastore.uris=thrift://localhost:9083 on a machine that is not the Hive Metastore Server will cause the Connect task to fail. The URI should be the actual host for Hive, just as you've done for the NameNode.
Also, it shouldn't be possible to get a .avro file extension with format.class=io.confluent.connect.hdfs.parquet.ParquetFormat anyway, so you might want to verify you are looking in the correct HDFS path. Note: Connect writes to a +tmp location temporarily before writing the final output files.

flum agent with syslogs source and hbase sink

I try to use flume with syslogs source and hbase sink.
when I run flume agent I get this error : Failed to start agent because dependencies were not found in classpath. Error follows. java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration, which means (from that question) that some hbase lib are missing, to solve it I need to set in flume-env.sh file the path to these lib,that what I did, and run flume but the error persisted here is the command that I used to run flume agent : bin/flume-ng agent --conf ./conf --conf-file ./conf/flume.properties --name agent -Dflume.root.logger=INFO,console so my question is, If the solution that I used is correct (I need to add lib to flume) why I still get the same error, if not how to solve that problem
EDIT
from the doc I read : The flume-ng executable looks for and sources a file named "flume-env.sh" in the conf directory specified by the --conf/-c commandline option..
I haven't test it yet but I think that is the solution (I just need a confirmation )
I would recommend you to download HBase full tar ball and set the environment variables like HBASE_HOME etc to the right locations. Then Flume can automatically pick the libraries from HBase repo.

Reading a file in Spark in cluster mode in Amazon EC2

I'm trying to execute a spark program in cluster mode in Amazon Ec2 using
spark-submit --master spark://<master-ip>:7077 --deploy-mode cluster --class com.mycompany.SimpleApp ./spark.jar
And the class has a line that tries to read a file:
JavaRDD<String> logData = sc.textFile("/user/input/CHANGES.txt").cache();
I'm unable to read this txt file in cluster mode even if I'm able to read in standalone mode. In cluster mode, it's looking to read from hdfs. So I put the file in hdfs at /root/persistent-hdfs using
hadoop fs -mkdir -p /wordcount/input
hadoop fs -put /app/hadoop/tmp/input.txt /wordcount/input/input.txt
And I can see the file using hadoop fs -ls /workcount/input. But Spark is still unable to read the file. Any idea what I'm doing wrong. Thanks.
You might want to check the following points:
Is the file really in the persistent HDFS?
It seems that you just copy the input file from /app/hadoop/tmp/input.txt to /wordcount/input/input.txt, all in the node disk. I believe you misunderstand the functionality of the hadoop commands.
Instead, you should try putting the file explicitly in the persistent HDFS (root/persistent-hdfs/), and then loading it using the hdfs://... prefix.
Is the persistent HDFS server up?
Please take a look here, it seems Spark only starts the ephemeral HDFS server by default. In order to switch to the persistent HDFS server, you must do the following:
1) Stop the ephemeral HDFS server: /root/ephemeral-hdfs/bin/stop-dfs.sh
2) Start the persistent HDFS server: /root/persistent-hdfs/bin/start-dfs.sh
Please try these things, I hope they can serve you well.

Using the storm hdfs connector

I am running cloudera cdh 4.4 on a VM.
I am trying to use the storm hdfs connector to write data into hdfs.
the link: https://github.com/ptgoetz/storm-hdfs
There is a test topology to write data into HDFS, i tried to run the topology using storm (0.9.2) but it doesn't seem to work.
the link for topology: https://github.com/ptgoetz/storm-hdfs/tree/master/src/test/java/org/apache/storm/hdfs/bolt
The topology itself requires two arguments ...
1. it is he fs url
2. yamli config file , i don't know where it is or if I have to create it.
Anybody know about the steps to follow to run this topology?

Resources