I have installed Kafka connect using confluent-4.0.0
Using hdfs connector I am able to save Avro records received from Kafka topic to hive.
I would like to know if there is any way to modify the records before writing into hdfs sink.
My requirement is to do small modifications to values of the record. For Example, performing arithmetic operations on integers or manipulation of strings etc.
Please suggest if there any way to achieve this
You have several options.
Single Message Transforms, which you can see in action here. Great for light-weight changes as messages pass through Connect. Configuration-file based, and extensible using the provided API if there's not an existing transform that does what you want.
See the discussion here on when SMT are suitable for a given requirement.
KSQL is a streaming SQL engine for Kafka. You can use it to modify your streams of data before sending them to HDFS. See this example here.
KSQL is built on the Kafka Stream's API, which is a Java library and gives you the power to transform your data as much as you'd like. Here's an example.
Take a look at Kafka connect transformers [1] & [2]. You can build a custom transformer library and use it in connector.
[1] http://kafka.apache.org/documentation.html#connect_transforms
[2] https://cwiki.apache.org/confluence/display/KAFKA/KIP-66%3A+Single+Message+Transforms+for+Kafka+Connect
Related
If on one topic I receive messages in some format which represent a list of identical structs (e.g. a JSON list or a repeated field in protobuf) could I configure Kafka Connect to write each entry in the list as a separate row (say in a parquet file in HDFS, or in a SQL database)? Is this possible using only the bundled converters/connectors?
I.e. can I use each Kafka message to represent thousands of records, rather than sending thousands of individual messages?
What would be a straightforward way to achieve this with Kafka Connect?
The bundled message transforms are only capable of making one-to-one message manipulations. Therefore, you would have to explicitly produce those flattened lists in some way (directly, or via a stream processing application) if you wanted Connect to write it out as separate records.
Or, if applicable, you can use Hive or Spark to expand that list as well for later processing.
Example:
{"id":"1","firstName":"abc","lastName":"xyz","dob":"12/09/1995","age":"23"}
This message structure is in kafka topic, but i want to index this in elasticsearch as below
{"id":"1","name"{"firstName":"abc","lastName":"xyz"},"dob":"12/09/1995","age":"23"}
how I can achieve this?
Two options:
Stream processing against the data in the Kafka topic. Using Kafka Streams you could wrangle the data model as required. KSQL would work for this in inverse but doesn't support creating STRUCTs yet. Other stream processing options would be Flink, Spark Streaming, etc
Modify the data as it passes through Kafka Connect, using Single Message Transform. There's no pre-built transform that does this but you could write one using the API.
Disclaimer: I work for Confluent, the company behind the open-source KSQL project, constributes to Kafka Streams, Kafka Connect, etc.
I am using flume to handle multiple sources data and stored in HDFS but I could not understand how to filter data before storing in HDFS.
You have two options:
Use Flume interceptor, check answer here.
Use streaming based solution(Apache spark, Apache Heron/Storm) to filter records then store it in HDFS,
2nd Option give you more flexibility to write different types of streaming patterns. Add comment if you have more queries.
I 'm using spark streaming where in I'm using Flume receiver.
The streamed events consist of many fields that I do not require. So, I want to filter this out.
I just want to check which is better place to filter data:
Applying a flume interceptor to alter data and then giving it to spark or, streaming.
Applying filtering on DStream in Spark Streaming.
Thanks in Advance.
Both the options will work. Depending on two things you can decide -
Flume interceptor is more decoupled way of doing it.
Spark streaming will be faster.
If you are receiving numerous event per second than I would say go for spark streaming and if thats not the case, go for flume interceptors.
We are receiving huge amounts of XML data via API. In-order to handle this large data set, we were planning to do it in Hadoop.
Needed your help in understanding how to efficiently bring the data to Hadoop. What are the tools available ? Is there a possibility of bringing this data real-time ?
Please provide your inputs.
Thanks for your help.
Since you are receiving huge amounts of data, the appropriate way, IMHO, would be to use some aggregation tool like Flume. Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of data into your Hadoop cluster from different types of sources.
You can easily write custom sources based on your needs to collect the data. You might fins this link helpful to get started. It presents a custom Flume source designed to connect to the Twitter Streaming API and ingest tweets in a raw JSON format into HDFS. You could try something similar for your xml data.
You might also wanna have a look at Apache Chukwa which does the same thing.
HTH
Flume, Scribe & Chukwa are the tools that can accomplish the above task. However Flume is most popularly used tool of all the three. Flume has strong Reliability and Failover techniques available. As well Flume has commercial support available from Cloudera while the other two does not have.
If your only objective is for the data to land in HDFS, you can keep writing the XML responses to disk following some convention such as data-2013-08-05-01.xml and write a daily (or hourly cron) to import the XML data in HDFS. Running Flume will be overkill if you don't need streaming capabilities. From your question, it is not immediately obvious why you need Hadoop? Do you need to run MR jobs?
You want to put the data into Avro or your choice of protocol buffer for processing. Once you have a buffer to match the format of the text the hadoop ecosystem is of much better help in processing the structured data.
Hadoop originally was found most useful for taking one line entries of log files and structuring / processing the data from their. XML is already structured and requires more processing power to get it into a hadoop friendly format.
A more basic solution would be to chunking the xml data and process using Wukong (Ruby streaming) or a python alternative. Since your network bound by the 3rd party api a streaming solution might be more flexible and just as fast in the end for your needs.