Clean or Filter Data - spark-streaming

Clean or Filter Data - spark-streaming

I 'm using spark streaming where in I'm using Flume receiver.
The streamed events consist of many fields that I do not require. So, I want to filter this out.
I just want to check which is better place to filter data:
Applying a flume interceptor to alter data and then giving it to spark or, streaming.
Applying filtering on DStream in Spark Streaming.
Thanks in Advance.

Both the options will work. Depending on two things you can decide -
Flume interceptor is more decoupled way of doing it.
Spark streaming will be faster.
If you are receiving numerous event per second than I would say go for spark streaming and if thats not the case, go for flume interceptors.

Related

Kafka Connect- Modifying records before writing into sink

I have installed Kafka connect using confluent-4.0.0
Using hdfs connector I am able to save Avro records received from Kafka topic to hive.
I would like to know if there is any way to modify the records before writing into hdfs sink.
My requirement is to do small modifications to values of the record. For Example, performing arithmetic operations on integers or manipulation of strings etc.
Please suggest if there any way to achieve this

You have several options.
Single Message Transforms, which you can see in action here. Great for light-weight changes as messages pass through Connect. Configuration-file based, and extensible using the provided API if there's not an existing transform that does what you want.
See the discussion here on when SMT are suitable for a given requirement.
KSQL is a streaming SQL engine for Kafka. You can use it to modify your streams of data before sending them to HDFS. See this example here.
KSQL is built on the Kafka Stream's API, which is a Java library and gives you the power to transform your data as much as you'd like. Here's an example.

Take a look at Kafka connect transformers [1] & [2]. You can build a custom transformer library and use it in connector.
[1] http://kafka.apache.org/documentation.html#connect_transforms
[2] https://cwiki.apache.org/confluence/display/KAFKA/KIP-66%3A+Single+Message+Transforms+for+Kafka+Connect

How to filter multiple source data using Apache Flume?

I am using flume to handle multiple sources data and stored in HDFS but I could not understand how to filter data before storing in HDFS.

You have two options:
Use Flume interceptor, check answer here.
Use streaming based solution(Apache spark, Apache Heron/Storm) to filter records then store it in HDFS,
2nd Option give you more flexibility to write different types of streaming patterns. Add comment if you have more queries.

spark streaming broadcast variable daily update

I am writing a spark streaming app with online streaming data compared to basic data which i broadcast into each computing node. However, since the basic data is updated daily, i need to update the broadcasted variable daily too. The basic data resides on hdfs.
Is there a way to do this? The update is not related to any online streaming results, just say at 12:00 am everyday. Moreover, if there is such a way, will the updating process block spark streaming computing jobs?

Refer to the last answer in the thread you referred. Summary - instead of sending the data, send the caching code to update data at the needed interval
Create CacheLookup object that updates daily#12 am
Wrap that in Broadcast variable
Use CacheLookup as part of streaming logic

creating Spark Dstreams from log archives

I am new to Spark; looks awesome!
I have gobs of hourly logfiles from different sources, and wanted to create DStreams from them with a sliding window of ~5 minutes to explore correlations.
I'm just wondering what the best approach to accomplish this might be. Should I chop them up into 5-minute chunks in different directories? How would that naming structure be associated with a particular timeslice across different HDFS directories? Do I implement a filter() method that knows the log record's embedded timestamp?
suggestions, RTFMs welcomed.
thanks!
Chris

You can use apache Kafka as Dstream source and then you can try reduceByKeyAndWindow Dstream function. It will create a window according your required time
Trying to understand spark streaming windowing

XML data via API to Land in Hadoop

We are receiving huge amounts of XML data via API. In-order to handle this large data set, we were planning to do it in Hadoop.
Needed your help in understanding how to efficiently bring the data to Hadoop. What are the tools available ? Is there a possibility of bringing this data real-time ?
Please provide your inputs.
Thanks for your help.

Since you are receiving huge amounts of data, the appropriate way, IMHO, would be to use some aggregation tool like Flume. Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of data into your Hadoop cluster from different types of sources.
You can easily write custom sources based on your needs to collect the data. You might fins this link helpful to get started. It presents a custom Flume source designed to connect to the Twitter Streaming API and ingest tweets in a raw JSON format into HDFS. You could try something similar for your xml data.
You might also wanna have a look at Apache Chukwa which does the same thing.
HTH

Flume, Scribe & Chukwa are the tools that can accomplish the above task. However Flume is most popularly used tool of all the three. Flume has strong Reliability and Failover techniques available. As well Flume has commercial support available from Cloudera while the other two does not have.

If your only objective is for the data to land in HDFS, you can keep writing the XML responses to disk following some convention such as data-2013-08-05-01.xml and write a daily (or hourly cron) to import the XML data in HDFS. Running Flume will be overkill if you don't need streaming capabilities. From your question, it is not immediately obvious why you need Hadoop? Do you need to run MR jobs?

You want to put the data into Avro or your choice of protocol buffer for processing. Once you have a buffer to match the format of the text the hadoop ecosystem is of much better help in processing the structured data.
Hadoop originally was found most useful for taking one line entries of log files and structuring / processing the data from their. XML is already structured and requires more processing power to get it into a hadoop friendly format.
A more basic solution would be to chunking the xml data and process using Wukong (Ruby streaming) or a python alternative. Since your network bound by the 3rd party api a streaming solution might be more flexible and just as fast in the end for your needs.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Clean or Filter Data - spark-streaming

Both the options will work. Depending on two things you can decide - Flume interceptor is more decoupled way of doing it. Spark streaming will be faster. If you are receiving numerous event per second than I would say go for spark streaming and if thats not the case, go for flume interceptors.

Related

Kafka Connect- Modifying records before writing into sink

How to filter multiple source data using Apache Flume?

spark streaming broadcast variable daily update

creating Spark Dstreams from log archives

XML data via API to Land in Hadoop

Categories

Resources