My application is configured to read a topic from a configured Kafka, then write the transformed result in the Hadoop HDFS. In order to do so, it needs to be launched on a Yarn cluster node.
In order to do so, we'd like to use Spring DataFlow. But since this application doesn't need any input from another flow (it already knows where to pull its source), and outputs nothing, how can I create a valid DataFlow stream from it ?
In other words, this would be a stream composed of only one app, that should run indefinitely on a Yarn Node.
In this case you need a stream definition that connects to a named destination in Kafka and write to HDFS.
For instance, the stream would look like this:
stream create a1 --definition ":myKafkaTopic > hdfs"
You can read here for more info on this.
Related
I need to get data from csv files ( daily extraction from différent business Databasses ) to HDFS then move it to Hbase and finaly charging agregation of this data to a datamart (sqlServer ).
I would like to know the best way to automate this process ( using java or hadoops tools )
I'd echo the comment above re. Kafka Connect, which is part of Apache Kafka. With this you just use configuration files to stream from your sources, you can use KSQL to create derived/enriched/aggregated streams, and then stream these to HDFS/Elastic/HBase/JDBC/etc etc etc
There's a list of Kafka Connect connectors here.
This blog series walks through the basics:
https://www.confluent.io/blog/simplest-useful-kafka-connect-data-pipeline-world-thereabouts-part-1/
https://www.confluent.io/blog/blogthe-simplest-useful-kafka-connect-data-pipeline-in-the-world-or-thereabouts-part-2/
https://www.confluent.io/blog/simplest-useful-kafka-connect-data-pipeline-world-thereabouts-part-3/
Little to no coding required? In no particular order
Talend Open Studio
Streamsets Data Collector
Apache Nifi
Assuming you can setup a Kafka cluster, you can try Kafka Connect
If you want to program something, probably Spark. Otherwise, pick your favorite language. Schedule the job via Oozie
If you don't need the raw HDFS data, you can load directly into HBase
I'm downloading a file source and creating and stream to process line by line to finally sink into HDFS.
For that purpose I'm using Spring Cloud Dataflow + Kafka.
Question: is there any way to know when the complete file has been sunk into HDFS to trigger an event?
is there any way to know when the complete file has been sunk into HDFS to trigger an event?
This type of use-case typically falls under task/batch as opposed to streaming pipeline. If you build a filehdfs task (batch-job) application, you could then have a stream listening to various task-events in order to make further downstream decisions or data processing.
Please refer to "Subscribing to Task/Batch Events" from the reference guide for more details.
How to insert streaming data to hawq and execute query on online data.
I teste jdbc insert and performance was very bad.
After that i tested writing data to hdfs with flume and created external table in hawq, but hawq can't read data until flume close the file. the problem is that if i set flume file rolling very low (1 min) after some days number of files goes up and this is not good for hdfs.
Third solution is hbase, but because most of my queries are aggregation on many data, hbase is not a good solution(hbase is good for getting single data).
So with these constraints, what is a good solution to query streaming data online with hawq?
if your source data is not on hdfs, you can try gpdfist/named pipe as a buffer with gpfdist external table or web external table using other linux scripts. another solution will be spring xd gpfdist module. http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#gpfdist
Another option for an External Table is to use the TRANSFORM option. This is where the External Table references a gpfdist URL and gpfdist executes a program for you to get data. It is a pull technique rather that push.
Here are the details:
External Table "TRANSFORM" Option
And since you mentioned JDBC, I wrote a program that leverages gpfdist which executes a Java program to get data via JDBC. It works with both Greenplum and HAWQ and any JDBC source.
gplink
Since you mentioned Flume, i will provide some alternative approach with similar tool springxd.
you can have a Kafka topic where you can drop streaming messages and springxd sink job which can write to HAWQ. for Example;
For example; if you have some stream loading files from a FTP to KAFKA and spring java job taking messages from kafka to hawq.
job deploy hawqbinjob --properties "module.hawqjob.count=2"
stream create --name ftpdocs --definition "ftp --host=me.local --remoteDir=/Users/me/DEV/FTP --username=me --password=********** --localDir=/Users/me/DEV/data/binary --fixedDelay=0 | log
stream create --name file2kafka --definition "file --dir=/Users/me/DEV/data/binary --pattern=* --mode=ref --preventDuplicates=true --fixedDelay=1 | transform --expression=payload.getAbsolutePath() --outputType=text/plain | kafka --topic=mybin1 --brokerList=kafka1.localdomain:6667" --deploy
stream create --name --definition "kafka --zkconnect= kafka1.localdomain:2181 --topic=mybin1 | byte2string > queue:job:hawqbinjob" --deploy
This is one way of getting parallelism and does not limit to hdfs file open issue. You can extend this pattern in many ways, since most of the streaming data is small set. Hope this help.
We have a project requirement of testing the data at Kafka Layer. So JSON files are moving into hadoop area and kafka is reading the live data in hadoop(Raw Json File). Now I have to test whether the data sent from the other system and read by kafka should be same.
Can i validate the data at kafka?. Does kafka store the messages internally on HDFS?. If yes then is it stored in a file structure similar to what hive saves internally just like a single folder for single table.
Kafka stores data in local files (ie, local file system for each running broker). For those files, Kafka uses its own storage format that is based on a partitioned append-only log abstraction.
The local storage directory, can be configured via parameter log.dir. This configuration happens individually for each broker, ie, each broker can use a different location. The default value is /tmp/kafka-logs.
The Kafka community is also working on tiered-storage, that will allow brokers to no only use local disks, but to offload "cold data" into a second tier: https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage
Furthermore, each topic has multiple partitions. How partitions are distributed, is a Kafka internal implementation detail. Thus you should now rely on it. To get the current state of your cluster, you can request meta data about topics and partitions etc. (see https://cwiki.apache.org/confluence/display/KAFKA/Finding+Topic+and+Partition+Leader for an code example). Also keep in mind, that partitions are replicated and if you write, you always need to write to the partition leader (if you create a KafkaProducer is will automatically find the leader for each partition you write to).
For further information, browse https://cwiki.apache.org/confluence/display/KAFKA/Index
I think you can, but you have to do that manually. You can let kafka sink whatever output to HDFS. Maybe my answer is a bit late and this 'confluent' reference appeared after that, but briefly one can do the followings:
Assuming you have all servers are running (check the confluent
website)
Create your connector:
name=hdfs-sink
connector.class=io.confluent.connect.hdfs.HdfsSinkConnector
tasks.max=1
topics='your topic'
hdfs.url=hdfs://localhost:9000
flush.size=3
Note: The approach assumes that you are using their platform
(confluent platform) which I haven't use.
Fire the kafka-hdfs streamer.
Also you might find more useful details in this Stack Overflow discussion.
This happens with most of the beginner. Let's first understand that component you see in Big Data processing may not be at all related to Hadoop.
Yarn, MapReduce, HDFS are 3 main core component of Hadoop. Hive, Pig, OOOZIE, SQOOP, HBase etc work on top of Hadoop.
Frameworks like Kafka or Spark are not dependent on Hadoop, they are independent entities. Spark supports Hadoop, like Yarn, can be used for Spark's Cluster mode, HDFS for storage.
Same way Kafka as an independent entity, can work with Spark. It stores its messages in the local file system.
log.dirs=/tmp/kafka-logs
You can check this at $KAFKA_HOME/config/server.properties
Hope this helps.
I have created a real time application in which I am writing data streams to hdfs from weblogs using flume, and then processing that data using spark stream. But while flume is writing and creating new files in hdfs spark stream is unable to process those files. If I am putting the files to hdfs directory using put command spark stream is able to read and process the files. Any help regarding the same will be great.
You have detected the problem yourself: while the stream of data continues, the HDFS file is "locked" and can not be read by any other process. On the contrary, as you have experienced, if you put a batch of data (that's yur file, a batch, not a stream), once it is uploaded it is ready for being read.
Anyway, and not being an expert on Spark streaming, it seems from the Spark Streaming Programming Guide, Overview section, that you are not performing the right deployment. I mean, from the picture shown there, it seems the streaming (in this case generated by Flume) must be directly sent to Spark Streaming engine; then the results will be put in HDFS.
Nevertheless, if you want to maintain your deployment, i.e. Flume -> HDFS -> Spark, then my suggestion is to create mini-batches of data in temporal HDFS folders, and once the mini-batches are ready, store new data in a second minibatch, passing the first batch to Spark for analysis.
HTH
In addition to frb's answer: which is correct - SparkStreaming with Flume acts as an Avro RPC Server - you'll need to configure an AvroSink which points to your SparkStreaming instance.
with spark2, now you can connect directly your spark streaming to flume, see official docs, and then write once on HDFS at the end of the process.
import org.apache.spark.streaming.flume._
val flumeStream = FlumeUtils.createStream(streamingContext, [chosen machine's hostname], [chosen port])