How to send, store and analyze sensor data using Hadoop - hadoop

My Raspberry pi 2 is doing good with windows 10 and I'm able to control LED from internet using .Net MF. Now, I wanted to send my LED (I'm going to use temperature sensor instead of LED) ON-OFF signal onto big data for storing and analyzing or retrieving purpose.
Checked on net, not able to find the simple and easy way to do that. Could any one please suggest any tutorial for "How can I send real time data to Hadoop"? I want to understand whole architecture to proceed on this.
What all technologies/things I should concentrate on to make such POC?
Note: I think, I need some combination like MQTT broker, Spark or Strom etc...But not sure, how can I put all things together to make it practically possible. Please correct me if I'm wrong and help.

You could send the signals as a stream of events to Hadoop in real time, using one of several components which make up the Hadoop "ecosystem". Systems such as Spark or Storm which are for processing the data in real time are only necessary if you want to apply logic to the stream in real-time. If you just want to batch up the events and store them in HDFS for later retrieval by a batch process, you could use:
Apache Flume. A Flume agent runs on one or more of the Hadoop nodes and listens on a port. Your Raspberry Pi sends each event one by one to that port. Flume buffers the events and then writes them to HDFS https://flume.apache.org/FlumeUserGuide.html
Kafka. Your Raspberry Pi sends the events one by one to a Kafka instance which stores them as a message queue. A further distributed batch process runs periodically on Hadoop in order to move the events from Kafka to HDFS. This solution is more robust but has more moving parts.

Related

Pause/Resume Flink job during migrations

I'm using Apache Flink to propagate updates from a given set of Kafka topics into an Elasticsearch cluster.
The problem I'm facing is that sometimes the Elasticsearch cluster evolves and I have to (1) modify the mappings, (2) copy over the data...and by the time I have to point the Flink jobs to the new alias/index, there are plenty of updates that made it to the old index.
So I wonder what's the best way to approach this. I can have downtime, but I would like to avoid this if possible. I was trying to make the Flink jobs to slowdown or pause the (Kafka) input sources until the migration finishes, but I didn't find any endpoint for this.
The Flink jobs run in application mode.
If anyone can shed some light on how to accomplish this: pause/resume the jobs via an API or something similar, I will really appreciate the input. The only constraint I have is around stopping the applications (as in stopping/killing pods): it's possible, but too troublesome due to access constraints to the Kubernetes clusters.
I'd probably look into stopping the job with a save point using Flink REST API: https://nightlies.apache.org/flink/flink-docs-stable/docs/ops/rest_api/#jobs-jobid-stop
If that Flink app is pretty big and has lots of state, you can also try to just stop sending data to the input Kafka topic if you don't want to stop it (assuming that it can properly write with the new mappings and indeces after you've made the required ES cluster changes without any change in the Flink Job). It is a bit of overhead, but you could have different topics for your producers and Flink sources, and have another simple Flink job mirror data from one topic (where producers produce to) to the other (where Flink consumes from). When you want to stop writing to ES, just stop that job using the REST API. To not write a new Flink job you could use MirrorMaker or similar, but to stop it you may have to kill its pod.
Or another option is architecting the Elasticsearch indexes so they can support your cluster evolution without having to stop the Flink app. It is hard to know what do you'd need to exactly change, but by writing into aliases and playing with the write index flag you may be able to achieve what you want. I've done this in the past, but it is true that if your mappings change a lot it may be hard to do,

Exchange files (up to many GB)

For my project, I have to create a file manager which aims at storing many files (from many locations) and exposing URL to download them.
In a micro-service ecosystem (I am used to use spring boot), I wonder what is the best way to exchange such files, I mean sending files to file manager?
On a one hand, I always thought it is better to exchange them asynchronously, so HTTP does not seem a good choice. But maybe I am wrong.
Is it a good choice to split files into fragments (in order to reduce number of bytes for each part) and send each of them through something like RabbitMQ or Kafka? Or should I rather transfer entire files on a NAS or through FTP and let file manager handling them? Or something else, like for example storing bytes in a temp database (maybe not a good choice)...
The problem of fragmentation is I have to implement a logic for keeping sort of each fragments which complicates processing of queues of topics.
IMO, never send actual files through a message broker.
First, setup some object storage system, for example S3 (with AWS or locally with Ceph), then send the path to the file as a string with the producer, then have the consumer read that path, and download the file.
If you want to collect files off of NAS or FTP, then Apache NiFi is one tool that has connectors to systems like that.
Based on my professional experience working with distributed systems (JMS based), to transfer huge content between participants:
a fragment approach should be used for request - reply model + control signals (has next, fragment counter)
delta approach for updates.
To avoid corrupt data, a hash function result can also be transmitted and checked in both scenarios.
But as mentioned in this e-mail thread, a better approach is to use FTP for this kind of scenarios:
RabbitMQ should actually not be used for big file transfers or only
with great care and fragmenting the files into smaller separate
messages.
When running a single broker instance, you'd still be safe, but in a
clustered setup, very big messages will break the cluster.
Clustered nodes are connected via 1 tcp connection, which must also
transport a (erlang) heartbeat. If your big message takes more time to
transfer between nodes than the heartbeat timeout (anywhere between
~20-45 seconds if I'm correct), the cluster will break and your
message is lost.
The preferred architecture for file transfer over amqp is to just send
a message with a link to a downloadable resource and let the file
transfer be handle by specialized protocol like ftp :-)
Hope it helps.

Concurrent batch jobs writing logs to database

My production system has nearly 170 Ctrl-M jobs(essentially cron jobs) running every day. These jobs are weaved together(by creating dependencies) to perform ETL operations. eg: Ctrl-M(scheduler like CRON) almost always start with a shell script, which then executes a bunch of python, hive scripts or map-reduce jobs in a specific order.
I am trying to implement logging into each of these processes to be able to better monitor the tasks and the pipelines in whole. The logs would be used to build a monitoring dashboard.
Currently I have implemented logging using a central wrapper which would be called by each of the processes to log information. This wrapper in turn opens up a teradata connection EACH time and calls a teradata stored procedure to write into a teradata table.
This works fine for now. But in my case, multiple concurrent processes (spawning even more parallel child processes) run at the same time and I have started experiencing dropped connections while doing some load testing. Below is an approach I have been thinking about:
Make processes write to some kind of message queues(eg: AWS sqs). A listener would pick data from these message queues asynchronously and then batch write to teradata.
Using files or some structure to perform batch writing to teradata db.
I would definitely like to hear your thoughts on that or any other better approaches. Eventually the end point of logging will be shifted to redshift and hence thinking in the lines of AWS SQS queues.
Thanks in advance.
I think Kinesis firehose is the perfect solution for this. Setting up a the firehose stream is incredibly quick and easy to configure, very inexpensive and will stream your data to s3 bucket of your choice and optionally stream your logs directly to redshift.
If redshift is your end goal (or even just s3), kinesis firehose couldn't make it easier.
https://aws.amazon.com/kinesis/firehose/
Amazon Kinesis Firehose is the easiest way to load streaming data into
AWS. It can capture and automatically load streaming data into Amazon
S3 and Amazon Redshift, enabling near real-time analytics with
existing business intelligence tools and dashboards you’re already
using today. It is a fully managed service that automatically scales
to match the throughput of your data and requires no ongoing
administration. It can also batch, compress, and encrypt the data
before loading it, minimizing the amount of storage used at the
destination and increasing security. You can easily create a Firehose
delivery stream from the AWS Management Console, configure it with a
few clicks, and start sending data to the stream from hundreds of
thousands of data sources to be loaded continuously to AWS – all in
just a few minutes.

UDP data stream handling with MapReduce

I have a problem using real time UDP stream processing with the map reduce system. Actually I am doing a university project and I want to use mapreduce to process this data. UDP stream is about ship data from several AIS devices.
As far as I am aware, Apache Storm will be the solution for that. But I dont know that I can incorporate mapreduce to the Storm . I want to incorporate mapreduce concepts and ultimately I want to learn it.
Also I want to have some advice about the system architecture, the normal procedure is this,
UDP stream received by the system
decode the stream
real time analytic should be shown
stored for future data retrial purposes.
so can anyone suggest what is the best way to do this? can Apache Storm do this?
I'll answer the easy question first: Yes, Apache Storm can do what you want it to do.
That said, any of other 'big data' streaming tools can do this data processing as well. These tools include Storm, but also Spark and Samza.
If I were building this myself, I'd push the streaming data into a messaging queue, probably Kafka, then use Storm to pull individual messages out and process them. You can then store the result however you want. That could be onto disk, back into Kafka, or whatever makes sense in your case.
Finally, it doesn't seem that mapreduce is a good fit to your problem. Mapreduce is for batch processing, which isn't what you are describing as your problem.

Apache Kafka vs Apache Storm

Apache Kafka: Distributed messaging system
Apache Storm: Real Time Message Processing
How we can use both technologies in a real-time data pipeline for processing event data?
In terms of real time data pipeline both seems to me do the job identical. How can we use both the technologies on a data pipeline?
You use Apache Kafka as a distributed and robust queue that can handle high volume data and enables you to pass messages from one end-point to another.
Storm is not a queue. It is a system that has distributed real time processing abilities, meaning you can execute all kind of manipulations on real time data in parallel.
The common flow of these tools (as I know it) goes as follows:
real-time-system --> Kafka --> Storm --> NoSql --> BI(optional)
So you have your real time app handling high volume data, sends it to Kafka queue. Storm pulls the data from kafka and applies some required manipulation. At this point you usually like to get some benefits from this data, so you either send it to some Nosql db for additional BI calculations, or you could simply query this NoSql from any other system.
I know that this is an older thread and the comparisons of Apache Kafka and Storm were valid and correct when they were written but it is worth noting that Apache Kafka has evolved a lot over the years and since version 0.10 (April 2016) Kafka has included a Kafka Streams API which provides stream processing capabilities without the need for any additional software such as Storm. Kafka also includes the Connect API for connecting into various sources and sinks (destinations) of data.
Announcement blog - https://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple/
Current Apache documentation - https://kafka.apache.org/documentation/streams/
In 0.11 Kafka the stream processing functionality was further expanded to provide Exactly Once Semantics and Transactions.
https://www.confluent.io/blog/exactly-once-semantics-are-possible-heres-how-apache-kafka-does-it/
Kafka and Storm have a slightly different purpose:
Kafka is a distributed message broker which can handle big amount of messages per second. It uses publish-subscribe paradigm and relies on topics and partitions. Kafka uses Zookeeper to share and save state between brokers. So Kafka is basically responsible for transferring messages from one machine to another.
Storm is a scalable, fault-tolerant, real-time analytic system (think like Hadoop in realtime). It consumes data from sources (Spouts) and passes it to pipeline (Bolts). You can combine them in the topology. So Storm is basically a computation unit (aggregation, machine learning).
But you can use them together: for example your application uses kafka to send data to other servers which uses storm to make some computation on it.
This is how it works
Kafka - To provide a realtime stream
Storm - To perform some operations on that stream
You might take a look at the GitHub project https://github.com/abhishekgoel137/kafka-nodejs-d3js.
(D3js is a graph-representation library)
Ideal case:
Realtime application -> Kafka -> Storm -> NoSQL -> d3js
This repository is based on:
Realtime application -> Kafka -> <plain Node.js> -> NoSQL -> d3js
As every one explain you that
Apache Kafka: is continuous messaging queue
Apache Storm: is continuous processing tool
here in this aspect Kafka will get the data from any website like FB,Twitter by using API's and that data is processed by using Apache Storm and you can store the processed data in either in any databases you like.
https://github.com/miguno/kafka-storm-starter
Just follow it you will get some idea
When I have a use case that requires me to visualize or alert on patterns (think of twitter trends), while continuing to process the events, I have a several patterns.
NiFi would allow me to process an event and update a persistent data store with low(er) batch aggregation with very, very little custom coding.
Storm (lots of custom coding) allows me nearly real time access to the trending events.
If I can wait for many seconds, then I can batch out of kafka, into hdfs (Parquet) and process.
If I need to know in seconds, I need NiFi, and probably even Storm. (Think of monitoring thousands of earth stations, where I need to see small region weather conditions for tornado warnings).
Simply Kafka send the messages from node to another , and Storm processing the messages . Check this example of how you can Integration Apache Kafka With Storm

Resources