Custom patterns for stream analytics blob storage - azure-blob-storage

My question is about saving out data from stream analytics to blob storage . In our system we are collecting clictstream-data from many websites via event hubs. Then we are doing some small grouping and aggregating. After that we send the results to our blob storage.
The problem is we want to seperate our results to many blob containers by id for each website. Now we can only do it by date and time pattern like /logs/{date}/{time} but we want /{websiteID}/{date}/{time}
Is there any way of achieving this?

This is a duplicate question:
Stream Analytics: Dynamic output path based on message payload
Azure Stream Analytics -> how much control over path prefix do I really have?
The short version of the above is you can't do this in Stream Analytics. If you have too many target paths for multiple sinks to be feasible, your best bet is to stream to a single blob store sink and process the results with something other than ASA. Azure Functions, WebJobs or ADF tasks are a few possible solutions.

The problem is we want to seperate our results to many blob containers by id for each website. Now we can only do it by date and time pattern like /logs/{date}/{time} but we want /{websiteID}/{date}/{time}
As this official document stream-analytics-define-outputs mentioned about Path Prefix Pattern of Blob storage output:
The file path used to write your blobs within the specified container.
Within the path, you may choose to use one or more instances of the following 2 variables to specify the frequency that blobs are written:
{date}, {time}
Example 1: cluster1/logs/{date}/{time}
Example 2: cluster1/logs/{date}
Based on my understanding, you could create multiple blob output targets from a single Stream Analytics job for each of your websites, and in your SQL-like query language, you could filter the event data and send data to the specific output. For more details, you could refer to Common query patterns.

Related

Choosing between DynamoDB Streams vs. Kinesis Streams for IoT Sensor data

I have a fleet of 250 Wifi-enabled IoT sensors streaming weight data. Each devices samples once per second. I am requesting help between choosing AWS DynamoDB Streams vs. AWS Kinesis Streams to to store and process this data in real-time. Here are some additional requirements:
I need to keep all raw data in a SQL-accessible table.
I also need to clean the raw stream data with Python's Pandas library to recognize device-level events based on weight changes (e.g. if weight of sensor #1 increases, record as "sensor #1 increased by x lbs # XX:XX PM" If no change, do nothing).
I need that change-event data (interpreted with library from the raw data streams) to be accessible in real time dashboard (e.g. device #1 weight just went to zero, prompting employee to refill container #1)
Either DDB Streams or Kinesis Streams can support Lambda functions, which is what I'll use for the data cleaning, but I've read the documentation and comparison articles and can't distinguish which is best for my use case. Cost is not a key consideration. Thanks in advance!!
Unfortunately, I think you will need a few pieces of infrastructure for a full solution.
I think you could use Kinesis and firehose to write to a database to store the raw data in a way that can be queried with SQL.
For the data cleaning step, I think you will need to use a stateful stream processor like flink or bytewax and then the transformed data can be written to a real-time database or back to kinesis so that it can be consumed in a dashboard.
DynamoDB stream works with DynamoDB. It streams row changes to be picked up by downstream services like Lambda. You mentioned that you want data to be stored in SQL data base. DynamoDB is a NOSQL databse. So you can exclude that service.
Not sure why you want to have data in SQL database. If it is timeseries data, you would probably store them into a time series db like TimeStream.
If you are using AWS IoT Core to send data over MQTT to AWS, you can forward those messages to Kinesis Data Stream (or SQS). Then you can have a lambda triggered on messages received in Kinesis. This lambda can process the data and store them in the DB you want.

How to use AWS Kinesis streams for multiple different data sources

We have a traditional batch application where we ingest data from multiple sources (Oracle, Salesforce, FTP Files, Web Logs etc.). We store the incoming data in S3 bucket and run Spark on EMR to process data and load on S3 and Redshift.
Now we are thinking of making this application near real time by bringing in AWS Kinesis and then using Spark Structured Streaming from EMR to process streaming data and load it to S3 and Redshift. Given that we have different variety of data e.g. 100+ tables from Oracle, 100+ salesforce objects, 20+ files coming from FTP location, Web Logs etc. what is the best way to use AWS Kinesis here.
1) Using Separate Stream for each source (Salesforce, Oracle, FTP) and then using a separate shard (within a stream) for each table/ object - Each consumer reads from its own shard which has a particular table/ file
2) Using a separate stream for each table/ object - We will end up having 500+ streams in this scenario.
3) Using a single stream for everything - not sure how the consumer app will read data in this scenario.
Kinesis does not care what data you put into a stream, data is just a blob to Kinesis. It will be up to you to determine (code) the writers and readers for a stream. You could intermix different types of data into one stream, the consumer will then need to figure out what each blob is and what to do with it.
I would break this into multiple streams based upon data type and priority of the data. This will make implementation and debugging a lot easier.
I think you are misunderstanding what shards are. They are for performance and not for data separation.

How best to implement a Dashboard from data in HDFS/Hadoop

I have a bunch of data in .csv format in Hadoop HDFS in several GBs.i have Flight data on one airport. there are different delays like carrier delay, weather delay. NAS delay etc
I want to create a dashboard that reports on the contents in there e.g maximum delay on particular route, maximum delay flight wise etc.
I am new to hadoop world.
thnak you
You can try Hive. Similar like SQL.
You can load the data from HDFS into tables using simple create table statements.
Hive also provides in-built functions which you can exploit to get the necessary results.
Many Data Visualizations tools are available, some commonly used are
Tableau
Qlik
Splunk
These tools provide us capabilities to create our own dashboard.

Large scale static file ( csv txt etc ) archiving solution

I am new to large scala data analytics and archiving so I though I ask this question to see if I am looking at things the right way.
Current requirement:
I have large number of static files in the filesystem. Csv, Eml, Txt, Json
I need to warehouse this data for archiving / legal reasons
I need to provide a unified search facility MAIN functionality
Future requirement:
I need to enrich the data file with additional metadata
I need to do analytics on the data
I might need to ingest data from other sources from API etc.
I would like to come up with a relatively simple solution with the possibility that I can expand it later with additional parts without having to rewrite bits. Ideally I would like to keep each part as a simple service.
As currently search is the KEY and I am experienced with Elasticsearch I though I would use ES for distributed search.
I have the following questions:
Should I copy the file from static storage to Hadoop?
is there any virtue keeping the data in HBASE instead of individual files ?
is there a way that once a file is added to Hadoop I can trigger an event to index the file into Elasticsearch ?
is there perhaps a simpler way to monitor hundreds of folders for new files and push them to Elasticsearch?
I am sure I am overcomplicating this as I am new to this field. Hence I would appreciate some ideas / directions I should explore to do something simple but future proof.
Thanks for looking!
Regards,

What's the proper way to log big data to organize and store it with Hadoop, and query it using Hive?

So basically I have apps on different platforms that are sending logging data to my server. It's a node server that essentially accepts a payload of log entries and it saves them to their respective log files (as write stream buffers, so it is fast), and creates a new log file whenever one fills up.
The way I'm storing my logs is essentially one file per "endpoint", and each log file consists of space separated values that correspond to metrics. For example, a player event log structure might look like this:
timestamp user mediatype event
and the log entry would then look like this
1433421453 bob iPhone play
Based off of reading documentation, I think this format is good for something like Hadoop. The way I think this works, is I will store these logs on a server, then run a cron job that periodically moves these files to S3. From S3, I could use those logs as a source for a Hadoop cluster using Amazon's EMR. From there, I could query it with Hive.
Does this approach make sense? Are there flaws in my logic? How should I be saving/moving these files around for Amazon's EMR? Do I need to concatenate all my log files into one giant one?
Also, what if I add a metric to a log in the future? Will that mess up all my previous data?
I realize I have a lot of questions, that's because I'm new to Big Data and need a solution. Thank you very much for your time, I appreciate it.
If you have a large volume of log dump that changes periodically, the approach you laid out makes sense. Using EMRFS, you can directly process the logs from S3 (which you probably know).
As you 'append' new log events to Hive, the part files will be produced. So, you dont have to concatenate them ahead of loading them to Hive.
(on day 0, the logs are in some delimited form, loaded to Hive, Part files are produced as a result of various transformations. On subsequent cycles, new events/logs will be appened to those part files.)
Adding new fields on an ongoing basis is a challenge. You can create new data structures/sets and Hive tables and join them. But the joins are going to be slow. So, you may want to define fillers/placeholders in your schema.
If you are going to receive streams of logs (lots of small log files/events) and need to run near real time analytics, then have a look at Kinesis.
(also test drive Impala. It is faster)
.. my 2c.

Resources