Azure Technology Choice for Project - events

There is a lot of information out there about the various Azure data storage flavors however I'd like to ask for some advice for my particular scenario.
I'm putting together a pet project to become more familiar with Azure technology, in particular, Service Bus/Event Hubs and data storage platforms. The system I want to create is fairly simple: accept a moderate load of events (not IoT scale), persist them, and make aggregated data available such as 'User A had N events of type X in the past day/week/month/etc.' as reports.
Given that the data will be quite structured (e.g. users, user groups, events, etc.), and I will need aggregation capabilities, it suggests that relational storage may be the best fit, although more expensive.
Another alternative I've considered is to maintain aggregated data at near real-time using something like stream analytics but not sure if this is overkill compared to a more data warehouse-esque solution.
Any suggestions/help would be greatly appreciated.
John

John,
Azure SQL would be a decent choice, or if that proves to be too expensive, regular SQL hosted on a VM. You can create an Azure Service Bus to hold the incoming requests, and then create competing consumers on 1 or more worker roles to monitor and process the messages. Each consumer can run the SQL and persist the data in a new table that is created and "pre-aggregated" for the caller, or you could persist the information to Azure BLOB storage in a structured format that matches your reporting tool (i.e. JSON). BLOB storage of the aggregated information will be the most cost effective, and relieve strain on SQL.
An alternative would be HDInsight which can aggregate the information in batch processing mode as well. I guess the choice between SQL/HDInsight depends on the native format of the base (non-aggregated) information.

I agree with Daniel. SQL Azure may be the way to go for your relational data needs. Another option to investigate for larger workloads for streaming and analytics is Azure Data Lake (https://azure.microsoft.com/en-us/solutions/data-lake/)

Related

Should event driven architecture be targeted for all data & analytics platforms?

For example,
You have an IT estate where a mix of batch and real-time data sources exists from multiple systems, e.g. ERP, Project management, asset, website, monitoring etc.
The aim is to integrate the datasources into a cloud environment (agnostic).
There is a need for reporting and analytics on combinations of all data sources.
Inevitably, some source systems are not capable of streaming, hence batch loading is required.
Potential use-cases for performing functionality/changes/updates based on the ingested data.
Given a steer for creating a future-proofed platform, architecturally, how would you look to design it?
It's a very open-end question, but there are some good principles you can adopt to help direct you in the right direction:
Avoid point-to-point integration, and get everything going through a few common points - ideally one. Using an API Gateway can be a good place to start, the big players (Azure, AWS, GCP) all have their own options, plus there's lots of decent independent ones like Tyk or Kong.
Batches and event-streams are totally different, but even then you can still potentially route them all through the gateway so that you get the centralised observability (reporting, analytics, alerting, etc).
Use standards-based API specifications where possible. A good REST based API, based off a proper resource model is a non-trivial undertaking, not sure if it fits with what you are doing if you are dealing with lots of disparate legacy integration. If you are going to adopt REST, use OpenAPI to specify the API's. Using this standard not only makes it easier for consumers, but also helps you with better tooling as many design, build and test tools support OpenAPI. There's also AsyncAPI for event/async API's
Do some architecture. Moving sh*t to cloud doesn't remove the sh*t - it just moves it to the cloud. Don't recreate old problems in a new place.
Work out the logical components in your new solution: what does each of them do (what's it's reason to exist)? Don't forget ancillary components like API catalogues, etc.
Think about layering the integration (usually depending on how they will be consumed and what role they need to play, e.g. system interface, orchestration, experience APIs, etc).
Want to handle data in a consistent way regardless of source (your 'agnostic' comment)? You'll need to think through how data is ingested and processed. This might lead you into more data / ETL centric considerations rather than integration ones.
Co-design. Is the integration mainly data coming in or going out? Is the integration with 3rd parties or strictly internal?
If you are designing for external / 3rd party consumers then a co-design process is advised, since you're essentially designing the API for them.
If the API's are for internal use, consider designing them for external use so that when/if you decide to do that later it's not so hard.
Taker a step back:
Continually ask yourselves "what problem are we trying to solve?". Usually, a technology initiate is successful if there's a well understood reason for doing it, which has solid buy-in from the business (non-IT).
Who wants the reporting, and why - what problem are they trying to solve?
As you mentioned its an IT estate aka enterprise level solution mix of batch and real time so first you have to identify what is end goal of this migration. You can think of refactoring applications. If you are trying to make it event driven then assess the refactoring efforts and cost. Separation of responsibility is the key factor for refactoring and migration.
If you are thinking about future proofing your solution then consider Cloud for storing and processing your data. Not necessary it will be cheap but mix of Cloud and on-prem could be a way. There are services available by cloud providers to move your data in minimal cost. Cloud native solutions are there for performing analysis on your data. Database migration service in AWS or Azure can move data and then capture on-going changes. So you can keep using on-prem db & apps and perform analysis for reporting on cloud. It will ease out load on your transactional DB. Most data sync from on-prem to cloud is near real time.

How do I get from "Big Data" to a webpage?

I've spent a lot of time reading and watching videos of people talking about how they use tools designed for handling huge datasets and real-time processing in their architectures. And while I understand what it is that tools like Hadoop/Cassandra/Kafka etc do, no one seems to explain how the data gets from these large processing tools to rendering something on a client/webpage.
From what I understand of big data tools, is that you can't build your application the same way you would a standard web-app querying MySQL, which I can understand given the size of the data that flows through these tools, however, for all this talk of "realtime data analytics" I cannot find any explanation of how the actual analytics gets put in front of someone in terms of some chart/table/etc?
explain how the data gets from these large processing tools to rendering something on a client/webpage.
With respect to this, one way would be to process the big data using Spark or Hadoop and store the results onto a RDBMS. Then have your webapp pull data from RDBMS to render charts, table etc. I can provide you the examples that I have done myself if you need more information.
Impala supports ODBC/JDBC interfaces. So, you actually could hook up a web app to it the same way you do with MySQL.
Other stuff you might want to check out is HBase, Kudu or Solr. In some realtime architectures data ends up in one of those. And all of them have some sort of an API that you can use in your web app to access their data.
If you want a simple solution for realtime data processing and analytics, check out the new Stride API, which enables developers to collect, process, and analyze streaming data and then either visualize summary data in Stride or push processed data out to applications in realtime. This is a very easy way to build the kind of realtime reporting dashboards and monitoring / alerting systems you described above.
Take a look at the Stride API technical docs for examples and more info on how to implement this.

Big data implementation on cloud

Could someone please let me know what does it mean by 'Big Data implementation over Cloud'
I have been using Amazon S3 to store data and query using hive, which I read is one of the cloud implementation. I would like to know what exactly does this mean and all possible ways to implement it.
Thanks,
Sree
Following are choices in the levels of services that a Cloud provider can offer for a Big Data analytics solution:
Data platform infrastructure service, such as Hadoop as a Service, that provides pre-installed and managed infrastructures. With this level of service, you are responsible for loading, governing, and managing the data and analytics for the analytics solution.
Data management service, such as a Data Lake Service, that provides data management, catalog services, analytics development, security, and information governance services on top of one or more data platforms. With this level of service, you are responsible for defining the policies for how data is managed and for connecting data sources to the cloud solution. The data owners have direct control of how their data is loaded, secured, and used. Consumers of data are able to use the catalog to locate the data they want, request access, and make use of the data through self-service interfaces.
Insight and Data Service, such as a Customer Analytics Service, that gives you the responsibility for connecting data sources to the cloud solution. The cloud solution then provides APIs to access combinations of your data and additional data sources, both proprietary to the solution and public open data, along with analytical insight generated from this data.
For more information regarding this, read the detailed article published by IBM here: http://www.ibm.com/developerworks/cloud/library/cl-ibm-leads-building-big-data-analytics-solutions-cloud-trs/index.html
Also take a look at the services provided by Qubole, which greatly simplifies, speeds and scales big data analytics workloads against data stored on AWS, Google, or Azure clouds - https://www.qubole.com/features.
Storing and processing big volumes of data
requires scalability plus availability.
Cloud computing delivers all these through hardware
virtualization. For the same reason, it is only logical that big data and cloud computing are
two compatible concepts as cloud enables big data to
be available, scalable and fault tolerant.
Not only that, the implementation does not stop there - many companies are now offering Big Data as A Service (BDaaS), such as Stratoscale, Cloudera and of course Azure and others.

In an Windows Azure application using CQRS architecture, what is the optimal choice for the read and write databases?

Building an application in Windows Azure with a CQRS architecture, what would be the optional storage solution for the read and write databases, if you are looking for maximum performance?
Read: denormalized table
Write: event storage
Currently Windows offer several solutions, such as:
Azure SQL database
Azure Table Storage
Hadoop
This depends on a few factors. Especially the write side depends on whether you use Event Sourcing or not.
(Edit: just re-read your question. Must have overlooked the event store bit.)
Have a look at Lokad.CQRS for an Azure-based example: http://lokad.github.com/lokad-cqrs/

Performance problems with external data dependencies

I have an application that talks to several internal and external sources using SOAP, REST services or just using database stored procedures. Obviously, performance and stability is a major issue that I am dealing with. Even when the endpoints are performing at their best, for large sets of data, I easily see calls that take 10s of seconds.
So, I am trying to improve the performance of my application by prefetching the data and storing locally - so that at least the read operations are fast.
While my application is the major consumer and producer of data, some of the data can change from outside my application too that I have no control over. If I using caching, I would never know when to invalidate the cache when such data changes from outside my application.
So I think my only option is to have a job scheduler running that consistently updates the database. I could prioritize the users based on how often they login and use the application.
I am talking about 50 thousand users, and at least 10 endpoints that are terribly slow and can sometimes take a minute for a single call. Would something like Quartz give me the scale I need? And how would I get around the schedular becoming a single point of failure?
I am just looking for something that doesn't require high maintenance, and speeds at least some of the lesser complicated subsystems - if not most. Any suggestions?
This does sound like you might need a data warehouse. You would update the data warehouse from the various sources, on whatever schedule was necessary. However, all the read-only transactions would come from the data warehouse, and would not require immediate calls to the various external sources.
This assumes you don't need realtime access to the most up to date data. Even if you needed data accurate to within the past hour from a particular source, that only means you would need to update from that source every hour.
You haven't said what platforms you're using. If you were using SQL Server 2005 or later, I would recommend SQL Server Integration Services (SSIS) for updating the data warehouse. It's made for just this sort of thing.
Of course, depending on your platform choices, there may be alternatives that are more appropriate.
Here are some resources on SSIS and data warehouses. I know you've stated you will not be using Microsoft products. I include these links as a point of reference: these are the products I was talking about above.
SSIS Overview
Typical Uses of Integration Services
SSIS Documentation Portal
Best Practices for Data Warehousing with SQL Server 2008

Resources