Big data implementation on cloud - hortonworks-data-platform

Big data implementation on cloud - hortonworks-data-platform

Could someone please let me know what does it mean by 'Big Data implementation over Cloud'
I have been using Amazon S3 to store data and query using hive, which I read is one of the cloud implementation. I would like to know what exactly does this mean and all possible ways to implement it.
Thanks,
Sree

Following are choices in the levels of services that a Cloud provider can offer for a Big Data analytics solution:
Data platform infrastructure service, such as Hadoop as a Service, that provides pre-installed and managed infrastructures. With this level of service, you are responsible for loading, governing, and managing the data and analytics for the analytics solution.
Data management service, such as a Data Lake Service, that provides data management, catalog services, analytics development, security, and information governance services on top of one or more data platforms. With this level of service, you are responsible for defining the policies for how data is managed and for connecting data sources to the cloud solution. The data owners have direct control of how their data is loaded, secured, and used. Consumers of data are able to use the catalog to locate the data they want, request access, and make use of the data through self-service interfaces.
Insight and Data Service, such as a Customer Analytics Service, that gives you the responsibility for connecting data sources to the cloud solution. The cloud solution then provides APIs to access combinations of your data and additional data sources, both proprietary to the solution and public open data, along with analytical insight generated from this data.
For more information regarding this, read the detailed article published by IBM here: http://www.ibm.com/developerworks/cloud/library/cl-ibm-leads-building-big-data-analytics-solutions-cloud-trs/index.html
Also take a look at the services provided by Qubole, which greatly simplifies, speeds and scales big data analytics workloads against data stored on AWS, Google, or Azure clouds - https://www.qubole.com/features.

Storing and processing big volumes of data
requires scalability plus availability.
Cloud computing delivers all these through hardware
virtualization. For the same reason, it is only logical that big data and cloud computing are
two compatible concepts as cloud enables big data to
be available, scalable and fault tolerant.
Not only that, the implementation does not stop there - many companies are now offering Big Data as A Service (BDaaS), such as Stratoscale, Cloudera and of course Azure and others.

Related

Should event driven architecture be targeted for all data & analytics platforms?

For example,
You have an IT estate where a mix of batch and real-time data sources exists from multiple systems, e.g. ERP, Project management, asset, website, monitoring etc.
The aim is to integrate the datasources into a cloud environment (agnostic).
There is a need for reporting and analytics on combinations of all data sources.
Inevitably, some source systems are not capable of streaming, hence batch loading is required.
Potential use-cases for performing functionality/changes/updates based on the ingested data.
Given a steer for creating a future-proofed platform, architecturally, how would you look to design it?

It's a very open-end question, but there are some good principles you can adopt to help direct you in the right direction:
Avoid point-to-point integration, and get everything going through a few common points - ideally one. Using an API Gateway can be a good place to start, the big players (Azure, AWS, GCP) all have their own options, plus there's lots of decent independent ones like Tyk or Kong.
Batches and event-streams are totally different, but even then you can still potentially route them all through the gateway so that you get the centralised observability (reporting, analytics, alerting, etc).
Use standards-based API specifications where possible. A good REST based API, based off a proper resource model is a non-trivial undertaking, not sure if it fits with what you are doing if you are dealing with lots of disparate legacy integration. If you are going to adopt REST, use OpenAPI to specify the API's. Using this standard not only makes it easier for consumers, but also helps you with better tooling as many design, build and test tools support OpenAPI. There's also AsyncAPI for event/async API's
Do some architecture. Moving sh*t to cloud doesn't remove the sh*t - it just moves it to the cloud. Don't recreate old problems in a new place.
Work out the logical components in your new solution: what does each of them do (what's it's reason to exist)? Don't forget ancillary components like API catalogues, etc.
Think about layering the integration (usually depending on how they will be consumed and what role they need to play, e.g. system interface, orchestration, experience APIs, etc).
Want to handle data in a consistent way regardless of source (your 'agnostic' comment)? You'll need to think through how data is ingested and processed. This might lead you into more data / ETL centric considerations rather than integration ones.
Co-design. Is the integration mainly data coming in or going out? Is the integration with 3rd parties or strictly internal?
If you are designing for external / 3rd party consumers then a co-design process is advised, since you're essentially designing the API for them.
If the API's are for internal use, consider designing them for external use so that when/if you decide to do that later it's not so hard.
Taker a step back:
Continually ask yourselves "what problem are we trying to solve?". Usually, a technology initiate is successful if there's a well understood reason for doing it, which has solid buy-in from the business (non-IT).
Who wants the reporting, and why - what problem are they trying to solve?

As you mentioned its an IT estate aka enterprise level solution mix of batch and real time so first you have to identify what is end goal of this migration. You can think of refactoring applications. If you are trying to make it event driven then assess the refactoring efforts and cost. Separation of responsibility is the key factor for refactoring and migration.
If you are thinking about future proofing your solution then consider Cloud for storing and processing your data. Not necessary it will be cheap but mix of Cloud and on-prem could be a way. There are services available by cloud providers to move your data in minimal cost. Cloud native solutions are there for performing analysis on your data. Database migration service in AWS or Azure can move data and then capture on-going changes. So you can keep using on-prem db & apps and perform analysis for reporting on cloud. It will ease out load on your transactional DB. Most data sync from on-prem to cloud is near real time.

Spring Cloud Netflix & Spring Cloud Data Flow microservice arheticture

I'm developing an application that must both handle events coming from other systems and provide a REST API. I want to split the applications into micro services and I'm trying to figure out which approach I should use. I drew attention to the Spring Cloud Netflix and the Spring Cloud Data Flow toolkit, but it's not clear to me whether they can be integrated and how.
As an example, suppose we have the following functionality in the system:
1. information about users
card of orders
product catalog
sending various notifications
obtaining information about the orders from third-party systems
processing, filtering, and transformation of order events
processing of various rules based on orders and sending notifications
sending information about user orders from third-party systems to other users using websockets (with pre-filtering)
Point 1-4 - there I see the classical micro service architecture. Framework - Spring Netflix Stack.
Point 5-9 - it's best to use an event-driven approach. Toolkit - Spring Data Flow.
The question is how to build communication between these platforms.
In particular - POPULATE ORDER DETAILS SERVICE must transform the incoming orders and save additional information (in case it needed) in the database. ORDER RULE EXECUTOR SERVICE should obtain information about the current saved rules, execute them and send notifications. WEB SOCKET SERVICE should send orders information only if a particular user has set the filters, and ORDER SAVER SERVICE should store the information about the transformed orders in the database.
1.
Communication between the micro-services within the two platforms could be using the API GATEWAY, but in this case, I have the following questions:
Does the Spring Cloud platform allow to work with micro services that way?
Performance - the number of events is very huge, which can significantly slow down the processing of events. Is it possible to use other approaches, for example, communication not through the API Gateway but through in-memory cache?
2.
Since some functionality intersects between these services, I have a question about what is "microservice" in the understanding of the Spring Cloud Stream framework. In particular, does it make sense to have separate services? Can the microservice in the Spring Cloud Stream have a REST API, work with the database and simultaneously process the events? Does such a diagram make sense and is it possible to build such a stack at the moment?
The question is which of these approaches is more correct? What did Spring Data Streams mean by "microservice"?

Given the limited information in the post, it is hard to convince on all the matters pertaining to this type of architecture, but I'll attempt to share some specifics, and point to samples. Also for the same reasons, it is hard to solve for your needs end-to-end. From the surface, it appears you're attempting to build event-driven applications and wondering whether Spring Cloud Stream (SCSt) and Spring Cloud Data Flow (SCDF) could help.
They can, yes.
The Order, User, and Catalog seem like domain objects and it would all come together to solve for a use-case. For instance, querying for a number of orders for a particular product, and group by the user. There are a few samples that articulate the data flow between the entities to solve similar problems. Here's a live code-walkthrough of event-driven systems in action. There's another example of social-graph application, too.
Though these event-driven applications can run standalone as individual services with the help of of message broker (eg: Kafka or RabbitMQ), you could of course also register them in SCDF and use them in the SCDF DSL to build a coherent data pipeline. We are expanding on more direct capabilities in SCDF for these types of use-cases, but there are ways to orchestrate them today with current abilities, too. Follow spring-cloud/spring-cloud-#2331#issuecomment-406444350 for more details.
I hope this gives an idea. Try to build something small using SCSt/SCDF, prove it out, and expand to more complex use-cases.

How can we get high availability in prometheus data store?

I am new to prometheus, and so I am not sure if high availability is part of Prometheus data store tsdb. I am not looking into something like having two prometheus server instances scraping data from the same exporter as that has high chance of having two tsdb data store which are out of sync.

It really depends on your requirements.
Do you need highly available alerting on your metrics? Prometheus can do that.
Do you need a highly available monitoring system that contains the last few hours of data for operational triage? Two prometheus instances are pretty good for that too.
Do you need long-term storage of timeseries data? Prometheus is not designed to accomplish this on its own. Either use the remote write functionality of prometheus to ship data to another TSDB that supports redundant storage (InfluxDB and Clickhouse are pretty promising here) but you are on the hook for de-duping data. Alternatively, consider Cortex.

For Kubernetes setup Using kube-prometheus (prometheus-operator), you can configure it using values.
and including thanos Would help in this situation

There is prometheus-postgresql-adapter that allows you to use PostgreSQL / TimescaleDB as a remote storage. The adapter enables multiple Prometheus instances (HA setup) to write to a single remote storage, so you have one source of truth. Recently, I've published a blog post about it [How to manage Prometheus high-availability with PostgreSQL + TimescaleDB] (https://blog.timescale.com/blog/prometheus-ha-postgresql-8de68d19b6f5/).
Disclaimer: I am one of the engineers behind the adapter

Hazelcast data isolation ("Memory Regions")

We are building a multi tenant application which has restrictions on the regions/countries where the data is persisted.
The application is based on microsoft .Net microservice architecture but we have shared Domains, although we have separate DBs at very lower levels say for each city a separate DB. We cannot persist the data of one country in another country's data center. Hazelcast will be used as the distributed cache. I could not find any direct ways to configure data isolation for ex. like "Memory Regions" in apache ignite. Do we have "Memory Regions" in hazelcast?
I need to write behind the data from cache to respective Database. Can I segregate a part/partition of cache specific to a database instance?
Any help would be greatly appreciated. Thanks in advance.

I am not directly replying to your question. IMHO, from my understanding when you have a data stored across different clusters / nodes, there will still be a network call, despite you having some key formats so that the data is stored within the same Cluster / Node.
Based on my experience, you could easily setup a MemoryCache that comes as part of the System.Runtime.Caching to store the data in every node and then use Redis Pub-Sub or Azure Service bus as the back-bone for the pub-sub.
In that case,
any data that is updated in a cache is notified to all the other instances of the application via a ServiceBus / Redis message which is typically the key.
Upon receipt of the key, each application clears out its internal cache and then gets the data cached back on the next DB access.
This method is more commonly prevalent in Multi-Tenant Applications and also is fail-safe and light weight. The payloads / network transfers are less and each AppDomain has its internal memory used as a cache which does support different regions via different instances of MemoryCache.
Hope this helps if no direct response is available regarding HazelCast
Also, you may refer to this link for some details regarding the Hazelcast

Azure Technology Choice for Project

There is a lot of information out there about the various Azure data storage flavors however I'd like to ask for some advice for my particular scenario.
I'm putting together a pet project to become more familiar with Azure technology, in particular, Service Bus/Event Hubs and data storage platforms. The system I want to create is fairly simple: accept a moderate load of events (not IoT scale), persist them, and make aggregated data available such as 'User A had N events of type X in the past day/week/month/etc.' as reports.
Given that the data will be quite structured (e.g. users, user groups, events, etc.), and I will need aggregation capabilities, it suggests that relational storage may be the best fit, although more expensive.
Another alternative I've considered is to maintain aggregated data at near real-time using something like stream analytics but not sure if this is overkill compared to a more data warehouse-esque solution.
Any suggestions/help would be greatly appreciated.
John

John,
Azure SQL would be a decent choice, or if that proves to be too expensive, regular SQL hosted on a VM. You can create an Azure Service Bus to hold the incoming requests, and then create competing consumers on 1 or more worker roles to monitor and process the messages. Each consumer can run the SQL and persist the data in a new table that is created and "pre-aggregated" for the caller, or you could persist the information to Azure BLOB storage in a structured format that matches your reporting tool (i.e. JSON). BLOB storage of the aggregated information will be the most cost effective, and relieve strain on SQL.
An alternative would be HDInsight which can aggregate the information in batch processing mode as well. I guess the choice between SQL/HDInsight depends on the native format of the base (non-aggregated) information.

I agree with Daniel. SQL Azure may be the way to go for your relational data needs. Another option to investigate for larger workloads for streaming and analytics is Azure Data Lake (https://azure.microsoft.com/en-us/solutions/data-lake/)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio