Enterprise Data warehouse with NOSQL /Hadoop - "NO RDBMS" - hadoop

Are there any EDW (enterprise data warehouse) systems designed using NOSQL/Hadoop solution ?
I know there are PDW systems(MS PDW polybase, Greenplum hawq etc) which connect to HDFS sub-systems. These are proprietary hardware and software solutions and are expensive at scale.I am looking for a solution with NOSQL or Hadoop and preferably open source for enterprise data warehouse solution. I would like to hear any of your experiences if you have implemented any. Just to mention again, I am not looking for any type of proprietary RDBMS as a player in this EDW solution.
I did some research on the internet, though it's possible(Impala is a possible option) but did not see anyone really implemented completely with NOSQL or Hadoop.
If you have done something of this type, I would like to hear how you designed and what different tools that are used by your business analysts etc... If you can share your experience along the journey that would be really appreciated.
Updating....
How about VoltDb and NEOdb (which are not true RDBMS) but they claim that they can support ANSI SQL to a greater extent.

First problem you will face with building the EDW on top of Hadoop is the fact that its storage is not updatable, so you should forget about SQL UPDATE and DELETE commands.
Second, solution built on top of Hadoop is usually times more expensive to maintain. More expensive specialists, more complex debugging (compare debugging the problem in Hive query vs SQL query problems in Oracle - which would be easier).
Third, Hadoop usually gives you much less concurrency and much higher latency for any type of workload you put on top of it.
Given all of this, why do you think DWH is built on top of Hadoop only for really big enterprises like Facebook, Yahoo, Ebay, LinkedIn and so on? Because it is not that simple to do, while when implemented it can be more scalable and more customizable than any proprietary solution.
So if you are clearly decided to go on with Hadoop or any other NoSQL solution to build your DWH, I would recommend you this:
Use Hadoop HDFS as a base for data storage
Use Flume for data loading into the HDFS
Use Hive with Tez for heavy ETL jobs
Provide Impala as a SQL query interface for analysts
Provide Spark as an advanced instrument for analysts
Use Ambari for management and provisioning of all of tools together
These tools together will cover most of your needs

Related

Data Transformations in Snowflake - View, Tools etc?

We're considering Snowflake and want to understand how we could use it, and possibly other tools, to overcome one of our main problems - ETL! We currently use a legacy DWH with an ETL process consisting of SSIS and some views. This has all the common pitfalls of this methodology - most notably that it takes ages!
I was under the assumption that we'd move to an ELT model in Snowflake, I started to research tools to do the 'T' part of it, however, I'm just listening to this podcast: https://www.dataengineeringpodcast.com/snowflakedb-cloud-data-warehouse-episode-110/
And it's suggesting that just slapping a SQL View over something and exposing it in say PowerBI or Tableau is enough for the T part of things!...
Just wondering what people's experience was here?
- Do you do transformations just by writing a view in Snowflake?
- Do you use a third party tool specifically to address this need?
Secondary to this, for the Extraction and Loading, do you:
- Do this using Snowflake only
- Use a third party tool
I'm specifically interested if you do this to create some kind of timeseries in Snowflake from a non timeseries source. That's something we'd be keen to do.
This question is hard to answer without sounding opinionated, especially not knowing your use case. For what it's worth here is what I think:
Don't stick views on top of your tables and expose to a reporting tool unless you have a very very simple setup. If you're considering a tool like Snowflake then you will probably want to go for something more sustainable, this approach can become prohibitive in terms of cost and complexity in your views.
Use a third-party tool to manage your ELT process. Your choice of tool will depend on your internal skills and cloud strategy, have a look at the tools out there like Stich, Fivetran etc. If you don't mind having on-premise technologies why not stick with SSIS or use something like Apache Airflow (requires up-skilling)
Snowflake will not help you with the E of ELT, you will need to use a third-party tool to manage the extract of data from your other systems like SSIS. It will help with the L part, for this you can use Snowpipe or COPY commands which are available within the Snowflake ecosystem. Snowflake will also help you share your data with external parties which is really nice.
My organization has created a fairly complicated dimensional model in Snowflake using layers of SQL views, against which we can point our reporting tools. We use a separate replication tool for extraction from source systems and loading into Snowflake. Using views simplifies our approach in that we don't need to use an additional tool. It also makes managing the code easier than something like SSIS. For instance, we can search for code using the Snowflake interface or our version control tool instead of having to open individual SSIS packages.

Apache NIFI for ETL

How effective is to use Apache NIFI for the ETL process having source as HDFS & destination as Oracle DB. What are the limitations of Apache NIFI compared other ETL tools such as Pentaho,Datastage,etc..
Main advantage of NiFi
The main advantages of NiFi:
Intuitive gui, which allows for easy inspection of the data
Strong delivery guarantees
Low latency, you can support both batch and streaming usecases
It can handle any format, not only limited to SQL tables, but can also move log files etc.
Schema aware, and can share schema with solutions like Kafka, Flink, Spark
Main limitation of NiFi
NiFi is really a tool for moving data around, you can do enrichments of individual records but it is typically mentioned to do 'EtL' with a small t. A typical thing that you would not want to do in NiFi is joining two dynamic data sources.
For joining tables, tools like Spark, Hive, or classical ETL alternatives are often used.
For joining streams, tools like Flink and Spark Streaming are often used.
Conclusion
NiFi is a great tool, you just need to make sure you use it for the right usecase. Where needed you can use other tools to complement it.
Extra strong full disclosure: I am an employee of Cloudera, the company that supports NiFi and other projects such as Spark and Flink. I have used other ETL tools before, but not to the same extent as NiFi.
Not sure about sqoop, I can explain the benifits of using Apache Nifi. In your case the data in HDFS could be of any format(Unstructured), Nifi has a capability to process and bring it to format of your choice so that you can directly save it to any RDBMS.
Nifi handles back-pressure in vary effective way to have lossless transmission.
One of the critical features that NiFi provides that our competitors generally don't is the ability to stop jobs and examine the flow and downstream systems while it's running. For you, this means you can test the flow against a test HDFS folder and a test Oracle DB, let some data go through, pause the flow and poke around Oracle to make sure it's to your liking after a matter of seconds or minutes instead of waiting for a "job to complete." It makes the process extremely agile.
Actually Nifi is very good tool. You can easily manipulate processors. In short time you can migrate huge data.
But for destinations such as RDBMS, there are always problems. I used to have a lot of problems about "non-killing" threads, you have to be very careful about stopping processes and the configuration of processors. Some processors like QueryDatabasetable consumes huge memory and the server goes down.

Manage reports, when our database is Cassandra ...Spark or Solr...or BOTH?

My db is Cassandra (datastax enterprise => linux). Since it doesn't support group-by, aggregate and etc. for reporting, according to its fundamentals, it's not a good decision to use Cassandra, downright. I googled about this deficit and found some results as this, and this and also this one.
But I really became confused! Hive uses additional tables, individually. Solr is better for full-text searching and like that. And Spark...it's useful for analysis, but, I didn't understand if it uses Hadoop eventually, or not.
I will have many reports, which needs indexing and grouping, at least. But I don't want to use additional tables which will impose overhead. And also, I'm .Net (and not Java) developer, and my application is besed on .Net Framework, too.
I am not exactly sure what your question is here and your confusion is understandable as with Cassandra and DSE there is a lot going on.
You are correct in stating that Cassandra does not support any aggregations or group by functionality that you would want to use for reporting.
Solr (DSE Search) is used for ad-hoc and full text searching of the data stored in Cassandra. This only works on a single table at a time.
Spark (DSE Analytics) provides analytics capabilities such as Map-Reduce as well as the ability to filter and join tables. This is not done in real-time though as the processing and shuffling of data can be expensive depending on the data load.
Spark does not use Hadoop. It performs many of the same jobs but is more efficient in many scenarios as it allows for in-memory distributed processing on the data.
Since you are using DataStax Enterprise the advantage is that you have built in connectors to both Solr (DSE Search) to provide ad-hoc queries and Spark (DSE Analytics) to provide analytics on your data.
Since I don't know your exact reporting requirements it is difficult to give you a specific recommendation. If you can provide some additional details about what sort of reporting (scheduled versus ad-hoc etc.) you will be running I may be able to help you more.

Hbase vs Cassandra: Which is better for a timeseries data storage?

I use my API logs to extract information like:
In this period of time how many are the users of my API ?
Or in this period of time, what type of services are called the most ?
Almost all the information I extract depend on the timestamp. Actually I use MongoDB and I added the time-stamp as an index(for 80GB, indexes size is 12GB).
A migration to cassandra or Hbase was recommended for me. And I want to know which is better for my use case:
Analysis for timeseries data.
Both good write and read performance are required.
Possibility of using hadoop to do my data analysis.
Thanks for sharing your point of view or your experience.
Advantages of Cassandra:
Cassandra generally shows better performance (though both are excellent).
Cassandra is substantially easier to setup and manage from an operational stand point (though there are tools that will help either way).
Advantages of HBase:
Native to the hadoop ecosystem
HBase will require you installing hadoop anyway, and you get a nice two-for-one. To use Cassandra you will probably need to go to use DataStax Enterprise, a commercial, non-open source product, OR investigate using Spark for your analytics work which has an open-source connector with Cassandra.
Chocolate or Vanilla ice cream - which is better?
I would suggest that you would be the best decision maker. Set up development environments for each option, and this will tell you much more about operational and tuning issues than, I think, anyone else might be able to give you. :)

Reason of why OLAP in HBase is possible

OLAP directly upon most of the noSQL databases is not possible, but from what I researched it's actually possible in HBase, so I was wondering what features does HBase have in particular that distinguishes it from the others allowing us to do this.
You will have to write lots of data processing logic in your application layer to accomplish this. Hbase is a Data store not a DBMS. So yes as long as the data goes in, you can get it out and process it in your application layer however you want.
If this proves inconvenient for you and a nosql platform that supports SQL for OLAP is desirable, you could try Amisa Server

Resources