Reason of why OLAP in HBase is possible - hadoop

OLAP directly upon most of the noSQL databases is not possible, but from what I researched it's actually possible in HBase, so I was wondering what features does HBase have in particular that distinguishes it from the others allowing us to do this.

You will have to write lots of data processing logic in your application layer to accomplish this. Hbase is a Data store not a DBMS. So yes as long as the data goes in, you can get it out and process it in your application layer however you want.
If this proves inconvenient for you and a nosql platform that supports SQL for OLAP is desirable, you could try Amisa Server

Related

Apache NIFI for ETL

How effective is to use Apache NIFI for the ETL process having source as HDFS & destination as Oracle DB. What are the limitations of Apache NIFI compared other ETL tools such as Pentaho,Datastage,etc..
Main advantage of NiFi
The main advantages of NiFi:
Intuitive gui, which allows for easy inspection of the data
Strong delivery guarantees
Low latency, you can support both batch and streaming usecases
It can handle any format, not only limited to SQL tables, but can also move log files etc.
Schema aware, and can share schema with solutions like Kafka, Flink, Spark
Main limitation of NiFi
NiFi is really a tool for moving data around, you can do enrichments of individual records but it is typically mentioned to do 'EtL' with a small t. A typical thing that you would not want to do in NiFi is joining two dynamic data sources.
For joining tables, tools like Spark, Hive, or classical ETL alternatives are often used.
For joining streams, tools like Flink and Spark Streaming are often used.
Conclusion
NiFi is a great tool, you just need to make sure you use it for the right usecase. Where needed you can use other tools to complement it.
Extra strong full disclosure: I am an employee of Cloudera, the company that supports NiFi and other projects such as Spark and Flink. I have used other ETL tools before, but not to the same extent as NiFi.
Not sure about sqoop, I can explain the benifits of using Apache Nifi. In your case the data in HDFS could be of any format(Unstructured), Nifi has a capability to process and bring it to format of your choice so that you can directly save it to any RDBMS.
Nifi handles back-pressure in vary effective way to have lossless transmission.
One of the critical features that NiFi provides that our competitors generally don't is the ability to stop jobs and examine the flow and downstream systems while it's running. For you, this means you can test the flow against a test HDFS folder and a test Oracle DB, let some data go through, pause the flow and poke around Oracle to make sure it's to your liking after a matter of seconds or minutes instead of waiting for a "job to complete." It makes the process extremely agile.
Actually Nifi is very good tool. You can easily manipulate processors. In short time you can migrate huge data.
But for destinations such as RDBMS, there are always problems. I used to have a lot of problems about "non-killing" threads, you have to be very careful about stopping processes and the configuration of processors. Some processors like QueryDatabasetable consumes huge memory and the server goes down.

Manage reports, when our database is Cassandra ...Spark or Solr...or BOTH?

My db is Cassandra (datastax enterprise => linux). Since it doesn't support group-by, aggregate and etc. for reporting, according to its fundamentals, it's not a good decision to use Cassandra, downright. I googled about this deficit and found some results as this, and this and also this one.
But I really became confused! Hive uses additional tables, individually. Solr is better for full-text searching and like that. And Spark...it's useful for analysis, but, I didn't understand if it uses Hadoop eventually, or not.
I will have many reports, which needs indexing and grouping, at least. But I don't want to use additional tables which will impose overhead. And also, I'm .Net (and not Java) developer, and my application is besed on .Net Framework, too.
I am not exactly sure what your question is here and your confusion is understandable as with Cassandra and DSE there is a lot going on.
You are correct in stating that Cassandra does not support any aggregations or group by functionality that you would want to use for reporting.
Solr (DSE Search) is used for ad-hoc and full text searching of the data stored in Cassandra. This only works on a single table at a time.
Spark (DSE Analytics) provides analytics capabilities such as Map-Reduce as well as the ability to filter and join tables. This is not done in real-time though as the processing and shuffling of data can be expensive depending on the data load.
Spark does not use Hadoop. It performs many of the same jobs but is more efficient in many scenarios as it allows for in-memory distributed processing on the data.
Since you are using DataStax Enterprise the advantage is that you have built in connectors to both Solr (DSE Search) to provide ad-hoc queries and Spark (DSE Analytics) to provide analytics on your data.
Since I don't know your exact reporting requirements it is difficult to give you a specific recommendation. If you can provide some additional details about what sort of reporting (scheduled versus ad-hoc etc.) you will be running I may be able to help you more.

Cassandra as Cache Front-end to RDBMS

We are using Oracle RDBMS in our system. To reduce database load we plan to use a caching layer.
I am looking to see if we can use Apache Cassandra as a Caching Storage frontend to Oracle db.
From what I have looked so far Cassandra is more like a database with built-in caching features. So, using it as a caching layer to Oracle would be more like using another database. I feel it would be better to Cassandra itself as an alternative to Oracle and other RDBMS rather than using it along with Oracle.
Has anyone used Cassandra as a caching layer to RDBMS. But, I have not found any resources or examples for using it. If so can you help me on this.
I'm not sure what you mean by a caching storage frontend.
Cassandra might be useful if you are expecting a large volume of writes that arrive at a rate faster than Oracle could handle. Cassandra can handle a high volume of writes since it can scale by adding more nodes.
You could then do some kind of data analysis and reduction on the data in Cassandra before inserting the crunched data into Oracle. You might then use Oracle for the tasks that suit it better such as financial reporting, ad hoc queries, etc.

Enterprise Data warehouse with NOSQL /Hadoop - "NO RDBMS"

Are there any EDW (enterprise data warehouse) systems designed using NOSQL/Hadoop solution ?
I know there are PDW systems(MS PDW polybase, Greenplum hawq etc) which connect to HDFS sub-systems. These are proprietary hardware and software solutions and are expensive at scale.I am looking for a solution with NOSQL or Hadoop and preferably open source for enterprise data warehouse solution. I would like to hear any of your experiences if you have implemented any. Just to mention again, I am not looking for any type of proprietary RDBMS as a player in this EDW solution.
I did some research on the internet, though it's possible(Impala is a possible option) but did not see anyone really implemented completely with NOSQL or Hadoop.
If you have done something of this type, I would like to hear how you designed and what different tools that are used by your business analysts etc... If you can share your experience along the journey that would be really appreciated.
Updating....
How about VoltDb and NEOdb (which are not true RDBMS) but they claim that they can support ANSI SQL to a greater extent.
First problem you will face with building the EDW on top of Hadoop is the fact that its storage is not updatable, so you should forget about SQL UPDATE and DELETE commands.
Second, solution built on top of Hadoop is usually times more expensive to maintain. More expensive specialists, more complex debugging (compare debugging the problem in Hive query vs SQL query problems in Oracle - which would be easier).
Third, Hadoop usually gives you much less concurrency and much higher latency for any type of workload you put on top of it.
Given all of this, why do you think DWH is built on top of Hadoop only for really big enterprises like Facebook, Yahoo, Ebay, LinkedIn and so on? Because it is not that simple to do, while when implemented it can be more scalable and more customizable than any proprietary solution.
So if you are clearly decided to go on with Hadoop or any other NoSQL solution to build your DWH, I would recommend you this:
Use Hadoop HDFS as a base for data storage
Use Flume for data loading into the HDFS
Use Hive with Tez for heavy ETL jobs
Provide Impala as a SQL query interface for analysts
Provide Spark as an advanced instrument for analysts
Use Ambari for management and provisioning of all of tools together
These tools together will cover most of your needs

What cassandra client to use for haoop integration?

I am trying to build a data services layer using cassandra as the backend store. I am new to Cassandra and not sure what client to use for cassandra - thrift or cql 3? We have a lot of mapreduce jobs using Amazon elastic mapreduce (EMR) that will be reading/ writing the data from cassandra at high volume. The total data volume will be > 100 TB with billions of rows in Cassandra. The mapreduce jobs may be read or write heavy with high qps (>1000 qps). The requirements are as follows:
Simplicity of client code. It seems thrift has in-built integration with Hadoop for bulk data loading using sstableloader (http://www.datastax.com/dev/blog/bulk-loading).
Ability to define new columns at run time. We may need to add more columns depending on application requirements. It seems cql3 does not allow definition of columns dynamically at runtime.
Performance of bulk read/ write. Not sure which client is better. However, I found this post that claims thrift client has better performance for high data volume: http://jira.pentaho.com/browse/PDI-7610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
I could not find any authoritative source of information that answers this question. Appreciate if you could help with this since I am sure this is a common problem for most folks and would benefit the overall community.
Many thanks in advance.
-Prateek
Hadoop and Cassandra are both written in Java so definitely pick a java based driver. As far as simplicity of code goes I'd go for Astyanax, their wiki page is really good and documentation is solid all round. And yes atyanax does allow you to define columns at runtime as you please but be aware that thrift based APIs are being superseded by cql apis.
If however you want to go down the pure cql3 route, datastax's driver is what I'd advise you to use. It allows for asynchronous connections and is continuously updated (view the logs). The code is also very clean although documentation isn't quite there yet, but there are tests in the source that you can look at.
But to be honest, there are so many questions about the APIs that you should read though them and form an opinion for yourself:
Cassandra Client Java API's
About Java Cassandra Client, which one is better? How about CQL?
Advantages of using cql over thrift
Also for performance here some benchmarks (they are however outdated!) showing that cql is catching up (and somewhat overtaking when it comes to prepared statements) thrift:
compare string vs. binary prepared statement parameters
CQL benchmarking

Resources