I am trying to write a wrapper that could easily be used by people coming from the Sql background. I have not started on this work yet and I would like to know what approach should I take.
Here's the problem statement - If someone has a lot of native Sql written on their rdbms data and they want to switch to hadoop then there are lots of problems. A major problem of building tables in hdfs has been eliminated by Hive. Now comes the querying part - for this we have different frameworks but none is complete in itself - like one might be slow and other might be lacking in features. For example, there's Impala, there's Hive QL but then for the end user there is no ONE framework.
I intend to do something like this - select(comma-separated string of column names, tableName).where(filter-expression)....
Something like Linq for hdfs and underneath it would figure out what's the best way to execute select(hive ql or Impala), best way to do a where clause, etc.
Suggestions? Ideas? Critique?
Thanks
Why not use the ODBC or JDBC drivers for Impala? These drivers are used by third-party tools like MicroStrategy or Tableau to submit queries to Impala.
Related
I have more of a conceptual question. I'm using Hive to pull data and then I want to insert all the retrieved values into IBM BigSQL (basically DB2) so that aggregating data would be easier/faster. So I want to create a view in Hive that I will use nightly perform CTAS so that I can take the table and migrate it to db2 and do the rest of the aggregation.
Is there a better practice?
I wanted to do everything including aggregation in Hive but it is extremely slow.
Thanks for your suggestions!
Considering that you are using Cloudera, is there a reason why you don't perform the aggregations in Impala? convert the json data to parquet (I would recommend this if there is not a lot of nested structure) shouldn't be really expensive. Another alternative depending the kind of aggregations that you are doing is use Spark to convert the data (also will depend a lot of your cluster size). I would like to give you more specific hints but without know what aggregations you are doing is be complicated
Are there any performance benchmark(genuine ones) that compare Stinger vs Impala vs Drill? Also, which is preferred - my use case will be mainly towards ad-hoc interactive queries on top of Hive. Thanks.
There are some performance numbers on the site http://allegro.tech/fast-data-hackathon.html.
In general, we see Drill and Impala are comparable in performance for the interactive queries with the differentiation of Drill being its ability to query without metadata definitions and its ease of use working with JSON data.
Note that these tests are on much older versions on Drill such as 0.8/0.9 (also not configured appropriately for data locality). Now Drill is 1.1 with a lot of improvements on SQL (window functions etc) and performance.
You cannot do benchmark like this, it's no sense and you should never trust a such benchmark.
Everything will depend on your own data, you have JSON files ? prefer Drill. You want to query more than 1TB, prefer Hive and so on.
Also, you may consider file format, JSON, Kudu, Parquet or ORC.
Then come the optimization, Hive+Tez seems better for parrarel queries but very slow for single query. Whereas Impala is the opposite (MapReduce versus MassiveParrarelProcessing).
Also, you want to consider the hardware ressource, disk SSD or not etc..
I recommend, start with Apache Drill + JSON file, then try Apache Drill with Parquet or ORC.
If you want help, describe exactly what you have (data + hardware) and what you want.
I'm working on a Hadoop / Cassandra integration I have a couple of questions I was hoping someone could help me with.
First, I seem to require the source table/cf to have been created with the option WITH COMPACT STORAGE otherwise I get an error can't read keyspace in map/reduce code.
I was wondering if this is just how it needs to be?
And if this is the case, my second question was, is it possible/how do I add the WITH COMPACT STORAGE option on to a pre-exsting table? .. or am I going to have to re-create them and move data around.
I am using Cassandra 1.2.6
thanks in advance
Gerry
I'm assuming you are using job.setInputFormatClass(ColumnFamilyInputFormat.class);
Instead, try using job.setInputFormatClass(CqlPagingInputFormat.class);
The Mapper input for this is Map<String, ByteBuffer>, Map<String,ByteBuffer>
Similarly, if you need to write out to Cassandra us CqlPagingOutputFormat and the appropriate output types.
See http://www.datastax.com/dev/blog/cql3-table-support-in-hadoop-pig-and-hive for more info.
#Gerry
The "WITH COMPACT STORAGE" thing is CQL3 syntax to create tables structure compatible with Thrift clients and legacy column families.
Essentially, when using this option, the table, or should I say column family, is created without using any Composite.
You should know that CQL3 tables heavily rely on composites to work.
Now to answer your questions:
I was wondering if this is just how it needs to be?
Probably because your map/reduce code cannot deal with Composites. But I believe in version 1.2.6 of Cassandra, you have all the necessary code to deal with CQL3 tables. Look at classes in package org.apache.cassandra.hadoop.
is it possible/how do I add the WITH COMPACT STORAGE option on to a pre-exsting table?
No, it's not possible to modify/change table structure once created. You'll need some kind of migrations.
Is there any support for Multidimensional Expressions (MDX) for Hadoop's Hive ?
Connecting an OLAP solution with Hadoop's data is possible. In icCube it's possible to create your own data sources (check documentation), you'll need a Java interface (like JDBC).
This solution is bringing the data to the OLAP server. To bring the processing to Hadoop is another question and at my knowledge nobody does it. Aggregating the facts in parallel is possible. Another step is to have the dimensions in the nodes. This is a complicated problem (algos are not easy to transform in a parallel version).
You can use Mondrian (Pentaho Analysis Services), it connects via JDBC and uses specific dialects for databases. I've seen reference to a Hive dialect, but have not tried it myself - best to search the forums.
There is a bit of a learning curve: you need to create a schema that defines the cubes in XML, but fortunately there is a GUI tool (schema workbench) that helps.
There is Simba MDX provider which claims to convert MDX queries to HiveQL. I have not tried it myself to comment on the features and limitations of this.
I'm working on a project with a friend that will utilize Hbase to store it's data. Are there any good query examples? I seem to be writing a ton of Java code to iterate through lists of RowResult's when, in SQL land, I could write a simple query. Am I missing something? Or is Hbase missing something?
I think you, like many of us, are making the mistake of treating bigtable and HBase like just another RDBMS when it's actually a column-oriented storage model meant for efficiently storing and retrieving large sets of sparse data. This means storing, ideally, many-to-one relationships within a single row, for example. Your queries should return very few rows but contain (potentially) many datapoints.
Perhaps if you told us more about what you were trying to store, we could help you design your schema to match the bigtable/HBase way of doing things.
For a good rundown of what HBase does differently than a "traditional" RDBMS, check out this awesome article: Matching Impedance: When to use HBase by Bryan Duxbury.
If you want to access HBase using a query language and a JDBC driver it is possible. Paul Ambrose has released a library called HBQL at hbql.com that will help you do this. I've used it for a couple of projects and it works well. You obviously won't have access to full SQL, but it does make it a little easier to use.
I looked at Hadoop and Hbase and as Sean said, I soon realised it didn't give me what I actually wanted, which was a clustered JDBC compliant database.
I think you could be better off using something like C-JDBC or HA-JDBC which seem more like what I was was after. (Personally, I haven't got farther with either of these other than reading the documentation so I can't tell which of them is any good, if any.)
I'd recommend taking a look at Apache Hive project, which is similar to HBase (in the sense that it's a distributed database) which implements a SQL-esque language.
Thanks for the reply Sean, and sorry for my late response. I often make the mistake of treating HBase like a RDBMS. So often in fact that I've had to re-write code because of it! It's such a hard thing to unlearn.
Right now we have only 4 tables. Which, in this case, is very few considering my background. I was just hoping to use some RDBMS functionality while mostly sticking to the column-oriented storage model.
Glad to hear you guys are using HBase! I'm not an expert by any stretch of the imagination, but here are a couple of things that might help.
HBase is based on / inspired by BigTable, which happens to be exposed by AppEngine as their db api, so browsing their docs should help a great deal if you're working on a webapp.
If you're not working on a webapp, the kind of iterating you're describing is usually handled with via map/reduce (don't emit the values you don't want). Skipping over values using iterators virtually guarantees your application will have bottlenecks with HBase-sized data sets. If you find you're still thinking in SQL, check out cloudera's pig tutorial and hive tutorial.
Basically the whole HBase/SQL mental difference (for non-webapps) boils down to "Send the computation to the data, don't send the data to the computation" -- if you keep that in mind while you're coding you'll do fine :-)
Regards,
David