Is there any support for Multidimensional Expressions (MDX) for Hadoop's Hive ?
Connecting an OLAP solution with Hadoop's data is possible. In icCube it's possible to create your own data sources (check documentation), you'll need a Java interface (like JDBC).
This solution is bringing the data to the OLAP server. To bring the processing to Hadoop is another question and at my knowledge nobody does it. Aggregating the facts in parallel is possible. Another step is to have the dimensions in the nodes. This is a complicated problem (algos are not easy to transform in a parallel version).
You can use Mondrian (Pentaho Analysis Services), it connects via JDBC and uses specific dialects for databases. I've seen reference to a Hive dialect, but have not tried it myself - best to search the forums.
There is a bit of a learning curve: you need to create a schema that defines the cubes in XML, but fortunately there is a GUI tool (schema workbench) that helps.
There is Simba MDX provider which claims to convert MDX queries to HiveQL. I have not tried it myself to comment on the features and limitations of this.
Related
Which hadoop component can handle all the oracle functions & which has low latency..
Am thinking to use the components like Presto, Drill and Shark..
Can anyone tell which of the above technology can handle all the functions in oracle with low latency..
or atleast which has more compatibility & which can handle all the functions of oracle..
I have the flexibility to use more than one technology but am confused to use which one for which like functions compatible for which technology & which technology can give low latency..?
Presto is designed to implement ANSI SQL and to execute queries with low latency (under 100ms for connectors that support it). Queries against Hive can execute in ~1s, depending on the speed of the Hive metastore (zero time if cached due to repeated access) and HDFS latency.
Regarding Oracle functionality, nothing in open source comes close. Oracle is a huge product with a ton of functionality. However, no one uses all of the functionality. Most people use a small subset. You will need to evaluate the different alternatives and decide which has the functionality subset that best meets your needs.
Disclosure: I am one of the creators of Presto.
I'm working on a Hadoop / Cassandra integration I have a couple of questions I was hoping someone could help me with.
First, I seem to require the source table/cf to have been created with the option WITH COMPACT STORAGE otherwise I get an error can't read keyspace in map/reduce code.
I was wondering if this is just how it needs to be?
And if this is the case, my second question was, is it possible/how do I add the WITH COMPACT STORAGE option on to a pre-exsting table? .. or am I going to have to re-create them and move data around.
I am using Cassandra 1.2.6
thanks in advance
Gerry
I'm assuming you are using job.setInputFormatClass(ColumnFamilyInputFormat.class);
Instead, try using job.setInputFormatClass(CqlPagingInputFormat.class);
The Mapper input for this is Map<String, ByteBuffer>, Map<String,ByteBuffer>
Similarly, if you need to write out to Cassandra us CqlPagingOutputFormat and the appropriate output types.
See http://www.datastax.com/dev/blog/cql3-table-support-in-hadoop-pig-and-hive for more info.
#Gerry
The "WITH COMPACT STORAGE" thing is CQL3 syntax to create tables structure compatible with Thrift clients and legacy column families.
Essentially, when using this option, the table, or should I say column family, is created without using any Composite.
You should know that CQL3 tables heavily rely on composites to work.
Now to answer your questions:
I was wondering if this is just how it needs to be?
Probably because your map/reduce code cannot deal with Composites. But I believe in version 1.2.6 of Cassandra, you have all the necessary code to deal with CQL3 tables. Look at classes in package org.apache.cassandra.hadoop.
is it possible/how do I add the WITH COMPACT STORAGE option on to a pre-exsting table?
No, it's not possible to modify/change table structure once created. You'll need some kind of migrations.
I am trying to write a wrapper that could easily be used by people coming from the Sql background. I have not started on this work yet and I would like to know what approach should I take.
Here's the problem statement - If someone has a lot of native Sql written on their rdbms data and they want to switch to hadoop then there are lots of problems. A major problem of building tables in hdfs has been eliminated by Hive. Now comes the querying part - for this we have different frameworks but none is complete in itself - like one might be slow and other might be lacking in features. For example, there's Impala, there's Hive QL but then for the end user there is no ONE framework.
I intend to do something like this - select(comma-separated string of column names, tableName).where(filter-expression)....
Something like Linq for hdfs and underneath it would figure out what's the best way to execute select(hive ql or Impala), best way to do a where clause, etc.
Suggestions? Ideas? Critique?
Thanks
Why not use the ODBC or JDBC drivers for Impala? These drivers are used by third-party tools like MicroStrategy or Tableau to submit queries to Impala.
I store a huge amount of reporting elements in a MySQL database. These elements are stored in a simple way :
KindOfEvent;FromCountry;FromGroupOfUser;FromUser;CreationDate
All these reporting elements should permit to display graphs from different points of view. I have tried using SQL requests for that but it is very slow for users. As this graph will be used by non-technical users, I need a tool to pre-work the result.
I am very new to all this data-mining, reporting, olap concepts. If you know a pragmatic approach not so time consuming, or a tool for that, it would help !
You could setup OLAP cubes on top of your MySQL data. The multi-dimensional model will help your users navigating through and analysing the data either via Excel or Web dashboards. One thing specific to icCube is its ability to integrate any Javascript charting library and to embed the dashboard within your own pages.
I am not familiar with DB, but I think MySQL is far than enough for your problems. Well designed index or transaction will speed up the query process.
I am not a DB expert but if you want to process graphs, you can use Neo4J (java graph processing framework), or SNAP (C++ graph processing framework), or employee cloud computing if this is possible. I would recommend either Hadoop (MapReduce) or Giraph (cloud graph processing). For graph display you can use whatever tools suites you. Of course "the best" technology depends on the data size. If none of the above suites you, try finding something that does on the wiki page: http://en.wikipedia.org/wiki/Graph_database
InforGrid (http://infogrid.org/trac/) looks like might suite you.
I'm working on a project with a friend that will utilize Hbase to store it's data. Are there any good query examples? I seem to be writing a ton of Java code to iterate through lists of RowResult's when, in SQL land, I could write a simple query. Am I missing something? Or is Hbase missing something?
I think you, like many of us, are making the mistake of treating bigtable and HBase like just another RDBMS when it's actually a column-oriented storage model meant for efficiently storing and retrieving large sets of sparse data. This means storing, ideally, many-to-one relationships within a single row, for example. Your queries should return very few rows but contain (potentially) many datapoints.
Perhaps if you told us more about what you were trying to store, we could help you design your schema to match the bigtable/HBase way of doing things.
For a good rundown of what HBase does differently than a "traditional" RDBMS, check out this awesome article: Matching Impedance: When to use HBase by Bryan Duxbury.
If you want to access HBase using a query language and a JDBC driver it is possible. Paul Ambrose has released a library called HBQL at hbql.com that will help you do this. I've used it for a couple of projects and it works well. You obviously won't have access to full SQL, but it does make it a little easier to use.
I looked at Hadoop and Hbase and as Sean said, I soon realised it didn't give me what I actually wanted, which was a clustered JDBC compliant database.
I think you could be better off using something like C-JDBC or HA-JDBC which seem more like what I was was after. (Personally, I haven't got farther with either of these other than reading the documentation so I can't tell which of them is any good, if any.)
I'd recommend taking a look at Apache Hive project, which is similar to HBase (in the sense that it's a distributed database) which implements a SQL-esque language.
Thanks for the reply Sean, and sorry for my late response. I often make the mistake of treating HBase like a RDBMS. So often in fact that I've had to re-write code because of it! It's such a hard thing to unlearn.
Right now we have only 4 tables. Which, in this case, is very few considering my background. I was just hoping to use some RDBMS functionality while mostly sticking to the column-oriented storage model.
Glad to hear you guys are using HBase! I'm not an expert by any stretch of the imagination, but here are a couple of things that might help.
HBase is based on / inspired by BigTable, which happens to be exposed by AppEngine as their db api, so browsing their docs should help a great deal if you're working on a webapp.
If you're not working on a webapp, the kind of iterating you're describing is usually handled with via map/reduce (don't emit the values you don't want). Skipping over values using iterators virtually guarantees your application will have bottlenecks with HBase-sized data sets. If you find you're still thinking in SQL, check out cloudera's pig tutorial and hive tutorial.
Basically the whole HBase/SQL mental difference (for non-webapps) boils down to "Send the computation to the data, don't send the data to the computation" -- if you keep that in mind while you're coding you'll do fine :-)
Regards,
David