Apache Phoenix + Pentaho Mondrian wrong join order - mondrian

I am using Apache Phoenix 4.5.2 from Cloudera labs distribution, which is installed over CDH 5.4 cluster. Now I'm trying to use it from Pentaho BA 5.4 server with embedded Mondrian and SAIKU Plugin installed.
I'm planning to use is as aggregator for Pentaho Mondrian ROLAP engine. So I have imported about 65 millions facts into fact table via slightly customized Pentaho Data Integration(if someone's interested, I added UPSERT to Table Output step, set Commit size to -1, set thin driver phoenix-<version>-query-server-thin-client.jar url to point to Apache Query Server and enabled its autocommit in hbase-site.xml via phoenix.connection.autoCommit), and now I have about 400 rows in time dimension table.
The problem is that Mondrian generates queries assuming that the order of table does not matter. It generates Cartesian join with FROM statement where dimension table come first and fact table comes last. If I change the order of tables, query works successfully.
This ends with Phoenix trying to cache 65 M rows table into memory, so I get org.apache.phoenix.join.MaxServerCacheSizeExceededException: Size of hash cache (104857626 bytes) exceeds the maximum allowed size (104857600 bytes).
Aside from building custom Mondrian which will place fact table first, is there any hint or index trick to force Phoenix iterate over facts table first, because for me it's no-brainer that it should iterate over 65M row table and hash join it with much smaller dimension table?
Exception stack trace:
Caused by: mondrian.olap.MondrianException: Mondrian Error:Internal error: Error while loading segment; sql=[select "DAYS"."DAY" as "c0", sum("account_transactions"."AMOUNT") as "m0" from "DAYS" as "DAYS", "account_transactions" as "account_transactions" where "account_transactions"."DATE" = "DAYS"."DATE" group by "DAYS"."DAY"]
at mondrian.resource.MondrianResource$_Def0.ex(MondrianResource.java:972)
at mondrian.olap.Util.newInternal(Util.java:2404)
at mondrian.olap.Util.newError(Util.java:2420)
at mondrian.rolap.SqlStatement.handle(SqlStatement.java:353)
at mondrian.rolap.SqlStatement.execute(SqlStatement.java:253)
at mondrian.rolap.RolapUtil.executeQuery(RolapUtil.java:350)
at mondrian.rolap.agg.SegmentLoader.createExecuteSql(SegmentLoader.java:625)
... 8 more
Caused by: java.sql.SQLException: Encountered exception in sub plan [0] execution.
at org.apache.phoenix.execute.HashJoinPlan.iterator(HashJoinPlan.java:171)
at org.apache.phoenix.execute.HashJoinPlan.iterator(HashJoinPlan.java:121)
at org.apache.phoenix.jdbc.PhoenixStatement$1.call(PhoenixStatement.java:266)
at org.apache.phoenix.jdbc.PhoenixStatement$1.call(PhoenixStatement.java:256)
at org.apache.phoenix.call.CallRunner.run(CallRunner.java:53)
at org.apache.phoenix.jdbc.PhoenixStatement.executeQuery(PhoenixStatement.java:255)
at org.apache.phoenix.jdbc.PhoenixStatement.executeQuery(PhoenixStatement.java:1409)
at org.apache.commons.dbcp.DelegatingStatement.executeQuery(DelegatingStatement.java:208)
at org.apache.commons.dbcp.DelegatingStatement.executeQuery(DelegatingStatement.java:208)
at mondrian.rolap.SqlStatement.execute(SqlStatement.java:200)
... 10 more
Caused by: org.apache.phoenix.join.MaxServerCacheSizeExceededException: Size of hash cache (104857626 bytes) exceeds the maximum allowed size (104857600 bytes)
at org.apache.phoenix.join.HashCacheClient.serialize(HashCacheClient.java:109)
at org.apache.phoenix.join.HashCacheClient.addHashCache(HashCacheClient.java:82)
at org.apache.phoenix.execute.HashJoinPlan$HashSubPlan.execute(HashJoinPlan.java:353)
at org.apache.phoenix.execute.HashJoinPlan$1.call(HashJoinPlan.java:145)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at org.apache.phoenix.job.JobManager$InstrumentedJobFutureTask.run(JobManager.java:183)
... 3 more

Hash Join vs. Sort-Merge Join
Basic hash join usually outperforms other types of join algorithms, but it has its limitations too, the most significant of which is the assumption that one of the relations must be small enough to fit into memory. Thus Phoenix now has both hash join and sort-merge join implemented to facilitate fast join operations as well as join between two large tables.
Phoenix currently uses the hash join algorithm whenever possible since it is usually much faster. However we have the hint “USE_SORT_MERGE_JOIN” for forcing the usage of sort-merge join in a query. The choice between these two join algorithms, together with detecting the smaller relation for hash join, will be done automatically in future under the guidance provided by table statistics.
You can add the USE_SORT_MERGE_JOIN hint in the query so that Phoenix does not try to fit the relation in memory.
ie.
SELECT /*+ USE_SORT_MERGE_JOIN*/ ...
Alternatively, you can configure a larger max cache size if you are confident that your relation will fit in memory.
https://phoenix.apache.org/tuning.html
phoenix.query.maxServerCacheBytes Default 100MB. 104857600
Maximum size (in bytes) of a single sub-query result (usually the filtered result of a table) before compression and conversion to a hash map. Attempting to hash an intermediate sub-query result of a size bigger than this setting will result in a MaxServerCacheSizeExceededException.

Related

How does one run compute stats on a subset of columns from a hive table using Impala?

I have a very long and wide hive table that is taking an exorbitant amount of time to return query results. As such, I attempted a 'compute stats' on the table but due to the width of the table, this operation often times out. As such, I was wondering if there is a way to run 'compute stats' on select columns? Documentation on the Cloudera website suggests that it is possible but the syntax does not work.
Here is what I've tried but to no avail. Please advise as these all result in syntax errors.
COMPUTE STATS database.table field1
COMPUTE STATS database.table field1, field2
COMPUTE STATS database.table (field1, field2)
After further research, it was confirmed that the syntax is, in fact, correct but the column list parameter for compute stats was not made available until CDH 5.15.x. I am leaving this here in case anybody comes across the same issue.
Hive works by creating jobs that run in a different engine (originally MapReduce, which can be rather slow) and the underlying engine can be changed.
Rather than MapReduce, you may be able to use Apache Spark or Apache Tez, both of which are faster than MapReduce.
Newer versions of Hive also support an architecture called LLAP (Live Long And Process) which caches metadata similarly to Impala, reducing query latency.
You may want to test some typical queries against your own tables to see if one of these works better for you than Impala for interactive and ad-hoc queries.
UNDERSTANDING EXECUTIONS PLANS
To get a true grasp on what causes a query to take a long time, you need to understand what operations Hive or Impala will perform when it executes a query.
To find this out, you can view the execution plan for a query.
The execution plan is a description of the tasks required for a query, the order in which they'll be executed, and some details about each task.
To see an execution plan for a query, you can do this:
Prefix the query with the keyword EXPLAIN, then run it.
Execution plans can be long and complex.
Fully understanding them requires a deep knowledge of MapReduce.
The execution plans provided by Hive and by Impala look slightly different, but at a basic level, they provide more or less the same information.
Hive explain plan understanding
TABLE AND COLUMNS STATISTICS
The SQL engines you use do a certain amount of optimizing of the queries on their own—they look for the best way to proceed with your query, when possible.
When the query uses joins, the optimizers can do a better job when they have table statistics and column statistics.
For the table as a whole, these statistics include the number of rows, the number of files used to store the data, and the total size of the data.
The column statistics includes the approximate number of distinct values and the maximum and average sizes of the values (not the maximum or average value, but rather the size used in storage).
The optimizers use this information when deciding how to perform the join tasks.
Statistics also help your system prevent issues due to memory usage and resource limitations.
These statistics are not automatically calculated—you have to manually trigger it using a SQL command.
Once statistics are computed, both Hive and Impala can use them, though if you compute them in Hive, you need to refresh Impala's metadata cache.
If you make any changes to the table, such as adding or deleting data, you'll need to recompute the statistics.
Both Hive and Impala can use the statistics, even when calculated by the other machine.
However, when you have both Impala and Hive available, Cloudera recommends using Impala's COMPUTE STATS command to calculate and view the statistics.
The method for Hive is a bit more difficult to use.
If you do use Hive, you must refresh Impala's metadata cache for the table if you want Impala to use the statistics.
Statistics in Impala
Impala's syntax for calculating statistics for a table (including statistics for all columns) is COMPUTE STATS dbname.tablename;
If the table is in the active database, you can omit dbname. from the command.
To see the statistics in Impala, run SHOW TABLE STATS dbname.tablename; or
SHOW COLUMN STATS dbname.tablename;
Note: If the statistics have not yet been computed, #Rows for the table shows -1.
The #Nulls statistics for each column will always be -1;
old versions of Impala would calculate this statistic, but it is not used for optimization, so newer versions skip it.
Statistics in Hive
Hive's syntax for calculating statistics for a table is ANALYZE TABLE dbname.tablename COMPUTE STATISTICS;
If the table is in the active database, you can omit dbname. from the command.
To calculate column statistics, add FOR COLUMNS at the end of the command.
To see the table statistics in Hive, run DESCRIBE FORMATTED dbname.tablename;
The Table Parameters section will include numFIles, numRows, rawDataSize, and totalSize.
To see the statistics for a column, include the column name at the end:
DESCRIBE FORMATTED dbname.tablename columnname;
You can only display column statistics one column at a time.

Does the number of columns in a Vertica table impact query performance?

We are working with a Vertica 8.1 table containing 500 columns and 100 000 rows.
The following query will take around 1.5 seconds to execute, even when using the vsql client straight on one of the Vertica cluster nodes (to eliminate any network latency issue) :
SELECT COUNT(*) FROM MY_TABLE WHERE COL_132 IS NOT NULL and COL_26 = 'anotherValue'
But when checking the query_requests table, the request_duration_ms is only 98 ms, and the resource_acquisitions table doesn't indicate any delay in resource asquisition. I can't understand where the rest of the time is spent.
If I then export to a new table only the columns used by the query, and run the query on this new, smaller, table, I get a blazing fast response, even though the query_requests table still tells me the request_duration_ms is around 98 ms.
So it seems that the number of columns in the table impacts the execution time of queries, even if most of these columns are not referenced. Am I wrong ? If so, why is it so ?
Thanks by advance
It sounds like your query is running against the (default) superprojection that includes all tables. Even though Vertica is a columnar database (with associated compression and encoding), your query is probably still touching more data than it needs to.
You can create projections to optimize your queries. A projection contains a subset of columns; if one is available that has all the columns your query needs, then the query uses that instead of the superprojection. (It's a little more complicated than that, because physical location is also a factor, but that's the basic idea.) You can use the Database Designer to create some initial projections based on your schema and sample queries, and iteratively improve it over time.
I was running Vertica 8.1.0-1, it seems the issue was a Vertica bug in the Vertica planning phase causing a performance degradation. It was solved in versions >= 8.1.1 :
[https://my.vertica.com/docs/ReleaseNotes/8.1.x/Vertica_8.1.x_Release_Notes.htm]
VER-53602 - Optimizer - This fix improves complex query performance during the query planning phase.

Generating star schema in hive

I am from SQL Datawarehouse world where from a flat feed I generate dimension and fact tables. In general data warehouse projects we divide feed into fact and dimension. Ex:
I am completely new to Hadoop and I came to know that I can build data warehouse in hive. Now, I am familiar with using guid which I think is applicable as a primary key in hive. So, the below strategy is the right way to load fact and dimension in hive?
Load source data into a hive table; let say Sales_Data_Warehouse
Generate Dimension from sales_data_warehouse; ex:
SELECT New_Guid(), Customer_Name, Customer_Address From Sales_Data_Warehouse
When all dimensions are done then load the fact table like
SELECT New_Guid() AS 'Fact_Key', Customer.Customer_Key, Store.Store_Key...
FROM Sales_Data_Warehouse AS 'source'
JOIN Customer_Dimension Customer on source.Customer_Name =
Customer.Customer_Name AND source.Customer_Address = Customer.Customer_Address
JOIN Store_Dimension AS 'Store' ON
Store.Store_Name = Source.Store_Name
JOIN Product_Dimension AS 'Product' ON .....
Is this the way I should load my fact and dimension table in hive?
Also, in general warehouse projects we need to update dimensions attributes (ex: Customer_Address is changed to something else) or have to update fact table foreign key (rarely, but it does happen). So, how can I have a INSERT-UPDATE load in hive. (Like we do Lookup in SSIS or MERGE Statement in TSQL)?
We still get the benefits of dimensional models on Hadoop and Hive. However, some features of Hadoop require us to slightly adopt the standard approach to dimensional modelling.
The Hadoop File System is immutable. We can only add but not update data. As a result we can only append records to dimension tables (While Hive has added an update feature and transactions this seems to be rather buggy). Slowly Changing Dimensions on Hadoop become the default behaviour. In order to get the latest and most up to date record in a dimension table we have three options. First, we can create a View that retrieves the latest record using windowing functions. Second, we can have a compaction service running in the background that recreates the latest state. Third, we can store our dimension tables in mutable storage, e.g. HBase and federate queries across the two types of storage.
The way how data is distributed across HDFS makes it expensive to join data. In a distributed relational database (MPP) we can co-locate records with the same primary and foreign keys on the same node in a cluster. This makes it relatively cheap to join very large tables. No data needs to travel across the network to perform the join. This is very different on Hadoop and HDFS. On HDFS tables are split into big chunks and distributed across the nodes on our cluster. We don’t have any control on how individual records and their keys are spread across the cluster. As a result joins on Hadoop for two very large tables are quite expensive as data has to travel across the network. We should avoid joins where possible. For a large fact and dimension table we can de-normalise the dimension table directly into the fact table. For two very large transaction tables we can nest the records of the child table inside the parent table and flatten out the data at run time. We can use SQL extensions such as array_agg in BigQuery/Postgres etc. to handle multiple grains in a fact table
I would also question the usefulness of surrogate keys. Why not use the natural key? Maybe performance for complex compound keys may be an issue but otherwise surrogate keys are not really useful and I never use them.

How to improve performance with MonetDB on OSX?

I am using monetdb on a 16GB Macbook Pro with OSX 10.10.4 Yosemite.
I execute queries with SQLWorkbenchJ (configured with a minimum of 2048M RAM).
I find the performance overall erratic:
performance is acceptable / good with small size tables (<100K rows)
abysmal with tables with many rows: a query with a join of two tables (8670 rows and 242K rows) and a simple sum took 1H 20m!!
My 16GB of memory notwithstanding, in one run I never saw MSERVER5 using more than 35MB of RAM, 450MB in another. On the other hand the time is consumed swapping data onto disk (according to Activity Monitor over 160GB of data!).
There are a number of performance-related issues that I would like to understand better:
I have the impression that MonetDB struggles with understanding how much RAM to use / is available in OSX. How can I "force" MonetDB to use more RAM?
I use MonetDB through R. The MonetDB.R driver converts all the character fields into CLOB. I wonder if CLOBs create memory allocation issues?
I find difficult to explain the many GBs of writes (as mentioned >150GB!!) even for index creation or temporary results. On the other hand when I create the DB and load the tables overall the DB is <50MB. Should I create an artificial integer key and set it as index?
I join 2 tables on a timestamp field (e.g. "2015/01/01 01:00") that again is seen as a text CLOB by MonetDb / MonetDb.R. Should I just convert it to integer before saving it to MonetDb?
I have configured each table with a primary key, using a field of type integer. MonetDB (as a typical columnar database) doesn't need the user to specify an index. Is there any other way to improve performance?
Any recommendation is welcome.
For clarity the two tables I join have the following layout:
Calendar # classic calendar table with one entry per our in a year = 8760 rows
Fields: datetime, date, month, weekbyhour, monthbyday, yearbyweek, yearbymonth # all fields are CLOBs as mentioned
Activity # around 200K rows
Fields: company, department, subdepartment, function, subfunction, activityname, activityunits, datetime, duration # all CLOBs except activityunits; datetime refers to when the activity has occurred
I have tied various types of join syntax, but an example would (`*` used for brevity)
select * from Activity as a, Calendar as b where a.datetime=b.datetime

Why Cassandra secondary indexes are so slow on just 350k rows?

I have a column family with the secondary index. The secondary index is basically a binary field, but I'm using a string for it. The field called is_exported and can be 'true' or 'false'. After request all loaded rows are updated with is_exported = 'false'.
I'm polling this column table each ten minutes and exporting new rows as they appear.
But here the problem: I'm seeing that time for this query grows pretty linear with amount of data in column table, and currently it takes from 12 to 20 seconds (!!!) to find 5000 rows. From my understanding, indexed request should not depend on number of rows in CF but from number of rows per one index value (cardinality), as it's just another hidden CF like:
"true" : rowKey1 rowKey2 rowKey3 ...
"false": rowKey1 rowKey2 rowKey3 ...
I'm using Pycassa to query the data, here the code I'm using:
column_family = pycassa.ColumnFamily(cassandra_pool, column_family_name, read_consistency_level=2)
is_exported_expr = create_index_expression('is_exported', 'false')
clause = create_index_clause([is_exported_expr], count = 5000)
column_family.get_indexed_slices(clause)
Am I doing something wrong, but I expect this operation to work MUCH faster.
Any ideas or suggestions?
Some config info:
Cassandra 1.1.0
RandomPartitioner
I have 2 nodes and replication_factor = 2 (each server has a full data copy)
Using AWS EC2, large instances
Software raid0 on ephemeral drives
Thanks in advance!
I don't know the internals of indexing in Cassandra but I'm under the assumption it behaves in a similar fashion to PostgreSQL / MySQL, were indexing boolean, true/false columns is redundant in many scenarios. If cardinality is low (true & false = 2 unique values) and data is distributed quite evenly, e.g. ~50% true and ~50% false, then the database engine will likely perform a full table scan (which doesn't utilize the indexes).
The linear relationship between query execution and data set size would further support that Cassandra is performing a full table (keyspace) scan.

Resources