Exact distinct count in druid - hadoop

I have loaded data into Druid from Hive and I haven't used any HLL columns.
When i run a COUNT(DISTINCT mycol) query in Druid, I do not get exact counts. The counts seem to be close, but do not match with what i have in Hive.
Why could Druid not be giving an exact count even though i haven't mentioned anything about HLL? Alternatively, is there a way to get exact count distincts in Druid?
Found an old post from 2014 regarding same issue https://groups.google.com/forum/#!topic/druid-development/AMSOVGx5PhQ, am not sure if the current version of Druid supports exact count distincts.

The COUNT(DISTINCT col) aggregation functions by default uses a variant of HyperLogLog, a fast approximate distinct counting algorithm. Druid SQL will switch to exact distinct counts if you set "useApproximateCountDistinct" to "false", either through query context or through broker configuration.( refer http://druid.io/docs/latest/querying/sql.html )
To get actual distinct count set druid.sql.planner.useApproximateCountDistinct to false . ( refer http://druid.io/docs/latest/configuration/index.html#broker-node-configs ) Also note that there is limitation in exact mode, only one distinct count per query is permitted.

Hard to tell what is happening without the DDLs and more clues...
I am guessing data got rolled up when indexed by Druid. When you index data with a Granularity other than none, it can get rolled up to the Granularity level.

I had similar issue and in my case it was because of data rollup as Slim mentioned in his answer.
Basically if your data is more granular than your segmentGranularity then it will get rolled up automatically. If you set segmentGranularity to None then it won't get rolled up.
One more thing I observed in my case was, even if I have segment granularity None but if my timestamp and all other columns for two different rows are same then it was getting auto merged into 1 row.
This particular behavior was OK for me as I was also looking for distinct count as you do.

Related

How to obtain count of update records from Hive

I am working with hive driver in which the executeUpdate() record does not return the affected record count. Is there any alternate way in which this can be obtained? We need to get the effected record value for further processing.
If I am not mistaken Hive does not show (or even know?!) the number of updated records. Therefore extracting this directly is likely not going to work.
Workaround
First run a count query using the exact where statement and log the result
Then do the actual update
Naturally this incurs significant overhead.

performance issues while processing 2 tables in lockstep based on orderedBy from-to

Title is probably not very clear so let me explain.
I want to process a in-process join (nodeJs) on 2 tables*, Session and SessionAction. (1-N)
Since these tables are rather big (millions of records both) my idea was to get slices based on an orderBy sessionId (which they both share), and sort of lock-step walk through both tables in batches.
This however proves to be awefully slow. I'm using pseudo code as follows for both the tables to get the batches:
table('x').orderBy({index:"sessionId"}.filter(row.sessionId > start && row.sessionId < y)
It seems that even though I'm essentially filtering on a attribute sessionId which has got an index, the query planner is not smart enough to see this and every query does a complete tablescan to do the orderby before filtering afterwards (or so it seems)
Of course, this is incredibly wasteful but I don't see another option. E.g.:
Order after filter is not supported by Rethink.
Getting a slice of the ordered table doesn't work either, since slice-enumeration (i.e.: the xth until the yth record) for lack of a better work doesn't add up between the 2 tables.
Questions:
Is my approach indeed expected to be slow, due to having to do a table scan at each iteration/batch?
If so, how could I design my queries to get it working faster?
*) It's too involved to do it using Rethink Reql only.
filter is never indexed in RethinkDB. (In general a particular command will only use a secondary index if you pass index as one of its optional arguments.) You can write that query like this to avoid scanning over the whole table:
r.table('x').orderBy({index: 'sessionID'}).between(start, y, {index: 'sessionId'})

Power Query 'an evaluation is in progress' when merging

When merging two tables in PowerQuery an evaulation is run to determine the possible number of matches. I run pretty large tables (merge a 10K record table with a 500K record table) so this can take a long time.
I know there will be matches because I have done this before and I am not a beginner. Yet PowerQuery insists on running this behaviour.
Is there anyway to baypass this step? It almost feels like when you just need to turn automatic calculation off in Excel so that you can get on with actually doing something.
Any ideas?
I would add in an upstream filter to limit the rows e.g. Keep Rows / Keep Top Rows / 100. You may need to do this on both Queries. Ideally you Keep enough rows or use a specific filter to get some matches, to help your downstream Query design work.
Then once the query design is finished, I would remove the filter(s) and let it rip.
This is what PQ should be doing in the Query Editor, but it does seem to go rogue on Merge in particular.

Why does the same select statement have different costs in Oracle?

Recently I used Oracle 11g database to do my homework. I had 12 tables, like trip_data_11 and trip_data_12.
They have same structure and the number of records is almost the same. I created the same indexes on each table.
So for trip_data_11 table:
create index pick_add_11 on trip_data_11(pickup_longitude,pickup_latitude);
create index drop_add_11 on trip_data_11(dropoff_longitude,dropoff_latitude);
The same operation to trip_data_12.
Then I used the following select statement to select the taxi numbers per day.
SELECT
COUNT(DISTINCT(td.medallion)) AS taxi_num
FROM
SYS.TRIP_DATA_11 td
WHERE
(td.pickup_longitude >= -74.2593 AND td.pickup_longitude <= -73.7011
AND td.pickup_latitude >= 40.4770 AND td.pickup_latitude <= 40.9171
)
AND
(td.dropoff_longitude >= -74.2593 AND td.dropoff_longitude <= -73.7011
AND td.dropoff_latitude >= 40.4770 AND td.dropoff_latitude <= 40.9171
)
AND
td.trip_distance > 0
AND
td.passenger_count > 0
GROUP BY
regexp_substr(td.pickup_datetime,'\d{4}-\d{2}-\d{2}')
ORDER BY
regexp_substr(td.pickup_datetime,'\d{4}-\d{2}-\d{2}');
It costs 38sec。When I changed the table name to SYS.TRIP_DATA_12, the problem coming, it costs more than 2 hours.
What's more, it did not end. I don't know why.
Today I ask my classmate and he said: clear the cache. So I used the following statements to do it.
alter system flush shared_pool;
alter system flush buffer_cache;
alter system flush global context;
Now when I use the same select statement for SYS.TRIP_DATA_11 I get the same poor performance like SYS.TRIP_DATA_12. Why?
It seems like your classmate was having a good joke at your expense.
Clearly your query was only performing well because you had a warm buffer cache full of all the data you needed from TRIP_DATA_11. By flushing the caches you have zapped all that, and now you have the same bad performance for all tables.
Tuning queries is hard, because there are lots of possibilities. Please read the documentation on it.
To pick just one thing: you're searching ranges, which is problematic. How many rows fill -74.2593 to -73.7011 ? It might be a lot more than say -71.00 to -68.59 even though that's a broader range. Understanding your data - its volume, its distribution and its skew - is crucial.
As a first step learn how to use EXPLAIN PLAN. Find out more. To get better plans, gather statistics on your tables and their indexes, using DBMS_STATS package. Find out more.
One tip. Oracle only uses one index to access a table. So it will choose pick_add_11 or drop_add_11 but not both. It will then read all the matching records from the table and filter them by the other criteria. You may get much better performance from a index designed to service this query:
create index add_11 on trip_data_11
(pickup_longitude
, pickup_latitude
, dropoff_longitude
, dropoff_latitude
, trip_distance
, passenger_count )
;
The select statement will execute the entire filter against this index and only touch the table to get the MEDALLION values. (You could add medallion to the index too). Experiment with the column order. As latitude has a narrower range than longitude probably that should go first; maybe drop-off value should appear before pick-up. You want an index in which the greatest number of related records are clustered together.
Indexes like this can be an overhead, so we wouldn't want to maintain too many of them in real life. But they are a valuable technique for tuning expensive queries which are run frequently.
Oh, and #Justin's right: don't use SYS for doing application work. Even for a school assignment you should create a fresh schema and create your tables, etc in that.

Does the order of columns on a covered index in Sybase affect select performance?

We have a large table, with several indices (say, I1-I5).
The usage pattern is as follows:
Application A: all select queries 100% use indices I1-I4 (assume that they are designed well enough that they will never use I5).
Application B: has only one select query (fairly frequently run), which contains 6 fields and for which a fifth index I5 was created as a covered index.
The first 2 fields of the covered index are date, and a security ID.
The table contains rows for ~100 dates (in date order, enforced by a clustered index I1), and tens of thousands of security identifiers.
Question: dies the order of columns in the covered index affect the performance of the select query in Application B?
I.e., would the query performance change if we switched around the first two fields of the index (date and security ID)?
Would the query performance change if we switch around one of the last fields?
I am assuming that the logical IOs would remain un-affected by any order of fields in the covered index (though I'm not 100% sure).
But will there be other performance effects? (Optimizer speed, caching, etc...)
The question is version-generic, but if it matters, we use Sybase 12.
Unfortunately, the table is so huge that actually changing the index in practice and quantitatively confirming the effects of the change is extremely difficult.
It depends. If you have a WHERE clause such as the following, you will get better performance out an index on (security_ID, date_column) than the converse:
WHERE date_column BETWEEN DATE '2009-01-01' AND DATE '2009-08-31'
AND security_ID = 373239
If you have a WHERE clause such as the following, you will get better performance out of an index on (date_column, security_ID) than the converse:
WHERE date_column = DATE '2009-09-01'
AND security_ID > 499231
If you have a WHERE clause such as the following, it really won't matter very much which column appears first:
WHERE date_column = DATE '2009-09-13'
AND security_ID = 211930
We'd need to know about the selectivity and conditions on the other columns in the index to know if there are other ways of organizing your index to gain more performance.
Just like your question is version generic, my answer is DBMS-generic.
Unfortunately, the table is so huge that actually changing the index in practice and quantitatively confirming the effects of the change is extremely difficult.
The problem is not the size of the table. Millions of rows is nothing for Sybase.
The problem is an absence of a test system.

Resources