Apache Druid: count outliers - outliers

I prepared an installation of Apache Druid that takes data from a Kafka topic. It works very smoothly and efficiently.
I'm currently trying to implement some queries and I'm stuck in the count of rows (grouped by some fields) for which a column value is an outlier. In the normal SQL world, I will essentially compute the first and third quartiles (q1 and q3) and then use something like (I'm interested only in "right" outliers):
SUM(IF(column_value > q3 + 1.5*(q3-q1), 1, 0))
This approach makes use of cte and joins: I compute the quartiles in a cte with grouping and then I join it with the original table.
I was able to easily compute the quartiles and the outlier threshold with datasketch extension using a groupBy query, but I'm not realizing how to perform a postAggregation that can perform the count.
In theory, I may implement a second query using the thresholds obtained in the first. Unfortunately, I can get hundreds of thousands of different values. That makes this approach unfeasible.
Do you have any suggestion on how to tackle this problem?

As of version 0.18.0, Apache Druid supports joins. This solves the problem.

Related

Neo4j 3.4.4 - Slow processing if using LIST

Two identical queries in terms of processing. Run multiple times to avoid cache distortions on timing:
MATCH (p:Pathway {name: {PNAME}}), (t:target {symbol: {TNAME}}) MERGE (p)-[:INVOLVES]->(t)
Above runs 11,100 commands per second
UNWIND {LIST} AS i MATCH (p:Pathway {name: i.PNAME}), (t:target {symbol: i.TNAME}) MERGE (p)-[:INVOLVES]->(t)
Above runs 547 commands per second on the same data set.
Windows 10 Pro, 64GB Ram, SSD, Python 3.7
There are unique constraints on both variables in the statements above and both indices are ON.
The LIST statement in other situations is dramatically faster so I like using it for bulk operations. I tested on Neo4j 3.4 and today on 3.4.4 and Python 3.6 and 3.7. Using latest neo4j-driver. Same results. My guess is query planning is not using the index. About 40,000 nodes in Pathway and 25,000 in target.
Any suggestions? Thanks in advance.
Query plan when using a list. For this profile, the list contained one record.
Suggestion: Can the plan optimizer calculate the number of records in the list to determine whether to scan in all records or to individually use the unique index. Maybe set a threshold of if less than 10% of rows will be needed use unique index. Just a thought for Neo4j developers. In the meantime I dropped using the LIST version.
The UNWIND {list} is what's killing you. You are completely changing the dynamic of the query from one to the other.
The first query is a simple 2 node lookup, The second query is creating a bunch of rows, and then doing a per row 2 node match.
In the first example, it is obvious to the Cypher Planner to use the index to match. In the later, The planner doesn't know for sure what the best way to proceed is. Run against the index for every row, or scan all the nodes to try and get what it needs in one pass through (or something else)?
You can use Cypher hints to try and help the Planner choose the right one, but in your case, use the first query. The first query is simpler and easier for the Cypher planner to plan, and the planner will actually cache the plan so that it doesn't need to figure out what to do each time you re-run it. (The second query will too, but as far as I can tell, it is only trying (and failing) to reproduce the performance boost of using a parameter-ized query, so why not just use the Neo4j built in one?).

Vertica query optimization

I want to optimize a query in vertica database. I have table like this
CREATE TABLE data (a INT, b INT, c INT);
and a lot of rows in it (billions)
I fetch some data using whis query
SELECT b, c FROM data WHERE a = 1 AND b IN ( 1,2,3, ...)
but it runs slow. The query plan shows something like this
[Cost: 3M, Rows: 3B (NO STATISTICS)]
The same is shown when I perform explain on
SELECT b, c FROM data WHERE a = 1 AND b = 1
It looks like scan on some part of table. In other databases I can create an index to make such query realy fast, but what can I do in vertica?
Vertica does not have a concept of indexes. You would want to create a query specific projection using the Database Designer if this is a query that you feel is run frequently enough. Each time you create a projection, the data is physically copied and stored on disk.
I would recommend reviewing projection concepts in the documentation.
If you see a NO STATISTICS message in the plan, you can run ANALYZE_STATISTICS on the object.
For further optimization, you might want to use a JOIN rather than IN. Consider using partitions if appropriate.
Creating good projections is the "secret-sauce" of how to make Vertica perform well. Projection design is a bit of an art-form, but there are 3 fundamental concepts that you need to keep in mind:
1) SEGMENTATION: For every row, this determines which node to store the data on, based on the segmentation key. This is important for two reasons: a) DATA SKEW -- if data is heavily skewed then one node will do too much work, slowing down the entire query. b) LOCAL JOINS - if you frequently join two large fact tables, then you want the data to be segmented the same way so that the joined records exist on the same nodes. This is extremely important.
2) ORDER BY: If you are performing frequent FILTER operations in the where clause, such as in your query WHERE a=1, then consider ordering the data by this key first. Ordering will also improve GROUP BY operations. In your case, you would order the projection by columns a then b. Ordering correctly allows Vertica to perform MERGE joins instead of HASH joins which will use less memory. If you are unsure how to order the columns, then generally aim for low to high cardinality which will also improve your compression ratio significantly.
3) PARTITIONING: By partitioning your data with a column which is frequently used in the queries, such as transaction_date, etc, you allow Vertica to perform partition pruning, which reads much less data. It also helps during insert operations, allowing to only affect one small ROS container, instead of the entire file.
Here is an image which can help illustrate how these concepts work together.

Reducing data with data stage

I've been asked to reduce an existing data model using Data Stage ETL.
It's more of an exercice and a way to get to know this program which I'm very new to.
Of course, the data shall be reduced following some functionnal rules.
Table : MEMBERSHIP (..,A,B,C) # where A,B,C are different attributes (our filters)
Reducing data from ~700k rows to 7k rows or so.
I was thinking about keeping the same percentage as in the data source.
Therefore if we have the 70% of A, 20% of B and 10% of C, we would pretty much have the same percentage on the reduced version.
I'm looking for the best way to do so and the inner tools to use(maybe with the aggregator stage?).
Is there any way to do some scripting similar to PL with DataStage ?
I hope I've been clear enough. If you have any advice I'd be very grateful.
Thanks to all of you.
~Whitoo
Datastage does not do percentage wise reductions
What you can do is to use a tranformer stage or a filter stage to filter out the data from the source based on certain conditions. But like I said conditions have to be very specific. (for example - select only those records which have A = [somevalue] or A not= [somevalue])
DataStage PX has the sample stage that allows you to specify what percent of data you want it to sample: http://datastage4you.blogspot.com/2014/01/sample-stage-in-datastage.html.

Netezza/PureData - Bad distribution key chosen in HASH JOIN

I am using Netezza/Pure Data for a query. I have a INNER JOIN (which became a HASH JOIN) on two columns A and B. A is a column that has good distribution and B is a column that has bad distribution. For some reason, my query plan always uses B instead A as the distribution key for that JOIN, which causes immense performance issue.
GENERATE STATISTICS does help alleviate this issue, but due to performance constraints, it is not feasiable to GENERATE STATISTICS before every query. I do it before a batch run but not in between each query within a batch.
In a nutshell, the source tables have good distributions but when I join them, they choose a bad distribution key (which is actually never used as a distribution column at all in the sources).
So my question is, what are some good ways to influence the choice of distribution key in a JOIN without doing GENERATE STATISTICS. I've tried changing around the distribution columns of the source tables but that didn't do much even if I make sure all the skew's are less than 0.5.
You could create a temp table and force the distribution so that they both align, this should expedite the join
The workaround is to force exhaustive planner to be used.
set num_star_planner_rels = X; -- Set X to very high.
According to IBM Netezza team, queries with more than 7 entities (# of tables) will use a greedy query planner called "Snowflake". At 7 or less entities, it will use the brute force approach to find the best plan.
The trade off is that exhaustive search is very expensive for large number of entities.

Neo4j Spatial order by distance

I'm currently using Spatial for my queries as follows:
START b=node:LocationIndex('withinDistance:[70.67,12.998,6.0]')
RETURN b
ORDER BY b.score
B is an entity that has a score and I'd like to order by this score, but I found a case in which, all the entities with score 0 were not ordered by distance. I know Spatial automatically orders by distance, but once I force the order by another field, I lose this order.
Is there any way of forcing this order as a second order field like:
START b=node:LocationIndex('withinDistance:[70.67,12.998,6.0]')
RETURN b
ORDER BY b.score, ?distance?
Unfortunately in the current spatial plugin, there is no cypher support at all, so the distance function (or distance result) cannot be accessed by the ORDER BY.
As you already noticed, the withinDistance function in the index itself will return results ordered by distance. If you do not add an extra ORDER BY in the cypher query, the distance order should be maintained. However, when adding the extra ORDER BY, the original order is lost. It would be an interesting feature request to the cypher developers to maintain the original order for elements that are comparatively identical in the ORDER BY.
There is also a separate plan to develop spatial functions within cypher itself, and that will solve the problem the way you want. But there is not yet any information on a development or release schedule for this.
One additional option that might help you in a shorter time frame and is independent of the neo4j development plans themselves, is to add an order by extension to the spatial index query. Right now you are specifying the index query as 'withinDistance:[70.67,12.998,6.0]', but you could edit the Spatial Plugin code to support passing extra parameters to this query, and they could be an order by parameter. Then you would have complete control of the order.

Resources