Using HLL_COUNT.MERGE outside of SQL - algorithm

I can use the following query to general all the HLL sketches of the distinct counts:
SELECT category, count(distinct city), HLL_COUNT.INIT(city) FROM `table`
GROUP BY category
And I get something like this:
While I would normally use the HLL_COUNT.merge(...) function to get the total count, for example:
select 'all -- hll', HLL_COUNT.MERGE(x), null from (select category, count(distinct city), HLL_COUNT.INIT(city) x from `datadocs-163219.010ff92f6a62438aa47c10005fe98fc9.inv` group by category) _
For various reasons, I need to do the MERGE outside of SQL/BigQuery. Is there some sort of library/open source library where I could do something like the following:
>>> hll_set
>>> {'CHAQMBgCIAuCBz8QFBgPIBQyN8hxlqEBvMMBnLMBgWnD5gTB3AH+ROgD/YMEpM8Jr70C6Q2LwwfZlQ3QMNu8AYDSBKf7AbOSqgE=', 'CHAQDhgCIAuCBxwQBxgPIBQyFP3PBMBtibMR3sgC77oViasKwfMF', 'CHAQJxgCIAuCBzIQEBgPIBQyKshxlqEBvMMBzfECh6gJxJABoNwF/rEGwf0PgYYFvOoFmzjJPZwg2y3nbw==', 'CHAQBBgCIAuCBw4QAhgPIBQyBpSJAfapKA==', 'CHAQBRgCIAuCBxEQAxgPIBQyCbaJBfqsH57tBw==', 'CHAQGBgCIAuCBykQDRgPIBQyId6SAtNvwJ0XgO8Ct/EFlvUOskG1E87ZA7/OApwg2y3nbw==', 'CHAQZhgCIAuCB2MQIxgPIBQyW5SJAcqJAbzDAcvcAoIV2xSMFsTyA42IAYkl+Wvj/AHqdJxRlEGbywG/WNjoAqS9BP3CAuPrBNSFAfdDt+YEoeIBr+ICmIYF6CL/MaLNAqKdA8k9rxntBrPVrAE=', 'CHAQEBgCIAuCByQQChgPIBQyHN6SAqjtArAJ/esCj9wSg+8KiVKNygHrpgXIogU=', 'CHAQpgkYAiALggfZAhChARgPIBQyzwKPBMwRkAzxP+wPogyqC8qJAeBo8BHsSOypAbAJriL+MYYR/1jnKqIyzR3wJIkI/QXkecNH7WCzQZgMuDvxFLh+xkboA7QB12akDhu5E+4+3KgBjAZ4nxLBRMw0xRWvIPZYszt+v1gnz2a0BZoF4wzQggHqOewsJeAxgguGErUCjGG3KuhKgUyfCtItkjOMZZwCpi3phgHlA+wRknEhwiq1Os4slgmhELEWl1f1rgH+B6e4AdCtAdkE4R7fK/gihHSRFqipAbYY9BmqP5oBgqsBvhrvEKGRAcpj7XHEVaAUrY8BylLRDgWn1wGpT6IS6irPHewb/AbKHqgQjQPyAeU82zuSHpgQ04UBzwqkFIADiBD4X6ABjBihFsIy6wmovgHNKssPsQOvGcADrQOQevMQvxKMBtANizqbP7l21+kB0UDxY92rVYCBMcD5H8CiEA=='}
>>> hll_merge_method(hll_set)
>>> 193
Is it possible to do this in any way using a library outside of BQ with the hash generated from it?

That's a feature request you might already find in the issue tracker: the current hash is Google proprietary, but one day BigQuery could use an open one. Vote that request up.
https://issuetracker.google.com/issues/62153424
There might be news soon, and subscribing to the issue will keep you updated.
2019 Update: Find the open source version of BigQuery's HyperLogLog++ at:
https://github.com/google/zetasketch

Related

SSIS Google Analytics - Reporting dimensions & metrics are incompatible

I'm trying to extract data from Google Analytics, using KingswaySoft - SSIS Integration Toolkit, in Visual Studio.
I've set the metrics and dimensions, but I get this error message:
Please remove transactions to make the request compatible. The request's dimensions & metrics are incompatible. To learn more, see https://ga-dev-tools.web.app/ga4/dimensions-metrics-explorer/
I've tried to remove transactions metric and it works, but this metric is really necessary.
Metrics: sessionConversionRate, sessions, totalUsers, transactions
Dimensions: campaignName, country, dateHour, deviceCategory, sourceMedium
Any idea on how to solve it?
I'm not sure how helpful this suggestion is but could a possible work around include having two queries.
Query 1: Existing query without transactions
Query 2: The same dimensions with transactionId included
The idea would be to use the SSIS Aggregate component to group by the original dimensions and count the transactions. You could then merge the queries together via a merge join.
Would that work?
The API supports what it supports. So if you've attempted to pair things that are incompatible, you won't get any data back. Things that seem like they should totally work go together like orange juice and milk.
While I worked on the GA stuff through Python, an approach we found helped us work through incompatible metrics and total metrics was to make multiple pulls using the same dimensions. As the data sets are at the same level of grain, as long as you match up each dimension in the set, you can have all the metrics you want.
In your case, I'd have 2 data flows, followed by an Execute SQL Task that brings the data together for the final table
DFT1: Query1 -> Derived Column -> Stage.Table1
DFT2: Query2 -> Derived Column -> Stage.Table2
Execute SQL Task
SELECT
T1.*, T2.Metric_A, T2.Metric_B, ... T2.Metric_Z
INTO
#T
FROM
Stage.T1 AS T1
INNER JOIN
Stage.T2 AS T2
ON T2.Dim1 = T1.Dim1 /* etc */ AND T2.Dim7 = T1.Dim7
-- Update you have solid data aka
-- isDataGolden exists in the "data" section of the response
-- Usually within 7? days but possibly sooner
UPDATE
X
SET
metric1 = S.metric1 /* etc */
FROM
dbo.X AS X
INNER JOIN #T AS T
ON T.Dim1 = X.Dim1
WHERE
X.isDataGolden IS NULL
AND T.isDataGolden IS NOT NULL;
-- Add new data but be aware that not all nodes might have
-- reported in.
INSERT INTO
dbo.X
SELECT
*
FROM
#T AS T
WHERE
NOT EXISTS (SELECT * FROM dbo.X AS X WHERE X.Dim1 = T.Dim1 /* etc */);

Oracle Sql group function is not allowed here

I need someone who can explain me about "group function is not allowed here" because I don't understand it and I would like to understand it.
I have to get the product name and the unit price of the products that have a price above the average
I initially tried to use this, but oracle quickly told me that it was wrong.
SELECT productname,unitprice
FROM products
WHERE unitprice>(AVG(unitprice));
search for information and found that I could get it this way:
SELECT productname,unitprice FROM products
WHERE unitprice > (SELECT AVG(unitprice) FROM products);
What I want to know is why do you put two select?
What does group function is not allowed here mean?
More than once I have encountered this error and I would like to be able to understand what to do when it appears
Thank you very much for your time
The phrase "group function not allowed here" is referring to anything that is in some way an "aggregation" of data, eg SUM, MIN, MAX, etc et. These functions must operate on a set of rows, and to operate on a set of rows you need to do a SELECT statement. (I'm leaving out UPDATE/DELETE here)
If this was not the case, you would end up with ambiguities, for example, lets say we allowed this:
select *
from products
where region = 'USA'
and avg(price) > 10
Does this mean you want the average prices across all products, or just the average price for those products in the USA? The syntax is no longer deterministic.
Here's another option:
SELECT *
FROM (
SELECT productname,unitprice,AVG(unitprice) OVER (PARTITION BY 1) avg_price
FROM products)
WHERE unitprice > avg_price
The reason your original SQL doesn't work is because you didn't tell Oracle how to compute the average. What table should it find it in? What rows should it include? What, if any, grouping do you wish to apply? None of that is communicated with "WHERE unitprice>(AVG(unitprice))".
Now, as a human, I can make a pretty educated guess that you intend the averaging to happen over the same set of rows you select from the main query, with the same granularity (no grouping). We can accomplish that either by using a sub-query to make a second pass on the table, as your second SQL did, or the newer windowing capabilities of aggregate functions to internally make a second pass on your query block results, as I did in my answer. Using the OVER clause, you can tell Oracle exactly what rows to include (ROWS BETWEEN ...) and how to group it (PARTITION BY...).

Loading items from service into a single select

I am using IBM BPM V8.5.7. I want to use a single select to select an item which are retrieved via a service. The problem is that the service returns 10k+ items and I currently use a search function. What I want to achieve basically is to search in the single select dropdown and whilst doing that it must fire the service and return the search results.
I have managed to get the search function working properly but searching through so many items slows down BPM. I've tried using a piece of code but that just cleared the items once the service was called
Your DB-Select probably looks somewhat like this:
SELECT name, value FROM YOURTABLE
This will stress your ibm-control (are you using the classic controls or something else like apex)?
Try to limit the result-set from your SQL-Query. One example could be:
SELECT name, value FROM YOURTABLE FETCH FIRST 100 ROWS ONLY
(be aware this is DB2 style) - in MS SQL this looks like this:
SELECT TOP(100) name, value FROM YOURTABLE

Is there a way to randomize search results (record ids) with Sphinx?

I have a complex SphinxQL query which, at the end, orders results by a specific field, Preferred, so that all records with that indexed value of Preferred=1 come before all records w Preferred=0. I also order by weight() so basically I end up with:
Select * from idx_X where MATCH('various parameters') ORDER by Preferred DESC,Weight() Desc
The problem is that, though Preferred records come first I end up with records sorted by ID which puts results from one field, Vendor, in blocks so for instance I get:
Beta Shipping
Beta Shipping
Beta Shipping
Acme Widgets
Acme Widgets
Acme Widgets
Acme Widgets
Acme Widgets
Which doesn't serve my purposes in this case well (often one 'Vendor' will have 1000 results)
So I'm looking to essentially do:
ORDER BY Preferred DESC,weight() DESC,ID RANDOM
So that after getting to Preferred Vendors whose weight is (e.g.) 100, I will get random Vendors vs blocks of them.
Update: Though I did find what appears to be a possible answer in another Stackoveflow Question
The issue is it seems to require the SPH_SORT_EXTENDED and I am forced to use SPH_RANK_PROXIMITY (ranker=proximity) and I am unclear if I can combine ranking and sorting.
Update 2: If I remove my existing two-level Order and just do Order by Rand() it indeed returns random IDs. However I cannot add Rand() after Order by Preferred DESC,Weight() DESC or I get the following error:
1064 - sphinxql: syntax error, unexpected '(', expecting $end near '()
Sadly yes, RAND() only works as a single sort order expression, but it DOES work as a select function....
Select *, RAND() AS r from idx_X where MATCH('various parameters')
ORDER by Preferred DESC,Weight() Desc, r DESC
Or if want a more consistent ordering, but still mixed, can for example use CRC32() function on a string atribute
Select *, CRC32(title) AS r from idx_X where MATCH('various parameters')
ORDER by Preferred DESC,Weight() Desc, r DESC
Can also just limit results to a few per vendor (vendor will need to be an attribute)
Select * from idx_X where MATCH('various parameters')
GROUP 3 BY vendor_id ORDER by Preferred DESC,Weight() Desc
Group by N is a little known by very useful sphinx feature.

Cache and SqlCacheDependency (ASP.NET MVC)

We need to return subset of records and for that we use the following command:
using (SqlCommand command = new SqlCommand(
"SELECT ID, Name, Flag, IsDefault FROM (SELECT ROW_NUMBER() OVER (ORDER BY #OrderBy DESC) as Row, ID, Name, Flag, IsDefault FROM dbo.Languages) results WHERE Row BETWEEN ((#Page - 1) * #ItemsPerPage + 1) AND (#Page * #ItemsPerPage)",
connection))
I set a SqlCacheDependency declared like this:
SqlCacheDependency cacheDependency = new SqlCacheDependency(command);
But immediately after I run the command.ExecuteReader() instruction, the hasChanged base property of the SqlCacheDependency object becomes true although I did not change the result of the query in any way! And, because of this, the result of this query is not kept in cache.
HttpRuntime.Cache.Insert( cacheKey, list, cacheDependency, Cache.NoAbsoluteExpiration, TimeSpan.FromMinutes(AppConfiguration.CacheExpiration.VeryLowActivity));
Is it because the command has 2 SELECT statements? Is it ROW_NUMBER()? If yes, is there any other way to paginate results?
Please help! After too many hours, a little will be greatly appreciated! Thank you
Running into the same issue and finding the same answers online without any help, I was reasearching the xml invalid subsicription response from profiler.
I found an example on msdn support site that had a slightly different order of code. When I tried it I realized the problem - Don't open your connection object until after you've created the command object and the cache dependency object. Here is the order you must follow and all will be good:
Be sure to enable notifications (SqlCahceDependencyAdmin) and run SqlDependency.Start first
Create the connection object
Create the command object and assign command text, type, and connection object (any combination of constructors, setting properties, or using CreateCommand).
Create the sql cache dependency object
Open the connection object
Execute the query
Add item to cache using dependency.
If you follow this order, and follow all other requirements on your select statement, don't have any permissions issues, this will work!
I believe the issue has to do with how the .NET framework manages the connection, specifically what settings are set. I tried overriding this in my sql command test but it never worked. This is only a guess - what I do know is changing the order immediately solved the issue.
I was able to piece it together from the following to msdn posts.
This post was one of the more common causes of the invalid subscription, and shows how the .Net client sets the properties that are in contrast to what notification requires.
https://social.msdn.microsoft.com/Forums/en-US/cf3853f3-0ea1-41b9-987e-9922e5766066/changing-default-set-options-forced-by-net?forum=adodotnetdataproviders
Then this post was from a user who, like me, had reduced his code to the simplest format. My original code pattern was similar to his.
https://social.technet.microsoft.com/Forums/windows/en-US/5a29d49b-8c2c-4fe8-b8de-d632a3f60f68/subscriptions-always-invalid-usual-suspects-checked-no-joy?forum=sqlservicebroker
Then I found this post, also a very simple reduction of the problem, only his was a simple issue - needing 2 part name for tables. In his case the suggestion resolved the issue. After looking at his code I noticed the main difference was waiting to open the connection object until AFTER the command object AND the dependency object were created. My only assumption is under the hood (I have not yet started reflector to check so only an assumption) the Connection object is opened differently, or order of events and command happen differently, because of this association.
https://social.msdn.microsoft.com/Forums/sqlserver/en-US/bc9ca094-a989-4403-82c6-7f608ed462ce/sql-server-not-creating-subscription-for-simple-select-query-when-using-sqlcachedependency?forum=sqlservicebroker
I hope this helps someone else in a similar issue.
Just a guess, but could it be because your SELECT statement doesn't have an ORDER BY clause?
If you don't specify an explicit ordering then it's possible for the query to return the results in any order each time it is run. Maybe this is causing the SqlCacheDependency object to think that the results have changed.
Try adding an ORDER BY clause:
SELECT ID, Name, Flag, IsDefault
FROM
(
SELECT ROW_NUMBER() OVER (ORDER BY #OrderBy DESC) AS Row,
ID, Name, Flag, IsDefault
FROM dbo.Languages
) AS results
WHERE Row BETWEEN ((#Page - 1) * #ItemsPerPage + 1) AND (#Page * #ItemsPerPage)
ORDER BY Row
i'm no expert on SqlCacheDependency, in fact, i found this question whilst looking for answers to my own issues with it! However, i believe the reason your SqlCacheDependency is not working is because your SQL contains a nested sub query.
Take a look at the documentation which lists what you can/can not use in your SQL: Creating a Query for Notification
"....The statement must not contain subqueries, outer joins, or self-joins....."
I also found some invaluable troubleshooting info from a guy at Redgate here: Using and Monitoring SQL 2005 Query Notification that helped me solve my own problem: By using Sql Profiler to trace the QN events he suggests, i was able to spot my connection was incorrectly using the 'SET ARITHABORT OFF' option, causing my notifications to fail.

Resources