How to sort an optimized keys only query - sorting

I'm using datastore native api to access to gae database (for well studied specific reasons). I wanted to optimize the code and use the memcache in my requests instead of directly grabbing the values, the issue, is that my query is sorted.
When I do a findProductsByFiltersQuery.setKeysOnly(); on my query, I receive this error:
The provided keys-only multi-query needs to perform some sorting in
memory. As a result, this query can only be sorted by the key
property as this is the only property that is available in memory.
The weired thing is that it starts happening from a certain complexity of the request, for example this request fails:
SELECT __key__ FROM Product WHERE dynaValue = _Rs:2 AND productState = PUBLISHED AND dynaValue = _RC:2 AND dynaValue = RF:1 AND dynaValue = ct:1030003 AND dynaValue = _RS:4 AND dynaValue = _px:2 AND itemType = NEWS ORDER BY modificationDate DESC
while this one passes :
SELECT __key__ FROM Product WHERE itemType = CI AND productState = PUBLISHED ORDER BY modificationDate DESC
Can someone explain me why this is happening and if ordering is not possible when getting the keys, for what is that feature? : since results are paginated, it is useless to get a bad set of keys from the first filtering request. So how is it thought???
Please also notice that when I do non keysOnly very long request I receive this message
Splitting the provided query requires that too many subqueries are
merged in memory.
at com.google.appengine.repackaged.com.google.common.base.Preconditions.checkArgument(Preconditions.java:129)
at com.google.appengine.api.datastore.QuerySplitHelper.splitQuery(QuerySplitHelper.java:99)
at com.google.appengine.api.datastore.QuerySplitHelper.splitQuery(QuerySplitHelper.java:71)
Can someone explain me how is it possible there is in memory treatment when values are indexed? or is it the devmode server only that does this error?

In-memory queries are necessary when you use OR, IN, and != operators in Datastore. As described in this blog post, queries using these operators are split in the client into multiple Datastore queries. For example:
SELECT * FROM Foo WHERE A = 1 OR A = 2
gets split into two queries:
SELECT * FROM Foo WHERE A = 1
SELECT * FROM Foo WHERE A = 2
If you add ORDER BY B to your query, both sub-queries get this order:
SELECT * FROM Foo WHERE A = 1 ORDER BY B
SELECT * FROM Foo WHERE A = 2 ORDER BY B
While each of these Datastore queries returns results sorted by B, the union of the queries is not. On the client side, the SDK merges the results from the two ordered by B.
In order to do this, the Datastore queries must actually return the ordered property, otherwise the SDK won't know the correct way to merge these together.
If you are writing queries with a large number of filters, make sure to only use AND filters. This will allow all the operations to be performed only in the Datastore, in which case no in-memory sorting is necessary.

Related

SSIS Google Analytics - Reporting dimensions & metrics are incompatible

I'm trying to extract data from Google Analytics, using KingswaySoft - SSIS Integration Toolkit, in Visual Studio.
I've set the metrics and dimensions, but I get this error message:
Please remove transactions to make the request compatible. The request's dimensions & metrics are incompatible. To learn more, see https://ga-dev-tools.web.app/ga4/dimensions-metrics-explorer/
I've tried to remove transactions metric and it works, but this metric is really necessary.
Metrics: sessionConversionRate, sessions, totalUsers, transactions
Dimensions: campaignName, country, dateHour, deviceCategory, sourceMedium
Any idea on how to solve it?
I'm not sure how helpful this suggestion is but could a possible work around include having two queries.
Query 1: Existing query without transactions
Query 2: The same dimensions with transactionId included
The idea would be to use the SSIS Aggregate component to group by the original dimensions and count the transactions. You could then merge the queries together via a merge join.
Would that work?
The API supports what it supports. So if you've attempted to pair things that are incompatible, you won't get any data back. Things that seem like they should totally work go together like orange juice and milk.
While I worked on the GA stuff through Python, an approach we found helped us work through incompatible metrics and total metrics was to make multiple pulls using the same dimensions. As the data sets are at the same level of grain, as long as you match up each dimension in the set, you can have all the metrics you want.
In your case, I'd have 2 data flows, followed by an Execute SQL Task that brings the data together for the final table
DFT1: Query1 -> Derived Column -> Stage.Table1
DFT2: Query2 -> Derived Column -> Stage.Table2
Execute SQL Task
SELECT
T1.*, T2.Metric_A, T2.Metric_B, ... T2.Metric_Z
INTO
#T
FROM
Stage.T1 AS T1
INNER JOIN
Stage.T2 AS T2
ON T2.Dim1 = T1.Dim1 /* etc */ AND T2.Dim7 = T1.Dim7
-- Update you have solid data aka
-- isDataGolden exists in the "data" section of the response
-- Usually within 7? days but possibly sooner
UPDATE
X
SET
metric1 = S.metric1 /* etc */
FROM
dbo.X AS X
INNER JOIN #T AS T
ON T.Dim1 = X.Dim1
WHERE
X.isDataGolden IS NULL
AND T.isDataGolden IS NOT NULL;
-- Add new data but be aware that not all nodes might have
-- reported in.
INSERT INTO
dbo.X
SELECT
*
FROM
#T AS T
WHERE
NOT EXISTS (SELECT * FROM dbo.X AS X WHERE X.Dim1 = T.Dim1 /* etc */);

How to implement conditional relationship clause in cypher for Spring Neo4j query?

I have a cypher query called from Java (Spring Neo4j) that I'm using, which works.
The basic query is straightfoward - it asks what nodes lead to another node.
The gist of the challenge is that there are a set (of three) items - contributor, tag, and source - that the query might need to filter its results by.
Two of these - tag and source - bring with them a relationship that needs to be tested. But if these are not provided, then that relationship does not need to be tested.
Neo-browser warns me that the query I'm using it will be inefficient, for probably obvious reasons:
#Query("MATCH (p:BoardPosition)<-[:PARENT]-(n:BoardPosition)<-[:PARENT*0..]-(c:BoardPosition),
(t:Tag), (s:JosekiSource) WHERE
id(p)={TargetID} AND
({ContributorID} IS NULL or c.contributor = {ContributorID}) AND
({TagID} IS NULL or ((c)-->(t) AND id(t) = {TagID}) ) AND
({SourceID} IS NULL or ((c)-->(s) AND id(s) = {SourceID}) )
RETURN DISTINCT n"
)
List<BoardPosition> findFilteredVariations(#Param("TargetID") Long targetId,
#Param("ContributorID") Long contributorId,
#Param("TagID") Long tagId,
#Param("SourceID") Long sourceId);
(Java multi-line query string catenation removed from code for clarity)
It would appear that this would be more efficient if the conditional inclusion of the relationship-test could be in the MATCH clause itself. But that does not appear possible.
Is there some in-cypher way I can improve the efficiency of this?
FWIW: the Neo4j browser warning is:
This query builds a cartesian product between disconnected patterns.
If a part of a query contains multiple disconnected patterns, this
will build a cartesian product between all those parts. This may
produce a large amount of data and slow down query processing.

Equal distribution of couchdb document keys for parallel processing

I have a couchdb db instance where each document has a unique id (string). I would like to go over each document in the db and perform some external operation based on the contents of each document (for ex: connecting to another web server to get specific details etc). However, instead of sequentially going over each document, is it possible to first get a list of k buckets of these document keys represented by the starting key + ending key (id being the key), then to query for all documents in each of these buckets separately & do the external operation on each bucket's documents in parallel ?
I currently use couchdb-python for accessing my db + views. For ex, this is the code I currently use:
for res in db.view("mydbviews/id"):
doc = db[res.id]
do_external_operation(doc) # Time consuming operation
It would be great if I could do something like 'parallel for' for the above loop.
Assuming that you're only emitting one result per document in the view, then presumably running the view with start and end keys along with some python parallelisation technique is sufficient here. As #Ved says, the bigger issue here is parallel processing, rather than generating the subsets of documents. I'd recommend the multiprocessing module, like so:
def work_on_subset(viewname, key_low, key_high):
rows = db.view(viewname, startkey=key_low, endkey=key_high)
for row in rows:
pass # Do your work here
viewname = '_design/designname/_view/viewname'
key_list = [('a', 'z'), ('1', '10')] # Or whatever subset you want
pool = multiprocessing.Pool(processes=10) # Or however many you want
result = []
for (key_low, key_high) in key_list:
result.append(pool.apply_async(work_on_subset, args=(viewname, key_low, key_high)))
pool.close()
pool.join()

Is it a good idea to store and access an active query resultset in Coldfusion vs re-quering the database?

I have a product search engine using Coldfusion8 and MySQL 5.0.88
The product search has two display modes: Multiple View and Single View.
Multiple displays basic record info, Single requires additional data to be polled from the database.
Right now a user does a search and I'm polling the database for
(a) total records and
(b) records FROM to TO.
The user always goes to Single view from his current resultset, so my idea was to store the current resultset for each user and not have to query the database again to get (waste a) overall number of records and (waste b) a the single record I already queried before AND then getting the detail information I still need for the Single view.
However, I'm getting nowhere with this.
I cannot cache the current resultset-query, because it's unique to each user(session).
The queries are running inside a CFINVOKED method inside a CFC I'm calling through AJAX, so the whole query runs and afterwards the CFC and CFINVOKE method are discarded, so I can't use query of query or variables.cfc_storage.
So my idea was to store the current resultset in the Session scope, which will be updated with every new search, the user runs (either pagination or completely new search). The maximum results stored will be the number of results displayed.
I can store the query allright, using:
<cfset Session.resultset = query_name>
This stores the whole query with results, like so:
query
CACHED: false
EXECUTIONTIME: 2031
SQL: SELECT a.*, p.ek, p.vk, p.x, p.y
FROM arts a
LEFT JOIN p ON
...
LEFT JOIN f ON
...
WHERE a.aktiv = "ja"
AND
... 20 conditions ...
SQLPARAMETERS: [array]
1) ... 20+ parameters
RESULTSET:
[Record # 1]
a: true
style: 402
price: 2.3
currency: CHF
...
[Record # 2]
a: true
style: 402abc
...
This would be overwritten every time a user does a new search. However, if a user wants to see the details of one of these items, I don't need to query (total number of records & get one record) if I can access the record I need from my temp storage. This way I would save two database trips worth 2031 execution time each to get data which I already pulled before.
The tradeoff would be every user having a resultset of up to 48 results (max number of items per page) in Session.scope.
My questions:
1. Is this feasable or should I requery the database?
2. If I have a struture/array/object like a the above, how do I pick the record I need out of it by style number = how do I access the resultset? I can't just loop over the stored query (tried this for a while now...).
Thanks for help!
KISS rule. Just re-query the database unless you find the performance is really an issue. With the correct index, it should scales pretty well. When the it is an issue, you can simply add query cache there.
QoQ would introduce overhead (on the CF side, memory & computation), and might return stale data (where the query in session is older than the one on DB). I only use QoQ when the same query is used on the same view, but not throughout a Session time span.
Feasible? Yes, depending on how many users and how much data this stores in memory, it's probably much better than going to the DB again.
It seems like the best way to get the single record you want is a query of query. In CF you can create another query that uses an existing query as it's data source. It would look like this:
<cfquery name="subQuery" dbtype="query">
SELECT *
FROM Session.resultset
WHERE style = #SelectedStyleVariable#
</cfquery>
note that if you are using CFBuilder, it will probably scream Error at you for not having a datasource, this is a bug in CFBuilder, you are not required to have a datasource if your DBType is "query"
Depending on how many records, what I would do is have the detail data stored in application scope as a structure where the ID is the key. Something like:
APPLICATION.products[product_id].product_name
.product_price
.product_attribute
Then you would really only need to query for the ID of the item on demand.
And to improve the "on demand" query, you have at least two "in code" options:
1. A query of query, where you query the entire collection of items once, and then query from that for the data you need.
2. Verity or SOLR to index everything and then you'd only have to query for everything when refreshing your search collection. That would be tons faster than doing all the joins for every single query.

What is the best way to integrate Solr as an index with Oracle as a storage DB?

I have an Oracle database with all the "data", and a Solr index where all this data is indexed. Ideally, I want to be able to run queries like this:
select * from data_table where id in ([solr query results for 'search string']);
However, one key issue arises:
Oracle WILL NOT allow more than 1000 items in the array of items in the "in" clause (BIG DEAL, as the list of objects I find is very often > 1000 and will usually be around the 50-200k items)
I have tried to work around this using a "split" function that will take a string of comma-separated values, and break them down into array items, but then I hit the 4000 char limit on the function parameter using SQL (PL/SQL is 32k chars, but it's still WAY too limiting for 80,000+ results in some cases)
I am also hitting performance issues using a WHERE IN (....), I am told that this causes a very slow query, even when the field referenced is an indexed field?
I've tried making recursive "OR"s for the 1000-item limit (aka: id in (1...1000 or (id in (1001....2000) or id in (2001....3000))) - and this works, but is very slow.
I am thinking that I should load the Solr Client JARs into Oracle, and write an Oracle Function in Java that will call solr and pipeline back the results as a list, so that I can do something like:
select * from data_table where id in (select * from table(runSolrQuery('my query text')));
This is proving quite hard, and I am not sure it's even possible.
Things that I can't do:
Store full data in Solr (security +
storage limits)
User Solr as
controller of pagination and ordering
(this is why I am fetching data from
the DB)
So I have to cook up a hybrid approach where Solr really act like the full-text search provider for Oracle. Help! Has anyone faced this?
Check this out:
http://demo.scotas.com/search-sqlconsole.php
This product seems to do exactly what you need.
cheers
I'm not a Solr expert, but I assume that you can get the Solr query results into a Java collection. Once you have that, you should be able to use that collection with JDBC. That avoids the limit of 1000 literal items because your IN list would be the result of a query, not a list of literal values.
Dominic Brooks has an example of using object collections with JDBC. You would do something like
Create a couple of types in Oracle
CREATE TYPE data_table_id_typ AS OBJECT (
id NUMBER
);
CREATE TYPE data_table_id_arr AS TABLE OF data_table_id_typ;
In Java, you can then create an appropriate STRUCT array, populate this array from Solr, and then bind it to the SQL statement
SELECT *
FROM data_table
WHERE id IN (SELECT * FROM TABLE( CAST (? AS data_table_id_arr)))
Instead of using a long BooleanQuery, you can use TermsFilter (works like RangeFilter, but the items doesn't have to be in sequence).
Like this (first fill your TermsFilter with terms):
TermsFilter termsFilter = new TermsFilter();
// Loop through terms and add them to filter
Term term = new Term("<field-name>", "<query>");
termsFilter.addTerm(term);
then search the index like this:
DocList parentsList = null;
parentsList = searcher.getDocList(new MatchAllDocsQuery(), searcher.convertFilter(termsFilter), null, 0, 1000);
Where searcher is SolrIndexSearcher (see java doc for more info on getDocList method):
http://lucene.apache.org/solr/api/org/apache/solr/search/SolrIndexSearcher.html
Two solutions come to mind.
First, look into using Oracle specific Java extensions to JDBC. They allow you to pass in an actual array/list as an argument. You may need to create a stored proc (it has a been a while since I had to do this), but if this is a focused use case, it shouldn't be overly burdensome.
Second, if you are still running into a boundary like 1000 object limits, consider using the "rows" setting when querying Solr and leveraging it's inherent pagination feature.
I've used this bulk fetching method with stored procs to fetch large quantities of data which needed to be put into Solr. Involve your DBA. If you have a good one, and use the Oracle specific extensions, I think you should attain very reasonable performance.

Resources