MongoTemplate bulk operations ignores limit - spring

I've got an employees collection that looks like:
{
_id
company_id
job_type
name
age
...
}
And a remove query that looks like:
val criteria = Criteria.where("company_id").isEqualTo(companyId).and("job_type").isEqualTo(jobType)
val query = Query(criteria).limit(3)
When I run it like:
mongoTemplate.remove(query, null, "collection_name")
It works great:)
Buy when I run it in Bulk Ops like:
val bulkOp = mongoTemplate.bulkOps(BulkOperations.BulkMode.UNORDERED, null, "collection_name")
bulkOp.remove(query)
...
bulkOp.execute()
It removes all of the documents for that company & job type while completely ignoring the limit=3
Is there a solution for the issue?

Bulk writes with cursor options is not supported and is opposite for what bulk is designed for - Cursor is used for processing results in iterative fashion and make multiple trips to the server. On the other hand bulk updates are done in batch on server side.
The reason remove works when using mongo template is behind the scene it makes multiple queries one for fetching all the matches for query and picking up ids followed by another call to remove those ids.
You can also do the same by collecting ids before running remove query in mongodb for bulk.

Related

Is there some way to perform a different update to each of many documents (bulk update) in spring-data-mongodb-reactive?

I am using spring-boot-starter-data-mongodb-reactive at the current latest version. I need to update a count field in many documents in my collection, but the count field is different for each document. I am looking for a way to perform a bulk update so that I do not have to perform an update for each item, separately, a million times.
My first approach has been to create a list of UpdateOneModel containing the Criteria and the Update. I can get the collection from the ReactiveMongoOperations instance, but this feels like quite an awkward way to do it. It looks like this:
Mono<MongoCollection<Document>> collection = mongoOps.getCollection(mongoOps.getCollectionName(Foo.class));
BulkWriteOptions options = new BulkWriteOptions()
.bypassDocumentValidation(true)
.ordered(false);
return result.getCounts()
.reduce(<Creating a map of ID to new count>)
.map(<Creating an UpdateModel<Document> instance)
.map(updates -> collection.map(c -> c.bulkWrite(updates, bulkWriteOptions)))
.then();
This feels like an odd (and almost brute-force) way to try to perform a bulk update. Am I missing something? Spring usually includes methods for performing bulk updates, but their reactive mongo library does not apparently include it. What else might I try?

how to get data from a view table(which has resource_full) from Google bigquery(google-cloud-ruby gem)

I am working on Sinatra,ruby Application.
I need to get data from one view table which is in Google-BigQuery and I am using google-cloud-bigquery gem in my Application.
here is how am Querying the google-bigquery(ruby code)
bigquery = Google::Cloud::Bigquery.new(<necessary credentials for the application>)
query = "select * from `dataset.table_name` limit 10"
bigquery.query query => #(this query will give me the exact output)
but, when I am querying without limit, like following
query = "select * from `dataset.table_name`"
bigquery.query query
I will get this kind of response.
Google::Cloud::InvalidArgumentError (resourcesExceeded: Resources exceeded during query execution: The query could not be executed)
so, here in this case how can I handle it. as, I need to get all the data from that table I shouldn't give any limit.
Instead of querying whole table - you should use Tabledata: list API (or respective method in client of your choice)
Using List is free and also comes with paging so you can get whole table data page by page

How to sort an optimized keys only query

I'm using datastore native api to access to gae database (for well studied specific reasons). I wanted to optimize the code and use the memcache in my requests instead of directly grabbing the values, the issue, is that my query is sorted.
When I do a findProductsByFiltersQuery.setKeysOnly(); on my query, I receive this error:
The provided keys-only multi-query needs to perform some sorting in
memory. As a result, this query can only be sorted by the key
property as this is the only property that is available in memory.
The weired thing is that it starts happening from a certain complexity of the request, for example this request fails:
SELECT __key__ FROM Product WHERE dynaValue = _Rs:2 AND productState = PUBLISHED AND dynaValue = _RC:2 AND dynaValue = RF:1 AND dynaValue = ct:1030003 AND dynaValue = _RS:4 AND dynaValue = _px:2 AND itemType = NEWS ORDER BY modificationDate DESC
while this one passes :
SELECT __key__ FROM Product WHERE itemType = CI AND productState = PUBLISHED ORDER BY modificationDate DESC
Can someone explain me why this is happening and if ordering is not possible when getting the keys, for what is that feature? : since results are paginated, it is useless to get a bad set of keys from the first filtering request. So how is it thought???
Please also notice that when I do non keysOnly very long request I receive this message
Splitting the provided query requires that too many subqueries are
merged in memory.
at com.google.appengine.repackaged.com.google.common.base.Preconditions.checkArgument(Preconditions.java:129)
at com.google.appengine.api.datastore.QuerySplitHelper.splitQuery(QuerySplitHelper.java:99)
at com.google.appengine.api.datastore.QuerySplitHelper.splitQuery(QuerySplitHelper.java:71)
Can someone explain me how is it possible there is in memory treatment when values are indexed? or is it the devmode server only that does this error?
In-memory queries are necessary when you use OR, IN, and != operators in Datastore. As described in this blog post, queries using these operators are split in the client into multiple Datastore queries. For example:
SELECT * FROM Foo WHERE A = 1 OR A = 2
gets split into two queries:
SELECT * FROM Foo WHERE A = 1
SELECT * FROM Foo WHERE A = 2
If you add ORDER BY B to your query, both sub-queries get this order:
SELECT * FROM Foo WHERE A = 1 ORDER BY B
SELECT * FROM Foo WHERE A = 2 ORDER BY B
While each of these Datastore queries returns results sorted by B, the union of the queries is not. On the client side, the SDK merges the results from the two ordered by B.
In order to do this, the Datastore queries must actually return the ordered property, otherwise the SDK won't know the correct way to merge these together.
If you are writing queries with a large number of filters, make sure to only use AND filters. This will allow all the operations to be performed only in the Datastore, in which case no in-memory sorting is necessary.

Is it a good idea to store and access an active query resultset in Coldfusion vs re-quering the database?

I have a product search engine using Coldfusion8 and MySQL 5.0.88
The product search has two display modes: Multiple View and Single View.
Multiple displays basic record info, Single requires additional data to be polled from the database.
Right now a user does a search and I'm polling the database for
(a) total records and
(b) records FROM to TO.
The user always goes to Single view from his current resultset, so my idea was to store the current resultset for each user and not have to query the database again to get (waste a) overall number of records and (waste b) a the single record I already queried before AND then getting the detail information I still need for the Single view.
However, I'm getting nowhere with this.
I cannot cache the current resultset-query, because it's unique to each user(session).
The queries are running inside a CFINVOKED method inside a CFC I'm calling through AJAX, so the whole query runs and afterwards the CFC and CFINVOKE method are discarded, so I can't use query of query or variables.cfc_storage.
So my idea was to store the current resultset in the Session scope, which will be updated with every new search, the user runs (either pagination or completely new search). The maximum results stored will be the number of results displayed.
I can store the query allright, using:
<cfset Session.resultset = query_name>
This stores the whole query with results, like so:
query
CACHED: false
EXECUTIONTIME: 2031
SQL: SELECT a.*, p.ek, p.vk, p.x, p.y
FROM arts a
LEFT JOIN p ON
...
LEFT JOIN f ON
...
WHERE a.aktiv = "ja"
AND
... 20 conditions ...
SQLPARAMETERS: [array]
1) ... 20+ parameters
RESULTSET:
[Record # 1]
a: true
style: 402
price: 2.3
currency: CHF
...
[Record # 2]
a: true
style: 402abc
...
This would be overwritten every time a user does a new search. However, if a user wants to see the details of one of these items, I don't need to query (total number of records & get one record) if I can access the record I need from my temp storage. This way I would save two database trips worth 2031 execution time each to get data which I already pulled before.
The tradeoff would be every user having a resultset of up to 48 results (max number of items per page) in Session.scope.
My questions:
1. Is this feasable or should I requery the database?
2. If I have a struture/array/object like a the above, how do I pick the record I need out of it by style number = how do I access the resultset? I can't just loop over the stored query (tried this for a while now...).
Thanks for help!
KISS rule. Just re-query the database unless you find the performance is really an issue. With the correct index, it should scales pretty well. When the it is an issue, you can simply add query cache there.
QoQ would introduce overhead (on the CF side, memory & computation), and might return stale data (where the query in session is older than the one on DB). I only use QoQ when the same query is used on the same view, but not throughout a Session time span.
Feasible? Yes, depending on how many users and how much data this stores in memory, it's probably much better than going to the DB again.
It seems like the best way to get the single record you want is a query of query. In CF you can create another query that uses an existing query as it's data source. It would look like this:
<cfquery name="subQuery" dbtype="query">
SELECT *
FROM Session.resultset
WHERE style = #SelectedStyleVariable#
</cfquery>
note that if you are using CFBuilder, it will probably scream Error at you for not having a datasource, this is a bug in CFBuilder, you are not required to have a datasource if your DBType is "query"
Depending on how many records, what I would do is have the detail data stored in application scope as a structure where the ID is the key. Something like:
APPLICATION.products[product_id].product_name
.product_price
.product_attribute
Then you would really only need to query for the ID of the item on demand.
And to improve the "on demand" query, you have at least two "in code" options:
1. A query of query, where you query the entire collection of items once, and then query from that for the data you need.
2. Verity or SOLR to index everything and then you'd only have to query for everything when refreshing your search collection. That would be tons faster than doing all the joins for every single query.

What is the best way to integrate Solr as an index with Oracle as a storage DB?

I have an Oracle database with all the "data", and a Solr index where all this data is indexed. Ideally, I want to be able to run queries like this:
select * from data_table where id in ([solr query results for 'search string']);
However, one key issue arises:
Oracle WILL NOT allow more than 1000 items in the array of items in the "in" clause (BIG DEAL, as the list of objects I find is very often > 1000 and will usually be around the 50-200k items)
I have tried to work around this using a "split" function that will take a string of comma-separated values, and break them down into array items, but then I hit the 4000 char limit on the function parameter using SQL (PL/SQL is 32k chars, but it's still WAY too limiting for 80,000+ results in some cases)
I am also hitting performance issues using a WHERE IN (....), I am told that this causes a very slow query, even when the field referenced is an indexed field?
I've tried making recursive "OR"s for the 1000-item limit (aka: id in (1...1000 or (id in (1001....2000) or id in (2001....3000))) - and this works, but is very slow.
I am thinking that I should load the Solr Client JARs into Oracle, and write an Oracle Function in Java that will call solr and pipeline back the results as a list, so that I can do something like:
select * from data_table where id in (select * from table(runSolrQuery('my query text')));
This is proving quite hard, and I am not sure it's even possible.
Things that I can't do:
Store full data in Solr (security +
storage limits)
User Solr as
controller of pagination and ordering
(this is why I am fetching data from
the DB)
So I have to cook up a hybrid approach where Solr really act like the full-text search provider for Oracle. Help! Has anyone faced this?
Check this out:
http://demo.scotas.com/search-sqlconsole.php
This product seems to do exactly what you need.
cheers
I'm not a Solr expert, but I assume that you can get the Solr query results into a Java collection. Once you have that, you should be able to use that collection with JDBC. That avoids the limit of 1000 literal items because your IN list would be the result of a query, not a list of literal values.
Dominic Brooks has an example of using object collections with JDBC. You would do something like
Create a couple of types in Oracle
CREATE TYPE data_table_id_typ AS OBJECT (
id NUMBER
);
CREATE TYPE data_table_id_arr AS TABLE OF data_table_id_typ;
In Java, you can then create an appropriate STRUCT array, populate this array from Solr, and then bind it to the SQL statement
SELECT *
FROM data_table
WHERE id IN (SELECT * FROM TABLE( CAST (? AS data_table_id_arr)))
Instead of using a long BooleanQuery, you can use TermsFilter (works like RangeFilter, but the items doesn't have to be in sequence).
Like this (first fill your TermsFilter with terms):
TermsFilter termsFilter = new TermsFilter();
// Loop through terms and add them to filter
Term term = new Term("<field-name>", "<query>");
termsFilter.addTerm(term);
then search the index like this:
DocList parentsList = null;
parentsList = searcher.getDocList(new MatchAllDocsQuery(), searcher.convertFilter(termsFilter), null, 0, 1000);
Where searcher is SolrIndexSearcher (see java doc for more info on getDocList method):
http://lucene.apache.org/solr/api/org/apache/solr/search/SolrIndexSearcher.html
Two solutions come to mind.
First, look into using Oracle specific Java extensions to JDBC. They allow you to pass in an actual array/list as an argument. You may need to create a stored proc (it has a been a while since I had to do this), but if this is a focused use case, it shouldn't be overly burdensome.
Second, if you are still running into a boundary like 1000 object limits, consider using the "rows" setting when querying Solr and leveraging it's inherent pagination feature.
I've used this bulk fetching method with stored procs to fetch large quantities of data which needed to be put into Solr. Involve your DBA. If you have a good one, and use the Oracle specific extensions, I think you should attain very reasonable performance.

Resources