Is there some way to perform a different update to each of many documents (bulk update) in spring-data-mongodb-reactive? - spring-data-mongodb-reactive

I am using spring-boot-starter-data-mongodb-reactive at the current latest version. I need to update a count field in many documents in my collection, but the count field is different for each document. I am looking for a way to perform a bulk update so that I do not have to perform an update for each item, separately, a million times.
My first approach has been to create a list of UpdateOneModel containing the Criteria and the Update. I can get the collection from the ReactiveMongoOperations instance, but this feels like quite an awkward way to do it. It looks like this:
Mono<MongoCollection<Document>> collection = mongoOps.getCollection(mongoOps.getCollectionName(Foo.class));
BulkWriteOptions options = new BulkWriteOptions()
.bypassDocumentValidation(true)
.ordered(false);
return result.getCounts()
.reduce(<Creating a map of ID to new count>)
.map(<Creating an UpdateModel<Document> instance)
.map(updates -> collection.map(c -> c.bulkWrite(updates, bulkWriteOptions)))
.then();
This feels like an odd (and almost brute-force) way to try to perform a bulk update. Am I missing something? Spring usually includes methods for performing bulk updates, but their reactive mongo library does not apparently include it. What else might I try?

Related

Marklogic - get list of all unique document structures in a Marklogic database

I want to get a list of all distinct document structures with a count in a Marklogic database.
e.g. a database with these 3 documents:
1) <document><name>Robert</name></document>
2) <document><name>Mark</name></document>
3) <document><fname>Robert</fname><lname>Smith</lname></document>
Would return that there are two unique document structures in the database, one used by 2 documents, and the other used by 1 document.
I am using this xquery and am getting back the list of unique sequence of elements correctly:
for $i in distinct-values(for $document in doc()
return <div>{distinct-values(
for $element in $document//*/*/name() return <div>{$element}</div>)} </div>)
return $i
I appreciate that this code will not handle duplicate element names but that is OK for now.
My t questions are:
1) Is there a better/more efficient way to do this? I am assuming yes.
2) Is there a way to get back enough detail so that I could build up the xml tree of each unique structure?
3) What is the best way to return the count of each distinct structure e.g. 2 and 1 and in the above example
If you have a finite list of elements for which you need to do this for, consider co-occurance or other similiar solutions: https://docs.marklogic.com/cts:value-co-occurrences
This requires a range index on each element in question.
MarkLogic works best to use indexes whenever possible. The other solution I can think of is that you actually create a hash/checksum for the values of the target content for each document in question and store this with the document (or in a triple if you happen to have a licence for semantics). Then you you would already have a key for
the unique combinations.
1) Is there a better/more efficient way to do this? I am assuming yes.
If it were up to me, I would create the document structured in a consistent fashion (like you're doing), then hash it, and attach the hash to each document as a collection. Then I could count the docs in each collection. I can't see any efficient way (using indexes) to get the counts without first writing to the document content or metadata (collection is a type of metadata) then querying against the indexes.
2) Is there a way to get back enough detail so that I could build up the xml tree of each unique structure?
After you get the counts for each collection, you could retrieve one doc from each collection and walk through it to build an empty XML structure. XSLT would probably be a good way to do this if you already know XSLT.
3) What is the best way to return the count of each distinct structure e.g. 2 and 1 and in the above example
Turn on the collection lexicon on your database. Then do something like the following:
for $collection in cts:collections()
return ($collection, cts:frequency($collection))
Not sure I follow exactly what you are after, but I am wondering if this is more what you are looking for- functx:distinct-element-paths($doc)
http://www.xqueryfunctions.com/xq/functx_distinct-element-paths.html
Here's a quick example:
xquery version "1.0-ml";
import module namespace functx = "http://www.functx.com" at "/MarkLogic/functx/functx-1.0-nodoc-2007-01.xqy";
let $doc := <document><fname>Robert</fname><lname>Smith</lname></document>
return
functx:distinct-element-paths($doc)
Outputs the following strings (which could be parsed, of course):
document
document/fname
document/lname
there are existing 3rd party tools that may work, depending on the size of the data, and the coverage required (is 100% sampleing needed).
Search for "Generate Schema from XML" --
Such tools will look at a sample set and infer a schema (xsd, dtd, rng etc).
They do an accurate job, but not always in the same way a human would.
If they do not have native ML integration then you need to expose a service or exort the data for analysis.
Once you HAVE a schema, load it into MarkLogic, and you can query the schema (and elements validated by it) directly and programmatically in ML
If you find a 'generate schema' tool that is implemented in XSLT, XQuery, or JavaScript you may be able to import and execute it in-server.

Transform select statement

How can i dynamically transform an SQL-Query?
I know there is a Select.getSelect(), but how can i add fields in the select-query?
Use-case: for a Rest-Query i have a lot of paginated resources and i have an abstraction to create the paginated-query. It takes the SelectConditionStep and adds the rest, depending on additional parameters. It works really well for simple queries, but for queries containing joins a little bit of transformation of the query would required. (Mainly because i can't naively limit the number results, since the join can be a one to many relationship)
The easiest way is to keep a List<Field<?>> where you add the fields for your select() clause, and then create the Select statement only when you actually execute it, instead of passing a Select object around. Example:
List<Field<?>> fields = new ArrayList<>();
// Just some examples:
fields.addAll(getDefaultFields());
fields.addAll(getFieldsFromUI());
fields.addAll(getCalculatedFields());
// Much later on, you finally create the statement:
DSL.using(configuration)
.select(fields)
.from(...)
.fetch();

Remove duplicates from custom entities in Microsoft Dynamics CRM

Has anyone found a good way to either merge or remove duplicates that are in custom entities? In our case we have two custom entities, literature history and subscriptions which relate contacts back to a custom entity named literature.
I can run a duplicate detection job, but this returns thousands of records and deleting them one at a time is impractical at best. We would like to either be able to merge them or just delete the duplicates. However, much Google searching has not turned up any good suggestions other than "you can write something."
Okay, but where to even get started? Should I be bulk deleting from the duplicate detection job? Should I try just writing a quick and dirty c# program with the SDK? Is there a way to merge custom entities that just requires some magical workflow voodoo?
EDIT: FYI What I eventually did was setting the deletion state code using some fun SQL to quickly find duplicates:
UPDATE T1 SET DeletionStateCode = 2
FROM New_subscriptionhistory T1 INNER JOIN New_subscriptionhistory T2 ON t1.New_LiteratureId = T2.New_LiteratureId AND t1.New_ContactId = t2.New_ContactId
AND t1.CreatedOn > t2.CreatedOn AND t1.statecode = 0 AND t2.statecode = 0
You should look into creating a Bulk Delete Job using the SDK.
Here's a short tutorial.
I won't say with certainty that this is the only or the best way, but we've used SQL queries in the _MSCRM database, setting the DeletionStateCode of any duplicated entity to 2.

What is the best way to integrate Solr as an index with Oracle as a storage DB?

I have an Oracle database with all the "data", and a Solr index where all this data is indexed. Ideally, I want to be able to run queries like this:
select * from data_table where id in ([solr query results for 'search string']);
However, one key issue arises:
Oracle WILL NOT allow more than 1000 items in the array of items in the "in" clause (BIG DEAL, as the list of objects I find is very often > 1000 and will usually be around the 50-200k items)
I have tried to work around this using a "split" function that will take a string of comma-separated values, and break them down into array items, but then I hit the 4000 char limit on the function parameter using SQL (PL/SQL is 32k chars, but it's still WAY too limiting for 80,000+ results in some cases)
I am also hitting performance issues using a WHERE IN (....), I am told that this causes a very slow query, even when the field referenced is an indexed field?
I've tried making recursive "OR"s for the 1000-item limit (aka: id in (1...1000 or (id in (1001....2000) or id in (2001....3000))) - and this works, but is very slow.
I am thinking that I should load the Solr Client JARs into Oracle, and write an Oracle Function in Java that will call solr and pipeline back the results as a list, so that I can do something like:
select * from data_table where id in (select * from table(runSolrQuery('my query text')));
This is proving quite hard, and I am not sure it's even possible.
Things that I can't do:
Store full data in Solr (security +
storage limits)
User Solr as
controller of pagination and ordering
(this is why I am fetching data from
the DB)
So I have to cook up a hybrid approach where Solr really act like the full-text search provider for Oracle. Help! Has anyone faced this?
Check this out:
http://demo.scotas.com/search-sqlconsole.php
This product seems to do exactly what you need.
cheers
I'm not a Solr expert, but I assume that you can get the Solr query results into a Java collection. Once you have that, you should be able to use that collection with JDBC. That avoids the limit of 1000 literal items because your IN list would be the result of a query, not a list of literal values.
Dominic Brooks has an example of using object collections with JDBC. You would do something like
Create a couple of types in Oracle
CREATE TYPE data_table_id_typ AS OBJECT (
id NUMBER
);
CREATE TYPE data_table_id_arr AS TABLE OF data_table_id_typ;
In Java, you can then create an appropriate STRUCT array, populate this array from Solr, and then bind it to the SQL statement
SELECT *
FROM data_table
WHERE id IN (SELECT * FROM TABLE( CAST (? AS data_table_id_arr)))
Instead of using a long BooleanQuery, you can use TermsFilter (works like RangeFilter, but the items doesn't have to be in sequence).
Like this (first fill your TermsFilter with terms):
TermsFilter termsFilter = new TermsFilter();
// Loop through terms and add them to filter
Term term = new Term("<field-name>", "<query>");
termsFilter.addTerm(term);
then search the index like this:
DocList parentsList = null;
parentsList = searcher.getDocList(new MatchAllDocsQuery(), searcher.convertFilter(termsFilter), null, 0, 1000);
Where searcher is SolrIndexSearcher (see java doc for more info on getDocList method):
http://lucene.apache.org/solr/api/org/apache/solr/search/SolrIndexSearcher.html
Two solutions come to mind.
First, look into using Oracle specific Java extensions to JDBC. They allow you to pass in an actual array/list as an argument. You may need to create a stored proc (it has a been a while since I had to do this), but if this is a focused use case, it shouldn't be overly burdensome.
Second, if you are still running into a boundary like 1000 object limits, consider using the "rows" setting when querying Solr and leveraging it's inherent pagination feature.
I've used this bulk fetching method with stored procs to fetch large quantities of data which needed to be put into Solr. Involve your DBA. If you have a good one, and use the Oracle specific extensions, I think you should attain very reasonable performance.

Using LINQ to query flat text files with fixed-length records?

I've got a file filled with records like this:
NCNSCF1124557200811UPPY19871230
The codes are all fixed-length, and some of them link to other flat files (sort of like a relational database). What's the best way of querying this data using LINQ?
This is what I came up with intuitively, but I was wondering if there's a more elegant way:
var records = File.ReadAllLines("data.txt");
var table = from record in records
select new { FirstCode = record.Substring(0, 2),
OtherCode = record.Substring(18, 4) };
For one thing I wouldn't read it all into memory to start with. It's very easy to write a LineReader class which iterates over a file a line at a time. I've got a version in MiscUtil which you can use.
Unless you only want to read the results once, however, you might want to call ToList() at the end to avoid reading the file multiple times. (This is still nicer than reading all the lines and keeping that in memory - you only want to do the splitting once.)
Once you've basically got in-memory collections of all the tables, you can use normal LINQ to Objects to join them together etc. You might want to go to a more sophisticated data model to get indexes though.
I don't think there's a better way out of the box.
One could define a Flat-File Linq Provider which could make the whole thing much simpler, but as far as I know, no one has yet.

Resources