Locayta indexer update - full-text-search

I am using Locayta for Full Text Search in one of my projects, and the entity I need to search has text content and tags, looking at the LocNotes code, looks like every time any aspect of the entity changes, I need to update it in Locayta Indexer, so if the text content is 100KB and I only modified the tags, it still does re-index the 100KB text content even it is not changed?
Thanks!

I'm one of Locayta's developers; your assumption there is correct. Locayta Search treats each document as a single entity, so at the moment the full document has to be re-sent if any aspect of it has been updated.
In the future, if there's enough demand for it, we may add the ability to update the boolean values stored for the document separately, but there would probably be a lot of provisos around such an update due to how Locayta Search stores various aspects of each document separately. The most reliable way to update a document will always be to send the entire document data each time.

Related

Check queries not used in a Oracle reports

I'm using Oracle Report Builder 9.0.4.1.0 and I have a heavy report that has defined a large number of queries. I think not all that queries are used in the report and are not linked to any layout object.
Is there a easy way to detect what queries (or other objects) aren't used at all in a specific report? Instead of delete the query, compile and run and verify one by one if are used or not?
Thanks
If there is an easy way to do that, I don't know it. A long time ago, when Reports 1.x was used, report was saved in the database so you could write a query to fetch metadata you're interested in. I never did that, though, but - that would be an option. Now, all you have is a RDF (or a JSP) file.
However, a few suggestions, if I may.
Open Paper Layout Editor. Click the repeating frame and observe its property palette as it contains information about the group it belongs to. "Group" can be viewed in Data Model layout.
As there aren't that many repeating frames, you should be able to eliminate queries that don't have any frames, i.e. don't contribute to the final result.
Another option is to put a condition
WHERE 1 = 2
into every query so that they won't return any rows. Run the report and check what's missing - then remove that condition so that you'd get values. Move on to second query, and so forth. That's a little bit tedious and time consuming, but should still be faster than deleting queries.
You can return a report results to an XML file. Each query with data will contain something in XML-s tags.
enter image description here

Updating nested documents en masse

We've been using Elasticsearch to deliver the 700,000 or so pieces of content to the readers of our site for a couple of years but some circumstances have changed and we need to work out whether or not the service can adapt with us... (sorry this post is so long, I tried to anticipate all questions!)
We use Elasticsearch to store "snapshots" of our content to avoid duplicating work and slowing down our apps by making them fetch data and resolve all resources from our content APIs. We also take advantage of Elasticsearch's search API to retrieve the content in all sorts of ways.
To maintain content in our cluster we run a service that receives notifications of content changes from our APIs which triggers a content "ingest" (fetching the data, doing any necessary transformation and indexing it). The same service also periodically "reingests" content over time. Typically a new piece of content will be ingested in <30 seconds of publishing and touched every 5 days or so thereafter.
The most common method our applications use to retrieve content is by "tag". We have list pages to view content by tag and our users can subscribe to content updates for a tag. Every piece of content has one or more tags.
Tags have several properties:- ID, name, taxonomy, and it's relationship to the content. They're indexed as nested objects so that we can aggregate on them etc.
This is where it gets interesting... tags used to be immutable but we have recently changed metadata systems and they may now change - names will be updated, IDs may flux as they move taxonomy etc.
We have around 65,000 tags in use, the vast majority of which are used only in relatively small numbers. If and when these tags change we can trigger a reingest of all the associated content without requiring any changes to our infrastructure.
However, we also have some tags which are very common, the most popular of which is used more than 180,000 times. And we've just received warning it, a few others with tens of thousands of documents are due to change! So we need to be able to cope with these updates now and into the future.
Triggering a reingest of all the associated content and queuing it up is not the problem, but this could take quite some time, at least 3-5 hours in some cases, and we would like to try and avoid our list pages becoming orphaned or duplicated while this occurs.
If you've got this far, thank you! I have two questions:
Is there a more optimal mapping we could use for our documents knowing now that nested objects - often duplicated thousands of times - may change? Could a parent/child mapping work with so many relations?
Is there an efficient way to update a large number of nested objects? Hacks are fine, at least to cover us in the short term. Could the update by query API and a script handle it?
Thanks
I've already answered a similar question to your use case of Nested datatype.
Here is the link to the answer of maintaining Parent-Child relation data into ES using Nested datatype.
Try this. Do let me know if this solution helps in solving your problem.

CouchDB change API on view

Say I have a document "schema" that includes a show_from field that contains a timestamp as a Unix epoch. I then create a view keyed by this show_from date and only return those documents with a key on or before the current timestamp (per the request). Thus documents will appear in the view "passively", rather than from any update request.
Is it possible to use the CouchDB change API to monitor this change of view state, or would I have to poll the view to watch for changes? (My guess is the latter, because the change API seems only to be triggered by updates, but just for the sake of confirmation!)
The _changes feed can be filtered in a number of ways.
One of the ways of filtering the _changes feed is reusing a view's map function.
GET /[DB]/_changes?filter=_view&view=[DESIGN_DOC]/[VIEW_NAME]
Note:
For every _changes request, CouchDB is going to look at each change and run it over the filter function (or in this case the view's map function). None of this is cached for subsequent requests (as on mapreduce views). So it can be quite taxing on resources, unless the changeset is small.
For a large dataset (with many changes) it can be useful to bootstrap with the view, and only incrementally keep track of changes.
Additional info:
Using _changes you can poll for changes since a given sequence point, for the latest N changes, etc. You can also use long polling, or a continuous feed. As long as the changeset to consider (and filter through) is small, it makes sense to use _changes.
But if the view is itself ordered chronologically, as it seems to be your case, it may be pointless to use changes. Just query the view.

A space-efficient view

For all documents with a certain type I have a single query in my app, which selects just a single field of the last document. I map those documents by date, so making a descending query limited to 1 should certainly do the trick. The problem I'm bothered by is that this view would cache all documents of this type, occupying an obviously redundant space.
So my questions are:
Would adding a reduce function, which would reduce to the single last document, to this view save any space for me or the view would still have to store all the documents involved?
If not, is there any other space-efficient strategy?
No. Space will still be wasted by the result of map function.
Some things in my mind at the moment:
Change the design of the database. If the id of the document will include the type and date you could do some searching without the map/reduce like this: http://127.0.0.1:5984/YOURDB/_all_docs?start_key="<TYPE>_<CURRENT_TIME>"&descending=true&limit=1.
Make use of map the best you can. Emit no value, and map will store the key and the id/ver of the documents. Use include_doc to retrieve the doc when querying.
Add additional field saying that the document is a candidate for the last one. Map only those candidates who does have the field. Periodically run cleanup, removing the field from all the documents except the latest one. Note: this can by difficult when deleting the last added document is supported.
That seems to be for me the idea of CouchDB: "waste" space by caching the queries, so they can by answered quickly if the data is not changing to frequently. Perhaps if you care so much about wasting the space, the answer in your case in not the CouchDB?
My couchdb setup has the data and the indexes on sperate RAID drives. Maps are written in erlang which I find 8x faster faster than javascript and maps of course return null. I keep the keys small and I also break up my views across many design documents and I keep my data very flat which improves serialization performance.

Solr query - Is there a way to limit the size of a text field in the response

Is there a way to limit the amount of text in a text field from a query? Here's a quick scenario....
I have 2 fields:
docId - int
text - string.
I will query the docId field and want to get a "preview" text from the text field of 200 chars. On average, the text field has anything from 600-2000 chars but I only need a preview.
eg. [mySolrCore]/select?q=docId:123&fl=text
Is there any way to do it since I don't see the point of bringing back the entire text field if I only need a small preview?
I'm not looking at hit highlighting since i'm not searching for specific text within the Text field but if there is similar functionaly of the hl.fragsize parameter it would be great!
Hope someone can point me in the right direction!
Cheers!
You would have to test the performance of this work-around versus just returning the entire field, but it might work for your situation. Basically, turn on highlighting on a field that won't match, and then use the alternate field to return the limited number of characters you want.
http://solr:8080/solr/select/?q=*:*&rows=10&fl=author,title&hl=true&hl.snippets=0&hl.fl=sku&hl.fragsize=0&hl.alternateField=description&hl.maxAlternateFieldLength=50
Notes:
Make sure your alternate field does not exist in the field list (fl) parameter
Make sure your highlighting field (hl.fl) does not actually contain the text you want to search
I find that the cpu cost of running the highlighter sometimes is more than the cpu cost and bandwidth of just returning the whole field. You'll have to experiment.
I decided to turn my comment into an answer.
I would suggest that you don't store your text data in Solr/Lucene. Only index the data for searching and store a unique ID or URL to identify the document. The contents of the document should be fetched from a separate storage system.
Solr/Lucene are optimized for searches. They aren't your data warehouse or database, and they shouldn't be used that way. When you store more data in Solr than necessary, you negatively impact your entire search system. You bloat the size of indices, increase replication time between masters and slaves, replicate data that you only need a single copy of, and waste cache memory on document caches that should be leveraged to make search faster.
So, I would suggest 2 things.
First, optimally, remove the text storage entire from your search index. Fetch the preview text and whole text from a secondary system that is optimized for holding documents, like a file server.
Second, sub-optimal, only store the preview text in your search index. Store the entire document elsewhere, like a file server.
you can add an additional field like excerpt/summary that consist the first 200 chars on text, and return that field instead
My wish, which I suspect is shared by many sites, is to offer a snippet of text with each query response. That upgrades what the user sees from mere titles or equivalent. This is normal (see Google as an example) and productive technique.
Presently we cannot easily cope with sending the entire content body from Solr/Lucene into a web presentation program and create the snippet there, together with many others in a set of responses as that is a significant network, CPU, and memory hog (think of dealing with many multi-MB files).
The sensible thing is for Solr/Lucene to have a control for sending only the first N bytes of content upon request, thereby saving a lot of trouble in the field. Kludges with hightlights and so forth are just that, and interfere with proper usage. We keep in mind that mechanisms feeding material into Solr/ucene may not be parsing the files, so those feeders can't create the snippets.
Linkedin real time search
http://snaprojects.jira.com/browse/ZOIE
For storing big data
http://project-voldemort.com/

Resources