Solr query - Is there a way to limit the size of a text field in the response - full-text-search

Is there a way to limit the amount of text in a text field from a query? Here's a quick scenario....
I have 2 fields:
docId - int
text - string.
I will query the docId field and want to get a "preview" text from the text field of 200 chars. On average, the text field has anything from 600-2000 chars but I only need a preview.
eg. [mySolrCore]/select?q=docId:123&fl=text
Is there any way to do it since I don't see the point of bringing back the entire text field if I only need a small preview?
I'm not looking at hit highlighting since i'm not searching for specific text within the Text field but if there is similar functionaly of the hl.fragsize parameter it would be great!
Hope someone can point me in the right direction!
Cheers!

You would have to test the performance of this work-around versus just returning the entire field, but it might work for your situation. Basically, turn on highlighting on a field that won't match, and then use the alternate field to return the limited number of characters you want.
http://solr:8080/solr/select/?q=*:*&rows=10&fl=author,title&hl=true&hl.snippets=0&hl.fl=sku&hl.fragsize=0&hl.alternateField=description&hl.maxAlternateFieldLength=50
Notes:
Make sure your alternate field does not exist in the field list (fl) parameter
Make sure your highlighting field (hl.fl) does not actually contain the text you want to search
I find that the cpu cost of running the highlighter sometimes is more than the cpu cost and bandwidth of just returning the whole field. You'll have to experiment.

I decided to turn my comment into an answer.
I would suggest that you don't store your text data in Solr/Lucene. Only index the data for searching and store a unique ID or URL to identify the document. The contents of the document should be fetched from a separate storage system.
Solr/Lucene are optimized for searches. They aren't your data warehouse or database, and they shouldn't be used that way. When you store more data in Solr than necessary, you negatively impact your entire search system. You bloat the size of indices, increase replication time between masters and slaves, replicate data that you only need a single copy of, and waste cache memory on document caches that should be leveraged to make search faster.
So, I would suggest 2 things.
First, optimally, remove the text storage entire from your search index. Fetch the preview text and whole text from a secondary system that is optimized for holding documents, like a file server.
Second, sub-optimal, only store the preview text in your search index. Store the entire document elsewhere, like a file server.

you can add an additional field like excerpt/summary that consist the first 200 chars on text, and return that field instead

My wish, which I suspect is shared by many sites, is to offer a snippet of text with each query response. That upgrades what the user sees from mere titles or equivalent. This is normal (see Google as an example) and productive technique.
Presently we cannot easily cope with sending the entire content body from Solr/Lucene into a web presentation program and create the snippet there, together with many others in a set of responses as that is a significant network, CPU, and memory hog (think of dealing with many multi-MB files).
The sensible thing is for Solr/Lucene to have a control for sending only the first N bytes of content upon request, thereby saving a lot of trouble in the field. Kludges with hightlights and so forth are just that, and interfere with proper usage. We keep in mind that mechanisms feeding material into Solr/ucene may not be parsing the files, so those feeders can't create the snippets.

Linkedin real time search
http://snaprojects.jira.com/browse/ZOIE
For storing big data
http://project-voldemort.com/

Related

Storing long text in Datastore

Is Datastore suitable to store really long text, e.g. profile descriptions and articles?
If not, what's the Google Cloud alternative?
If yes, what would be the ideal way to store it in order to maintain formatting such as linebreaks and markdown supported keywords? Simply store as string or convert to byte? And should I be worried about dirty user input?
I need it for a Go project (I don't think language is relevant, but maybe Go have some useful features for this)
Yes, it's suitable if you're OK with certain limitations.
These limitations are:
the overall entity size (properties + indices) must not exceed 1 MB (this should be OK for profiles and most articles)
texts longer than a certain limit (currently 1500 bytes) cannot be indexed, so the entity may store a longer string, but you won't be able to search in it / include it in query filters; don't forget to tag these fields with "noindex"
As for the type, you may simply use string, e.g.:
type Post struct {
UserID int64 `datastore:"uid"`
Content string `datastore:"content,noindex"`
}
string types preserve all formatting, including newlines, HTML, markup and whatever formatting.
"Dirty user input?" That's the issue of rendering / presenting the data. The datastore will not try to interpret it or attempt to perform any action based on its content, nor will transform it. So from the Datastore point of view, you have nothing to worry about (you don't create text GQLs by appending text ever, right?!).
Also note that if you're going to store large texts in your entities, those large texts will be fetched whenever you load / query such entities, and you also must send it when you modify and (re)save such an entity.
Tip #1: Use projection queries if you don't need the whole texts in certain queries to avoid "big" data movement (and so to ultimately speed up queries).
Tip #2: To "ease" the burden of not being able to index large texts, you may add duplicate properties like a short summary or title of the large text, because string values shorter than 1500 bytes can be indexed.
Tip #3: If you want to go over the 1 MB entity size limit, or you just want to generally decrease your datastore size usage, you may opt to store large texts compressed inside entities. Since they are long, you can't search / filter them anyway, but they are very well compressed (often below 40% of the original). So if you have many long texts, you can shrink your datastore size to like 1 third just by storing all texts compressed. Of course this will add to the entity save / load time (as you have to compress / decompress the texts), but often it is still worth it.

make better LinQ queries mvc3

I had a question in my mind, to get the better performance of a linq query , is it better to use the select extension to select just the fields we need, not the whole fields?!!!
I have a table "News" and it has fileds ( id, title, text, regtime,regdate,username,...) which the text field is very long text and it has to have a big size for each row.
so i decided to change this query in the index page which does not show the text of news from this
var model=db.News.ToList();
to this one
var model=db.News.Select(n=>new NewsVM(){ id=r.id, title=r.title, regtime=r.regtime,...});
and i fiddler both queries and the Bytes Received was the same
Fiddler doesn't monitor the data sent from the Sql Server to the webserver. Fiddler will only show you the size of the HTML that the view generates.
So to answer your question, YES! The performance is much better if you ask only for the fields you need rather than either asking for all of them, or blindly using the select method. The Sql Server should/may run the query faster. It may be able to retrieve all the fields you are asking for directly from an index rather than having to actually read each row. There are many other reasons as well, but they get more technical.
As for the webserver, it too will execute faster since it doesn't have to recieve as much data from the sql server, and it will use less memory (leaving more memory available for caching, etc).
A good analogy would be asking if I ask you for the first word of the first 10 books in your library, would it be faster if you had to copy the entire content of each book to your notebook first, then give me the first word, or if it would be faster if you just wrote down the first word of each book. Both answers are only 10 words long.

A space-efficient view

For all documents with a certain type I have a single query in my app, which selects just a single field of the last document. I map those documents by date, so making a descending query limited to 1 should certainly do the trick. The problem I'm bothered by is that this view would cache all documents of this type, occupying an obviously redundant space.
So my questions are:
Would adding a reduce function, which would reduce to the single last document, to this view save any space for me or the view would still have to store all the documents involved?
If not, is there any other space-efficient strategy?
No. Space will still be wasted by the result of map function.
Some things in my mind at the moment:
Change the design of the database. If the id of the document will include the type and date you could do some searching without the map/reduce like this: http://127.0.0.1:5984/YOURDB/_all_docs?start_key="<TYPE>_<CURRENT_TIME>"&descending=true&limit=1.
Make use of map the best you can. Emit no value, and map will store the key and the id/ver of the documents. Use include_doc to retrieve the doc when querying.
Add additional field saying that the document is a candidate for the last one. Map only those candidates who does have the field. Periodically run cleanup, removing the field from all the documents except the latest one. Note: this can by difficult when deleting the last added document is supported.
That seems to be for me the idea of CouchDB: "waste" space by caching the queries, so they can by answered quickly if the data is not changing to frequently. Perhaps if you care so much about wasting the space, the answer in your case in not the CouchDB?
My couchdb setup has the data and the indexes on sperate RAID drives. Maps are written in erlang which I find 8x faster faster than javascript and maps of course return null. I keep the keys small and I also break up my views across many design documents and I keep my data very flat which improves serialization performance.

Parsing structured and semi-structured text with hundreds of tags with Ruby

I will be processing batches of 10,000-50,000 records with roughly 200-400 characters in each record. I expect the number of search terms I could have would be no more than 1500 (all related to local businesses).
I want to create a function that compares the structured tags with a list of terms to tag the data.
These terms are based on business descriptions. So, for example, a [Jazz Bar], [Nightclub], [Sports Bar], or [Wine Bar] would all correspond to queries for [Bar].
Usually this data has some kind of existing tag, so I can also create a strict hierarchy for the first pass and then do a second pass if there is no definitive existing tag.
What is the most performance sensitive way to implement this? I could have a table with all the keywords and try to match them against each piece of data. This is straightforward in the case where I am matching the existing tag, less straightforward when processing free text.
I'm using Heroku/Postgresql
It's a pretty safe bet to use the Sphinx search engine and the ThinkingSphinx Ruby gem. Yes, there is some configuration overhead, but I am yet to find a scenario where Sphinx has failed me. :-)
If you have 30-60 minutes to tinker with setting this up, give it a try. I have been using Sphinx to search in a DB table with 600,000+ records with complex queries (3 separate search criterias + 2 separate field groupings / sortings) and I was getting results in 0.625 secs, which is not bad at all and I am sure is lot better than anything you could accomplish yourself with a pure Ruby code.

Locayta indexer update

I am using Locayta for Full Text Search in one of my projects, and the entity I need to search has text content and tags, looking at the LocNotes code, looks like every time any aspect of the entity changes, I need to update it in Locayta Indexer, so if the text content is 100KB and I only modified the tags, it still does re-index the 100KB text content even it is not changed?
Thanks!
I'm one of Locayta's developers; your assumption there is correct. Locayta Search treats each document as a single entity, so at the moment the full document has to be re-sent if any aspect of it has been updated.
In the future, if there's enough demand for it, we may add the ability to update the boolean values stored for the document separately, but there would probably be a lot of provisos around such an update due to how Locayta Search stores various aspects of each document separately. The most reliable way to update a document will always be to send the entire document data each time.

Resources