Why is it bad to resturn a document from Couchbase View`s - view

I am trying to enter Couchbase world and learning things about a views.
Several time in presentations and demos i heard its bad to return whole doc in from a view:
emit(meta.id, doc);
My question is why? What should i return then and how can i grab a proper values of the document?

It's a bad idea because it's actually counterproductive. Writing a document to the view means it will be stored on disk with the view index itself. You pay the IO price for writing the document to disk again (a duplicate of the original key/value doc), and you pay it again for reading it at query time. Because views queries are served from disk (or the file system cache), you will never take advantage of the integrated cache layer to retrieve the document faster. In short, in average it will be faster to get the document ID from the view and retrieve the document by id, than it is to just read the whole document from the view. This is especially true for operations on multiple documents.

It's bad because it's a large drain on resources, views will often update and overwrite indices, so if you are writing a whole doc repeatedly it's going to require a large amount of processor time and disk I/O (along with filesystem cache).
Therefore, it is recommended (and far more efficient) to have the view return the doc.id and then use the standard get procedure to return the whole doc.

Related

Core Data or sqlite for fast search?

This is a description of the application I want to build and I'm not sure whether to use Core Data or Sqlite (or something else?):
Single user, desktop, not networked, only one frontend is accessing datastorage
User occasionally enters some data, no bulk data importing or large data inserts
Simple datamodel: entity with up to 20-30 attributes
User searches in data (about 50k datasets max.)
Search takes place mostly in attribute values, not looking for any keys here, but searching for text in values
Writing the data is nothing I see as critical, it happens not very often and with small amounts of data. The text search in the attributes has to be blazingly fast, a user would expect almost instant results. This is absolutely critical.
I would rather go with Core Data, but is this a scenario CD can handle?
Thanks
-Fish
Core Data can handle this scenario. But because you're looking for blazingly fast full text search, you'll have to do some extra work. Session 211 of WWDC 2013 goes into depth about how to do this (slides 117-131). You'll probably want to have a separate Entity with text search tokens: all of the findable words in your dataset.
Although one of the FTS extensions is available in Apple's deployment of SQLite, it's not exposed in Core Data.

A space-efficient view

For all documents with a certain type I have a single query in my app, which selects just a single field of the last document. I map those documents by date, so making a descending query limited to 1 should certainly do the trick. The problem I'm bothered by is that this view would cache all documents of this type, occupying an obviously redundant space.
So my questions are:
Would adding a reduce function, which would reduce to the single last document, to this view save any space for me or the view would still have to store all the documents involved?
If not, is there any other space-efficient strategy?
No. Space will still be wasted by the result of map function.
Some things in my mind at the moment:
Change the design of the database. If the id of the document will include the type and date you could do some searching without the map/reduce like this: http://127.0.0.1:5984/YOURDB/_all_docs?start_key="<TYPE>_<CURRENT_TIME>"&descending=true&limit=1.
Make use of map the best you can. Emit no value, and map will store the key and the id/ver of the documents. Use include_doc to retrieve the doc when querying.
Add additional field saying that the document is a candidate for the last one. Map only those candidates who does have the field. Periodically run cleanup, removing the field from all the documents except the latest one. Note: this can by difficult when deleting the last added document is supported.
That seems to be for me the idea of CouchDB: "waste" space by caching the queries, so they can by answered quickly if the data is not changing to frequently. Perhaps if you care so much about wasting the space, the answer in your case in not the CouchDB?
My couchdb setup has the data and the indexes on sperate RAID drives. Maps are written in erlang which I find 8x faster faster than javascript and maps of course return null. I keep the keys small and I also break up my views across many design documents and I keep my data very flat which improves serialization performance.

Serializeable In-Memory Full-Text Index Tool for Ruby

I am trying to find a way to build a full-text index stored in-memory in a format that can be safely passed through Marshal.dump/Marshal.load so I can take the index and encrypt it before storing it to disk.
My rationale for needing this functionality: I am designing a system where a user's content needs to be both encrypted using their own key, and indexed for full text searching. I realize there would be significant overhead and memory usage if for each user of the system I had to un-marshal and load the entire index of their content into memory. For this project security is far more important than efficiency.
A full text index would maintain far too many details about a user's content to leave unencrypted, and simply storing the index on an encrypted volume is insufficient as each user's index would need to be encrypted using the unique key for that user to maintain the level of security desired.
User content will be encrypted and likely stored in a traditional RDBMS. My thought is that loading/unloading the serialized index would be less overhead for a user with large amounts of content than decrypting all the DB rows belonging to them and doing a full scan for every search.
My trials with ferret got me to the point of successfully creating an in-memory index. However, the index failed a Marshal.dump due to the use of Mutex. I am also evaluating xapian and solr but seem to be hitting roadblocks there as well.
Before I go any further I would like to know if this approach is even a sane one and what alternatives I might want to consider if its not. I also want to know if anyone has had any success with serializing a full-text index in this manner, what tool you used, and any pointers you can provide.
Why not use a standard full-text search engine and keep each client's index on a separate encrypted disk image, like TrueCrypt? Each client's disk image could have a unique key, it would use less RAM, and would probably take less time to implement.

How is wordweb english dictionary implemented?

We need to keep some in-memory data structure to keep english word dictionary in memory.
When the computer/wordweb starts,we need to read dictionary from disk into an in-memory data structure.
This question asks how do we populate from disk to in-memory data structure in typical real world dictionaries say wordweb?
Ideally we would like to keep dictionary in disk in the way, we require it in in-memory, so that we don't have to spend time building in-memory data structure, we just read it off the disk. But for linked lists, pointers etc, how do we store the same image in disk. Some relative addresses etc would help here?
Typically, is the entire dictionary read and stored in memory. or only part/handlers and leaf page IOs are done, when searching for a specific word.
If somebody wants to help with what that in-memory data structure is typically, please go ahead.
Thanks,
You mentioned pointers, so I'm assuming you're using C++; if that's the case and you want to read directly from disk into memory without having to "rebuild" your data structure, then you might want to look into serialization: How do you serialize an object in C++?
However, you generally don't want to load the entire dictionary anyway, especially if it's a user application. If the user is looking up dictionary words, then reading from disk happens so fast that the user will never notice the "delay." If you're servicing hundreds or thousands of requests, then it might make sense to cache the dictionary into memory.
So how many users do you have?
What kind of load are you expecting to have on the application?
Wordweb is using Sqlite Database at backend. It makes sense to me to use a Database system to store the content so its easier to GET the content which the user is looking for quickly.
Wordweb has Word prediction as well... so it will be a query to database like
select word from table where word='ab%';
on the other hand, when the user presses enter for the word
select meaning from table where word='abandon'
You do not want to be Serializing the content from disk to memory while the user is typing or after he has pressed Enter to search. Since the data will be large (Dictionary), Serialization will probably take time more then the user will tolerate for every word search.
Else why don't you create a JSON format File containing all the meaning creating a short form of Dictionary ?

Solr query - Is there a way to limit the size of a text field in the response

Is there a way to limit the amount of text in a text field from a query? Here's a quick scenario....
I have 2 fields:
docId - int
text - string.
I will query the docId field and want to get a "preview" text from the text field of 200 chars. On average, the text field has anything from 600-2000 chars but I only need a preview.
eg. [mySolrCore]/select?q=docId:123&fl=text
Is there any way to do it since I don't see the point of bringing back the entire text field if I only need a small preview?
I'm not looking at hit highlighting since i'm not searching for specific text within the Text field but if there is similar functionaly of the hl.fragsize parameter it would be great!
Hope someone can point me in the right direction!
Cheers!
You would have to test the performance of this work-around versus just returning the entire field, but it might work for your situation. Basically, turn on highlighting on a field that won't match, and then use the alternate field to return the limited number of characters you want.
http://solr:8080/solr/select/?q=*:*&rows=10&fl=author,title&hl=true&hl.snippets=0&hl.fl=sku&hl.fragsize=0&hl.alternateField=description&hl.maxAlternateFieldLength=50
Notes:
Make sure your alternate field does not exist in the field list (fl) parameter
Make sure your highlighting field (hl.fl) does not actually contain the text you want to search
I find that the cpu cost of running the highlighter sometimes is more than the cpu cost and bandwidth of just returning the whole field. You'll have to experiment.
I decided to turn my comment into an answer.
I would suggest that you don't store your text data in Solr/Lucene. Only index the data for searching and store a unique ID or URL to identify the document. The contents of the document should be fetched from a separate storage system.
Solr/Lucene are optimized for searches. They aren't your data warehouse or database, and they shouldn't be used that way. When you store more data in Solr than necessary, you negatively impact your entire search system. You bloat the size of indices, increase replication time between masters and slaves, replicate data that you only need a single copy of, and waste cache memory on document caches that should be leveraged to make search faster.
So, I would suggest 2 things.
First, optimally, remove the text storage entire from your search index. Fetch the preview text and whole text from a secondary system that is optimized for holding documents, like a file server.
Second, sub-optimal, only store the preview text in your search index. Store the entire document elsewhere, like a file server.
you can add an additional field like excerpt/summary that consist the first 200 chars on text, and return that field instead
My wish, which I suspect is shared by many sites, is to offer a snippet of text with each query response. That upgrades what the user sees from mere titles or equivalent. This is normal (see Google as an example) and productive technique.
Presently we cannot easily cope with sending the entire content body from Solr/Lucene into a web presentation program and create the snippet there, together with many others in a set of responses as that is a significant network, CPU, and memory hog (think of dealing with many multi-MB files).
The sensible thing is for Solr/Lucene to have a control for sending only the first N bytes of content upon request, thereby saving a lot of trouble in the field. Kludges with hightlights and so forth are just that, and interfere with proper usage. We keep in mind that mechanisms feeding material into Solr/ucene may not be parsing the files, so those feeders can't create the snippets.
Linkedin real time search
http://snaprojects.jira.com/browse/ZOIE
For storing big data
http://project-voldemort.com/

Resources