Storing long text in Datastore - go

Is Datastore suitable to store really long text, e.g. profile descriptions and articles?
If not, what's the Google Cloud alternative?
If yes, what would be the ideal way to store it in order to maintain formatting such as linebreaks and markdown supported keywords? Simply store as string or convert to byte? And should I be worried about dirty user input?
I need it for a Go project (I don't think language is relevant, but maybe Go have some useful features for this)

Yes, it's suitable if you're OK with certain limitations.
These limitations are:
the overall entity size (properties + indices) must not exceed 1 MB (this should be OK for profiles and most articles)
texts longer than a certain limit (currently 1500 bytes) cannot be indexed, so the entity may store a longer string, but you won't be able to search in it / include it in query filters; don't forget to tag these fields with "noindex"
As for the type, you may simply use string, e.g.:
type Post struct {
UserID int64 `datastore:"uid"`
Content string `datastore:"content,noindex"`
}
string types preserve all formatting, including newlines, HTML, markup and whatever formatting.
"Dirty user input?" That's the issue of rendering / presenting the data. The datastore will not try to interpret it or attempt to perform any action based on its content, nor will transform it. So from the Datastore point of view, you have nothing to worry about (you don't create text GQLs by appending text ever, right?!).
Also note that if you're going to store large texts in your entities, those large texts will be fetched whenever you load / query such entities, and you also must send it when you modify and (re)save such an entity.
Tip #1: Use projection queries if you don't need the whole texts in certain queries to avoid "big" data movement (and so to ultimately speed up queries).
Tip #2: To "ease" the burden of not being able to index large texts, you may add duplicate properties like a short summary or title of the large text, because string values shorter than 1500 bytes can be indexed.
Tip #3: If you want to go over the 1 MB entity size limit, or you just want to generally decrease your datastore size usage, you may opt to store large texts compressed inside entities. Since they are long, you can't search / filter them anyway, but they are very well compressed (often below 40% of the original). So if you have many long texts, you can shrink your datastore size to like 1 third just by storing all texts compressed. Of course this will add to the entity save / load time (as you have to compress / decompress the texts), but often it is still worth it.

Related

Why is it bad to resturn a document from Couchbase View`s

I am trying to enter Couchbase world and learning things about a views.
Several time in presentations and demos i heard its bad to return whole doc in from a view:
emit(meta.id, doc);
My question is why? What should i return then and how can i grab a proper values of the document?
It's a bad idea because it's actually counterproductive. Writing a document to the view means it will be stored on disk with the view index itself. You pay the IO price for writing the document to disk again (a duplicate of the original key/value doc), and you pay it again for reading it at query time. Because views queries are served from disk (or the file system cache), you will never take advantage of the integrated cache layer to retrieve the document faster. In short, in average it will be faster to get the document ID from the view and retrieve the document by id, than it is to just read the whole document from the view. This is especially true for operations on multiple documents.
It's bad because it's a large drain on resources, views will often update and overwrite indices, so if you are writing a whole doc repeatedly it's going to require a large amount of processor time and disk I/O (along with filesystem cache).
Therefore, it is recommended (and far more efficient) to have the view return the doc.id and then use the standard get procedure to return the whole doc.

Core Data or sqlite for fast search?

This is a description of the application I want to build and I'm not sure whether to use Core Data or Sqlite (or something else?):
Single user, desktop, not networked, only one frontend is accessing datastorage
User occasionally enters some data, no bulk data importing or large data inserts
Simple datamodel: entity with up to 20-30 attributes
User searches in data (about 50k datasets max.)
Search takes place mostly in attribute values, not looking for any keys here, but searching for text in values
Writing the data is nothing I see as critical, it happens not very often and with small amounts of data. The text search in the attributes has to be blazingly fast, a user would expect almost instant results. This is absolutely critical.
I would rather go with Core Data, but is this a scenario CD can handle?
Thanks
-Fish
Core Data can handle this scenario. But because you're looking for blazingly fast full text search, you'll have to do some extra work. Session 211 of WWDC 2013 goes into depth about how to do this (slides 117-131). You'll probably want to have a separate Entity with text search tokens: all of the findable words in your dataset.
Although one of the FTS extensions is available in Apple's deployment of SQLite, it's not exposed in Core Data.

Does Core Data/SQLite compress redundant information?

I want to use Core Data (probably with SQLite backing) to store a large database. Much of the string data will be the same between numerous rows. Does Core Data/SQLite see such redundancy, and automatically save space in the db files?
Do I need to make sure that the same text in different rows is the same string object before adding it to the db? If so, how do I detect that a new piece of text matches something anywhere in the existing db?
No, Core Data does not attempt to analyze your data to avoid duplication. If you want to save 10 million objects with the same attributes, you'll get 10 million copies.
If you want to avoid creating duplicate instances, you need to do a fetch for matching instances before creating a new one. The general approach is
Fetch objects matching new data-- according to whatever standard indicates a duplicate for your app. Use a predicate with the fetch that contains the attribute(s) that you don't want to duplicate.
If you find anything, either (a) update the instances you find with any new values you have, or (b) if there are no new values, do nothing.
If you don't find anything, create a new instance.
Application-layer logic can help reduce space at the cost of application complexity.
Say your name field can contain either an integer or a string. (SQLite's weak typing makes this easy to do).
If string -- that's the name right there.
If integer -- go look it up on a name table, using the int as key
Of course you have to create that name table, either on the fly as data is inserted, or a once-in-a-while trawl through the data for new names that are worth surrogating in this way.

A space-efficient view

For all documents with a certain type I have a single query in my app, which selects just a single field of the last document. I map those documents by date, so making a descending query limited to 1 should certainly do the trick. The problem I'm bothered by is that this view would cache all documents of this type, occupying an obviously redundant space.
So my questions are:
Would adding a reduce function, which would reduce to the single last document, to this view save any space for me or the view would still have to store all the documents involved?
If not, is there any other space-efficient strategy?
No. Space will still be wasted by the result of map function.
Some things in my mind at the moment:
Change the design of the database. If the id of the document will include the type and date you could do some searching without the map/reduce like this: http://127.0.0.1:5984/YOURDB/_all_docs?start_key="<TYPE>_<CURRENT_TIME>"&descending=true&limit=1.
Make use of map the best you can. Emit no value, and map will store the key and the id/ver of the documents. Use include_doc to retrieve the doc when querying.
Add additional field saying that the document is a candidate for the last one. Map only those candidates who does have the field. Periodically run cleanup, removing the field from all the documents except the latest one. Note: this can by difficult when deleting the last added document is supported.
That seems to be for me the idea of CouchDB: "waste" space by caching the queries, so they can by answered quickly if the data is not changing to frequently. Perhaps if you care so much about wasting the space, the answer in your case in not the CouchDB?
My couchdb setup has the data and the indexes on sperate RAID drives. Maps are written in erlang which I find 8x faster faster than javascript and maps of course return null. I keep the keys small and I also break up my views across many design documents and I keep my data very flat which improves serialization performance.

Solr query - Is there a way to limit the size of a text field in the response

Is there a way to limit the amount of text in a text field from a query? Here's a quick scenario....
I have 2 fields:
docId - int
text - string.
I will query the docId field and want to get a "preview" text from the text field of 200 chars. On average, the text field has anything from 600-2000 chars but I only need a preview.
eg. [mySolrCore]/select?q=docId:123&fl=text
Is there any way to do it since I don't see the point of bringing back the entire text field if I only need a small preview?
I'm not looking at hit highlighting since i'm not searching for specific text within the Text field but if there is similar functionaly of the hl.fragsize parameter it would be great!
Hope someone can point me in the right direction!
Cheers!
You would have to test the performance of this work-around versus just returning the entire field, but it might work for your situation. Basically, turn on highlighting on a field that won't match, and then use the alternate field to return the limited number of characters you want.
http://solr:8080/solr/select/?q=*:*&rows=10&fl=author,title&hl=true&hl.snippets=0&hl.fl=sku&hl.fragsize=0&hl.alternateField=description&hl.maxAlternateFieldLength=50
Notes:
Make sure your alternate field does not exist in the field list (fl) parameter
Make sure your highlighting field (hl.fl) does not actually contain the text you want to search
I find that the cpu cost of running the highlighter sometimes is more than the cpu cost and bandwidth of just returning the whole field. You'll have to experiment.
I decided to turn my comment into an answer.
I would suggest that you don't store your text data in Solr/Lucene. Only index the data for searching and store a unique ID or URL to identify the document. The contents of the document should be fetched from a separate storage system.
Solr/Lucene are optimized for searches. They aren't your data warehouse or database, and they shouldn't be used that way. When you store more data in Solr than necessary, you negatively impact your entire search system. You bloat the size of indices, increase replication time between masters and slaves, replicate data that you only need a single copy of, and waste cache memory on document caches that should be leveraged to make search faster.
So, I would suggest 2 things.
First, optimally, remove the text storage entire from your search index. Fetch the preview text and whole text from a secondary system that is optimized for holding documents, like a file server.
Second, sub-optimal, only store the preview text in your search index. Store the entire document elsewhere, like a file server.
you can add an additional field like excerpt/summary that consist the first 200 chars on text, and return that field instead
My wish, which I suspect is shared by many sites, is to offer a snippet of text with each query response. That upgrades what the user sees from mere titles or equivalent. This is normal (see Google as an example) and productive technique.
Presently we cannot easily cope with sending the entire content body from Solr/Lucene into a web presentation program and create the snippet there, together with many others in a set of responses as that is a significant network, CPU, and memory hog (think of dealing with many multi-MB files).
The sensible thing is for Solr/Lucene to have a control for sending only the first N bytes of content upon request, thereby saving a lot of trouble in the field. Kludges with hightlights and so forth are just that, and interfere with proper usage. We keep in mind that mechanisms feeding material into Solr/ucene may not be parsing the files, so those feeders can't create the snippets.
Linkedin real time search
http://snaprojects.jira.com/browse/ZOIE
For storing big data
http://project-voldemort.com/

Resources