A space-efficient view - view

For all documents with a certain type I have a single query in my app, which selects just a single field of the last document. I map those documents by date, so making a descending query limited to 1 should certainly do the trick. The problem I'm bothered by is that this view would cache all documents of this type, occupying an obviously redundant space.
So my questions are:
Would adding a reduce function, which would reduce to the single last document, to this view save any space for me or the view would still have to store all the documents involved?
If not, is there any other space-efficient strategy?

No. Space will still be wasted by the result of map function.
Some things in my mind at the moment:
Change the design of the database. If the id of the document will include the type and date you could do some searching without the map/reduce like this: http://127.0.0.1:5984/YOURDB/_all_docs?start_key="<TYPE>_<CURRENT_TIME>"&descending=true&limit=1.
Make use of map the best you can. Emit no value, and map will store the key and the id/ver of the documents. Use include_doc to retrieve the doc when querying.
Add additional field saying that the document is a candidate for the last one. Map only those candidates who does have the field. Periodically run cleanup, removing the field from all the documents except the latest one. Note: this can by difficult when deleting the last added document is supported.
That seems to be for me the idea of CouchDB: "waste" space by caching the queries, so they can by answered quickly if the data is not changing to frequently. Perhaps if you care so much about wasting the space, the answer in your case in not the CouchDB?

My couchdb setup has the data and the indexes on sperate RAID drives. Maps are written in erlang which I find 8x faster faster than javascript and maps of course return null. I keep the keys small and I also break up my views across many design documents and I keep my data very flat which improves serialization performance.

Related

Is it bad practice to store JSON members with Redis GEOADD?

My application should handle a lot of entities (100.000 or more) with location and needs to display them only within a given radius. I basically store everything in SQL but using Redis for caching and optimization (mainly GEORADIUS).
I am adding the entities like the following example (not exactly this, I use Laravel framework with the built-in Redis facade but it does the same as here in the background):
GEOADD k 19.059982 47.494338 {\"id\":1,\"name\":\"Foo\",\"address\":\"Budapest, Astoria\",\"lat\":47.494338,\"lon\":19.059982}
Is it bad practice? Or will it make a negative impact on performance? Should I store only ID-s as member and make a following query to get the corresponding entities?
This is a matter of the requirements. There's nothing wrong with storing the raw data as members as long as it is unique (and it unique given the "id" field). In fact, this is both simple and performant as all data is returned with a single query (assuming that's what actually needed).
That said, there are at least two considerations for storing the data outside the Geoset, and just "referencing" it by having members reflect some form of their key names:
A single data structure, such as a Geoset, is limited by the resources of a single Redis server. Storing a lot of data and members can require more memory than a single server can provide, which would limit the scalability of this approach.
Unless each entry's data is small, it is unlikely that all query types would require all data returned. In such cases, keeping the raw data in the Geoset generates a lot of wasted bandwidth and ultimately degrades performance.
When data needs to be updated, it can become too expensive to try and update (i.e. ZDEL and then GEOADD) small parts of it. Having everything outside, perhaps in a Hash (or maybe something like RedisJSON) makes more sense then.

Postgres tsvector_update_trigger sometimes takes minutes

I have configured free text search on a table in my postgres database. Pretty simple stuff, with firstname, lastname and email. This works well and is fast.
I do however sometimes experience looong delays when inserting a new entry into the table, where the insert keeps running for minutes and also generates huge WAL files. (We use the WAL files for replication).
Is there anything I need to be aware of with my free text index? Like Postgres maybe randomly restructuring it for performance reasons? My index is currently around 400 MB big.
Thanks in advance!
Christian
Given the size of the WAL files, I suspect you are right that it is an index update/rebalancing that is causing the issue. However I have to wonder what else is going on.
I would recommend against storing tsvectors in separate columns. A better way is to run an index on to_tsvector()'s output. You can have multiple indexes for multiple languages if you need. So instead of a trigger that takes, say, a field called description and stores the tsvector in desc_tsvector, I would recommend just doing:
CREATE INDEX mytable_description_tsvector_idx ON mytable(to_tsvector(description));
Now, if you need a consistent search interface across a whole table, there are more elegant ways of doing this using "table methods."
In general the functional index approach has fewer issues associated with it than anything else.
Now a second thing you should be aware of are partial indexes. If you need to, you can index only records of interest. For example, if most of my queries only check the last year, I can:
CREATE INDEX mytable_description_tsvector_idx ON mytable(to_tsvector(description))
WHERE created_at > now() - '1 year'::interval;

How to optimize "text search" for inverted index and relational database? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Update 2022-08-12
I re-thought about it and realized I was overcomplicating it. I found the best way to enhance this system is by using good old information retrieval techniques ie using 'location' of a word in a sentence and 'ranking' queries to display best hits. The approach is illustrated in this following picture.
Update 2015-10-15
Back in 2012, I was building a personal online application and actually wanted to re-invent the wheel because am curious by nature, for learning purposes and to enhance my algorithm and architecture skills. I could have used apache lucene and others, however as I mentioned I decided to build my own mini search engine.
Question: So is there really no way to enhance this architecture except by using available services like elasticsearch, lucene and others?
Original question
I am developing a web application, in which users search for specific titles (say for example : book x, book y, etc..) , which data is in a relational database (MySQL).
I am following the principle that each record that was fetched from the db, is cached in memory , so that the app has less calls to the database.
I have developed my own mini search engine , with the following architecture:
This is how it works:
a) User searches a record name
b) The system check what character the query starts with, checks if query there : get record. If not there, adds it and get all matching records from database using two ways:
Either query already there in the Table "Queries" (which is a sort of history table) thus get record based on IDs (Fast performance)
Or, otherwise using Mysql LIKE %% statement to get records/ids (Also then keep the used query by the user in history table Queries along with the ids it maps to).
-->Then It adds records and their ids to the cache and Only the ids to the inverted index map.
c) results are returned to the UI
The system works fine, however I have Two main issues, that i couldn't find a good solution for (been trying for the past month):
First issue:
if you check point (b) , case where no query "history" is found and it has to use the Like %% statement : this process becomes time consuming when the query matches numerous records in the database (instead of one or two):
It will take some time to get records from Mysql (this is why i used INDEXES on the specific columns)
Then time to save query history
Then time to add records/ids to cache and inverted index maps
Second issue:
The application allows users to add themselves new records, that can immediately be used by other users logged in the to application.
However to achieve this, inverted index map and table "queries" have to be updated so that in case any old query matches to the new word. For example if a new record "woodX" is being added, still the old query "wood" does map to it. So in order to re-hook query "wood" to this new record, here is what i am doing now:
new record "woodX" gets added to "records" table
then i run a Like %% statement to see which already existing query in table "queries" does map to this record(for example "wood"), then add this query with the new record id as a new row: [ wood, new id].
Then in memory, update inverted index Map's "wood" key's value (ie the list), by adding the new record Id to this list
--> Thus now if a remote user searches "wood" it will get from memory : wood and woodX
The Issue here is also time consumption. Matching all query histories (in table queries) with the newly added word takes a lot of time (the more matching queries, the more time). Then the in memory update also takes a lot of time.
What i am thinking of doing to fix this time issue, is to return the desired results to the user first , then let the application POST an ajax call with the required data to achieve all these UPDATE tasks. But i am not sure if this is a bad practice or an unprofessional way of doing things?
So for the past month ( a bit more) i tried to think of the best optimization/modification/update for this architecture, but I am not an expert in the document retrieval field (actually its my first mini search engine ever built).
I would appreciate any feedback or guidance on what i should do to be able to achieve this kind of architecture.
Thanks in advance.
PS:
Its a j2ee application using servlets.
I am using MySQL innodb (thus i cannot use full-text search option)
I would strongly recommend Sphinx Search Server, wchich is best optimized in full-text searching. Visit http://sphinxsearch.com/.
It's designed to work with MySQL, so it's an addition to Your current workspace.
I do not pretend to have THE solution but here is my ideas.
First, I though like you for time-consuming queries LIKE%% : I would execute a query limited to a few answers in MySQL, like a dozen, return that to user, and wait to see if user wants more matching records, or launch in background the full-query, depending on you indexation needs for future searches.
More generally, I think that storing everything in memory could lead, one day, to too-much memory consumption. And althrough the search-engine becomes faster and faster when it keeps everything in memory, you'll have to keep all these caches up-to-date when data is added or updated and it will certainly take more and more time.
That's why I think the solution I saw a day in an "open-source forum software" (I couldn't remember its name) is not too bad for text searching in posts : each time a data is inserted, a table named "Words" keeps tracks of every existing word, and another table (let's say "WordsLinks") the link between each word and posts it appears in.
This kind of solution has some drawbacks:
Each Insert, Delete, Update in database is a lot slower
Data selection for search engine must be anticipated : if you choose to keep two letter words you never kept, it is too late for already recorded data, unless you launch a complete data re-processing.
You must take care of DELETE as well as UPDATE and INSERT
But I think there are some big advantages:
Computing time is probably the same than the "memory solution" (eventually), but it is divided in each database Create/Update/Delete, rather than at query time.
Looking for a whole word, or words "starting with" is instantaneous : when indexed, searching in "Words" table is dichotomic. And "WordLinks" table query is very fast either with an index.
Looking for multiple words at the same time could be simple : gather a group of "WordLinks" for each found Word, and execute an intersection on them to keep only "Database Ids" common to all these groups. For example with the words "tree" and "leaf", the first one could give Table records {1, 4, 6}, and the second one could give {1, 3, 6, 9}. So with an intersection it is simple to keep only common parts : {1, 6}.
A "Like %%" in a single-column table is probably faster than a lot of "Like %%" in different fields of different tables. And each database engine handles some cache : "Words" table could be little enough to be kept in memory
I think there is a small risk of performance and memory problems if data becomes huge.
As every search is fast, you can even look for synonyms. For example search "network" if user didn't find anything with "ethernet".
You can apply rules, like splitting camel case words to generate for example the 3 words "wood", "X", "woodX" from "woodX". Each "word" is very lightweight to store and find, so you can do a lot of things.
I think the solution you need could be a blend of methods : for example you can keep lightweight UPDATE, INSERT, DELETE, and launch "Words" and "WordsLinks" feeding from a TRIGGER.
Just for anecdote, I saw a software developped by my company in which it was decided to keep "everything" (!) in memory. It leads us to recommend to our customers to buy servers with 64GB RAM. A little bit expensive. It explains why I am very prudent when I see solutions that could lead, eventually, to memory filling.
I have to say, I don't think your design fits the problem very well. The issues that you see now are consequences of that. And apart from that, your current solution doesn't scale.
Here is a possible solution:
Redesign your database to only contain authoritative data, but no derived data. So all cache entries must vanish from MySQL.
Keep data only for the duration of a request in memory within your application. This makes the design of your application much simpler (think race conditions) and enables you to scale to a sensible number of clients.
Introduce a caching layer. I'd strongly recommend to use an established product, rather than building this yourself. This frees you of all the custom built caching logic in your application and even does the job much better.
You can take a look at Redis or Memcached for the caching layer. I think an LRU strategy should fit here. Depending on how complex your queries become, a dedicated indexed search mechanism like Lucene might make sense as well.
I'm sure this can be implemented in MySQL but it would be a lot less effort to just use an existing search-oriented database such as Elasticsearch. It uses Lucene library to implement the inverted index, has extensive documentation, supports horizontal scaling, fairly simple query language and so forth. I guess it has been quite a lot of work to get this far, and it will be even more work to handle caches, race conditions, bugs, performance issues etc. to make the solution "production grade".

Solr query - Is there a way to limit the size of a text field in the response

Is there a way to limit the amount of text in a text field from a query? Here's a quick scenario....
I have 2 fields:
docId - int
text - string.
I will query the docId field and want to get a "preview" text from the text field of 200 chars. On average, the text field has anything from 600-2000 chars but I only need a preview.
eg. [mySolrCore]/select?q=docId:123&fl=text
Is there any way to do it since I don't see the point of bringing back the entire text field if I only need a small preview?
I'm not looking at hit highlighting since i'm not searching for specific text within the Text field but if there is similar functionaly of the hl.fragsize parameter it would be great!
Hope someone can point me in the right direction!
Cheers!
You would have to test the performance of this work-around versus just returning the entire field, but it might work for your situation. Basically, turn on highlighting on a field that won't match, and then use the alternate field to return the limited number of characters you want.
http://solr:8080/solr/select/?q=*:*&rows=10&fl=author,title&hl=true&hl.snippets=0&hl.fl=sku&hl.fragsize=0&hl.alternateField=description&hl.maxAlternateFieldLength=50
Notes:
Make sure your alternate field does not exist in the field list (fl) parameter
Make sure your highlighting field (hl.fl) does not actually contain the text you want to search
I find that the cpu cost of running the highlighter sometimes is more than the cpu cost and bandwidth of just returning the whole field. You'll have to experiment.
I decided to turn my comment into an answer.
I would suggest that you don't store your text data in Solr/Lucene. Only index the data for searching and store a unique ID or URL to identify the document. The contents of the document should be fetched from a separate storage system.
Solr/Lucene are optimized for searches. They aren't your data warehouse or database, and they shouldn't be used that way. When you store more data in Solr than necessary, you negatively impact your entire search system. You bloat the size of indices, increase replication time between masters and slaves, replicate data that you only need a single copy of, and waste cache memory on document caches that should be leveraged to make search faster.
So, I would suggest 2 things.
First, optimally, remove the text storage entire from your search index. Fetch the preview text and whole text from a secondary system that is optimized for holding documents, like a file server.
Second, sub-optimal, only store the preview text in your search index. Store the entire document elsewhere, like a file server.
you can add an additional field like excerpt/summary that consist the first 200 chars on text, and return that field instead
My wish, which I suspect is shared by many sites, is to offer a snippet of text with each query response. That upgrades what the user sees from mere titles or equivalent. This is normal (see Google as an example) and productive technique.
Presently we cannot easily cope with sending the entire content body from Solr/Lucene into a web presentation program and create the snippet there, together with many others in a set of responses as that is a significant network, CPU, and memory hog (think of dealing with many multi-MB files).
The sensible thing is for Solr/Lucene to have a control for sending only the first N bytes of content upon request, thereby saving a lot of trouble in the field. Kludges with hightlights and so forth are just that, and interfere with proper usage. We keep in mind that mechanisms feeding material into Solr/ucene may not be parsing the files, so those feeders can't create the snippets.
Linkedin real time search
http://snaprojects.jira.com/browse/ZOIE
For storing big data
http://project-voldemort.com/

Should I store reference data in my application memory, or in the database?

I am faced with the choice where to store some reference data (essentially drop down values) for my application. This data will not change (or if it does, I am fine with needing to restart the application), and will be frequently accessed as part of an AJAX autocomplete widget (so there may be several queries against this data by one user filling out one field).
Suppose each record looks something like this:
category
effective_date
expiration_date
field_A
field_B
field_C
field_D
The autocomplete query will need to check the input string against 4 fields in each record and discrete parameters against the category and effective/expiration dates, so if this were a SQL query, it would have a where clause that looks something like:
... WHERE category = ?
AND effective_date < ?
AND expiration_date > ?
AND (colA LIKE ? OR colB LIKE ? OR colC LIKE ?)
I feel like this might be a rather inefficient query, but I suppose I don't know enough about how databases optimize their indexes, etc. I do know that a lot of really smart people work really hard to make database engines really fast at this exact type of thing.
The alternative I see is to store it in my application memory. I could have a list of these records for each category, and then iterate over each record in the category to see if the filter criteria is met. This is definitely O(n), since I need to examine every record in the category.
Has anyone faced a similar choice? Do you have any insight to offer?
EDIT: Thanks for the insight, folks. Sending the entire data set down to the client is not really an option, since the data set is so large (several MB).
Definitely cache it in memory if it's not changing during the lifetime of the application. You're right, you don't want to be going back to the database for each call, because it's completely unnecessary.
There's can be debate about exactly how much to cache on the server (I tend to cache as little as possible until I really need to), but for information that will not change and will be accessed repeatedly, you should almost always cache that in the Application object.
Given the number of directions you're coming at this data (filtering on 6 or more columns), I'm not sure how much more you'll be able to optimize the information in memory. The first thing I would try is to store it in a list in the Application object, and query it using LINQ-to-objects. Or, if there is one field that is used significantly more than the others, or try using a Dictionary instead of a list. If the performance continues to be a problem, try using storing it in a DataSet and setting indexes on it (but of course you loose some code-simplicity and maintainability this way).
I do not think there is a one size fits all answer to your question. Depending on the data size and usage patterns the answer will vary. More than that the answer may change over time.
This is why in my development I built some intermediate layer which allows me to change how the caching is done by changing configuration (with no code changes). Every while we analyze various stats (cache hit ratio, etc.) and decide if we want to change cache behavior.
BTW there is also a third layer - you can push your static data to the browser and cache it there too
Can you just hard-wire it into the program (as long as you stick to DRY)? Changing it only requires a rebuild.

Resources