Previous and next documents in CouchDB

Previous and next documents in CouchDB - sorting

I have a database containing items with a floating-point timestamp. When the user visits a page with url /item/id/<the-id-here>, the document with the corresponding identifier should be displayed (this is easy) along with links to the previous and next documents in chronological order, if they exist.
These links must be persistent, and items tend to appear and disappear at various timestamps based on external conditions, so I have no option to cache one or both of these identifiers in URLs.
Right now, I'm using a view sorted by timestamp to make two queries (one descending, one ascending) in order to fetch the previous and next documents.
Can I do this with fewer requests?
I'm afraid that floating-point precision will cause mayhem if objects have very close timestamps. How can I avoid this?

Isn't it possible to somehow maintains in the database the previous/next document for each document ?
It depends of the ratio of reads/writes that you have but it might be faster to do these two queries at each insert/delete than at each page view.

Not sure if you're still working on solving this, but if so, you might try the follow:
Output the timestamp as a date array: [2012, 04, 30, 03, 20, 35]
Use ?startkey=[2012, 04, 30, 03, 20, 0]&endkey=[2012, 04, 30, 03, 20, 60] to
get the range of key/values in the index for the last minute (or hour, day, etc.
depending on the granularity of what you're emitting).
In the client code use the timestamp of the document you're looking for siblings of, then get its siblings from the array/hash table/dictionary/vector/whatever.
That would get you down to a single GET request with the trade off of a larger response size. However, if you're using the HTTP If-Not-Match/ETag headers and some client-side caching, you may be able to get by with a single request for a larger range of documents--caching a list for the day or week range and using that cached array/whatever for later look-ups with much "lighter" hits to the server (as it would only check the ETag values and return "just" the 304 Not Modified response code.
FWIW, this works with the Couchbase Server 2.0 API as well (as it's based on CouchDB's View's system/API).
Hope that helps.

Related

How to sort by a derived value that includes a moving date in ElasticSearch?

I have a requirement to sort the results returned by ElasticSearch by a special value i define, let's call it 'X'.
Now - the problem is, 'X' is a value derived based on:
field A in the document (which is a 'term')
field B (which is a 'date')
the current date (UTC)
So, the problem is obviously 3. The date always changes, therefore i'm not sure how to include this in the sort, since it's not part of the document.
From my initial reading it appears i can use a 'script' here, but i'm worried about the performance, since i could be searching + sorting over 1000's of documents.
The only other idea that came to mind is to calculate the value nightly, and store that in each document. But that has a few drawbacks:
i need to have something running in the background to update this value
could be a lot of documents to update (60%+ every night).
i lose precision for the value depending on how long between script runs. (if i run nightly, value is 23 hours 'stale')
Any advice?
Thanks

This can be done by having an ES script run nightly calculating value, and store that in each document

Smart pagination algorithm that works with local data cache

This is a problem I have been thinking about for a long time but I haven't written any code yet because I first want to solve some general problems I am struggling with. This is the main one.
Background
A single page web application makes requests for data to some remote API (which is under our control). It then stores this data in a local cache and serves pages from there. Ideally, the app remains fully functional when offline, including the ability to create new objects.
Constraints
Assume a server side database of products containing +- 50000 products (50Mb)
Assume no db type, we interact with it via REST/GraphQL interface
Assume a single product record is < 1kB
Assume a max payload for a resultset of 256kB
Assume max 5MB storage on the client
Assume search result sets ranging between 0 ... 5000 items per search
Challenge
The challenge is to define a stateless but (network) efficient way fetch pages from a result set so that it is deterministic which results we will get.
Example
In traditional paging, when getting the next 100 results for some query using this url:
https://example.com/products?category=shoes&firstResult=100&pageSize=100
the search result may look like this:
{
"totalResults": 2458,
"firstResult": 100,
"pageSize": 100,
"results": [
{"some": "item"},
{"some": "other item"},
// 98 more ...
]
}
The problem with this is that there is no way, based on this information, to get exactly the objects that are on a certain page. Because by the time we request the next page, the result set may have changed (due to changes in the DB), influencing which items are part of the result set. Even a small change can have a big impact: one item removed from the DB, that happened to be on page 0 of the result set, will change what results we will get when requesting all subsequent pages.
Goal
I am looking for a mechanism to make the definition of the result set independent of future database changes, so if someone was looking for shoes and got a result set of 2458 items, he could actually fetch all pages of that result set reliably even if it got influenced by later changes in the DB (I plan to not really delete items, but set a removed flag on them, for this purpose)
Ideas so far
I have seen a solution where the result set included a "pages" property, which was an array with the first and last id of the items in that page. Assuming your IDs keep going up in number and you don't really delete items from the DB ever, the number of items between two IDs is constant. Meaning the app could get all items between those two IDs and always get the exact same items back. The problem with this solution is that it only works if the list is sorted in ID order... I need custom sorting options.
The only way I have come up with for now is to just send a list of all IDs in the result set... That way pages can be fetched by doing a SELECT * FROM products WHERE id IN (3,4,6,9,...)... but this feels rather inelegant...
Any way I am hoping it is not too broad or theoretical. I have a web-based DB, just no good idea on how to do paging with it. I am looking for answers that help me in a direction to learn, not full solutions.

Versioning DB is the answer for resultsets consistency.
Each record has primary id, modification counter (version number) and timestamp of modification/creation. Instead of modification of record r you add new record with same id, version number+1 and sysdate for modification.
In fetch response you add DB request_time (do not use client timestamp due to possibly difference in time between client/server). First page is served normally, but you return sysdate as request_time. Other pages are served differently: you add condition like modification_time <= request_time for each versioned table.

You can cache the result set of IDs on the server side when a query arrives for the first time and return a unique ID to the frontend. This unique ID corresponds to the result set for that query. So now the frontend can request something like next_page with the unique ID that it got the first time it made the query. You should still go ahead with your approach of changing DELETE operation to a removed operation because it would make sure that none of the entries from the result set it deleted. You can discard the result set of the query from the cache when the frontend reaches the end of the result set or you can set a time limit on the lifetime of the cache entry.

How to properly read all changed entities from external API

I need to properly traverse over all items in some external API. All items have "update_time" property and I can query the items from API in ascending or descending order. Which I should use to properly get all items without missing any of them?
Facts:
External API has pagination (limit and page parameters are fixed) and I cannot query all items by one query.
Querying of items takes some time.
Processing of received items takes some time.
While a page of items is queried or processed, items in external system can be changed -> this cause updating its 'update_time' property and influence ordering (paginating), so next page API call can cause "gap" in list of received items.
I don't want to process all items every time - only updated ones by the last traverse (this task is scheduled every 1 hour for example) - I store max of "last_update" property of all received items and skip processing of older items next traverse.
Thanks, it's really complicated to imagine for me.

CouchDb filter and sort in one view

I'm new to the CouchDb.
I have to filter records by date (date must be between two values) and to sort the data by the name or by the date etc (it depends on user's selection in the table).
In MySQL it looks like
SELECT * FROM table WHERE date > "2015-01-01" AND date < "2015-08-01" ORDER BY name/date/email ASC/DESC
I can't figure out if I can use one view for all these issues.
Here is my map example:
function(doc) {
emit(
[doc.date, doc.name, doc.email],
{
email:doc.email,
name:doc.name,
date:doc.date,
}
);
}
I try to filter data using startkey and endkey, but I'm not sure how to sort data in this way:
startkey=["2015-01-01"]&endkey=["2015-08-01"]
Can I use one view? Or I have to create some views with keys order depending on my current order field: [doc.date, doc.name, doc.email], [doc.name, doc.date, doc.email] etc?
Thanks for your help!

As Sebastian said you need to use a list function to do this in Couch.
If you think about it, this is what MySQL is doing. Its query optimizer will pick an index into your table, it will scan a range from that index, load what it needs into memory, and execute query logic.
In Couch the view is your B-tree index, and a list function can implement whatever logic you need. It can be used to spit out HTML instead of JSON, but it can also be used to filter/sort the output of your view, and still spit out JSON in the end. It might not scale very well to millions of documents, but MySQL might not either.
So your options are the ones Sebastian highlighted:
view sorts by date, query selects date range and list function loads everything into memory and sorts by email/etc.
views sort by email/etc, list function filters out everything outside the date range.
Which one you choose depends on your data and architecture.
With option 1 you may skip the list function entirely: get all the necessary data from the view in one go (with include_docs), and sort client side. This is how you'll typically use Couch.
If you need this done server side, you'll need your list function to load every matching document into an array, and then sort it and JSON serialize it. This obviously falls into pieces if there are soo many matching documents that they don't even fit into memory or take to long to sort.
Option 2 scans through preordered documents and only sends those matching the dates. Done right this avoids loading everything into memory. OTOH it might scan way too many documents, trashing your disk IO.
If the date range is "very discriminating" (few documents pass the test) option 1 works best; otherwise (most documents pass) option 2 can be better. Remember that in the time it takes to load a useless document from disk (option 2), you can sort tens of documents in memory, as long as they fit in memory (option 1). Also, the more indexes, the more disk space is used and the more writes are slowed down.

you COULD use a list function for that, in two ways:
1.) Couch-View is ordered by dates and you sort by e-amil => but pls. be aware that you'd have to have ALL items in memory to do this sort by e-mail (i.e. you can do this only when your result set is small)
2.) Couch-View is ordered by e-mail and a list function drops all outside the date range (you can only do that when the overall list is small - so this one is most probably bad)
possibly #1 can help you

Querying MongoDB for last-items-before

Consider I have two collections in MongoDB. One for products with documents like:
{'_id': ObjectId('lalala'), 'title': 'Yellow banana'}
And another stores price changes with documents like:
{'product': DBRef('products', ObjectId('lalala')),
'since': datetime(2011, 4, 5),
'new_price': 150 }
One product may have many price changes. The price lasts until a new change with later time stamp. I guess you've caught idea.
Say, I have 100 products. I want to query my DB to get know what's the price of each product at the moment of June 9, 2011. What is the most efficient (quick) way to perform this query in MongoDB? Suppose I have no cache solution or cache is empty.
I thought about group statement on prices collection, where reduce function would select last since before a date provided, grouping by product.$id. But in this case I would not benefit from an index on since field and all documents would be scanned.
Any ideas?

I had a similar problem, but for GPS locations. I found the fastest way was to set up a query for each item, which is rather counter-intuitive if your used to SQL databases.
Query for the item where it's timestamp is less or equal than the date your looking for, and limit the result to 1. Repeat for each item. To really speed things up, run multiple querys in parallel to utilise all the cores on the MongoDB server.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio