Caching results for Tinyurls - caching

I was reading about tinyurls and have a question on this. If I am an API user trying to get tinyurls for a long url, would it be ok to cache the tinyruls or save it at the developers end (in this case API callers end) instead of making repeated calls?
One way to think about this is it saves one explicit API call hence you can use your quota for another new API call. But I also feel that it might be ethically wrong to do so (might be wrong?).
At the end of the day, tinyurls are all for saving and sharing with others, if I think from that perspective, it should be ok to cache but just wanted to know what is the right thing to do here.

Related

Process request and return response in chunks

I'm making a search agreggator and I've been wondering how could I improve the performance of the search.
Given that I'm getting results from different websites, currently I need to wait to receive the results for each provider but this is done one after another so the whole request takes a while to respond.
The easiest solution would be to just make a request from the client for each provider, but this would end up with a ton of request per search, (but if this is the proper way I'll just do it.)
Why I've been wondering is if there's way to return results everytime a provider responds, so if we have providers A, B and C and B already returned results then send it back to the client. In order for this to work all the searchs would need to run in parallel of course.
Do you know a way of doing this?
I'm trying to build a search experience similar to SkyScanner, that loads results but then you can see it still keeps getting more records and it sorts them on the fly (on client side as far as I can see).
Caching is the key here. Best practices for external API (or scraping) is to be as little of a 'taker' as possible. So in your Laravel setup, get your results, but cache the results for as long as makes sense for your app. Although the odds in a skyscanner situation is low that two users will make the exact same request, the odds are much higher that a user will make the same request multiple times, or may share the link, etc.
https://laravel.com/docs/8.x/cache
cache(['key' => 'value'], now()->addMinutes(10));
$value = cache('key');
To actually scrape the content, you could use this:
https://github.com/softonic/laravel-intelligent-scraper
Or to use an API which is the nicer route:
https://docs.guzzlephp.org/en/stable/
On the client side, you could just make a few calls to your own service in separate requests and that would give you your asynchronous feel you're looking for.

REST API for main page - one JSON or many?

I'm providing RESTful API to my (JS) client from (Java Spring) server.
Main site page contains a number of logical blocks (news, last comments, some trending stuff), each of them has a corresponding entity on server. Which way is a right one to go, handle one request like
/api/main_page/ ->
{
news: {...}
comments: {...}
...
}
or let the client do a few requests like
/api/news/
/api/comments/
...
I know in general it's better to have one large request/response, but is this an answer to this situation as well?
Ideally, you should have different API calls for fetching individual configurable content blocks of the page from the same API.
This way your content blocks are loosely bounded to each other.
You
can extend, port(to a new framework) and modify them independently at
anytime you want.
This comes extremely useful when application grows.
Switching off a feature is fairly easy in this
case.
A/B testing is also easy in this case.
Writing automation is
also very easy.
Overall it helps in reducing the testing efforts.
But if you really want to fetch this in one call. Then you should add additional params in request and when the server sees that additional param it adds the additional independent JSON in the response by calling it's own method from BL layer.
And, if speed is your concern then try caching these calls on server for some time(depends on the type of application).
I think in general multiple requests can be justified, when the requested resources reflect parts of the system state. (my personal rule of thumb, still WIP).
i.e. if a news gets displayed in your client application a lot, I would request it once and reuse it wherever I can. If you aggregate here, you would need to request for it later, maybe some of them never get actually displayed, and you have some magic to do if the representation of a news differs in the aggregation and /news/{id}-resource.
This approach would increase communication if the page gets loaded for the first time, but decrease communication throughout your client application the longer it runs.
The state on the server gets copied request by request to your client or updated when needed (Etags, last-modified, etc.).
In your example it looks like /news and /comments are some sort of latest or since last visit, but not all.
If this is true, I would design them to be a resurce as well, like /comments/latest or similar.
But in any case I would them only have self-links to the /news/{id} or /comments/{id} respectively. Then you would have a request to /comments/latest, what results in a list of news-self-links, for what I would start a request only if I don't already have that news (maybe I want to check if the cached copy is still up to date).
It is also possible to trigger the request to a /news/{id} only if it gets actually displayed (scrolling, swiping).
Probably the lifespan of a news or a comment is a criterion to answer this question. Meaning the caching in the client it is not that vital to the system, in opposite of a book in an Book store app.

Incremental updates using browser cache

The client (an AngularJS application) gets rather big lists from the server. The lists may have hundreds or thousands of elements, which can mean a few megabytes uncompressed (and some users (admins) get much more data).
I'm not planning to let the client get partial results as sorting and filtering should not bother the server.
Compression works fine (factor of about 10) and as the lists don't change often, 304 NOT MODIFIED helps a lot, too. But another important optimization is missing:
As a typical change of the lists are rather small (e.g., modifying two elements and adding a new one), transferring the changes only sounds like a good idea. I wonder how to do it properly.
Something like GET /offer/123/items should always return all the items in the offer number 123, right? Compression and 304 can be used here, but no incremental update. A request like GET /offer/123/items?since=1495765733 sounds like the way to go, but then browser caching does not get used:
either nothing has changed and the answer is empty (and caching it makes no sense)
or something has changed, the client updates its state and does never ask for changes since 1495765733 anymore (and caching it makes even less sense)
Obviously, when using the "since" query, nothing will be cached for the "resource" (the original query gets used just once or not at all).
So I can't rely on the browser cache and I can only use localStorage or sessionStorage, which have a few downsides:
it's limited to a few megabytes (the browser HTTP cache may be much bigger and gets handled automatically)
I have to implement some replacement strategy when I hit the limit
the browser cache stores already compressed data which I don't get (I'd have to re-compress them)
it doesn't work for the users (admins) getting bigger lists as even a single list may already be over limit
it gets emptied on logout (a customer's requirement)
Given that there's HTML 5 and HTTP 2.0, that's pretty unsatisfactory. What am I missing?
Is it possible to use the browser HTTP cache together with incremental updates?
I think there is one thing you are missing: in short, headers. What I'm thinking you could do and that would match (most) of your requirements, would be to:
First GET /offer/123/items is done normally, nothing special.
Subsequents GET /offer/123/items will be sent with a Fetched-At: 1495765733 header, indicating your server when the initial request has been sent.
From this point on, two scenarios are possible.
Either there is no change, and you can send the 304.
If there is a change however, return the new items since the time stamp previously sent has headers, but set a Cache-Control: no-cache from your response.
This leaves you to the point where you can have incremental updates, with caching of the initial megabytes-sized elements.
There is still one drawback though, that the caching is only done once, it won't cache updates. You said that your lists are not updated often so it might already work for you, but if you really want to push this further, I could think of one more thing.
Upon receiving an incremental update, you could trigger in the background another request without the Fetched-At header that won't be used at all by your application, but will just be there to update your http cache. It should not be as bad as it sounds performance-wise since your framework won't update its data with the new one (and potentially trigger re-renders), the only notable drawback would be in term of network and memory consumption. On mobile it might be problematic, but it doesn't sounds like an app intended to be displayed on them anyway.
I absolutely don't know your use-case and will just throw that out there, but are you really sure that doing some sort of pagination won't work? Megabytes of data sounds a lot to display and process for normal humans ;)
I would ditch the request/response cycle entirely and move to a push model.
Specifically, WebSockets.
This is the standard technology used on financial trading websites serving tables of real-time ticker data. Here is one such production application demonstrating the power of WebSockets:
https://www.poloniex.com/exchange#btc_eth
WebSocket applications have two types of state: global and user. The above link will show three tables of global data. When you're logged in, two aditional tables of user data are displayed at the bottom.
This is not HTTP; you won't be able to just slap this into a Java Servlet. You'll need to run a separate process on your server which communicates over TCP. The good news is, there are mature solutions readily available. A Java-based solution with a very decent free licensing option, which includes both client and server APIs (and does integrate with Angular2) is Lightstreamer. They have a well-organized demo page too. There are also adapters available to integrate with your data sources.
You may be hesitant to ditch your existing servlet approach, but this will be less headaches in the long run, and scales marvelously. HTTP polling, even with well-designed header-only requests, do not scale well with large lists which update frequently.
---------- EDIT ----------
Since the list updates are infrequent, WebSockets are probably overkill. Based on the further details provided by comments on this answer, I would recommend a DOM-based, AJAX-updated sorter and filterer such as DataTables, which has some built-in options for caching. In order to reuse client data across sessions, ajax requests in the previous link should be modified to save the current data in the table to localStorage after every ajax request, and when the client starts a new session, populate the table with this data. This will allow the plugin to manage the filtering, sorting, caching and browser-based persistence.
I'm thinking about something similar to Aperçu's idea, but using two requests. The idea is yet incomplete, so bear with me...
The client asks for GET /offer/123/items, possibly with the ETag and Fetched-At headers.
The server answers with
200 and a full list if either header is missing, or when there are too many changes since the Fetched-At timestamp
304 if nothing has changed since then
304 and a special Fetch-More header telling the client that more data is to be fetched otherwise
The last case is violating how HTTP should work, but AFAIK it's the only way letting the browser cache everything what I want it to cache. Since the whole communication is encrypted, proxies can't punish me for violating the spec.
The client reacts to Fetch-Errata by requesting GET /offer/123/items/errata. This way, the resource has got split into two requests. The split is ugly, but an angular $http interceptor can hide the ugliness from the application.
The second request is cacheable, too, and there can be also a Fetched-At header. The details are unclear, but some strong handwavium makes me believe that it can work. Actually, the errata could itself be inaccurate but still useful and get an errata itself.... etc.
With HTTP/1.1, more requests may mean more latency, but having a couple of them should still be profitable because of the saved bandwidth. The server can decide when to stop.
With HTTP/2, multiple requests could be send at once. The server could be make to handle them efficiently as it knows that they belong together. Some more handwavium...
I find the idea strange, but interesting and I'm looking forward to comments. Feel free to downvote me, but please leave an explanation.

Oracle CRM On Demand, Client Side Extensions

After lurking for a long time, I finally have a question I can't google my way out of.
I am working for an Oracle CRM On Demand implementation to use client side extensions to enhance some functionality for our Clients.
At the moment the most pressing issue is the fact that when I am pinging an object, let's say an Activity, I get some but not all of its attributes.
For example, '/OnDemand/....../Activity/1234-56789A' ends up returning the Id of the Activity and the two urls. There is so much more content though, that I need to find. I could spend the next day going through every single attribute, throwing it on the end of a '?fields=SomeThing,OrAnother' string, but that seems like massive over kill.
Is there any one who knows how to do this?
All my thanks,
Riye.

How do I get around the Twitter API caching problem?

I'm building a Twitter app that requires to check user data somewhat frequently, but I'm facing trouble with a cache that's oddly on Twitter's side, not mine.
Try the following user:
users/show in XML: http://twitter.com/users/show.xml?screen_name=technolocus
users/show in JSON: http://twitter.com/users/show.json?screen_name=technolocus
normal page: http://twitter.com/technolocus
All these methods of accessing data should return the same values, right? Check the statuses_count for each of them.
XML: 12548
JSON: 12513
normal: 12498
The normal method (i.e. just visiting the profile non-programatically) serves up the most correct value of 12498. If I post or delete tweets to this account, it gets updated on the profile page instantly, but the XML and JSON methods still return cached data.
At this point, the values of the XML and JSON methods are 12 to 18 hours old respectively.
I first tried to access these methods from my website (hosted on Dreamhost). I thought it was Dreamhost caching the responses. Then I tried to access the API directly from my browser. I did a cURL from the command line from my machine after that. It wasn't dreamhost. I thought it was probably my ISP (I think they use NetApp or something like that). Then I asked a friend in another corner of India to try it. He's getting the exact same cached responses as I am.
So it isn't Dreamhost's cache; it isn't my ISP or my country's cache. There's only one conclusion - Twitter is caching responses.
How in the heavens do I get around this?!?
Forgot to mention this: The script on the server is in PHP and is using cURL to retrieve the XML and JSON data from Twitter, while the local tests have been just using the browser. Both have the exact same result!
First, I think you should report this a a bug to Twitter. I see the same discrepancy as you, and no matter what that seems like a bug. Even if they're caching, I'd expect that a cache on their side would store an abstract form that would then be rendered into HTML, JSON, and XML. I wonder if what's actually going on is that these requests are performing similar but different queries.
Are you sure that the values are "old"? For example, did you actually delete about 50 updates recently (since you say the HTML one is newest but shows a lower count than the other two)? If you create another update do you see the HTML number increment while the other numbers stay the same, or do they all increment simultaneously?
If what you are saying is accurate, and it probably is, generally, you can't get around it. Twitter would want to be caching its responses since they are costly to reproduce every single time.
When you use Twitter's APIs, you end up being bound by its conventions, even if that includes caching.
Your best bet is to tweet to #twitterapi and get them to give you a response as to why the two representations are divergent.
Add ?blah=xxxx to all urls.
I don't develop anything against twitter and ocassionaly manually "follow" three tweets by going to them in my browser. They always lag behind by half a day. I add ?asdsadsadsad to the url (everytime something different) and it always updates. I don't know what Twitter is doing here and came here while searching for the problem. But I guess this trick of appending a random value to the url via GET will probably work for your api requests, too.

Resources