We have a "big data" service that receives requests in JSON format and returns results in JSON as well. Requests and especially responses sometime can be quite huge - up to 1GB in size ... We are logging all requests and responses and now I'm building a simple user web interface to search and show all requests we've processed.
My problem is I have a JSON which can be up to 40 levels deep and containing a lot of arrays. How can I give the user ability to drill down and explore the content?
For what it's worth the users have the latest stable version of Chrome and 64GB of RAM.
Related
the documentation of Apollo-Server states, that Batching + Caching should not be used together with REST-API Datasources:
Most REST APIs don't support batching. When they do, using a batched
endpoint can jeopardize caching. When you fetch data in a batch
request, the response you receive is for the exact combination of
resources you're requesting. Unless you request that same combination
again, future requests for the same resource won't be served from
cache.
We recommend that you restrict batching to requests that can't be
cached. In these cases, you can take advantage of DataLoader as a
private implementation detail inside your RESTDataSource [...]
Source: https://www.apollographql.com/docs/apollo-server/data/data-sources/#using-with-dataloader
I'm not sure why they say: "Unless you request that same combination again, future requests for the same resource won't be served from cache.".
Why shouldn't future requests be loaded from cache again? I mean, here we have 2 caching layers. The DataLoader which batches requests and memorizes - with an per-request cache - which objects are requested and return the same object from it's cache if requested multiple times in the whole request.
And we have a 2nd level cache, that caches individual objects over multiple requests (Or at least it could be implemented in a way that it caches the individual objects, not the whole result set).
Wouldn't that ensure that feature requests would be served from the second layer cache if the whole request changes but includes some of the objects which were requested in a previous request?
Many REST APIs implement some sort of request caching for GET requests based on URLs. When you request an entity from a REST endpoint a second time, the result can be returned faster.
For example lets imagine a fictional API "Weekend City Trip".
Your GraphQL API fetches the three largest cities around you and then checks the weather in these cities on the weekend. In this fictional example you receive two requests. The first request is from someone in Germany. You find the three largest cities around them: Cologne, Hamburg and Amsterdam. You can now call the weather API either in a batch or one by one.
/api/weather/Cologne
/api/weather/Hamburg
/api/weather/Amsterdam
or
/api/weather/Cologne,Hamburg,Amsterdam
The next person is in Belgium and we find Cologne, Amsterdam and Brussels.
/api/weather/Cologne
/api/weather/Amsterdam
/api/weather/Brussels
or
/api/weather/Cologne,Amsterdam,Brussels
Now as you can see, without the batching we have requested some URLs twice. The API provider can use a CDN to return these results quickly and not strain their application infrastructure. And since you are probably not the only one using the API, all these URLs might already be cached in the first place, meaning you will receive responses much faster. While the amount of possible batch endpoints grows massively with each city offered and amount of cities offered. If the API provides only 1000 cities, there are 166167000 possible combinations that could be requested when batching three cities. Therefore, the chance that someone else already requested the combination of these three cities might be rather low.
Conclusion
The caching is really just on the API provider side but could greatly benefit your response times as a consumer. Often, GraphQL is used as an API gateway to your own REST services. If you don't cache your services, it can be worth it to use batching in that case.
I'm trying to use data from google analytics for an existing website to load test a new website. In our busiest month over an hour we had 8361 page requests. So should I get a list of all the urls for these page requests and feed these to jMeter, would that be a sensible approach? I'm hoping to compare the page response times against the existing website.
If you need to do this very quickly, say you have less than an hour for scripting, in that case you can do this way to compare that there are no major differences between 2 instances.
If you would like to go deeper:
8361 requests per hour == 2.3 requests per second so it doesn't make any sense to replicate this load pattern as I'm more than sure that your application will survive such an enormous load.
Performance testing is not only about hitting URLs from list and measuring response times, normally the main questions which need to be answered are:
how many concurrent users my application can support providing acceptable response times (at this point you may be also interested in requests/second)
what happens when the load exceeds the threshold, what types of errors start occurring and what is the impact.
does application recover when the load gets back to normal
what is the bottleneck (i.e. lack of RAM, slow DB queries, low network bandwidth on server/router, whatever)
So the options are in:
If you need "quick and dirty" solution you can use the list of URLs from Google Analytics with i.e. CSV Data Set Config or Access Log Sampler or parse your application logs to replay production traffic with JMeter
Better approach would be checking Google Analytics to identify which groups of users you have and their behavioral patterns, i.e. X % of not authenticated users are browsing the site, Y % of authenticated users are searching, Z % of users are doing checkout, etc. After it you need to properly simulate all these groups using separate JMeter Thread Groups and keep in mind cookies, headers, cache, think times, etc. Once you have this form of test gradually and proportionally increase the number of virtual users and monitor the correlation of increasing response time with the number of virtual users until you hit any form of bottleneck.
The "sensible approach" would be to know the profile, the pattern of your load.
For that, it's excellent you're already have these data.
Yes, you can feed it as is, but that would be the quick & dirty approach - while get the data analysed, patterns distilled out of it and applied to your test plan seems smarter.
Does the Google Analytics API throttle requests?
We have a batch script that I have just moved from v2 to v3 of the API and the requests go through quite well for the first bit (50 queries or so) and then they start taking 4s or so each. Is this Google throttling us?
While Matthew is correct, I have another possibility for you. Google analytics API cashes your requests to some extent. Let me try and explain.
I have a customer / site that I request data from. While testing I noticed some strange things.
the first million rows results would come back with in an acceptable amount of time.
after a million rows things started to slow down we where seeing results returning in 5 times as much time instead of 5 minutes we where waiting 20 minutes or more for the results to return.
Example:
Request URL :
https://www.googleapis.com/analytics/v3/data/ga?ids=ga:34896748&dimensions=ga:date,ga:sourceMedium,ga:country,ga:networkDomain,ga:pagePath,ga:exitPagePath,ga:landingPagePath&metrics=ga:entrances,ga:pageviews,ga:exits,ga:bounces,ga:timeOnPage,ga:uniquePageviews&filters=ga:userType%3D%3DReturning+Visitor;ga:deviceCategory%3D%3Ddesktop&start-date=2014-05-12&end-date=2014-05-22&start-index=236001&max-results=2000&oauth_token={OauthToken}
Request Time (seconds:milliseconds): :0:484
Request URL :
https://www.googleapis.com/analytics/v3/data/ga?ids=ga:34896748&dimensions=ga:date,ga:sourceMedium,ga:country,ga:networkDomain,ga:pagePath,ga:exitPagePath,ga:landingPagePath&metrics=ga:entrances,ga:pageviews,ga:exits,ga:bounces,ga:timeOnPage,ga:uniquePageviews&filters=ga:userType%3D%3DReturning+Visitor;ga:deviceCategory%3D%3Ddesktop&start-date=2014-05-12&end-date=2014-05-22&start-index=238001&max-results=2000&oauth_token={OauthToken}
Request Time (seconds:milliseconds): :7:968
I did a lot of testing stopping and starting my application. I couldn't figure out why the data was so fast in the beginning then slow later.
Now I have some contacts on the Google Analytics Development team the guys in charge of the API. So I made a nice test app, logged some results showing my issue and sent it off to them. With the question Are you throttling me?
They where also perplexed, and told me there is no throttle on the API. There is a flood protection limit that Matthew speaks of. My Developer contact forwarded it to the guys in charge of the traffic.
Fast forward a few weeks. It seams that when we make a request for a bunch of data Google cashes the data for us. Its saved on the server incase we request it again. By restarting my application I was accessing the cashed data and it would return fast. When I let the application run longer I would suddenly reach non cashed data and it would take longer for them to return the request.
I asked how long is data cashed for, answer there was no set time. So I don't think you are being throttled. I think your initial speedy requests are cashed data and your slower requests are non cashed data.
Email back from google:
Hi Linda,
I talked to the engineers and they had a look. The response was
basically that they thinks it's because of caching. The response is
below. If you could do some additional queries to confirm the behavior
it might be helpful. However, what they need to determine is if it's
because you are querying and hitting cached results (because you've
already asked for that data). Anyway, take a look at the comments
below and let me know if you have additional questions or results that
you can share.
Summary from talking to engineer: "Items not already in our cache will
exhibit a slower retrieval processing time than items already present
in the cache. The first query loads the response into our cache and
typical query times without using the cache is about 7 seconds and
with using the cache is a few milliseconds. We can also confirm that
you are not hitting any rate limits on our end, as far as we can tell.
To confirm if this is indeed what's happening in your case, you might
want to rerun verified slow queries a second time to see if the next
query speeds up considerably (this could be what you're seeing when
you say you paste the request URL into a browser and results return
instantly)."
-- IMBA Google Analytics API Developer --
Google's Analytics API does have a rate limit per their docs: https://developers.google.com/analytics/devguides/reporting/core/v3/coreErrors
However they should not caused delayed requests, rather the request should be returned with a response of: 403 userRateLimitExceeded
Description of that error:
Indicates that the user rate limit has been exceeded. The maximum rate limit is 10 qps per IP address. The default value set in Google Developers Console is 1 qps per IP address. You can increase this limit in the Google Developers Console to a maximum of 10 qps.
Google's recommended course of action:
Retry using exponential back-off. You need to slow down the rate at which you are sending the requests.
I'm working on a little search engine, where I'm trying to find out how to cache query results.
These results are simple JSON text, retrieved using an ajax request.
Storing results in memory is not an option, I can see two options remaining:
Use a nosql database to retrieve cached results.
Store results on a CDN and redirect the http request (307 - Temporary Redirect) in case the result was already cached.
However, I don't have much experience with CDN, and wonder if using it for a huge amount of temporary small text files is a good practice.
Is it a good practice to use redirection on an ajax request?
Is a CDN an appropriate solution to cache small text files?
Short answer: no.
Long: Usually, you use a CDN for large static files that you want the CDN to mirror all around the world so it's close to a user when she requests them. When you have data that changes a lot, it would always take a while to propagate the changes to all nodes of the CDN, in the meantime users get inconsistent results (this may or may not matter to you).
Also, to avoid higher latency I wouldn't use an HTTP redirect (where you tell the client to make a second request to somewhere else) but rather figure out whether to get the data from the cache or the engine on your end (e.g. using a caching proxy or a load balancer) and then serve it directly to the client.
I have a web page which, upon loading, needs to do a lot of JSON fetches from the server to populate various things dynamically. In particular, it updates parts of a large-ish data structure from which I derive a graphical representation of the data.
So it works great in Chrome; however, Safari and Firefox appear to suffer somewhat. Upon the querying of the numerous JSON requests, the browsers become sluggish and unusable. I am under the assumption that this is due to the rather expensive iteration of said data structure. Is this a valid assumption?
How can I mitigate this without changing the query language so that it's a single fetch?
I was thinking of applying a queue that could limit the number of concurrent Ajax queries (and hence also limit the number of concurrent updates to the data structure)... Any thoughts? Useful pointers? Other suggestions?
In browser-side JS, create a wrapper around jQuery.post() (or whichever method you are using)
that appends the requests to a queue.
Also create a function 'queue_send' that will actually call jQuery.post() passing the entire queue structure.
On server create a proxy function called 'queue_receive' that replays the JSON to your server interfaces as though it came from the browser, collects the results into a single response, sends back to browser.
Browser-side queue_send_success() (success handler for queue_send) must decode this response and populate your data structure.
With this, you should be able to reduce your initialization traffic to one actual request, and maybe consolidate some other requests on your website as well.
in particular, it updates parts of a largish data structure from which i derive a graphical representation of the data.
I'd try:
Queuing responses as they come in, then update the structure once
Hiding the representation invisible until the responses are in
Magicianeer's answer is also good - but I'm not sure if it fits your definition of "without changing the query language so that it's a single fetch" - it would avoid re-engineering existing logic.