Best strategy for Foursquare venue/search caching - caching

We are looking to use Foursquare as the location database for our application. Their API states that an application can make up to 5,000 userless requests per hour to venues/* endpoints. In order to help reduce the amount of requests, they recommend that you utilize caching to avoid making repetitive calls to the Foursquare API when different users are requesting the same information.
For our application, we are looking to use the venues/search endpoint to get checkin data around a location. What is the best way to go about caching this data to allow for the least amount of calls to the Foursquare API?
The current idea we have is to cache km by km “boxes” that represent an area on the earth. When a user requests nearby venues, we would make a call from the center point location of the box they are currently in to Foursquare, and cache the results for that box. Now when another user comes along, if they are too in that box we can return the results that we had cached for that box that are closest to the user. If a user is close to the edge of a box, we would return the close results from the box they are currently in, as well as the close results in the adjacent box.
Is this a good way to go about things to limit the requests? We fear this technique may use way too much memory. How do you go about it in your applications? Any insights would be great, thanks!

This sounds like a good strategy for caching venue searches. However, just to be super clear on Foursquare policies, they state that "Server-side caching of venue details is generally required for apps requesting an increase." We don't make caching of search results an explicit requirement before granting rate limit increases, only calls to venue details.

Related

What is "sf_max_daily_api_calls"?

Does someone know what "sf_max_daily_api_calls" parameter in Heroku mappings does? I do not want to assume it is a daily limit for write operations per object and I cannot find an explanation.
I tried to open a ticket with Heroku, but in their support ticket form "Which application?" drop-down is required, but none of the support categories have anything to choose there from, the only option is "Please choose..."
I tried to find any reference to this field and can't - I can only see it used in Heroku's Quick Start guide, but without an explanation. I have a very busy object I'm working on, read/write, and want to understand any limitations I need to account for.
Salesforce orgs have rolling 24h limit of max daily API calls. Generally the limit is very generous in test orgs (sandboxes), 5M calls because you can make stupid mistakes there. In productions it's lower. Bit counterintuitive but protects their resources, forces you to write optimised code/integrations...
You can see your limit in Setup -> Company information. There's a formula in documentation, roughly speaking you gain more of that limit with every user license you purchased (more for "real" internal users, less for community users), same as with data storage limits.
Also every API call is supposed to return current usage (in special tag for SOAP API, in a header in REST API) so I'm not sure why you'd have to hardcode anything...
If you write your operations right the limit can be very generous. No idea how that Heroku Connect works. Ideally you'd spot some "bulk api 2.0" in the documentation or try to find synchronous vs async in there.
Normal old school synchronous update via SOAP API lets you process 200 records at a time, wasting 1 API call. REST bulk API accepts csv/json/xml of up to 10K records and processes them asynchronously, you poll for "is it done yet" result... So starting job, uploading files, committing job and then only checking say once a minute can easily be 4 API calls and you can process milions of records before hitting the limit.
When all else fails, you exhausted your options, can't optimise it anymore, can't purchase more user licenses... I think they sell "packets" of more API calls limit, contact your account representative. But there are lots of things you can try before that, not the least of them being setting up a warning when you hit say 30% threshold.

How can i track AJAX performance using Google Analytics?

Since my web application using many AJAX request so categorize as Single Page Application.
what i want is to track AJAX technical performance using Google Analytics.
Regarding to GA document, it suggest to implement Virtual Pageviews Tracking as detail in this link
https://developers.google.com/analytics/devguides/collection/analyticsjs/single-page-applications
After implement virtual pageviews tracking, Pageviews stats and Page URI seem to be feed into GA correctly. But Timing Stats such as Avg.Page Load Time (sec) are not. all of them have no value!
I tried these 3 senario to implement Virtual Page Tracking but non of them is working.
do i miss something ? or it's GA limitation so we can not collect Timing stats of Virtual Page just like Real Pageview ?
any others Tools suggestion to track AJAX performance ?
GA is not meant to be used to track page performance and the Value in ga implies monetary value.
When it says "tracking pageviews" it's not about measuring performance, it's about tracking user activity. As in, how many pages per session, what pages, what led to conversions, where they have troubles going through and so forth. Not a technical tool, but an analytics/marketing tool.
Technically, you still could use it to track page performance and people do it. But not as you've done it. You have to remove any network influence on your timestamps since normal fluctuation there would exceed the useful timing of page performance.
I think the most elegant way of doing it would be creating a custom metric in GA interface and then populate it with performance measuring events (or pageviews). So:
You take a new Date() timestamp (or whatever you do in jquery to get current timestamp) right before the post request
You get another new Date() in the post callback
You calculate the difference in milliseconds and send that as the value of the custom metric with the pageview
You wait for two days for the new data to get processed and build a custom report using your custom metric.
Now when you improve performance of your endpoint, you will be able to see statistical improvements in that report.
This is usually done on the backend though, with the datadog or a similar tool with endpoint monitoring functionality.
When performance is measured on the front-end, we usually use the native performance API, so the window.performance object. Or whatever your front-end rendering library suggests using for that. Here's a bit more on this: https://developer.mozilla.org/en-US/docs/Web/API/performance_property That way you're taking into account a bit more data, not just one endpoint response time.

Attribute Based Access Control (ABAC) in a microservices architecture for lists of resources

I am investigating options to build a system to provide "Entity Access Control" across a microservices based architecture to restrict access to certain data based on the requesting user. A full Role Based Access Control (RBAC) system has already been implemented to restrict certain actions (based on API endpoints), however nothing has been implemented to restrict those actions against one data entity over another. Hence a desire for an Attribute Based Access Control (ABAC) system.
Given the requirements of the system to be fit-for-purpose and my own priorities to follow best practices for implementations of security logic to remain in a single location I devised to creation of an externalised "Entity Access Control" API.
The end result of my design was something similar to the following image I have seen floating around (I think from axiomatics.com)
The problem is that the whole thing falls over the moment you start talking about an API that responds with a list of results.
Eg. A /api/customers endpoint on a Customers API that takes in parameters such as a query filter, sort, order, and limit/offset values to facilitate pagination, and returns a list of customers to a front end. How do you then also provide ABAC on each of these entities in a microservices landscape?
Terrible solutions to the above problem tested so far:
Get the first page of results, send all of those to the EAC API, get the responses, drop the ones that are rejected from the response, get more customers from the DB, check those... and repeat until either you get a page of results or run out of customers in the DB. Tested that for 14,000 records (which is absolutely within reason in my situation) would take 30 seconds to get an API response for someone who had zero permission to view any customers.
On every request to the all customers endpoint, a request would be sent to the EAC API for every customer available to the original requesting user. Tested that for 14,000 records the response payload would be over half a megabyte for someone who had permission to view all customers. I could split it into multiple requests, but then you are just balancing payload size with request spam and the performance penalty doesn't go anywhere.
Give up on the ability to view multiple records in a list. This totally breaks the APIs use for customer needs.
Store all the data and logic required to perform the ABAC controls in each API. This is fraught with danger and basically guaranteed to fail in a way that is beyond my risk appetite considering the domain I am working within.
Note: I tested with 14,000 records just because its a benchmark of our current state of data. It is entirely feasible that a single API could serve 100,000 or 1m records, so anything that involves iterating over the whole data set or transferring the whole data set over the wire is entirely unsustainable.
So, here lies the question... How do you implement an externalised ABAC system in a microservices architecture (as per the diagram) whilst also being able to service requests that respond with multiple entities with a query filter, sort, order, and limit/offset values to facilitate pagination.
After dozens of hours of research, it was decided that this is an entirely unsolvable problem and is simply a side effect of microservices (and more importantly, segregated entity storage).
If you want the benefits of a maintainable (as in single piece of externalised infrastructure) entity level attribute access control system, a monolithic approach to entity storage is required. You cannot simultaneously reap the benefits of microservices.

JMeter and page views

I'm trying to use data from google analytics for an existing website to load test a new website. In our busiest month over an hour we had 8361 page requests. So should I get a list of all the urls for these page requests and feed these to jMeter, would that be a sensible approach? I'm hoping to compare the page response times against the existing website.
If you need to do this very quickly, say you have less than an hour for scripting, in that case you can do this way to compare that there are no major differences between 2 instances.
If you would like to go deeper:
8361 requests per hour == 2.3 requests per second so it doesn't make any sense to replicate this load pattern as I'm more than sure that your application will survive such an enormous load.
Performance testing is not only about hitting URLs from list and measuring response times, normally the main questions which need to be answered are:
how many concurrent users my application can support providing acceptable response times (at this point you may be also interested in requests/second)
what happens when the load exceeds the threshold, what types of errors start occurring and what is the impact.
does application recover when the load gets back to normal
what is the bottleneck (i.e. lack of RAM, slow DB queries, low network bandwidth on server/router, whatever)
So the options are in:
If you need "quick and dirty" solution you can use the list of URLs from Google Analytics with i.e. CSV Data Set Config or Access Log Sampler or parse your application logs to replay production traffic with JMeter
Better approach would be checking Google Analytics to identify which groups of users you have and their behavioral patterns, i.e. X % of not authenticated users are browsing the site, Y % of authenticated users are searching, Z % of users are doing checkout, etc. After it you need to properly simulate all these groups using separate JMeter Thread Groups and keep in mind cookies, headers, cache, think times, etc. Once you have this form of test gradually and proportionally increase the number of virtual users and monitor the correlation of increasing response time with the number of virtual users until you hit any form of bottleneck.
The "sensible approach" would be to know the profile, the pattern of your load.
For that, it's excellent you're already have these data.
Yes, you can feed it as is, but that would be the quick & dirty approach - while get the data analysed, patterns distilled out of it and applied to your test plan seems smarter.

The ways to make a scraper bot look more like a human

Due to a limitation of the API of a websites I use for searching some products, I have to do html scraping its Products page. There's no no other way because it offers only free API with the limitation. I just need 10 or 100 times more items that its API returns, meaning even if I call it 5 times, it'll return the same set of the products as if it were 1 call.
I don't need to scrape plenty of the page in short period of time. Normally a scrape bot would scrape all that data in a few minutes. For me a few hours is acceptable, so my scraper can be more like a human.
The questions is: what are the ways to make my scraper look like a normal user?
First, make less calls in a short period of time.
Use a headless browser, maybe?
Use vpn? or proxy? or both?
What are other pointers?
Note: in my case scraping is the only way to achieve what I want because the API doesn't work. So there's no question whether I should use the API or scraping. I simply can only use scraping.
You are basically heading toward a right direction.
Yet I suspect that you don't really master the API (or it's a weird one) if if call it 5 times, it'll return the same set of the products as if it were 1 call. API should be able to let users access to all possible data (with frequency limit though).
The items you've asked about:
Make less calls in a short period of time. - Kind of true, yet still you should be clear what request frequency is acceptible for certain site (not being detected, nor bandwidth throttling).
Use a headless browser. - Yes. Abandon cookie, be anonymous.
Use vpn? or proxy? - Proxy yes, use an appropriate proxy service that will provide you enough flexibility of not being detected. VPN does not help, since network nodes (where you scrape from) are limited in number and have static IPs (basically).
I think this post might be to your help.

Resources