Handling Parse 64KB limit - parse-platform

We are using Parse to send out Push notifications using the Parse REST API. We compute the audience of the push notification based on dynamic user data such as user's current location. And in our Production system, we observe that there are times when this user base can be quite high given the time of the day. During such times, we have seen the ParseException:
org.parse4j.ParseException: Neither where clause nor data may exceed 64KB
This is because the where clause has a large number of "installation ids" or "device tokens" specified as we find a large number of users in a given location.
I understand that Channels / Parse Audience is a way to deal with larger sets of users. But this requires me to store the dynamic data like a user's current location in the Parse database as part of the Installation metadata.
My questions are:
Is it the right way to implement it if we decide to store user location in Parse, which will also mean that we would need to update this Installation object per user very frequently.
Is it advisable to just send push notifications via Parse in chunks, that is first to a set of 2000 users, then the next 1000 etc.
Is there any other way to handle such a case?

Related

Solutions for Recording Google Protocol Buffer Messages

My project is using protobufs for our data types. We need to be able to record our data so it can be played back later. Our use case is to recreate the event or to reprocess the same data but with new algorithms and check for improvements.
As data is flowing through our system it is all protobufs. These are easily serilized to a byte array which could be recorded to files or maybe as blobs in a database. Playback would simply mean reading the byte array and converting back to a protobuf, then sending it off into our software again.
Are there any existing technologies used for recording protobufs?
Even though the initial use case is very simple, eventually the solution will get more complex. It will probably need to:
Farm out recording to multiple hosts to keep up with the input data rate
Allow querying to find out how much data exists during a specific time period
Play back only those data records where some field has a specific value
Save the data for long term storage, e.g. never delete a record but instead move it to a tape backup
I think the above is best accomplished using a database which stores some subset of meta data along with the protobuf byte array itself. Before I go reinventing the wheel, I would like opinions on anything that exists already that might do this job.

Transform data before adding to cache system or after when reading it

I have this situation, which I don't know which could fit better.
I have this solution where I search for soccer players, I only have their names and teams, but when a user comes to my website and clicks on the player I will use detailed information from the player that I get from various external providers (usually based by country).
I know which external provider to use when a call is done, and I pay to the external providers each time I grab data, so to mitigate this, I will try to get the less times possible, so I grab once a user clicks on the player info, and next time if it's in my database cache I will show my cache info. After 10 days I will grab again for the specific player form the external provider as I want the info to be somehow updated.
I will need to transform different providers data that come, usually, as JSON in my own structure so I can handle it the right way, I have my own object structure, so the fields coming from the external providers fall/map/transform in my code always with the same naming and structure..
So, my problem is to decide when should I map/transform data coming form the providers.
I grab data from the provider, I transform it to my JSON structure and record/keep it in the database cache system this once with my main structure, and in my solution code all I need is, everytime a user clicks on the soccer player details I get from the database cache this JSON field and convert it directly to object I know how to use.
I grab data from the provider, keep it as is in my database cache system, and in my solution code everytime someone clicks to get the soccer player detail info I get the JSON record from my database cache, I transform it for my naming and structure, and convert it to object
Notes:
- this is a cache database, records won't be kept forever, if in a call I see the record have more then 10 days I will get new data from the apropriate external provider
Deciding the layer to cache data is an art form all its own. The higher the layer you cache data, the more performant it will be (less reprocessing needed), but the lower the re-use potential will be (different parts of the application may use the same cache, and find value if it hasn’t been transformed too much).
Yours is another case of this. If you store it as the provider provides it, and you need to change the way you transform it, you won’t have to pay to re-retrieve it. If on the other hand, you store it as you need it now, you may have to discard it all if you decide to change the transformation method.
Like all architectural design decisions, it's all about trade-offs. You have to decide what is more important to you and your application.

Simulating server-side group and sort in Azure table storage

I have a table to which I add records whenever the user views a particular resource. The key fields are
Username
Resource
Date Viewed
On a history page of my app, I want to present a set number (e.g., top 5) of the user's most recently viewed Resources, but I want to group by Resource, so that if some were viewed several times, only the most recent of each one is shown.
To be clear, if the raw data looked like this:
UserA | ResourceA | Jan 1
UserA | ResourceA | Jan 2
UserA | ResourceB | Jan 3
UserA | ResourceA | Jan 4
...
...only the bottom two records would appear in the history page.
I know you can get server-side chronological sorting by using a string derived from the date in the PartitionKey or RowKey fields.
I also see that you could enable a crude grouping mechanism by using Username and Resource as your PartitionKey and RowKey fields, and then using Insert-or-update, to maintain a table in which you kept pointers for the most recent value for each combination. However, those records wouldn't be sorted chronologically.
Is there any way to design a set of tables so that I can get the data I need without retrieving tons of extra entities and sorting on the client? I'm willing to get elaborate with the design if that's what it takes. Thanks in advance!
First, I would strongly recommend that you read this excellent Azure Storage Table Design Guide: Designing Scalable and Performant Tables document from Storage team.
Yes, I would agree that it is somewhat tricky with Azure Table Storage but it is doable :).
What you have to do is keep multiple copies of the same data. Each copy will serve a different purpose.
Considering the scenario where you want to fetch most recent lines for Resource A and B, here's what your entity structure would look like:
PartitionKey: Date/Time (in Ticks) reversed i.e. DateTime.MaxValue.Ticks - LastAccessedDateTime.Ticks. Reverse ticks is required to that most recent entries will show up on the top of the table.
RowKey: Resource name.
AccessDate: Indicates the last access date/time.
User: Name of the user who accessed that resource.
So when you are interested in just finding out most recently used resources, you could start fetching records from the top.
In short, your data storage approach should be primarily governed by how you want to fetch the data. It would even mean you will have to save the same data multiple times.
UPDATE
As discussed in the comments below, Table Service doesn't directly support Server Side Grouping. This is something that you would need to do on your own. What you could do is create a separate table to store the access counts. As and when the resources are accessed, you basically either insert a new record in that table or update the count for that resource in that table.
Assuming you're always interested in finding out resource access count within a date/time range, here's what your entity structure would look like:
PartitionKey: Date/Time (in Ticks). The precision would depend on your reporting requirement. For example, if you want to maintain access counts by day then your precision would be a day.
RowKey: Resource name.
AccessCount: This field will constantly update as and when a resource is accessed.
LastAccessDateTime: This field will denote when a resource was last accessed.
For updating access counts, I would recommend that you make use of a background process. Basically in this approach, as a resource is accessed you add a message in a queue. This message will have resource name and date/time resource was last accessed. Then have a background process poll this queue and fetch messages. As the messages are received, you first get the current count and last access date/time for that resource. If no records are found, you simply insert a record in this table with count as 1. If a record is found then you compare the date/time from the table with the date/time sent in the message. If the date/time from the table is smaller than the date/time sent in the message, you update both count (increase that by 1) and last access date/time. If the date/time from the table is more than the date/time sent in the message, you only update the count.
Now to find most accessed resources in a time span, you simply query this table. Assuming there are limited number of resources (say in 100s), you can get this information from the table with at least 1 request. Since you're dealing with small amount of data, you can simply download this data on the client side and order it anyway you see fit. However to see the access details for a particular resource, you would have to fetch detailed data (1000 entities at a time).
Part of your brain might still be unconsciously trapped in relational-table design paradigms, I'm still getting to grips with that issue myself.
Rather than think of table storage as a database table (with the "query-ability" that goes with it) try visualizing it in more simple (dumb) terms.
A design problem I'm working on now is storing financial transaction data, and I want to know what the total $ amount of these transactions are. Because Azure table storage doesn't (yet?) offer aggregate functions I can't simply go .Sum(). To get around that I'm going to:
Sum the values of the transactions in my app before I pass them to azure.
I'll then pass that the result of the sum into azure as a separate piece of information, called RunningTotal.
Later on I can just return RunningTotal rather than pulling down all the transactions, and I can repeat the process by increment the value of RunningTotal each time i get new transactions.
Of course there are risks to this but the app is a personal one so the risk level is low and manageable, at least as a proof-of-concept.
Perhaps you can use a similar approach for the design of your system: compute useful values in advance. I'll almost be using table storage as a long-term cache rather than a database.

Scaling message queues with lots of API calls

I have an application where some of my user's actions must be retrieved via a 3rd party api.
For example, let's say I have a user that can receive tons of phone calls. This phone call record should be update often because my user want's to see the call history, so I should do this "almost in real time". The way I managed to do this is to retrieve every 10 minutes the list of all my logged users and, for each user I enqueue a task that retrieves the call record list from the timestamp of the latest saved record to the current timestamp and saves all that to my database.
This doesn't seems to scale well because the more users I have, then, the more connected users I'll have and the more tasks i'll enqueue.
Is there any other approach to achieve this?
Seems straightforward with background queue of jobs. It is unlikely that all users use the system at the same rate so queue jobs based on their use. With fall back to daily.
You will likely at some point need more workers taking jobs from the queue and then multiple queues so if you had a thousand users the ones with a later queue slot are not waiting all the time.
It also depends how fast you need this updated and limit on api calls.
There will be some sort of limit. So suggest you start with committing to updated with 4h or 1h delay to always give some time and work on improving this to sustain level.
Make sure your users are seeing your data and cached api not live call api data incase it goes away.

Redis multiple requests

I am writing a very simple social networking app that uses Redis.
Each user has a sorted set that contains ids of items in their feed. If I want to display their feed, I do the following steps:
use ZREVRANGE to get ids of items in their feed
use HMGET to get the feed (each feed item is a string)
But now, I also want to know if the user has liked a feed item or not. So I have a set associated with each feed item that contains ids of user who have liked a feed item.
If I get 15 feed items, now I have to execute an additional 15 requests to Redis to find out, for each feed item if current user has commented on it or not (by checking if id exists in each set for each feed).
So that will take 15+1 requests.
Is this type of querying considered 'normal' when using Redis? Are there better ways I can structure the data to avoid this many requests?
I am using redis-rb gem.
You can easily refactor your code to collapse the 15 requests in one by using pipelines (which redis-rb supports).
You get the ids from the sorted sets with the first request and then you use them to get the many keys you need based on those results (using the pipeline)
With this approach you should have 2 requests in total instead of 16 and keep your code quite simple.
As an alternative you can use a lua script and fetch everything in one request.
This kind of database (Non-relational database), you have to make a trade-off between multiple requests and include some data redundancy.
You should analyze each case separately and consider some aspects, like:
How frequently this data will be accessed?
How much space this redundancy will consume?
How many requests I will have to do, in order to have all data, without redundancy?
Performance is an issue?
In your case, I would suggest to keep a Set/Hash or just a JSON encoded data for each user with a historical of all recent user interaction, such as comments, likes, etc. Every time the user access the feeds you just have to read the feeds and the historical; only two requests.
One thing to keep in mind, every user interaction, you must update all redundant data as well.

Resources