I'm trying to create a feed to display the combined results of 3 queries to the Twitter API, since there seems to be no way to get what I want with one API call (2 user timelines and 1 search result for a hashtag). I want the results to be sorted by date so that the most recent appears at the start (just like when I get a result back from the Twitter API).
How can I combine these 3 JSONs (from the Twitter API), whilst maintaining the date order?
Thanks
there are several ways of doing this. i would probably persist all the stuff in a database.
that comes with several benefits:
first of all, caching is super easy, you can fetch the data from your own database until it gets stale.
secondly, databases are reaaaaaaaaaaaaaaaaly good at sorting and all this date shizzle. it's hard doing this manually and usually slow. i tend to mess it up all the time, so i let the database do the job for me.
Related
I'm building an internal server which contains a database of customer events. The webpage which allows access to the events is going to utilize an infinite scroll/dynamic loading scheme for display of live events as well as for browsing the results of queries to the database. So, you might query the database and maybe get 200k results. The webpage would display the 'first' 50 and allow you to scroll and scroll and scroll to see more and more results (loading perhaps 50 more at time).
I'm supposed to be using a REST api for the database access (a C# server). I'm unsure what the API should be so it remains RESTful. I've come up with 3 options. The question is, are any of them RESTful and which is most RESTful(is there such a thing -- if not I'll pick one of the RESTful).
Option 1.
GET /events?query=asdfasdf&first=1&last=50
This simply does the query and specifies the range of results to return. The server, unable to keep state, would have to requery the database each time (though perhaps utilizing the first/last hints to stop early) the infinite scroll occurs. Seems bad and there isn't any feedback about how many results are forthcoming.
Option 2 :
GET /events/?query=asdfasdf
GET /events/details?id1=asdf&id2=qwer&id3=zxcv&id4=tyui&...&id50=vbnm
This option first does a query which then returns the list of event ids but no further details. The webpage simply has the list of all the ids(at least it knows the count). The webpage holds onto the event id list and as infinite scroll/dynamic load is needed, makes another query for the event details of the specified ids. Each id is would nominally be a guid, so about 36 characters per id (plus &id##= for 41 characters). At 50 queries per hit, the URL would be quite long, 2000+ characters. The URL limit mentioned elsewhere on SO is around 2k. Maybe if I limit it to 40 ids per query this would be fine. It'd be nice to simply have a comma separated list instead of all the query parameters. Can you make a query parameter like ?ids=qwer,asdf,zxcv,wert,sdfg,rtyu,gfhj, ... ,vbnm ?
Option 3 :
POST /events/?query=asdfasdf
GET /events/results/{id}?first=1&last=50
This would post the query to the server and cause it to create a results resource. The ID of the results resource would be returned and would then be used to get blocks of the query results which in turn contain the event details needed for the webpage. The return from the POST XML could contain the number of records and other useful information besides the ID. Either the webpage would have to later delete the resource when the query page closed or the server would have to clean them up once they expire (days or weeks later).
I am concerned at Option 1, while RESTful is horrible for the server. I'm not sure requesting so many simultaneous resources, like the second GET in Option 2 is really RESTful or practical(seems like there has to be a better way). I'm not sure Option 3 is RESTful at all or if it is, its sort of cheating the REST thing by creating state via a POST(or should that be PUT).
Option 3 worked out fine. It required the server to maintain the query results and there was a bit of debate about how many queries (from various users) should simultaneously be saved as there would be no way to know when a user was actually done with a query.
We are building an iOS app with Parse.com, but still can't figure out the right way to backup data efficiently.
As a premise, we have and will have a LOT of data store rows.
Say we have a class with 1million rows, assume we have it backed up, then want to bring it back to Parse, after a hazardous situation (like data loss on production).
The few solutions we have considered are the following:
1) Use external server for backup
BackUp:
- use the REST API to constantly back up data to a remote MySQL server (we chose MySQL for customized analytics purpose, since it's way faster and easier to handle data with MySQL for us)
ImportBack:
a) - recreate JSON objects from MySQL backup and use the REST API to send back to Parse.
Say we use the batch operation which permits 50 simultaneous objects to be created with 1 query, and assume it takes 1 sec for every query, 1million data sets will take 5.5hours to transfer to Parse.
b) - recreate one JSON file from MySQL backup and use the Dashboard to import data manually.
We just tried with 700,000 records file with this method: it took about 2 hours for the loading indicator to stop and show the number of rows in the left pane, but now it never opens in the right pane (it says "operation time out") and it's over 6hours since the upload started.
So we can't rely on 1.b, and 1.a seems to take too long to recover from a disaster (if we have 10 million records, it'll be like 55 hours = 2.2 days).
Now we are thinking about the following:
2) Constantly replicate data to another app
Create the following in Parse:
- Production App: A
- Replication App: B
So while A is in production, every single query will be duplicated to B (using background job constantly).
The downside is of course that it'll eat up the burst limit of A as it'll simply double the amount of query. So not ideal thinking of scaling up.
What we want is something like AWS RDS which gives an option to automatically backup daily.
I wonder how this could be difficult for Parse since it's based on AWS infra.
Please let me know if you have any idea on this, will be happy to share know-hows.
P.S.:
We’ve noticed an important flaw in the above 2) idea.
If we replicate using REST API, all the objectIds of all Classes will be changed, so every 1to1 or 1toMany relations will be broken.
So we think about putting a uuid for every object class.
Is there any problem about this method?
One thing we want to achieve is
query.include(“ObjectName”)
( or in Obj-C “includeKey”),
but I suppose that won’t be possible if we don’t base our app logic on objectId.
Looking for a work around for this issue;
but will uuid-based management be functional under Parse’s Datastore logic?
Parse has never lost production data. While we don't currently offer automated backups, you can request one any time you like, and we're working on making all of this even nicer. Additionally, it's easier in most cases to import the JSON export file through the data browser rather than using the REST batch.
I can confirm that today, Parse did lost my data. Or at least it appeared to be so.
After several errors where detected on multiple apps (agreed by Parse Status twitter account), we could not retrieve data for an app, without any error.
It was because an entire column of one of our class (type pointer) disappeared and data was not present anymore in the dashboard.
We are using this pointer column to filter / retrieve data, so the returned queries and collections were empty.
So we decided to recreate the column manually. By chance, recreating the column, with the same name and type, solved the issue and the data was still there... I can't explain it but I really thought, and the app reacted as if, data were lost.
So an automated backup and restore option is mandatory, it is not an option.
On December 2015 parse.com released a new dashboard with an improved export feature.
Just select your app, click on "App Settings" -> "General" -> "Export app data". Parse generates a json-file for every class in your app and sends an email to you, if the export-progress is done.
UPDATE:
Sad but true, parse.com is winding down: http://blog.parse.com/announcements/moving-on/
I had the same issue of backing up parse server data. As parse server is using mongodb that is why backing up data is not an issue I have just done a simple thing. downloaded the mongodb backup from the server. And then restored it using
mongorestore /path-to-mongodump (extracted files)
As parse has been turned to open source.Therefore we can adopt this technique.
For accidental deletes, writing a cloud function 'beforedelete' to backup the current row to another class would work.
For regular backups, manual export of changed records (use filter) will be useful. For recovery this requires you to write scripts / use import option (not so sure) in data browser. You could also write a cloud function replicate data on your backup server (haven't tried this yet).
However there are some limitations to cloud code that you should consider before venturing into it:
https://parse.com/docs/cloud_code_guide#functions-resource
I have an application who is doing a job aggregating data from different Social Network sites Back end processes done Java working great.
Its front end is developed Rails application deadline was 3 weeks for some analytics filter abd report task still few days left almost completed.
When i started implemented map reduce for different states work great over 100,000 record over my local machine work great.
Suddenly my colleague gave me current updated database which 2.7 millions record now my expectation was it would run great as i specify date range and filter before map_reduce execution. My believe was it would result set of that filter but its not a case.
Example
I have a query just show last 24 hour loaded record stats
result comes 0 record found but after 200 seconds with 2.7 million record before it comes in milliseconds..
CODE EXAMPLE BELOW
filter is hash of condition expected to check before map_reduce
map function
reduce function
SocialContent.where(filter).map_reduce(map, reduce).out(inline: true).entries
Suggestion please.. what would be ideal solution in remaining time frame as database is growing exponentially in days.
I would suggest you look at a few different things:
Does all your data still fit in memory? You have a lot more records now, which could mean that MongoDB needs to go to disk a lot more often.
M/R can not make use of indexes. You have not shown your Map and Reduce functions so it's not possible to point out mistakes. Update the question with those functions, and what they are supposed to do and I'll update the answer.
Look at using the Aggregation Framework instead, it can make use of indexes, and also run concurrently. It's also a lot easier to understand and debug. There is information about it at http://docs.mongodb.org/manual/reference/aggregation/
I am writing a very simple social networking app that uses Redis.
Each user has a sorted set that contains ids of items in their feed. If I want to display their feed, I do the following steps:
use ZREVRANGE to get ids of items in their feed
use HMGET to get the feed (each feed item is a string)
But now, I also want to know if the user has liked a feed item or not. So I have a set associated with each feed item that contains ids of user who have liked a feed item.
If I get 15 feed items, now I have to execute an additional 15 requests to Redis to find out, for each feed item if current user has commented on it or not (by checking if id exists in each set for each feed).
So that will take 15+1 requests.
Is this type of querying considered 'normal' when using Redis? Are there better ways I can structure the data to avoid this many requests?
I am using redis-rb gem.
You can easily refactor your code to collapse the 15 requests in one by using pipelines (which redis-rb supports).
You get the ids from the sorted sets with the first request and then you use them to get the many keys you need based on those results (using the pipeline)
With this approach you should have 2 requests in total instead of 16 and keep your code quite simple.
As an alternative you can use a lua script and fetch everything in one request.
This kind of database (Non-relational database), you have to make a trade-off between multiple requests and include some data redundancy.
You should analyze each case separately and consider some aspects, like:
How frequently this data will be accessed?
How much space this redundancy will consume?
How many requests I will have to do, in order to have all data, without redundancy?
Performance is an issue?
In your case, I would suggest to keep a Set/Hash or just a JSON encoded data for each user with a historical of all recent user interaction, such as comments, likes, etc. Every time the user access the feeds you just have to read the feeds and the historical; only two requests.
One thing to keep in mind, every user interaction, you must update all redundant data as well.
I've already experiences with MongoDB, CouchDB, Redis, Tokyo Cabinet, and other NoSQL Databases. Recently I stumbled upon Riak and it looks very interesting to me. To getting started with it, I decided to write a small Twitter clone, the "hello world" in the NoSQL World. To get a fully working clone, it's necessary to order the tweets chronologically. After reading the Riak docs I discovered that Map-Reduce is the right tool for this job. In my development-environment it works quite well, but how's the performance in production, with hundreds of parallel queries? Are there other, maybe faster, methods for sorting data, or is it possible to store data in an ordered form (like Cassandra)?
I think I've found another solution to this problem - a simple linked list. So one possible implementation could be, that every user gets his/her own "timeline bucket", where links to the tweets-data itself gets stored (tweets gets stored separately in the "tweets" bucket). As you would know, this timeline-bucket must contain a key named "first", which links to the latest timeline-object and is the starting point of the list. To insert a new tweet in the timeline, just insert a new item in the timeline bucket, set the "next"-link of this new item to the "first"-item, after that, make the new item to "first".
In short: Insert an item as you would do in a linked list...
As with Twitter, the personal timeline just holds 20 tweets shown to the user. To receive the last 20 tweets, there are only 2 queries necessary. To speed things up, the first query uses the link-walking ability of Riak to get the latest 20 objects, tagged by "next". Finally, the second, and last query uses the keys computed by the first query to receive the tweets itself (using map/reduce).
To remove the tweets of users you've just unfollowed, I would use the secondary index ability of Riak 1.0 to receive the related timeline-objects/tweets.
It is not possible to store data in an ordered form in Riak without resorting to re-writing portions of the Riak core. Data is stored, roughly, in bucket + key order. The actual order depends on the backend storage mechanism that you're using for Riak.
Riak 1.0 has some features that might help you, too. There's support for secondary indexes as well as improvements to Map Reduce operations - in particular, they perform much better in highly concurrent scenarios.
Alexander Siculars wrote an article about Pagination with Riak. It outlines the problem pretty well. Yammer also make extensive use of Riak and two of their engineers put together a presentation about Riak at Yammer. It doesn't go into a lot of implementation details, but you can learn a lot about how they have designed their solution.
Combining secondary index queries and Map Reduce makes it possible to solve your problem very easily.
As Jeremiah says it's not possible to store the data in sorted order, but you can still make it return sorted results by using secondary indexes and map/reduce. The problem, as described, is that you can't efficiently limit the query in a sorted way.
Here is an example using range query to list all keys and then sorting them using the built in functions in *riak_kv_mapreduce*::
{ok, Pid} = riakc_pb_socket:start_link("127.0.0.1", 8087),
riakc_pb_socket:mapred(Pid
, {index, colonel_riak:bucket(context), <<"$key">>, <<0>>, <<255>>}
, [{reduce, {modfun, riak_kv_mapreduce, reduce_sort}, none, true}])
You can use functions in the lists module in erlang or use the native javascript sort function. Order by can be achieved by lists:reverse/1 in erlang.