Streaming, Daemons, Cronjobs, how do you use them? (in Ruby) - ruby

I've finally had a second to look into streaming, daemons, and cron
tasks and all the neat gems built around them! But I'm not clear on
how/when to use these things.
I have a few questions:
1) If I wanted to have a website that stayed constantly updated, realtime, with my Facebook friends' activity feeds, up-to-the-minute Amazon book reviews on my favorite books, and my Twitter feed, would I just create some custom streaming implementation using the Daemon gem, the ruby-yali gem for streaming the content, and the Whenever gem, which could say, check those sites every 3-10 seconds to see if content I'm looking for has changed? Is that how it would work? Or is it typically/preferably done differently?
2) Is (1) too processor intensive? Is there a better way you do it, a better way for live content streaming, given that the website you want realtime updates on doesn't have a streaming api? I'm thinking about just sending a request every few seconds in a separate small ruby app (with daemons and cronjobs), getting the json/xml result, using nokogiri to remove the stuff I don't need, and then just going through the small list of comments/books/posts/etc., building a feed of what's changed, and using Juggernaut or something to push those changes to some rails app. Would that work?
I guess it all boils down to the question:
How does real-time streaming of the latest content of some website work? How do YOU do it?
...so if someone is on my site, they can see in real time the new message or new book that just came out?
Looking forward to your answers,
Lance

Well first, if a website that doesn't provide an API, then it's a strong indication that it's not legal to parse and extract their data, however you'd better check their terms of use and privacy policy.
Personally I'm not aware of something called "Streaming API", but supposing that they have an API , you still need to pull the results provided by it(xml, json, ....), parse them and present them back to the user. The strategy will vary depending on your app type:
Desktop app: then you just can pull the data directly, parse it and provide it to the user, many apps are like that just like Twhirl.
Web app: then you need to cut down the time for extracting the data. Typically you will pull the data from the API and store it. However, storing the data is a bit tricky! You don't want want your database to be a lock down for the app by the extreme pull queries that it gonna get to retrieve the data back. One way to do this is to use push methodology; follow option 2 in this case to get the data and then push to the user. If you want instant updates like chat for example you can have a look at orbited. If it's ok to save the data to some kind of user and followers' 'inboxes', then the simplest way as I can tell is to use IMAP to send the updates to the user inbox.

Related

Golang Cache HTTP GET Results In Memory

I am working on a CLI in Go that scrapes a webpage to collect the href attributes of all the links on the page into a slice. I want to store this slice in memory for some time so that the scraper is not being called on every execution of the CLI command. Ideally, the scraper would only be called after the cache expires or the user provides some sort of --update flag.
I came across the library go-cache and other similar libraries, but from what I could tell they only work for something that is continuously running, like a server.
I thought about writing the links to a file, but then how would I expire the results after a specific duration? Would it make sense to create a small server in the background that shuts down after a while in order to use a library like go-cache? Any help is appreciated.
There are two main approaches in these scenarios:
Create a daemon, service or background application that acts as your data repository. You can run it as an HTTP server / RPC server depending on your requirements. Your CLI application then interacts with this daemon as required;
Implement a persistence mechanism that will allow data to be written and read across multiple CLI application executions. You may use normal text files, databases or even an implementation of golang's encoding/gob to write and read your slice (a map would probably be better) to and from a binary file.
You can timestamp entries and simply remove them after their ttl expires by explicitly deleting them, or by simply not rewriting them during subsequent executions, according to the strategy / approach selected above.
The scope and number of examples for such an open ended question is too myriad to post in a single answer and will most likely require multiple specific questions.
Use a database and store as much detail as you can (fetched_at, host, path, title, meta_desc, anchors etc). You'll be able to query over the data later and it will be useful to have it in a structured format. If you don't want to deal with a db dependency you could embed something like boltdb (pure go) or sqlite (cgo).

Heroku or Amazon to host a backend Rails JSON API? Which should I use?

The answer to this no doubt lies in answering exactly what I need. The thing is... I don't really know.
The criteria for my choice will be price. Whatever is cheapest, unless both are so closely similar and the every so slightly more expensive one is a much better service.
I'm creating an iOS application, and have a Rails backend JSON API that serves my app.
I have a Post/Comment style app. I don't store any images, just text throughout various tables, etc. I shouldn't need much data with no images, and the fact that I will be purging old data (old posts / comments that are no longer relevant are just deleted).
I need a scheduler, likely daily, but guaranteed no more frequent (hourly etc not needed). So I need to run cron tasks daily.
My application does have a user sign-in. Sign-up and you can post and comment, otherwise you can only view. Does that mean I'm going to need an SSL endpoint, or is that not necessary?
Other than that I'm just serving GETting/POSTing data. I don't need anything else that I can think of. As a beginner, am I possibly overlooking anything?
Which service should I go with given the above. This is my first iOS app, and Rails backend (first time working with either), and first time deploying anything to either service, so I'm looking for some advice in this area.
Thanks!
Short googling gave me these:
HEROKU VS. AMAZON WEB SERVICES
Ruby hosting in the cloud – Elastic Beanstalk vs Heroku vs EngineYard

What is pump.io?

Recently I have been looking into the development of social networks and I often find references to pump.io. There is however very limited information available on what pump.io actually is. The official website says nothing more than: "It's a stream server that does most of what people really want from a social network." I found some more information on this website (http://slid.es/evanp/understanding-pumpio/fullscreen#/) but that still doesn't say a lot to me.
Could someone please provide an elaborate discussion on what pump.io actually is (and does) to someone who does not know anything about (activity) stream servers? Maybe the better question is: "What is an activity stream server?"
Yeah, the term is one a lot of people are unfamiliar with and it makes a couple of distinctions that aren't immediately obvious even if you use and post to a pump.io site.
pump.io, as it is distributed, is really two programs with different sets of functions. One is the Activity Stream Server and the other is the Web Client.
At the risk of being pedantic, let me define each of the words. I know you know what the words mean, but I hope the specific contexts/usage will help:
Server: a program which distributes information (usually) across a
network.
Stream: a (usually) chronological series of some sorts of pieces of information.
Activity: a description or depiction of something someone is doing.
The Activity Stream Server is a program which distributes (server) a chronological series (stream) of posts about stuff people do (activities).
The distinction is important because the website part of a pump.io website is a client for the pump server—essentially no different from a desktop or smartphone pump.io client. It listens to the pump's stream of posts and sends new posts to the pump using the same API and data formats that standalone applications—or other pumps—do.
You could actually totally decouple the Web Client and have a fully-functioning pump.io instance without any website. Users on other pump sites could see your posts and you could see theirs, and you could comment back and forth. It would make no difference.
ActivityStream is a JSON-based data format to describe "activities". The specification of ActivityStream 2.0 can be found at https://www.w3.org/TR/activitystreams-core/ and the vocabulary of activities at https://www.w3.org/TR/activitystreams-vocabulary/. To get the feeling of how the data format looks like you can have a look at the few examples at https://www.w3.org/TR/activitystreams-core/#examples. More examples can be found throughout the two specifications.
pump.io is an activity stream server that does most of what people
really want a social network server to do.
That's a pretty packed sentence, I understand, but I can try to unwind
it a little.
"Activities" are the things we do in our on-line or off-line
life—waking up in the morning, going for a run, tasting a beer,
uploading a photo, adding a friend, eating a burrito, joining a group,
liking a blog post.
pump.io uses a simple JSON format to represent all these kinds of
activities and many more. It organizes activities into streams—time
ordered lists of activities, with the newest first. Most streams are
organized by theme, like: all the things that my friends did, or all
the things that I did, or all the things anyone has done to this
picture.
Programmers use a simple API to connect to a pump.io server and add
new activities. pump.io automatically organizes the activities into
streams and makes sure the activities get to the people who are
interested in them.
And, really, that's what we want from a social network
Behrenshausen, B. (2013). 'Interview with Evan Prodromou, lead developer of pump.io'. Retrieved from: https://opensource.com/life/13/7/pump-io
If you peer a few centimeters down the page on the official website, you'll see:
What's it for? I post something and my followers see it. That's the
rough idea behind the pump.
There's an API defined in the API.md file. It uses activitystrea.ms
JSON as the main data and command format.
You can post almost anything that can be represented with activity
streams -- short or long text, bookmarks, images, video, audio,
events, geo checkins. You can follow friends, create lists of people,
and so on.
The software is useful for at least these scenarios:
Mobile-first social networking
Activity stream functionality for an existing app
Experimenting with social software
Those last 3 items hopefully answer your question.
Currently, you can:
install the nodejs-based pump.io server
(or) sign up for an account on a public service
post notes and pictures with configurable permissions
log in to web and client applications using your webfinger ID

Should I make my CouchDB database server public-facing?

I'm new to CouchDb and am trying to comprehend how to properly make use of it. I'm coming from MongoDB where I would always write a web layer and put it in front of mongo so that I could allow users to access the data inside of it, etc. In fact, this is how I've used all databases for every web site that I've ever written. So, looking at Couch, I see that it's native API is HTTP and that it has built in things like OAuth support, and other features that hint to me that perhaps I should no longer have my code layer sitting in front of Couch, but instead write Views and things and just give out accounts to Couch to my users? I'm thinking in terms of like an HTTP-based API for a site of mine, or something that users would consume my data through. Opening up Couch like this seems odd to me, though. Is OAuth, in Couch's sense, meant more for remote access for software that I'd write and run internal to my own network "officially", or is it literally meant for the end users?
I know there might be things that could only be done through a code layer on top of CouchDB, like if you wanted additional non-database related things to occur during API requests, also. So thinking along those lines I think I will still need a code layer, anyway.
Dealer's choice.
Nodejitsu has a great writeup on this sort of topic here.
Not knowing your application specifics I'll take a broad approach...
Back-end
If you want to prevent users from ever seeing your database then make it back-end. You can pipe everything through something like node.js and present only what the user needs to see and they'll never know anything about the database.
See Resource View Presenter
Front-end
If you are not concerned about data security, you can host an entire app on CouchDB; see CouchApp. This approach has the benefit of using the replication mechanism to control publishing your site/data. The drawback here is that you will almost certainly run into some technical limitations that will require moving CouchDB closer to the backend.
Bl-end
Have the app server present the interface and the client pull the data from the database separately. This gives the most flexibility but can be a bag of hurt because even with good design this could lead to supportability and scalability issues.
My recommendation
Use CouchDB on the backend. If you need mobile clients to synchronize then use a secondary DB publicly exposed for this purpose and selectively sync this data to wherever it needs to go.
Simply put, no.
There's no way to secure Couch properly on a public facing site. There's no way to discriminate access at a fine enough granular level. If someone has access to any of the data, they have access to all of the data.
Not all data on a site is meant for public consumption, save for the most trivial of sites.

How would you make an RSS-feeds entries available longer than they're accessible from the source?

My computer at home is set up to automatically download some stuff from RSS feeds (mostly torrents and podcasts). However, I don't always keep this computer on. The sites I subscribe to have a relatively large throughput, so when I turn the computer back on it has no idea what it missed between the the time it was turned off and the latest update.
How would you go about storing the feeds entries for a longer period of time than they're available on the actual sites?
I've checked out Yahoo's pipes and found no such functionality, Google reader can sort of do it, but it requires a manual marking of each item. Magpie RSS for php can do caching, but that's only to avoid retrieving the feed too much not really storing more entries.
I have access to a webserver (LAMP) that's on 24/7, so a solution using a php/mysql would be excellent, any existing web-service would be great too.
I could write my own code to do this, but I'm sure this has to be an issue previously encountered by someone?
What I did:
I wasn't aware you could share an entire tag using Google reader, thanks to Mike Wills for pointing this out.
Once I knew I could do this it was simply a matter of adding the feed to a separate Google account (not to clog up my personal reading list), I also did some selective matching using Yahoo pipes just to get the specific entries I was interested in, this too to minimize the risk that anything would be missed.
It sounds like Google Reader does everything you're wanting. Not sure what you mean by marking individual items--you'd have to do that with any RSS aggregator.
I use Google Reader for my podiobooks.com subscriptions. I add all of the feeds to a tag, in this case podiobooks.com, that I share (but don't share the URL). I then add the RSS feed to iTunes. Example here.
Sounds like you want some sort of service that checks the RSS feed every X minutes, so you can download every single article/item published to the feed while you are "watching" it, rather than only seeing the items displayed on the feed when you go to view it. Do I have that correct?
Instead of coming up with a full-blown software solution, can you just use cron or some other sort of job scheduling on the webserver with whatever solution you are already using to read the feeds and download their content?
Otherwise it sounds like you'll end up coming close to re-writing a full-blown service like Google Reader.
Writing an aggregator for keeping longer history shouldn't be too hard with a good RSS library.

Resources