Scaling Redis for Online Friends List - ruby

I'm having trouble thinking of how to implement an online friends list with Ruby and Redis (or with any NoSQL solution), just like any chat IM i.e. Facebook Chat. My requirements are:
Approximately 1 million total users
DB stores only the user friends' ids (a Set of integer values)
I'm thinking of using a Redis cluster (which I actually don't know too much about) and implement something along the lines of http://www.lukemelia.com/blog/archives/2010/01/17/redis-in-practice-whos-online/.
UPDATE: Our application really won't use Redis for anything else other than potentially for the online friends list. Additionally, it's really not write heavy (most of our queries, I anticipate, will be reads for online friends).

After discussing this question in the Redis DB Google Groups, my proposed solution (inspired by this article) is to SADD all my online users into a single set, and for each of my users, I create a user:#{user_id}:friends_list and store their friends list as another set. Whenever a user logs in, I would SINTER the user's friends list and the current online users set. Since we're read heavy and not write heavy, we would use a single master node for writes. To make it scale, we'd have a cluster of slave nodes replicating from the master, and our application will use a simple round-robin algorithm to do the SINTER
Josiah Carlson suggested a more refined approach:
When a user X logs on, you intersect their friends set with the
online users to get their initial online set, and you keep it with a
Y-minute TTL (any time they do anything on the site, you could update
the expire time to be Y more minutes into the future)
For every user in that 'online' set, you find their similar
'initial set', and you add X to the set
Any time a user Z logs out, you scan their friends set, and remove
Z from them all (whether they exist or not)

what about xmpp/jabber? They built-in massive concurrent use and highly reliable, all you need to do is the make an adapter for the user login stuff.

You should consider In Memory Data Grids, these platforms are targeted to provide such scalability.
And usually can easily be deployed as a cluster on any hardware of cloud.
See: GigaSpaces XAP, vMware GemFire or Oracle Coherence.
If you're looking for free edition XAP provides Community Edition.

Related

Fast delivery webpages on shared hosting

I have a website (.org) for a project of mine on LAMP hosted on a shared plan.
It started very small but now I extended this community to other states (in US) and it's growing fast.
I had 30,000 (or so) visits per day (about 4 months ago) and my site was doing fine and today I reached 100,000 visits.
I want to make sure my site will load fast for everyone and since it's not making any money I can't really move it to a private server. (It's volunteer work).
Here's my setup:
- Apache 2
- PHP 5.1.6
- MySQL 5.5
I have 10 pages PER state and on each page people can contribute, write articles, like, share, etc... on few pages I can hit 10,000 per hours during lunch time and the rest of the day it's quiet.
All databases are setup properly (I personally paid a DBA expert to build the code). I am pretty sure the code is also good. Now, I can make page faster if I use memcached but the problem is I can't use it since I am on a shared hosting.
Will the MySQL be able to support that many people, with lots of requests per minutes? or I should create a fund to move to a private server and install all the tools I need to make it fast?
Thanks
To be honest there's not much you can do on shared hosting. There's a reason why they are cheap ... they limit you to do stuff like you want to do.
Either you move to a VPS that allow memcache (which are cheaper) and you put some google ads OR you keep going on your shared hosting using a pre-generated content system.
VPS can be very cheap (look for coupons) and you can install what ever you want since you are root.
for example hostmysite.com with the coupon: 50OffForLife you pay 20$ per month for life ... vs a 5$ shared hosting ...
If you want to keep the current hosting, then what you can do is this:
Pages are generated by a process (cronjob or on the fly), everytime someone write a comment or make an update. This process start and fetch all the data on the page and saves it to the web page.
So let say you have a page with comments, grab the contents (meta, h1, p, etc..) and the comments and save both into the file.
Example: (using .htaccess - based on your answer you are familiar with this)
/topic/1/
If the file exists, then simply echo ...
if not:
select * from pages where page_id = 1;
select * from comments where page_id = 1;
file_put_contents('/my/public_html/topic/1/index.html', $content);
Or something along these lines.
Therefore, saving static HTML will be very fast since you don't have to call any DB. It just loads the file once it's generated.
I know I'm stepping on unstable ground providing an answer to this question, but I think it is very indicative.
Pat R Ellery didn't provide enough details to do any kind of assessment, but the good news there can't be enough details. Explanation is quite simple: anybody can build as many mental model as he wants, but real system will always behave a bit differently.
So Pat, do test your system all the time, as much as you can. What you are trying to do is to plan the capacity of your solution.
You need the following:
Capacity test - To determine how many users and/or transactions a given system will support and still meet performance goals.
Stress test - To determine or validate an application’s behavior when it is pushed beyond normal or peak load conditions.
Load test - To verify application behavior under normal and peak load conditions.
Performance test - To determine or validate speed, scalability, and/or stability.
See details here:
Software performance testing
Types of Performance Testing
In the other words (and a bit primitive): if you want to know your system is capable to handle N requests per time_period simulate N requests per time_period and see the result.
(image source)
Another example:
There are a lot of tools available:
Load Tester LITE
Apache JMeter
Apache HTTP server benchmarking tool
See list here

ec2 spot complete price history

I am using API to get ec2 spot price history, but I cannot get anything except for last 90 or so days, and cannot specify frequency of observations. Is there a way to get a complete history of spot prices, preferable at minute or hourly frequency?
While not explicitly documented for the DescribeSpotPriceHistory API action, this restriction is at least mentioned for the AWS Management Console (which uses that API in turn), see Viewing Spot Price History:
You can view the Spot Price history over a period from one to 90 days based on the instance type, the operating system you want the instance to run on, the time period, and the Availability Zone in which it will be launched.
Since anybody could have retrieved and logged the entire spot price history ever since this API is available (and without a doubt quite some users and researchers have done just that; even the AWS blog listed some dedicated Third-Party AWS Tracking Sites, but these are all defunct at first sight), this restriction admittedly seems a bit arbitrary, but is certainly pragmatic from a strictly operational point of view, i.e. you have all the information you need to base future bids upon (esp. given AWS has only ever reduced prices so far, and regularly does so in fact much to the delight of its customers).
Likewise there's no option to change the frequency, so you'd need to resort to client side code for the hourly aggregation.
This website has re-sampled EC2 spot price histories for some regions, you may access them via a simple API directly from your Python script:
http://ec2-spot-prices.ai-mmo-games.de/
I hope this helps.
AWS only provides 90 days of history. And the data is raw, i.e., not normalized by hour or even minute. so there are holes in the data sometimes.
One approach would be to suck in the data into an ipython notebook and use pandas excellent time series tools to resample by minute or 5-min etc. here's a short tutorial:
https://medium.com/cloud-uprising/the-data-science-of-aws-spot-pricing-8bed655caed2
here are more details on using pandas for time-series resampling:
http://pandas.pydata.org/pandas-docs/stable/timeseries.html
hope that helps...

I have many separate installs for magento can I merge them together?

I've just started a new job and we have several installs of magneto all of different versions!
Now really it seems to me that we need to firstly upgrade them all and then get them all under one installation of magneto and using one database.
What is the best way (in general terms) of doing this.
Is it even possible or is my best bet to make the sites again under one installation and import the products into it.
There is some talk by a fellow developer that having them under different installs helps with performance. Is this true?
Once we have them all under one install things like stock control and orders as well as putting products on multiple site should also be very straightforward - correct?
We are talking quite a few stores say around 15ish and quite a few products around I would say 4000 maybe more.
My first suggestion is to consider the reasons, why you need all Magento instances to be moved under one installation. The reasons are not clear from your question. So the best developer's advice is "Does it work? Then don't touch it" :)
If there are no specific reasons, then you'd better leave it as is. All reorganization processes (upgrading, infrastructure configuration, access setup, etc.) for a software system are hard, costly, consume time, error-prone, usually have no much value from business point of view and are a little boring. This is not a Magento-specific thing, it is just general characteristic for any software.
Also note, that it is a holiday season. So it is better not to do anything with e-commerce stores until the middle of January.
If you see value in a reorganization of your Magento stores, then the best way to do it is to go gradually - step by step, store by store:
Take your most complex store. Prepare everything you need for the further steps - i.e. get ready the tools, write automatic scripts, go through the process with its copy at some testing server. Write set of functional tests
to cover it with at least smoke-tests. You'll have to repeat
such light checks many times to be sure, that the store appears to be
working. The automatic tests will save much time. Thus all these preparations will decrease your downtime.
Close public access to the store.
Upgrade store to the Magento version, you need. Move it to the new infrastructure.
Verify all the user scenarios manually and with automatic tests. Fix the issues, if any.
Open public access to the store. Monitor logs, load reports at the server machines. Fix issues, if any.
Take next store (let's call it NextStore). Make its copy at a sandbox server.
Make copy of your already converted store (let's call it ConvertedStore) at a sandbox server.
Export all the data from copy of NextStore and import it to the copy of ConvertedStore. You can use Magento Dataflow or Import/Export modules to do that. Not all data can be
imported/exported with those modules - just Catalog, Orders, Customers. You will need
to develop custom scripts to import/export other entities, if you need them.
Verify result manually and with automatic tests and manually. Write automatic scripts, that fix found issues. You will need those scripts later during the real converting process.
Close NextStore.
Move it to the new infrastructure, by engaging the already prepared procedures and scripts. You will need to consider, whether to close ConvertedStore during the converting process. It depends on your feeling, whether it is ok to have it opened or not. For safety reasons it is better to close it.
Verify, that everything works fine. Monitor logs, reports.
Fix issues, if any.
Proceed with the rest of your stores.
That is my (totally personal) view on the procedures.
There is some talk by a fellow developer that having them under
different installs helps with performance. Is this true?
Yes, your friend is right. Separating Magento (actually, anything in this world) into smaller instances makes it lighter to be handled. The performance difference is very small (for your instance of 4000 products), but it is inevitable. Consider, that after combining the instances (suppose, there are ten of them with 400 products each) you'll be handling data for 10x more customers, reports, products, stores, etc. Therefore any search will have to go through ten times more products, in order to return data. Of course, it doesn't matter, if the search takes 0.00001 second, because 0.0001s for combined instance is ok as well. But some things, like sorting or matching sets, grow non-linearly. But, as said before, for 4000 products you won't see big difference.
Once we have them all under one install things like stock control and
orders as well as putting products on multiple site should also be
very straightforward - correct?
You're right - after combining the stores together, handling orders, stock, customers will be much more simpler and straightforward process.
Good luck! :)
The most important thing to consider is what problem you're solving by having all these sites on one Magento "instance". What's more important to your business/team: having these sites share product and inventory or having the flexibility of independently modifying these sites? Any downtime or impacts to availability may affect all sites.
Further questions/areas of investigation:
How much does the product hierarchy (categories and attributes) differ?
Is pricing the same across each site or different?
Are any of these sites multi-regional and how is pricing handled for each region?
It's certainly possible to run multiple sites on one Magento instances, even if there are some rough edges within the platform.
Since there's no way to export all entities in Magento, there's no functionality to merge stores. You'd have to write custom code - it would have to take all the records from the old store, assign them new IDs while preserving referential integrity & insert them into the new store (this is what the "product import" does, but they don't have it for categories, orders, customers, etc.).
The amount of code you'd be writing to do that would take almost longer than just starting over in my opinion. You'd basically be writing the missing functionality for Magento. If it were easy they would have it done it already.
However splitting two stores apart is very easy, since you don't have to worry about reassigning unique identifiers in the DB.

memcached usage patterns

I'm planning the injection of a caching system within my website, will use it in different layers (data, presentation and may be somewhere else). Being my stack LAMP and my infrastructure 100% cloud on AWS, I thought the natural choice would be Amazon Elasticache (a managed installation of memcached). But...
Surprisingly - for me - I discovered memcached completely lacks of dependency management. I don't need "advanced" stuffs like ASP.Net cache SqlDependency or FileDependency, but memcached doesn't offer an easy other-key dependency neither, something pretty useful for building a dependency tree that greatly simplify the invalidation process.
So, as I know memcached is used in many complex systems, am I missing something? Are there usage patterns that make this lack irrelevant?
thanks
UPDATE
as asked, I add some pseudo code to clarify what I mean
dependency = 'ROOT_KEY';
cache:set(dependency, 0, NEVER_EXPIRE);
expire = 600;
cache:set('key1', obj1, expire, dependency);
cache:set('key2', obj2, expire, dependency);
...
cache:set('keyN', objN, expire, dependency);
//later, when I have to invalidate
cache:remove(dependency); //this will cause all keyX to be invalidated too
Based on the example in your question, memcached (and thus Elastic Cache) does not support any sort of key metadata like you are looking for by which you could relate such keys and operate on them as a group.
I suppose if you had only a handful of different "dependencies" you could simply utilize multiple elastic cache instances, which would allow you to invalidate all items within each instance/dependency simultaneously. This of course might end up costing you more in terms of AWS hardware costs then your would like since you can only increment your cache sizes in discrete amounts. This also would eliminate the ability for you to do a cache lookup without knowing the dependency/instance upon which the lookup is to occur.
For what you are trying to do, you might be able to use something like memory tables in MySQL/RDS if you are looking for more of a works-out-of-the-box type of solution. Of course you would not want to use RDS high-availibility features or point-in-time restoration, as these will break, since they require writing to disk. You would basically need to have a standalone RDS instance doing nothing but these memory tables.
It seems none of these options however is really an exact fit for what you are looking to do, so you might need to look into either adjusting your approach (if you want to use basic AWS components), or deploying an alternate caching system on EC2.

What is a reliable method to record votes from anonymous users, without allowing duplicates

First of all, I searched as best I could and read all SO questions that seem relevant, but nothing specifically answered this. This is not a duplicate, afaik.
Obviously if anonymous voting on a website is allowed, there is no fool proof way to prevent someone voting more than once.
However, I am wondering if someone with experience can aide me in coming up with a reasonably reliable way of tracking absolutely unique visitors and recording votes against those credentials.
Currently I am ensuring that only one vote per item/session combo is allowed, however this is easily circumvented by restarting browser, changing browsers/computers, or clearing your session data.
Recording against IP seems the next reasonable solution but I wonder if this will get false positives too often (multiple people on same LAN behind a NAT will have same external IP, etc).
Is there a middle ground to be had here or some other method/combination I am overlooking?
I'd collect as much data about the session as possible without asking any questions directly (browser, OS, installed plugins, all with versions numbers, IP address etc) and hash it.
Record the hash and increment a counter if you want multiple votes to be allowed. Include a timestamp (daily, hourly etc) in the salt to make votes time sensitive, say 5 votes per day.
The simplest answer is to use a cookie. Obviously it's vulnerable to people clearing their cookies, but anonymous voting is inherently approximate anyway.
In practice, unless the topic being voted on is in some way controversial or inflammatory, people aren't going to have a motive behind rigging the vote anyway.
IP is more 'reliable' but will produce an unacceptably high level of collisions due to NATs.
How about a more unique identifier composed of IP + user-agent (maybe a hash)? That effectively means for each IP, each exact OS/browser version pair gets 1 vote, which is a lot closer to 1 vote per person. Most browsers provide detailed version information in the user-agent -- I'm not sure, but my gut feel is that this would prevent the majority of collisions caused by NATs.
The only place that would still produce lots of collisions is a corporate environment with a standardised network, where everyone is using an identical machine.
The Chinese have to share one IPv4 address with hundreds of others; Hp/Compaq/DEC has almost 50 million addresses. IPv6 doesn't help as everyone get addresses by the billion. A person just is not the same as an IP address, and that notion is becoming ever more false.
There are just no proper ways to do this on the Internet. Persons are simply a concept unknown on the Internet, and any idea to introduce the concept is unlikely to succeed. (Too many governments would not want this to happen, for instance.)
Of course, you can relate the amount of votes per IP to the amounf of repeat page visits from that IP, especially in combination with cookie tracking. This works best if you estimate that number before you start the voting period. If the top 5% popular articles are typically read 10 times from a single IP, it's likely 10 people share that IP and they should get 10 votes. Cookies can be used to prevent them from stealing each others vote, but on the whole they can't skew your poll. (Note: this fails in small communities where a large group of voters come from a small number of IPs, in particular this happens around universities).
If you're not looking at authenticating voters, then you're going to be getting some duplicate votes no matter what you use. I'd use a cookie, and have done with it for the anonymous users.
UserVoice allows both anonymous voting and voting when logged in, but then allows the admin to filter out anonymous votes - a nice solution to this problem.
Anything based on IP addresses isn't an option - the case of NAT has been mentioned, but this seems to only be in the case of home users. There are many larger installations that use NAT - some corporations can have thousands of users pooled behind a single IP address. There are also ISP's that use proxy servers for their users - another case where you can have many thousands of users appear to your application as a single address. Adding unique UA combinations to this won't help, as there isn't enough variation.
A persistent cookie is going to be your best bet - and you'll have to live with the fact that it is easy to game. At least when the cookie is persistent (as opposed to session based) you'll catch the majority of users who run a single browser.
If you really want to rely on the results, you are going to have to add some form of identification in the process (like e-mail validation, which is still gameable).
At the end of the day any internet survey is going to have flaws (like: http://www.time.com/time/arts/article/0,8599,1894028,00.html), and you'll have to live with this.
Use a persistent cookie to allow only one vote per item
and record the IP, if there are more than 100 (1,000? 10,000?) requests in less than X mins then "soft block" the IP
The "soft block": dont show a page saying "your IP has been blocked" but show your "thank you for your vote" page and DONT record the vote in your DB. You even can increase the counter for that IP only. You want to prevent them to know that you are blocking their IP.
Two ideas not mentioned yet are:
Asking for the user's email address and emailing them a verification link
Using a captcha
Obviously the former can be circumvented with disposable email addresses and so on, but gives you an audit trail, and provides a significant hurdle to casual/bot vote-stuffing. A good captcha likewise severely limits vote-stuffing, but with all the usual caveats surrounding their use.
I have the same problem, and here's what I am planning on doing...
Set a persistent cookie. Check the cookie to decide whether a particular vote could be cast.
Additionally store some data about the vote request in the form of a combination of IP address + User Agent. And then use this value to limit the no. of votes to, say, 10 per day.
What is the best way of going about creating this hash (IP + UA String)?

Resources