proxy for scale, performance (to load external content)? - proxy

I am sure answer for this question will be very subjective, I simply want to know what the options are out there (for building a proxy to load external contents).
Typically I used cURL in php and pass a variable like proxy.url to fetch content. Then make an AJAX call with Javascript to populate the contents.
EDIT:
YQL (Yahoo Query language) seems a very promising solution to me, however, it has a daily usage limit which essentially prevents me from using it for large scale projects.
What other options do I have? I am open to any language, any platform, key criteria are: performance and scalability.
Please share your ideas, thoughts and experience on this topic.
Thanks,

you dont need a proxy server or something else.
Just create a cronjob to fetch the contents every 5 minutes (or whenever you want).
You just need to create a script that grabs the content from the web and saves it (to a file, a database, ...), which will be started by the cronjob.
If somebody requests your page, you just need to send the cached content out and do with it whatever you want to do.
I think scalability and performance will be no problem.

Depending on what you need to do with the content, you might consider Erlang. It's lightening fast, ridiculously reliable, and great for scaling.

Related

Golang Cache HTTP GET Results In Memory

I am working on a CLI in Go that scrapes a webpage to collect the href attributes of all the links on the page into a slice. I want to store this slice in memory for some time so that the scraper is not being called on every execution of the CLI command. Ideally, the scraper would only be called after the cache expires or the user provides some sort of --update flag.
I came across the library go-cache and other similar libraries, but from what I could tell they only work for something that is continuously running, like a server.
I thought about writing the links to a file, but then how would I expire the results after a specific duration? Would it make sense to create a small server in the background that shuts down after a while in order to use a library like go-cache? Any help is appreciated.
There are two main approaches in these scenarios:
Create a daemon, service or background application that acts as your data repository. You can run it as an HTTP server / RPC server depending on your requirements. Your CLI application then interacts with this daemon as required;
Implement a persistence mechanism that will allow data to be written and read across multiple CLI application executions. You may use normal text files, databases or even an implementation of golang's encoding/gob to write and read your slice (a map would probably be better) to and from a binary file.
You can timestamp entries and simply remove them after their ttl expires by explicitly deleting them, or by simply not rewriting them during subsequent executions, according to the strategy / approach selected above.
The scope and number of examples for such an open ended question is too myriad to post in a single answer and will most likely require multiple specific questions.
Use a database and store as much detail as you can (fetched_at, host, path, title, meta_desc, anchors etc). You'll be able to query over the data later and it will be useful to have it in a structured format. If you don't want to deal with a db dependency you could embed something like boltdb (pure go) or sqlite (cgo).

ColdFusion Caching Solutions for Fusebox 4

I have an application that was built using Fusebox 4 with ColdFusion. Can anyone recommend a good caching solution, that is a plugin, which works directly with this older version of the framework?
Another idea I've been tinkering around with is to take the most commonly used queries in the system and applying cachedWithin. The value would be a variable stored in the application scope. Basically anytime we update any of the most commonly accessed tables in the db, we update the application.cachedwithin variable as well. So whenever these tables are updated the data is refreshed. Anything else that isn't used frequently will simply query the DB to get the content.
Also to add to this very simple caching methodology would be to simply store strings, or other frequently used content, directly within the application scope.
This bulk of this application is around 30 pages, comprised of approximately 200 products. So its quite a small website.
Can anyone recommend a good Fusebox 4 cache plugin or confirm if this simple caching methodology is a good idea? If not, could you recommend a simple alternative?
thanks in advance
I would suggest you to use cfcache to store all pages output into statistics HTML files.
Then on any update, you can clear the cache of the updated pages or all the cache:
<cfcache action="flush" />
<cfobjectcache action="clear" />
make sure to disable the urlSessionFormat() in URL.
I'm not sure that you even need to be caching given the size of the site, unless you are getting a huge amount of traffic. If you are currently having performance problems, the first thing to do is make sure that Fusebox is in production mode, so that it isn't recreating the parsed files on each request.
Caching the queries should certainly aid performance - how long are the queries currently taking to execute? With Fusebox 4, it can be problematic to have "Report execution times" turned on in CF when debugging, as it can significantly affect the time the request takes to execute.

Embedding code (css, js) into a document on high profile sites

Is there an advantage of some sort (speed or performance wise) to embed your CSS and JS into your web page, as opposed to keeping the code in sparate files? I was raised to believe that keeping code separate in separate files makes things easier to maintain. However, on high profile websites like amazon or google even facebook, I see a lot of embed code. Is there a performance reason they choose to do so or is it just an old/new way of doing things. I suppose my question is similar to this one: Should I inline CSS & JS in mobile sites to save bandwidth?
But I would like to hear form experts, most notably from people who worked on high profile web sties and have done so, if any.
P.S.
Bonus Question: Last html comment on amazon web pages is <!-- MEOW --> does it mean anything or is it just a funny prank?
There are good reasons to inline resources, but as with most things, it also has its tradeoffs. The simplest case for inlining is cases where the cost of an HTTP connection is much more than the resource itself, ex: if you have a 10x10 icon you need to show, a dedicated request for that may not be worth it vs. inlining the data via a data URI.
This is especially true when and if you have many small resources that need to be fetchd. Most browsers limit themselves to a max of 6 connections per host, so if you have 60 resources which need to be fetched, then you'll be blocked for a significant chunk of time.
To work around these case we've invented other workarounds: domain sharding to go over the 6 connection limit, and "spriting" to fetch one resource vs multiple.
If you take a look at mod_pagespeed (Apache module), which does many of these optimizations on the fly for you, then the recommended setting we provide is to inline any resource that's below 2kb. That's a pretty good rule of thumb for today's stack.
Once SPDY is more widely deployed, many of these workarounds can be eliminated: no need to do domain sharding, cost of extra requests is much less, etc.
Stoyan did an experiment that you might find interesting http://www.phpied.com/style-tag-to-inline-style-attrrib/
CSS/JS external files typically get cached on the user's hard drive under that users browser's profile. So unless you change the code frequently, you won't really be doing yourself a favor by putting it inline.
Definitely saves you time from maintenance, but you can easily call in a javascript/css file and embed the code on the page you're populating on the server side, but that also means you're making your server do additional work.
As for the MEOW - yeah, them trying to be funny, or it's code... for... cat...

How would you stress test a dynamic site, when you don't know what the URLs will be ahead of time?

This isn't a question of what stress testing tools are out there. I'm afraid it's a lot harder than that. (At least for me)
Consider a restful architecture for a forum or blog that generates random IDs for each post.
Simulating creating those topics/articles would be simple, because you'd just be posting form data to an endpoint like: /article, or /topic
But how do you then stress test commenting on those articles/topics? This is different, because the comments need to belong to an article/topic, which means that you need the ids of those items. However, if all you can do is issue posts, and you have no way of pulling those ids, you'd be unable to create them.
I'm creating a site that is similar in this regard, and I have no idea how to stress test the creation of the comments.
I have two ideas, and they're both pretty awful:
Generate a massive system ahead of time with some kind of factory, and then freeze it. From there, I figure I'd have to use some kind of browser automation to create my 'comments' on all of this. The automation would I suppose go through a recording proxy, like what JMeter offers. Then, to run the test, I reload the database, and replay the massive log file.
Use browser automation for the whole thing, taking advantage of the dynamic links delivered in the HTML page. The only option here would be Selenium, and really, we're talking a massive selenium grid that would be extremely expensive. Probably very difficult to maintain also.
Option 2 is completely infeasible near as I can tell, but option 1 sounds excruciating. I'm really hoping someone can suggest something more clever.
Option 1.
I mean, implementation notes aside, you're basically just asking for a testing environment. So, the answer is to make one. In whatever fashion:
Generate it
Make it once and reload it
Randomise it
Whatever. It's the approach to go with.
How do you your testing is kind of a side issue (unit testing/browser/whatever, up to you).
But you've reached a point where you need to test with real data. So make it happen.
This is a common problem. We handle it by extracting the dynamic parts of the URLs from the server responses. I presume this system uses web browser client - which implies that those URLs are being sent in the server responses. If they are in the responses, then you CAN get them. However, since you said "if all you can do is issue posts, and you have no way of pulling those ids", then perhaps this is not the case? In that case, can you clarify?
We've recently been doing a lot of testing of Drupal systems for our customers - which has exactly the problem you've described. We either solve it by extracting the IDs dynamically from the page as the user browses to the page they want to comment on, or we use option 1, or a combination of both. Note that if you have a load testing tool handy, then generation of content is not too difficult - use the tool to do it. I.e. run a "content generation" load test. Besides yielding useful data on its own accord, that will give you a test database that you can then backup/restore as needed to maintain your test infrastructure. Now you can run the test on a more realistic environment - one that has lots of content already in it (assuming, of course, that this is realistic for your purposes).
If you are interested, I'd be happy to demo how we solve the problem using our software (Web Performance Load Tester).
I have used Visual Studio to solve this kind of problem. Visual Studio allows C# coded web tests that can programatically parse the html returned and take action based on that.
I was load testing a SharePoint website and required information to be populated ahead of time. I did create a load test that was specifically for creating "random" pages of content ahead of time. I populated a test harness database with the urls ahead of time, allowing some control over the pages that were loaded.
With a list of "articles" available and a list of potential comments, it is possible to code a pseudo-random number generator (inside a stored procedure because of the asynchronous nature of the test harness) to get a repeatable load test. That meant that the site would be populated in the same way each time the load test was run.
It does take some effort to create a decent way of populating the site off the bat, but the return in the relevance of the load test is quite good.

Client-side logic OR Server-side logic?

I've done some web-based projects, and most of the difficulties I've met with (questions, confusions) could be figured out with help. But I still have an important question, even after asking some experienced developers: When functionality can be implemented with both server-side code and client-side scripting (JavaScript), which one should be preferred?
A simple example:
To render a dynamic html page, I can format the page in server-side code (PHP, python) and use Ajax to fetch the formatted page and render it directly (more logic on server-side, less on client-side).
I can also use Ajax to fetch the data (not formatted, JSON) and use client-side scripting to format the page and render it with more processing (the server gets the data from a DB or other source, and returns it to the client with JSON or XML. More logic on client-side and less on server).
So how can I decide which one is better? Which one offers better performance? Why? Which one is more user-friendly?
With browsers' JS engines evolving, JS can be interpreted in less time, so should I prefer client-side scripting?
On the other hand, with hardware evolving, server performance is growing and the cost of sever-side logic will decrease, so should I prefer server-side scripting?
EDIT:
With the answers, I want to give a brief summary.
Pros of client-side logic:
Better user experience (faster).
Less network bandwidth (lower cost).
Increased scalability (reduced server load).
Pros of server-side logic:
Security issues.
Better availability and accessibility (mobile devices and old browsers).
Better SEO.
Easily expandable (can add more servers, but can't make the browser faster).
It seems that we need to balance these two approaches when facing a specific scenario. But how? What's the best practice?
I will use client-side logic except in the following conditions:
Security critical.
Special groups (JavaScript disabled, mobile devices, and others).
In many cases, I'm afraid the best answer is both.
As Ricebowl stated, never trust the client. However, I feel that it's almost always a problem if you do trust the client. If your application is worth writing, it's worth properly securing. If anyone can break it by writing their own client and passing data you don't expect, that's a bad thing. For that reason, you need to validate on the server.
Unfortunately if you validate everything on the server, that often leaves the user with a poor user experience. They may fill out a form only to find that a number of things they entered are incorrect. This may have worked for "Internet 1.0", but people's expectations are higher on today's Internet.
This potentially leaves you writing quite a bit of redundant code, and maintaining it in two or more places (some of the definitions such as maximum lengths also need to be maintained in the data tier). For reasonably large applications, I tend to solve this issue using code generation. Personally I use a UML modeling tool (Sparx System's Enterprise Architect) to model the "input rules" of the system, then make use of partial classes (I'm usually working in .NET) to code generate the validation logic. You can achieve a similar thing by coding your rules in a format such as XML and deriving a number of checks from that XML file (input length, input mask, etc.) on both the client and server tier.
Probably not what you wanted to hear, but if you want to do it right, you need to enforce rules on both tiers.
I tend to prefer server-side logic. My reasons are fairly simple:
I don't trust the client; this may or not be a true problem, but it's habitual
Server-side reduces the volume per transaction (though it does increase the number of transactions)
Server-side means that I can be fairly sure about what logic is taking place (I don't have to worry about the Javascript engine available to the client's browser)
There are probably more -and better- reasons, but these are the ones at the top of my mind right now. If I think of more I'll add them, or up-vote those that come up with them before I do.
Edited, valya comments that using client-side logic (using Ajax/JSON) allows for the (easier) creation of an API. This may well be true, but I can only half-agree (which is why I've not up-voted that answer yet).
My notion of server-side logic is to that which retrieves the data, and organises it; if I've got this right the logic is the 'controller' (C in MVC). And this is then passed to the 'view.' I tend to use the controller to get the data, and then the 'view' deals with presenting it to the user/client. So I don't see that client/server distinctions are necessarily relevant to the argument of creating an API, basically: horses for courses. :)
...also, as a hobbyist, I recognise that I may have a slightly twisted usage of MVC, so I'm willing to stand corrected on that point. But I still keep the presentation separate from the logic. And that separation is the plus point so far as APIs go.
I generally implement as much as reasonable client-side. The only exceptions that would make me go server-side would be to resolve the following:
Trust issues
Anyone is capable of debugging JavaScript and reading password's, etc. No-brainer here.
Performance issues
JavaScript engines are evolving fast so this is becoming less of an issue, but we're still in an IE-dominated world, so things will slow down when you deal with large sets of data.
Language issues
JavaScript is weakly-typed language and it makes a lot of assumptions of your code. This can cause you to employ spooky workarounds in order to get things working the way they should on certain browsers. I avoid this type of thing like the plague.
From your question, it sounds like you're simply trying to load values into a form. Barring any of the issues above, you have 3 options:
Pure client-side
The disadvantage is that your users' loading time would double (one load for the blank form, another load for the data). However, subsequent updates to the form would not require a refresh of the page. Users will like this if there will be a lot of data fetching from the server loading into the same form.
Pure server-side
The advantage is that your page would load with the data. However, subsequent updates to the data would require refreshes to all/significant portions of the page.
Server-client hybrid
You would have the best of both worlds, however you would need to create two data extraction points, causing your code to bloat slightly.
There are trade-offs with each option so you will have to weigh them and decide which one offers you the most benefit.
One consideration I have not heard mentioned was network bandwidth. To give a specific example, an app I was involved with was all done server-side and resulted in 200Mb web page being sent to the client (it was impossible to do less without major major re-design of a bunch of apps); resulting in 2-5 minute page load time.
When we re-implemented this by sending the JSON-encoded data from the server and have local JS generate the page, the main benefit was that the data sent shrunk to 20Mb, resulting in:
HTTP response size: 200Mb+ => 20Mb+ (with corresponding bandwidth savings!)
Time to load the page: 2-5mins => 20 secs (10-15 of which are taken up by DB query that was optimized to hell an further).
IE process size: 200MB+ => 80MB+
Mind you, the last 2 points were mainly due to the fact that server side had to use crappy tables-within-tables tree implementation, whereas going to client side allowed us to redesign the view layer to use much more lightweight page. But my main point was network bandwidth savings.
I'd like to give my two cents on this subject.
I'm generally in favor of the server-side approach, and here is why.
More SEO friendly. Google cannot execute Javascript, therefor all that content will be invisible to search engines
Performance is more controllable. User experience is always variable with SOA due to the fact that you're relying almost entirely on the users browser and machine to render things. Even though your server might be performing well, a user with a slow machine will think your site is the culprit.
Arguably, the server-side approach is more easily maintained and readable.
I've written several systems using both approaches, and in my experience, server-side is the way. However, that's not to say I don't use AJAX. All of the modern systems I've built incorporate both components.
Hope this helps.
I built a RESTful web application where all CRUD functionalities are available in the absence of JavaScript, in other words, all AJAX effects are strictly progressive enhancements.
I believe with enough dedication, most web applications can be designed this way, thus eroding many of the server logic vs client logic "differences", such as security, expandability, raised in your question because in both cases, the request is routed to the same controller, of which the business logic is all the same until the last mile, where JSON/XML, instead of the full page HTML, is returned for those XHR.
Only in few cases where the AJAXified application is so vastly more advanced than its static counterpart, GMail being the best example coming to my mind, then one needs to create two versions and separate them completely (Kudos to Google!).
I know this post is old, but I wanted to comment.
In my experience, the best approach is using a combination of client-side and server-side. Yes, Angular JS and similar frameworks are popular now and they've made it easier to develop web applications that are light weight, have improved performance, and work on most web servers. BUT, the major requirement in enterprise applications is displaying report data which can encompass 500+ records on one page. With pages that return large lists of data, Users often want functionality that will make this huge list easy to filter, search, and perform other interactive features. Because IE 11 and earlier IE browsers are are the "browser of choice"at most companies, you have to be aware that these browsers still have compatibility issues using modern JavaScript, HTML5, and CSS3. Often, the requirement is to make a site or application compatible on all browsers. This requires adding shivs or using prototypes which, with the code included to create a client-side application, adds to page load on the browser.
All of this will reduce performance and can cause the dreaded IE error "A script on this page is causing Internet Explorer to run slowly" forcing the User to choose if they want to continue running the script or not...creating bad User experiences.
Determine the complexity of the application and what the user wants now and could want in the future based on their preferences in their existing applications. If this is a simple site or app with little-to-medium data, use JavaScript Framework. But, if they want to incorporate accessibility; SEO; or need to display large amounts of data, use server-side code to render data and client-side code sparingly. In both cases, use a tool like Fiddler or Chrome Developer tools to check page load and response times and use best practices to optimize code.
Checkout MVC apps developed with ASP.NET Core.
At this stage the client side technology is leading the way, with the advent of many client side libraries like Backbone, Knockout, Spine and then with addition of client side templates like JSrender , mustache etc, client side development has become much easy.
so, If my requirement is to go for interactive app, I will surely go for client side.
In case you have more static html content then yes go for server side.
I did some experiments using both, I must say Server side is comparatively easier to implement then client side.
As far as performance is concerned. Read this you will understand server side performance scores.
http://engineering.twitter.com/2012/05/improving-performance-on-twittercom.html
I think the second variant is better. For example, If you implement something like 'skins' later, you will thank yourself for not formatting html on server :)
It also keeps a difference between view and controller. Ajax data is often produced by controller, so let it just return data, not html.
If you're going to create an API later, you'll need to make a very few changes in your code
Also, 'Naked' data is more cachable than HTML, i think. For example, if you add some style to links, you'll need to reformat all html.. or add one line to your js. And it isn't as big as html (in bytes).
But If many heavy scripts are needed to format data, It isn't to cool ask users' browsers to format it.
As long as you don't need to send a lot of data to the client to allow it to do the work, client side will give you a more scalable system, as you are distrubuting the load to the clients rather than hammering your server to do everything.
On the flip side, if you need to process a lot of data to produce a tiny amount of html to send to the client, or if optimisations can be made to use the server's work to support many clients at once (e.g. process the data once and send the resulting html to all the clients), then it may be more efficient use of resources to do the work on ther server.
If you do it in Ajax :
You'll have to consider accessibility issues (search about web accessibility in google) for disabled people, but also for old browsers, those who doesn't have JavaScript, bots (like google bot), etc.
You'll have to flirt with "progressive enhancement" wich is not simple to do if you never worked a lot with JavaScript. In short, you'll have to make your app work with old browsers and those that doesn't have JavaScript (some mobile for example) or if it's disable.
But if time and money is not an issue, I'd go with progressive enhancement.
But also consider the "Back button". I hate it when I'm browsing a 100% AJAX website that renders your back button useless.
Good luck!
2018 answer, with the existence of Node.js
Since Node.js allows you to deploy Javascript logic on the server, you can now re-use the validation on both server and client side.
Make sure you setup or restructure the data so that you can re-use the validation without changing any code.

Resources