How to read content from reCAPTCHA protected site - recaptcha

My client needs data scraped from a website. I am planning to use php_curl. The problem is, the site is using Google reCAPTCHA. Few powerful data items are visible only when you click "show this information link". then the reCAPTCHA appears in lightbox and vanishes, and information is displayed.
I have checked the source html, the protected item is actually loaded when someone clicks, and there is no way for me to automate this click. I have even tried to open the site in iframe and then use JS to click it, but it fails as both domains are different. I have also tried to use Selenium stand alone version but its downloads are corrupt.

Unless there is a design flaw with the website, the reCAPTCHA will prevent you from scraping the material without human intervention.
Technically, your best bet is to employ humans to solve CAPTCHAs all day and write some software to automatically scrape the material it protects for each one they solve. A number of viable businesses have been created this way, where the data is valuable and there is a genuine public interest in opening the data-set. (For example I heard that flight companies use CAPTCHA devices to prevent price comparison sites from driving down the cost to the consumer, and I'd argue in such a case there is an overwhelming public interest to defeating such defences).
Morally, however, you would need to tell us what you are doing in order for us to advise you. It is possible your client is merely planning to steal other people's material and then attempt to monetise it for him/herself, even though they had no hand in creating it. That may breach some copyright laws, but moreover, they (and you) need to decide if the scraping is fair.

I am facing the same problem but resolved it using clear my cookies in httprequest in useragent after clear cookie wait time function (tread sleep) for some time and then start scrapping again. But I am doing this in C#, not in PHP. Applying this logic may help you.

Related

Can I blog adblock-enabled users from viewing content by wrapping content in an ad?

Just trying to find a good, simple way to block users who use adblock plus (ads are the only way writers can receive any money on our website, as with many sites).
It seems the plugins that block adblock-plus users can also be blocked.
I considered a simple solution. If their computers block ads, can't I just wrap my content in the ad, so they view all or nothing?
If not, can anyone think of any similar methods of denying users who want to deny any chance of compensation to the hardworking writers whose content they already enjoy without charge?
Possibly using a conditional statement dependent on the ad script?
It is impossible to do that.
What you can do, however, (and this is a very popular technique) is put an image behind your ad telling your viewers about how ad-blocking is bad. That way, you can possibly persuade your viewers to disable their ad-block.
If your viewer has ad-block on, they'll be able to see the image. If they don't have ad-block on, the ad will overlap the image, so they'll only see the ad.
By doing that, at least you can convince sympathetic people to perhaps disable their ad-block.
Aside from that, you're really out of luck.

Why is CoreGui RobloxLocked in the DataModel and why can't trusted users use CoreScripts?

We should be able to access some of it so that we can edit the placement of each GUI object inside of CoreGui. So, other than security reasons, why are we not allowed to edit placement of GUI objects?
Also, why can't trusted users use CoreScripts? What if they need to access HttpGet so they can provide a nice display showing where their best friend is at the current time and place? SocialService won't always do the trick.
Can a developer (or any other experienced Roblox player, particularly one that knows the UI in and out) please answer these questions to the best of his/her ability?
I asked this in the OBC cast, specifically about editing the UI inside CoreGui. I'm not sure what security reasons could be preventing this, however. They did reply - the answer was, "Well, we definitely don't want you moving the little help icon, or the exit button."
I got the feeling the general reason is because users would become confused if everything was misplaced. For example, if you went into a website where you could play several games all made by that company (like ROBLOX), would you expect the exit or help buttons to me placed differently in every game?
They did say we will be able to change the colours.
Hope this clears things up.
Some GUI objects like the report abuse button we don't want users to have the ability to be able to remove. Another sensitive area is the chat window. If it was completely scriptable, you could write a script to make it look like another user was saying something that he wasn't. This is not really desirable.
HttpGet is currently a privileged function for two main reasons:
It would allow users to get dynamic content into levels, which would make moderation a more difficult task.
Poorly or maliciously written scripts could HttpGet roblox.com in an infinite loop, sapping our server resources.
There was no obvious benefit, but some obvious downsides. We prefer to solve only the problems that need to be solved in order to ship features, so we err on the side of caution for things like this. If we later decide to open up new functionality, like making the ROBLOX social graph available through an API, we can do that with a dedicated interface that limits the number of requests you can make to the website in a given period, and only return the info that we are sure we want you to be able to get.
It's interesting to note that for a very long time Adobe Flash player didn't support TCP sockets for the same reason.

How to know quantity of users with turned off images in browser?

I'm working on the quite popular website, which looks good if user has turned on "Load images" option in his browser's settings.
When you try to open the website with "turned off images" option, it becomes not usable, many components won't work, because user won't see "important" buttons(we don't use standard OS buttons).
So, we can't understand and measure negative business impact of this mistake(absent alt/title attributes).
We can't set priority for this task - because we don't know how much such users comes to our website.
Please give me some advice how this problem can be solved?
Look in the logs for how many hits you get on a page without the subsequent requests from the browser for the other images.
Of course the browser might have images cached, so look for the first time you get a hit.
You can even use IP address for this, since it's OK if you throw out good data (that is, hits that are new that you disregard). The question is just: Of the hits you know are first-time, how many don't get images?
If this is a public page (i.e. not a web application that you've logged in to), also disregard search engine bots to the greatest extent possible; they usually won't retrieve images.

Mixing Secure and Non-Secure Content on Web Pages - Is it a good idea?

I'm trying to come up with ways to speed up my secure web site. Because there are a lot of CSS images that need to be loaded, it can slow down the site since secure resources are not cached to disk by the browser and must be retrieved more often than they really need to.
One thing I was considering is perhaps moving style-based images and javascript libraries to a non-secure sub-domain so that the browser could cache these resources that don't pose a security risk (a gradient isn't exactly sensitive material).
I wanted to see what other people thought about doing something like this. Is this a feasible idea or should I go about optimizing my site in other ways like using CSS sprite-maps, etc. to reduce requests and bandwidth?
Browsers (especially IE) get jumpy about this and alert users that there's mixed content on the page. We tried it and had a couple of users call in to question the security of our site. I wouldn't recommend it. Having users lose their sense of security when using your site is not worth the added speed.
Do not mix content, there is nothing more annoying then having to go and click the yes button on that dialog. I wish IE would let me always select show mixed content sites. As Chris said don't do it.
If you want to optimize your site, there are plenty of ways, if SSL is the only way left buy a hardware accelerator....hmmm if you load an image using http will it be cached if you load it with https? Just a side question that I need to go find out.
Be aware that in IE 7 there are issues with mixing secure and non-secure items on the same page, so this may result in some users not being able to view all the content of your pages properly. Not that I endorse IE 7, but recently I had to look into this issue, and it's a pain to deal with.
This is not advisable at all. The reason browsers give you such trouble about insecure content on secure pages is it exposes information about the current session and leaves you vulnerable to man-in-the-middle attacks. I'll grant there probably isn't much a 3rd party could do to sniff venerable info if the only insecured content is images, but CSS can contain reference to javascript/vbscript via behavior files (IE). If your javascript is served insecurely, there isn't much that can be done to prevent a rouge script scraping your webpage at an inopportune time.
At best, you might be able to get a way with iframing secure content to keep the look and feel. As a consumer I really don't like it, but as a web developer I've had to do that before due to no other pragmatic options. But, frankly, there's just as many if not more defects with that, too, as after all, you're hoping that something doesn't violate the integrity of the insecure content so that it may host the secure content and not some alternate content.
It's just not a great idea from a security perspective.

How do I Extend Blogengine.Net to collect statistics of visitors?

I love BlogEngine. But from what I can se it does not collect the standard information about the visitors I would like to see (referrer, browser-type and so on).
When I log in as Admin I have a menu item named "Referrer". I can choose a weekday and then I'll be presented with 1 or 2 rows with
"google.com 4 hits, "itmaskinen.se 6 hits" and so on, But that's not what I want to se, I want to se where my visitors come from, country, IP if possible, how many visitors and so on.
If someone of you are familiar with Blogengine.Net and can point me in the right direction to where I would put my own log-code or if you know any visitor-statistic-extension that can do it for me, I would be really happy to know. I prefer an extension, because if I make changes myself to BlogEngine it may break later updates I install.
Blogengine.Net is a blog software made in .Net found here: http://www.dotnetblogengine.net/
And yes, I prefer to take this question here rather then in the Blogengine.Net forum, you know why. ;)
(Anyone, feel free to edit my (bad) english in this post and after that delete this sentence)
This isn't an extension, but it's what I use to collect all my blogengine.net data and it should be upgrade safe.
When you log into the Blogengine.NET admin screens you can go to "Settings> Custome Code > Tracking Script", here you can put your http://www.google.com/analytics/ logging script. Google Analytics provides all the referrer, browser type, etc stuff you were wanting. And what's nice is you can then create additional accounts for other sites if you choose.
I use both Google Analytics and StatCounter to track visitor stats. I find that each one provides useful information that the other doesn't. And they're both free to a certain extent.
I place their javascript code int the site.master file of my custom BE.Net skin.
For Google Analytics I go a step further and pass the username of authenticated users as a custom variable. That way I can match users names up with the stats. To do this you can use the _setVar javascript method on the GA pageTracker like so:
<script type="text/javascript">
var pageTracker = _gat._getTracker("UA-129049-25");
var userDefinedValue = '<%= System.Web.Security.Membership.GetUser() != null ? System.Web.Security.Membership.GetUser().UserName : "" %>';
pageTracker._setVar(userDefinedValue);
pageTracker._trackPageview();
</script>
Anyone noticed that we miss all the hits coming from RSS readers? Syndication.axd does not run the analytics javascripts. So we miss the vast majority of viewers from the statistics. And we happily analyze that is just not impotant - ad-hoc visitors.
For the vast majority of cases, Google Analytics does just fine. It all depends on how much data you want. For example, if you want to keep note of IP addresses and resolve them to get domain names, and also highlight all visits to your blog from, say, your coworkers at the company where you work, you'd have to write some custom code yourself. However, it's all fairly primitive - these sorts of things are easily achievable using ASP.NET.
I set up gathering statistics on IIS web site of my BlogEngine instance and then analyze the logs using WebLog Expert - http://www.weblogexpert.com.
It is more reliable than google analytics, since I see really ALL requests that are coming to my IIS, no matter if this is a request to axd or to some static content. And, once I've found out that google was fooling me in the number of visits. After that I trust my IIS statistics much more than google.
There is a Widget which can be use to display Visits and Online Users Statistics.
You can find it from following links:
http://www.nuget.org/packages/Statistics/
http://www.itnerd.ir/post/2013/07/25/Visits-and-Online-Users-Statistics-widget-for-BlogEngine-2
but to see the instructions go to the second link.

Resources