I had some time off recently and thought it would be a neat exercise to see how quickly I could put together a working program to automatically retrieve '.torrent' files for me. I'm aware there are existing solutions, but this was more of a programming exercise.
All was well, it ran, checked the sites for new torrents, and attempted to download them. But this is where I'm running into a problem; one of the sites that I'm trying to download the .torrent file from is giving me a file containing this instead of the torrent file when I try to download it;
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>400 Bad Request</title>
</head><body>
<h1>Bad Request</h1>
<p>Your browser sent a request that this server could not understand.<br />
</p>
<hr>
<address>Apache/2.2.3 (CentOS) Server at forums.mvgroup.org Port 80</address>
</body></html>
My first thought was maybe a broken link, so I went and successfully downloaded the file in my browser, so it's not a broken link.. My next thought is that maybe I'm not downloading the file correctly.. This is the example that I used, and this is actual code that's doing the downloading in my program.
I have a sneaking suspicion this is going to turn out to be one of those brain-dead simple gotchas, but I'm having a heck of a time figuring it out. Does anyone know why I'm getting a 400, or how to fix this?
A broken link should return a 404 Not Found error. Because you can retrieve the file with a browser I see there are two other possible issues: Either you are missing handling redirects in your code that the browser handles automatically, or you are missing needed session IDs or cookies or some state value. Again, a browser will handle those but your code will not unless you write it in, or take advantage of the right gem.
The sample code you link to at http://snippets.dzone.com/posts/show/2469 is rudimentary, but is not wired to follow redirects, which is what I suspect you need. I glanced at your code and it doesn't handle them either. The "Following Redirection" sample code in the docs for Net::HTTP shows how to do it.
Rather than write the code to retrieve the URL yourself, amounting to reinventing the wheel, I recommend using Ruby's Open::URI, because it handles redirects automatically along with time-out retries. It's easy to use and a good work horse for those normal "get a URL" jobs.
If you want to have a gem that handles redirects and cookies and session IDs, look at Mechanize. It's a very good gem for general purpose tasks, though it is really designed for navigating web sites.
For more robust tasks, Curb and Typhoeus are good because they can handle multiple requests, though you'll need to write a bit more code for managing the files and navigating sites. For a file download they'd be fine.
You need a logging proxy in between, so you can see which bytes go over the wire.
If you use Eclipse, it has a http proxy available. I believe it is part of the Eclipse Java EE download.
Related
We have an ASP.NET website on IIS. We have a Lead Forensics link. Which has been working fine prior to switching to require SSL on all pages. It is something similar to:
<script type="text/javascript" src="http://lead-123.com/js/8303.js"></script>
Since requiring SSL however, the tracking no longer seems to be working.
Obviously this is caused by the request to http link from the original https page. But the following two attempts are also failing:
src="//lead-123.com/js/8303.js"
src="https://lead-123.com/js/8303.js"
Visiting the https URL to the tracking script shows that it is being served (albeit with security errors).
I'm sure Lead Forensics have considered this. Does anyone know if there are any conventions or workarounds that can somehow be used so that security errors aren't reported on the site and for tracking to work? I can't find any documentation on this, and attempts to contact them haven't proven successful to date.
**
Update
I'm not sure the script is hosted on the https link after all. (It only responds in my browser after I have successfully received a response from the http link). Nevertheless, I am still looking for a convention on how to handle this situation, or whether a separate link is provided if using SSL, or indeed whether the technology is even capable of working over SSL.
Call Lead Forensics support. They can configure a secure endpoint for the tracker upon request:
<script type="text/javascript" src="https://secure.leadforensics.com/js/XXXXX.js"></script>
<noscript>
<img src="https://secure.leadforensics.com/XXXXX.png" style="display:none;" />
</noscript>
There is nothing you can do about this. The CN (Common Name) name assigned to this certificate is *.leadforensics.com; however, they kept giving other domain names bound to this certificate.
ERR_CERT_COMMON_NAME_INVALID is the error which we get.
As this entire process runs in background, thus browser doesn't open JS and PNG file, and tracking doesn't happen.
I am not sure how Lead Forensics can even do this!
We can easily create a workaround, by using HttpWebRequest class and overriding X509 event to always true - but creating such a workaround would violate the security norms and may mask other vulnerabilities.
So I've asked Lead Forensics to correct it.
I am developing a website using Microsoft MVC3, and have built it upon the default MVC3 Application template. It accesses an external database and works on localhost.
I have deployed it to a shared server I rent from storminternet via the publish tool using ftp method (storminternet do not yet support web deploy), and it runs well. It accesses the database okay and get requests work fine.
However, any form that submits via POST protocol returns page not found error 404 (this is on actions where I have asserted [HttpPost]).
Storm internet assure me that POST and GET are allowed by default, and since the helpdesk are not developers, I'm unsure who to turn to. I don't have an excellent understanding of web.config, although I can read and understand xml and see what's going on by reading through and googling. I have tried adding the protocols to the root web.config, and I think I might be barking up the wrong tree.
Has anyone else had this problem, or might anyone know how to help me?
To replicate my error, my site is here... 213.229.125.117/$sitepreview/ase-limited.com/Dev (sorry it isn't blue. The dollar gets parsed to % something)
and the quickest route to a POST request is to click 'Add Building' at the top of the left-hand side and then click 'Save' at the top of the dialogue box.
Any help will be gratefully received. I've been stuck on this for days without luck.
Best Regards
Nick
STOP-PRESS-STOP-PRESS-STOP-PRESS-STOP-PRESS-STOP-PRESS-STOP-PRESS-STOP-PRESS-
It turned out to be a known issue with sitepreview. Switching to the proper domain sorted everything.
I have noticed that you have some 404 javascript errors when performing your AJAX requests. For example you have a request to:
http://213.229.125.117/$sitepreview/ase-limited.com/BuildingManager/Employees/2
instead of:
http://213.229.125.117/$sitepreview/ase-limited.com/Dev/BuildingManager/Employees/2
Notice how /Dev is missing. That's because in your javascripts you have hardcoded your urls instead of using url helpers to generate them. For example you wrote something like this:
$.ajax({
url: '/BuildingManager/Employees/2',
....
});
which works fine on localhost because you don't have a virtual directory name but doesn't work when you deploy on your server because now the correct path is:
$.ajax({
url: '/Dev/BuildingManager/Employees/2',
....
});
For this reason you should absolutely never hardcode urls like that.
And when I try to POST the form in tries to post to http://213.229.125.117/Dev/BuildingManager/SaveBuilding which seems a very weird url as it is missing the whole beginning. Once again: never hardcode urls. Always use url helpers.
what do I have to do to add a ?_escaped_fragment_= support to my server? I want google to be able to crawl through my ajax site. My hashes are already in #! form
But I have no idea how to tell my server that when I enter mywebsite.com/?_escaped_fragment_=section to my browser so the url mywebsite.com/section and it would be equal to mywebsite.com/#!
thanks
Simple answer - my method (soon to be used for a site with ca. 50,000 AJAX-generated URLs) is to have a node.js server using a headless environment (try zombie, phantomjs, or any other) to load the site, making sure it's able to execute javascript and read the DOM - then at runtime, if it's google requesting the fragment, fire a request to the node.js server, which loads the site, executes the javascript, waits for the response, and delivers back the HTML, which is output to the browser.
If that sounds like a lot of work - I'm about 90% finished on the code that does it all for you, where you'd simply drop one line of (PHP) code at the top of your site/app and it does the rest for you, using a remote node.js server.
The code will be open source so if you want to set it up yourself on a node server, you can - or if it's a PITA to set it up yourself, I'll probably have a live server up and running which your app/website would fire ?_escaped_fragment_ requests to, and get the html snapshot back. It also implements caching so that these are only requested once every X days.
Watch this space - just got a few kinks to work out, and it'll be on my site (josscrowcroft.com) and I'll put it in a github repo too.
We are grabbing our feed at feedburner by using the jquery jGFeed plugin.
this works great until the moment our users are on a httpS:// page.
When we try to load the feed on that page the user gets the message that there is mixed conteent, protected and unprotected on the page.
A solution would be to load the feed on https, but google doesn't allow that, the certificate isn't working.
$.jGFeed('httpS://feeds.feedburner.com/xxx')
Does anyone know a workaround for this. The way it functions now, we simply cannot server the feed in our pages when on httpS
At this time Feedburner does not offer feeds over SSL (https scheme). The message that you're getting regarding mixed content is by design; in fact, any and all content that is not being loaded from a secured connection will trigger that message, so making sure that all content is loaded over SSL is really your only alternative to avoid that popup.
As I mentioned, Feedburner doesn't offer feeds over SSL, so realistically you'll need to look into porting your feed to another service that DOES offer feeds over SSL. Keep in mind what I said above, however, with respect to your feed's content as well. If you have any embedded content that is not delivered via SSL then that content will also trigger the popup that you're trying to avoid.
This comes up from time to time with other services that don't have an SSL cert (Twitter's API is a bit of a mess that way too.) Brian's comment is correct about the nature of the message, so you've got a few options:
If this is on your server, and the core data is on your server too, then you've got end to end SSL capabilities; just point jGFeed to the local RSS feed that FeedBurner's already importing.
Code up a proxy on your server to marshall the call to Feedburner and return the response over SSL.
Find another feed service that supports SSL, and either pass it the original feed or the Feedburner one.
i have started using WordPress paid theme Schema for my several blogs. In general, it is a nice theme, fast and SEO friendly. However, since my blogs are all on HTTPS, then I noticed that if I had a widget of (Google Feedburner) in the sitebar. The chrome will show a security error for any secure page with an insecure form call on the page.
To fix this, it is really simple,
you would just need to change the file widget-subscribe.php located at /wp-content/themes/schema/functions/ and replace all “http://feedburner.google.com” to “https://feedburner.google.com”.
Save the file, and clear the cache, then your browser will show a green padlock.
and i fix this in my this blog www.androidloud.com
I am using RUBY to screen scrap a web page (created in asp.net) which uses gridview to display data. I am successfully able to read the data displayed on page-1 of the grid but unable to figure out how I can move to the next page in the grid to read all the data.
Problem is the page number hyperlinks are not normal hyperlinks (with URL) but instead are javascript hyperlink which causes postback to the same page..
An example of the hyperlink:-
6
I recommend using Watir, a ruby library designed for browser testing, if you're already using ruby for processing. For one thing, it gives you a much nicer interface to the DOM elements on the page, and it makes clicking links like this easier:
ie.link(:text, '6').click
Then, of course you have easier methods for navigating the table as well. It's easy enough to automate this process:
1..total_number_of_pages.each do |next_page|
ie.link(:text, next_page).click
# table processing goes here
end
I don't know your use case, but this approach has its advantages and disadvantages. For one thing, it actually runs a browser instance, so if this is something you need to frequently run quietly in the background in completely automated way, this may not be the best approach. On the other hand, if it's ok to launch a browser instance, then you don't have to worry about all that postback nonsense, and you can just click the link as if you were a user.
Watir: http://wtr.rubyforge.org/
You'll need to figure out the actual URL.
Option 1a: Open the page in a browser with good developer support (e.g. firefox with the web development tools) and look through the source to find where _doPostBack is defined. Figure out what URL it's constructing. Note that it might not be in the main page source, but instead in something that the page loads.
Option 1b: Ditto, but have ruby do it. If you're fetching the page with Net:HTTP you've got the tools to find the definition of __doPostBack already (the body as a string, ruby's grep, and the ability to request additional files, such as those in script tags).
Option 2: Monitor the traffic between a browser and the page (e.g. with a logging proxy) to find out what the URL is.
Option 3: Ask the owner of the web page.
Option 4: Guess. This may not be as bad as it sounds (e.g. if the original URL ends with "...?page=1" or something) but in general this is the least likely to work.
Edit (in response to your comment on the other question):
Assuming you're using the Net:HTTP library, you can do a postback by just replacing your get with a post, e.g. my_http.post(my_url) instead of my_http.get(my_url)
Edit (in response to danieltalsky's answer):
watir may be a really good solution for you (I'm kicking myself for not having thought of it), but be aware that you may have to manually fire the event or go through other hoops to get what you want. As a specific gotcha, with any asynchronous fetch like this you need to make sure that the full response has come back before you scrape it; that isn't a problem when you're doing the request inline yourself.
You will have to perform the postback. The data is pass with a form POST back to the server. Like Markus said use something like FireBug or the Developer Tools in IE 8 and fiddler to watch the traffic. But honestly this is a web form using the bloated GridView and you will be in for a fun adventure. ;)
You'll need to do some investigation in order to figure out what HTTP request the javascript execution is performing. I've used the Mozilla browser with the Firebug plugin and also the "Live HTTP Headers" plugin to help determine what is going on. It will likely become clear to you which requests you will need to make in order to traverse to the next page. Make sure you pay attention to any cookies getting set.
I've had really good success using Mechanize for scraping. It wraps all of the HTTP communication, html parsing and searching(using Nokogiri), redirection, and holding onto cookies. But it doesn't know how to execute Javascript, which is why you will need to figure out what http request to perform on your own.