i try to scrap a web page. I use Nokogiri/ Mechanize. so if i make
page = agent.get(url)
page.class
=> Mechanize::File
, sometimes i get a page object sometimes a file object. but what i need is, everytime a page object. i tried to add a pluggable_parser for plain/text but this don't work for me.
have anyone an idea how i can fix it, or how i can find out the content-type from a file object or know, how i can cast a file to an page object?
Thanks Michael
Most likely the page you're requesting is unavailable and the server returns a plaintext error page.
See the docs on Mechanize::File.
The content type is in page.response['content-type'].
It's definitely possible to change the content type of the response and then create a Mechanize::Page from the data without having to download it again - but I don't think that would give you anything useful.
Check the response code as well, it's in page.code.
Related
Getting a 416 error when trying to GET a website with HTTParty. Works just fine in the browser.
I have never gotten this error before, so I went online and found this:
It occurs when the server is unable to fulfill the request. This may
be, for example, because the client asked for the 800th-900th bytes of
a document, but the document is only 200 bytes long.
The request includes a Range request-header field, and not any of the
range-specifier values in this field overlaps the current extent of
the selected resource, and also the request does not include an
If-Range request-header field.
Wondering if anyone has gotten 416 with HTTParty before and if there is a way to prevent this form happening. Thanks
Example website where error occurs:
http://www.bizjournals.com/jacksonville/blog/morning-edition/2014/07/teens-make-up-less-of-summer-workforce-than-ever.html
It appears that bizjournals is able to detect you are a bot (not accessing in the browser) and therefore returns a 416.
irb(main):005:0> HTTParty.get('http://www.bizjournals.com/jacksonville/blog/morning-edition/2014/07/teens-make-up-less-of-summer-workforce-than-ever.html').body
=> "........As you were browsing <strong>http://www.bizjournals.com</strong> something about your browser made us think you were a bot. There are a few reasons this might happen........"
You could either ask bizjournals to allow you to make requests or try to change the headers to make bizjournals think you are not a bot.
I have an application which relies on can.route to capture the #change when the user clicks on a link.
href for the link is having pattern '#!'.
Once the change is capture by the can route utility, i am seeing the hash in the browser changing to #!&.
This is causing an additional entry in browser history stack.
Has anyone faced a similar issue?
Appreciate your help.
Could not provide a fix as there is no code to see how the route is configured.
Looks like you are adding only additional parameters to the route. To confirm, pls execute can.route.attr(); in your developer console of the browser.
If everything is configured properly, you should get something like this for the url http://localhost/example#!currentRoutePage
---> can.route.attr();
Object {route: "currentRoutePage"}
Looks in your case, url is http://localhost/example#!&view=currentRoutePage and so route is null in the object
---> can.route.attr();
Object {view: "currentRoutePage", route: ""}
If this doesn't help much, please share the url you are seeing in the browser and the route configuration for the same.
I am developing a component in joomla 2.5, my component sends a request to some url and gets the response object. If i pass wrong url, joomla takes me to the default page of Error : 500 - No response code found . I want that if user install my component and mistakenly they put wrong url , it should show some custom error message/page which should more meaningful to non-programming person rather than taking user to default error page. Is there some way to add this type of functionality in Joomla without editing template/error.php file in core.
You should have an error.php file in your template, if you don't add one and make it look the way you want. also remember that when you turn debugging off you won't get the stack trace etc.
However error 500 indicates something different than that the url does not exist ("wrong URL"), which would be a 404. 500 is an internal server error and you need to check your logs to figure out what is causing it.
So basically I am trying to debug my routes, because they are not working as intended, but when using the profiler, I can see the URI string, which is basically the second part of URL in the browser address bar and CLASS/METHOD which are always of the 404 page that I am being redirected to. So how can I get the primary routes Class, Method and arguments/parameters that were attempted to run before being sent to 404?
E.g.
$route['en/catalog/(.+)/(.+)'] = "ccatalog/index/$1/$2";
something's gone wrong and I get redirected to the 404, but I want to see which class (most likely "ccatalog" here), which method (hopefully "index") and arguments ($1, $2).
Thank you in advance to anyone who could help me with my problem!
I don't see a reason for your route to not work.
Check by directly opening your_path/ccatalog/index/whatever/whatever in the browser.
If it gives you a 404, it means the problem is with your controller, maybe the controller or function naming.
If it is working fine, then you may be able to use a pre_system hook to figure out the parameter values.
You may also consider hacking around with Routing files in the core(making sure you change them back), to figure out what the real issue is.
Actually, this was done really easy:
$this->uri->rsegment(1);
I use the following code to pass data into a website:
require "net/http"
params = {"message"=>"some message", "to"=>"someone"}
Net::HTTP.post_form(URI.parse("http://example.com/m/send"),params)
When I inspect the web page, the form action is http://example.com/m/send and I can post the data using the site itself without any problem.
I keep getting HTTP 404 and my data is not passed to the database.
When I request the page with GET method, then I get HTTP 405, which is an unauthorized request error. This guarantees that the page exists.
Since the url is valid, what would prevent the data being posted? And how can I fix that?
I could not solve the question using Net/HTTP library solely; however, Mechanize gem as Tin Man suggested in the comments solves the problem and successfully posts the data into the server.
It is also more flexible and easier in terms of following redirection. Hence, if anyone runs into this problem like I did, I recommend them using the Mechanize gem.