I'm trying to write simple crawler, that would be filling 2 input fields. The page has an img element. Through Chrome developer mode I can see that img has src attribute. But after fetching the page the src attribute is gone. How do I get over this?
Code:
require 'mechanize'
agent = Mechanize.new
agent.user_agent_alias = 'Windows Chrome'
page = agent.get('https://ercdmd.ru/?gpay')
form = page.forms.first
form.gpay_abon = '00-0000000000'
captcha = page.at('#img_captcha')
pp captcha
Output:
#(Element:0x15e90ec {
name = "img",
attributes = [ #(Attr:0x15e8c14 { name = "id", value = "img_captcha" })]
})
My idea is to get invoice by a query through Telegram bot. Since there is a captcha I thought that I could read captcha image src with Mechanize to send that image through Telegram. Than, I would input digits that I can see on image and send in back to Mechanize to fill second input field. But now I am stuck.
Is there an other way to get invoice from that source?
I'm looking at that page, the captcha url would be:
captcha_url = "https://ercdmd.ru/captcha.php?time=#{Time.now.to_i}000"
Give that a try and see if it works.
Related
I'm doing a personal project that involves getting data out of a website, I managed to make it automatically log in and all of that, but i have reached a point where i have to click on an image
img src="data:image/png;base64, {{sc.PhotoLocation}}" style="width: 75px; margin-top:3px;cursor:pointer" class="ng-cloak" ng-click="sc.selectPerson(sc.person_Guid)" title="View Student Record" /
to progress to another menu of that page, after multiple google searches and documentations i decided on using XPath
HtmlImage image = page.<HtmlImage>getFirstByXPath("//img[#src=\'data:image/png;base64, sc.PhotoLocation}}\']");
page = (HtmlPage) image.click();
problem is, i'm getting a NullPointerException out of this, anything i did wrong? Thanks.
Rest Of The Code:
WebClient webClient = new WebClient();
HtmlPage page = webClient.getPage("https://myWebsite");
HtmlInput username = page.getElementByName("Template1$MenuList1$TextBoxUsername");
username.setValueAttribute("myUsername");
HtmlInput password = page.getElementByName("Template1$MenuList1$TextBoxPassword");
password.setValueAttribute("myPassword");
HtmlSubmitInput loginButton = page.getElementByName("Template1$MenuList1$ButtonLogin");
page = loginButton.click();
webClient.waitForBackgroundJavaScript(2000);
if (page.getElementByName("Template1$Control0$ButtonContinue") != null) {
HtmlSubmitInput continueButton = page.getElementByName("Template1$Control0$ButtonContinue");
page = continueButton.click();
webClient.waitForBackgroundJavaScript(2000);
}
Hi #Orestesk Welcome to SO,
Your xpath is not correct.
src attribute of you image is dynamic. src="data:image/png;base64, {{sc.PhotoLocation}}"
Here {{sc.PhotoLocation}} would evaluate to some value. So value of src attribute keeps changing depending on the value of sc.PhotoLocation.
Use some other strategy to select your image instead of relying on src attribute.
OR try this trick.
//img[contains(#src, 'data:image/png;base64')]
I can't get a list of links through the Nokogiri parse, https://chromedriver.storage.googleapis.com/index.html?path=79.0.3945.36/
What am I doing wrong ?
links = Nokoiri::HTML('https://chromedriver.storage.googleapis.com/index.html?path=79.0.3945.36/')
or
links = Nokoiri::XML('https://chromedriver.storage.googleapis.com/index.html?path=79.0.3945.36/')
--->
#(Document:0x3fcdda1b988c {
name = "document",
children = [
#(DTD:0x3fcdda1b5b24 { name = "html" }),
#(Element:0x3fcdda1b46fc {
name = "html",
children = [
#(Element:0x3fcdda1b0804 {
name = "body",
children = [
#(Element:0x3fcdda1ac920 {
name = "p",
children = [ #(Text "https://chromedriver.storage.googleapis.com/index.html?path=79.0.3945.36/")]
})]
})]
})]
})
puts links.to_html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>https://chromedriver.storage.googleapis.com/index.html?path=79.0.3945.36/</p></body></html>
=> nil
This is not going to work as the entire page is created with JavaScript. The body of the document just contains a single script tag. Open up the page source or look at the raw response instead of just looking at the rendered DOM in the web inspector/developer tools.
view-source:https://chromedriver.storage.googleapis.com/index.html?path=79.0.3945.36/
Nokogiri is just a HTML parser and not a browser and thus does not run JavaScript. While you could use a headless browser like phantom.js you might just want to look for an API that provides the data you want instead. A web scraper is usually the wrong answer to any question.
I found a more interesting solution )) for example:
link_driver = Nokogiri::HTML(page.source).at('a:contains("mac")').values.join('')
chromedriver_storage_page = 'https://chromedriver.storage.googleapis.com/'
File.new('filename.zip', 'w') << URI.parse(chromedriver_storage+link).read
contains("mac") can change contains("linux") or contains("win"), it does not matter, choose any version of the operating system
And 2 solution - to parse the page chromedriver.chromium.org and to get information about all versions. If the version on the site is more recent than mine, then I substitute the version number in a new line for downloading
chromedriver_storage = 'https://chromedriver.storage.googleapis.com/'
chromedriver = '79.0.3945.36/' - get using Capybara and cut only the version
zip = 'chromedriver_mac64.zip'
link = chromedriver_storage+chromedriver+zip File.new('filename.zip', 'w') << URI.parse(link).read
it turns out that the parser, in headless mode, can be inserted into the crontab task to update the version of the current browser
I'm using Django and Google's Closure javascript library, and I want to do some form processing via AJAX.
Currently, I have a button on the page that says "Add score." When you click it, it fires off a goog.net.Xhrio request to load another URL with a form on it and display the contents in a little pop up box, via a call to loadForm().
loadForm = function(formId) {
var form = goog.dom.getElement(formId);
goog.style.setElementShown(goog.dom.getElement('popup-box'), true);
goog.net.XhrIo.send(form.action, displayForm, form.method);
}
displayForm = function(e) {
goog.dom.getElement('popup-box').innerHTML = e.target.getResponseText();
}
The Django form that gets loaded is a very basic model form, with a simple "score" attribute that gets validated against a number range. Here's the code I have to process the form submission:
def Score(request):
obj = ScoreModel.get(pk=request.POST['obj_id'])
form = ScoreForm(request.POST, instance=obj)
if form.is_valid():
form.save()
messages.success(request, 'Score saved!')
return shortcuts.redirect('index')
else:
context_vars = {'score': score, 'form': quarter_form}
shortcuts.render_to_response(
'score_form.html', context_vars,
context_instance=context.RequestContext(request))
This would all work fine if the form to enter the score itself was just displayed on the page, but because it is an AJAX popup, it doesn't work properly. If I just do a simple form submission (via HTML submit button), it works fine if the data is valid. But if the data isn't valid, instead of displaying the form with errors in the popup, it just loads only the text that would've been displayed in the popup - the form with errors - in the main browser window rather than in the popup.
Conversely, if I submit the form via my loadForm() JS method above, it works perfectly fine if the form is invalid (and displays the invalid form in the popup box), but doesn't work if the form is valid (because the main index page ends up getting displayed in my popup's innerHTML).
I can't seem to figure out how to get the code to work in both scenarios. So, how can I have my cake and eat it to? :)
This is kind of a strange issue, so if I didn't explain it well enough, let me know and I'll try to clarify. Thanks in advance.
I got it to work. The basic trick was, if the form submission was successful, instead of returning a redirect I returned a basic response object with a redirect status code and the URL to redirect to. Then I modified my displayForm() to look for that and redirect if it was found.
Here's the modified code from the Score() function:
if form.is_valid():
form.save()
messages.success(request, 'Score saved!')
redirect = shortcuts.redirect('index')
return http.HttpResponse(content=redirect['Location'],
status=redirect.status_code)
And the modified displayForm():
var displayForm = function(e) {
var responseText = e.target.getResponseText();
if (e.target.getStatus() == 302) {
// If this was a redirect form submission, the response text will be the URL
// to redirect to.
window.location.href = responseText;
} else {
// Regular form submission. Show the response text.
goog.dom.getElement('popup-box').innerHTML = responseText;
}
};
I'm trying to use Mechanize to login and crawl a site.
For some reason, I can't seem to get the login function to work. Any ideas?
This is my code:
require 'nokogiri'
require 'open-uri'
require 'mechanize'
a = Mechanize.new
a.get('https://jackthreads.com/')
form = a.page.form_with(:class => 'jt-form')
form.field_with(:name => "email").value = "email"
form.field_with(:name => "password21").value = "password"
page = a.submit(form, form.buttons.first)
The action on the form is set to "#", so your submit is being ignored. The POST call is actually being made to https://www.jackthreads.com/login?method=ajax via AJAX. Perhaps if you update the form's action attribute with Mechanize before submitting, it will do the trick.
For what it's worth, I figured this out with the Chrome Web Inspector. After seeing the value was set to "#", I went to the network tab, filtered by XHR, then tried submitting something.
I wrote some codes.
I could save image in BobProperty.
But I cannot load image into HTML page...
source code:
class Product(db.Model):
image = db.BlobProperty()
...
class add:
productImage = self.request.get('image')
product.image = db.Blob(productImage)
product.put()
but i wrote {{product.image}} into html code. But there were like ��袀 ���� ���� ���� (����������� ��(:(������� (������� (��>̢��� (�������>������Y������K��
What should i do if i want load image from datastore?
I use an auxiliary view:
def serve_image(request, image):
if image == "None":
image = ""
response = HttpResponse(image)
response['Content-Type'] = "image/png"
response['Cache-Control'] = "max-age=7200"
return response
and in the model:
def get_image_path(self):
# This returns the url of serve_image, with the argument of image's pk.
# Something like /main/serve_image/1231234dfg22; this url will return a
# response image with the blob
return reverse("main.views.serve_image", args=[str(self.pk)])
and just use {{ model.get_image_path }} instead.
(this is django-nonrel, but I guess you could figure out what it does)
Also, there is a post here about this; you should check it out.