I can't parse the page and get links Nokogiri - ruby

I can't get a list of links through the Nokogiri parse, https://chromedriver.storage.googleapis.com/index.html?path=79.0.3945.36/
What am I doing wrong ?
links = Nokoiri::HTML('https://chromedriver.storage.googleapis.com/index.html?path=79.0.3945.36/')
or
links = Nokoiri::XML('https://chromedriver.storage.googleapis.com/index.html?path=79.0.3945.36/')
--->
#(Document:0x3fcdda1b988c {
name = "document",
children = [
#(DTD:0x3fcdda1b5b24 { name = "html" }),
#(Element:0x3fcdda1b46fc {
name = "html",
children = [
#(Element:0x3fcdda1b0804 {
name = "body",
children = [
#(Element:0x3fcdda1ac920 {
name = "p",
children = [ #(Text "https://chromedriver.storage.googleapis.com/index.html?path=79.0.3945.36/")]
})]
})]
})]
})
puts links.to_html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>https://chromedriver.storage.googleapis.com/index.html?path=79.0.3945.36/</p></body></html>
=> nil

This is not going to work as the entire page is created with JavaScript. The body of the document just contains a single script tag. Open up the page source or look at the raw response instead of just looking at the rendered DOM in the web inspector/developer tools.
view-source:https://chromedriver.storage.googleapis.com/index.html?path=79.0.3945.36/
Nokogiri is just a HTML parser and not a browser and thus does not run JavaScript. While you could use a headless browser like phantom.js you might just want to look for an API that provides the data you want instead. A web scraper is usually the wrong answer to any question.

I found a more interesting solution )) for example:
link_driver = Nokogiri::HTML(page.source).at('a:contains("mac")').values.join('')
chromedriver_storage_page = 'https://chromedriver.storage.googleapis.com/'
File.new('filename.zip', 'w') << URI.parse(chromedriver_storage+link).read
contains("mac") can change contains("linux") or contains("win"), it does not matter, choose any version of the operating system
And 2 solution - to parse the page chromedriver.chromium.org and to get information about all versions. If the version on the site is more recent than mine, then I substitute the version number in a new line for downloading
chromedriver_storage = 'https://chromedriver.storage.googleapis.com/'
chromedriver = '79.0.3945.36/' - get using Capybara and cut only the version
zip = 'chromedriver_mac64.zip'
link = chromedriver_storage+chromedriver+zip File.new('filename.zip', 'w') << URI.parse(link).read
it turns out that the parser, in headless mode, can be inserted into the crontab task to update the version of the current browser

Related

Why does my header end up way outside the page with TuesPechkin?

I do a convert
document.Objects.Clear();
document.GlobalSettings.PaperSize = PaperKind.A4;
document.Objects.Add(new ObjectSettings
{
HtmlText = xml,
HeaderSettings = new HeaderSettings { HtmlUrl = headerPath, RightText = "[page]/[sitepages]", ContentSpacing = 10 },
FooterSettings = new FooterSettings { HtmlUrl = headerPath, RightText = "[page]/[sitepages]" },
});
and the HTML is visible in the footer, but in the header it's way outside the page. It looks like it tries to put the header on the previous page, that's how far outside it it.
OK, the answer was as simple and not obvious. The header HTML is a bit more sensitise so it needed a <!doctype html> first in the file.

How to get captcha img src with Ruby and Mechanize?

I'm trying to write simple crawler, that would be filling 2 input fields. The page has an img element. Through Chrome developer mode I can see that img has src attribute. But after fetching the page the src attribute is gone. How do I get over this?
Code:
require 'mechanize'
agent = Mechanize.new
agent.user_agent_alias = 'Windows Chrome'
page = agent.get('https://ercdmd.ru/?gpay')
form = page.forms.first
form.gpay_abon = '00-0000000000'
captcha = page.at('#img_captcha')
pp captcha
Output:
#(Element:0x15e90ec {
name = "img",
attributes = [ #(Attr:0x15e8c14 { name = "id", value = "img_captcha" })]
})
My idea is to get invoice by a query through Telegram bot. Since there is a captcha I thought that I could read captcha image src with Mechanize to send that image through Telegram. Than, I would input digits that I can see on image and send in back to Mechanize to fill second input field. But now I am stuck.
Is there an other way to get invoice from that source?
I'm looking at that page, the captcha url would be:
captcha_url = "https://ercdmd.ru/captcha.php?time=#{Time.now.to_i}000"
Give that a try and see if it works.

Create a child page using ruby in confluence

I want to create a new page in the confluence wiki by using a ruby script. This script will run once a month and everytime create a new page. I know how to create a new page but I can't tell confluence that the created page should be a child page of another one. Here's what I tried:
server = Confluence::Server.new(url)
server.login(confluence['username'], confluence['password'])
template = load_file(template_filename)
erb = ERB.new(template)
page = {
"space" => "lxeng",
"title" => "Release 2015.08",
"ancestors" => [{"id" => 49776851 }],
"content" => erb.result,
}
server.storePage(page)
The code above works perfectly. It just doesn't create the page under the page with the ID 49776851.
I know that you can use the ancestors in other languages but i can't figure out how it works in ruby. How can I tell where the new page should be placed?
Any help would be much appreciated

Manually control <head> markup in Joomla

Is there a way to manually configure the contents of the <head> section of the site in Joomla 3.1? I want to use the templating system for the entire markup of the page, including everything between <html></html>.
I just read this: http://forum.joomla.org/viewtopic.php?f=466&t=230787 and I am astonished at the response. Surely this is template/data separation 101. Has this been fixed in the latest Joomla release?
If you are planning for a template development and you need all your template data get separated from Joomla libraries or core file (the head section).
Normally the head section include will works like
<jdoc:include type="head" />
it loads the content from libraries libraries\joomla\document\html\renderer\head.php
If you want to override the content of head you can make a module for your task.
Just create a module and include that module instead of this head make sure that have all required codes added to work $document Class otherwise it miss a lot off features of Joomla regarding document class
As explained by the answer from Jobin, normally, you would include the head data by using the <jdoc:include type="head" /> tag, but if you want more control over this, you can use the JDocument.
Example code in your template's PHP:
$doc = JFactory::getDocument();
$my_head_data = $doc->getHeadData();
This will give you an array of the data that JDocument would normally print, so that you can completely choose what to print and how.
To make jQuery load from CDN and get it on top of the script list, I made a little patch just after the $doc = JFactory::getDocument(); that manipulates the header array directly inside the $this object as follows:
$my_jquery = "//ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js";
$my_jquery_ui = "//ajax.googleapis.com/ajax/libs/jqueryui/1.11.2/jquery-ui.min.js";
$my_jquery_cx = $this->baseurl."/media/jui/js/jquery-noconflict.js ";
foreach($this->_scripts as $k=>$v) {
// put own jquery.conflict && jquery-ui && jquery on top of list
if( strpos($k,'jquery.min.js')) {
unset($this->_scripts[$k]);
$r = array( $my_jquery_cx => $v);
$this->_scripts = $r + $this->_scripts;
$r = array( $my_jquery_ui => $v);
$this->_scripts = $r + $this->_scripts;
$r = array( $my_jquery => $v);
$this->_scripts = $r + $this->_scripts;
}
else if( strpos($k,'jquery.ui.min.js')) {
unset($this->_scripts[$k]);
}
else if( strpos($k,'jquery-noconflict.js')) {
unset($this->_scripts[$k]);
}
}
Replace $my_jquery_xxx with editable config parameters in your templateDetails.xml file

jquery - load all text files and display them

I have been using jQuery pretty long time, but I never learned AJAX, so here I come..
we have this code :
$('body').load('hello.txt');
Simple enough, now let's say I have multiple text files (I don't know their names) I want to load,
Can I do that ?
Maybe I need to loop all the text files and load them somehow ?
Thanks in Advance
Assuming you have text files in the server in a specific location you can do this:
HTML markup:
<div id="fileList">
here list of files will be loaded so that user can select which one to load
<div>
<div id="file-content">
content of selected file will be loaded here
<div>
JQuery part :
$.ajax({
url : "FileServer/GetFileNames", // this is just a url that is responsible to return files list
success : function(data){
//here a JSON data including filenames expected
$.each(data,function(i,item){
var $fileHolder = $("<div></div>");
$fileHolder.attr("filename",item.filename).click(function(){
$("#file-content").load($(this).attr("filename"));
}).html(item.filename).appendTo("#fileList");
});
}
});
JSON Structure expected
[
{
filename : "text1.txt"
},
{
filename : "text2.txt"
},
{
filename : "text3.txt"
}
]
implementing file listing in the server side is up to you.
Javascript does not have access to the local file system for obvious
security reasons. This is not possible.
Unless you are trying to loop through files on your server, in which
case you wouldn't want to use jQuery anyway but something like ASP.NET
or PHP or whatever framework you are using.
Foreach file in directory jQuery
UPDATE
Try this out
var files;
$.ajax({
url: "http://homepage/folder",
success: function (txt) {
files = txt.split('<A href="');
}
});
var fList = new Array();
$(files).each(function () {
if (this.indexOf('.txt') > -1) {
fList.push(this);
}
});
for (i = 0; i < fList.length; i++) {
fList[i] = fList[i].split('">')[0];
fList[i] = fList[i].replace('"');
}
for (i = 0; i < fList.length; i++) {
$('#idLoadHere').load(fList[i]);
}
Run FTP list command (there are various ways to do so, Web-Sockets is one..)
A simpler, more common ans secure-solution is a server-side listing of the files, and "cooking" the HTML (meaning- embedding the file-listing within it),
*you can use raw HTML or put it in var statement to be used by JavaScript (for example).
see following answer:
https://stackoverflow.com/a/30949072/257319

Resources