Ruby script for posting comments - ruby

I have been trying to write a script that may help me to comment from command line.(The sole reason why I want to do this is its vacation time here and I want to kill time).
I often visit and post on this site.So I am starting with this site only.
For example to comment on this post I used the following script
require "uri"
require 'net/http'
def comment()
response = Net::HTTP.post_form(URI.parse("http://www.geeksforgeeks.org/wp-comments-post.php"),{'author'=>"pikachu",'email'=>"saurabh8c#gmail.com",'url'=>"geekinessthecoolway.blogspot.com",'submit'=>"Have Your Say",'comment_post_ID'=>"18215",'comment_parent'=>"0",'akismet_comment_nonce'=>"70e83407c8",'bb2_screener_'=>"1330701851 117.199.148.101",'comment'=>"How can we generalize this for a n-ary tree?"})
return response.body
end
puts comment()
Obviously the values were not hardcoded but for sake of clearity and maintaining the objective of the post i am hardcoding them.
Beside the regular fields that appear on the form,the values for the hidden fields i found out from wireshark when i posted a comment the normal way.I can't figure out what I am missing?May be some js event?
Edit:
As few people suggested using mechanize I switched to python.Now my updated code looks like:
import sys
import mechanize
uri = "http://www.geeksforgeeks.org/"
request = mechanize.Request(mechanize.urljoin(uri, "archives/18215"))
response = mechanize.urlopen(request)
forms = mechanize.ParseResponse(response, backwards_compat=False)
response.close()
form=forms[0]
print form
control = form.find_control("comment")
#control=form.find_control("bb2_screener")
print control.disabled
# ...or readonly
print control.readonly
# readonly and disabled attributes can be assigned to
#control.disabled = False
form.set_all_readonly(False)
form["author"]="Bulbasaur"
form["email"]="ashKetchup#gmail.com"
form["url"]="9gag.com"
form["comment"]="Y u no put a captcha?"
form["submit"]="Have Your Say"
form["comment_post_ID"]="18215"
form["comment_parent"]="0"
form["akismet_comment_nonce"]="d48e588090"
#form["bb2_screener_"]="1330787192 117.199.144.174"
request2 = form.click()
print request2
try:
response2 = mechanize.urlopen(request2)
except mechanize.HTTPError, response2:
pass
# headers
for name, value in response2.info().items():
if name != "date":
print "%s: %s" % (name.title(), value)
print response2.read() # body
response2.close()
Now the server returns me this.On going through the html code of the original page i found out there is one more field bb2_screener that i need to fill if I want to pretend like a browser to the server.But the problem is the field is not written inside the tag so mechanize won't treat it as a field.

Assuming you have all the params correct, you're still missing the session information that the site stores in a cookie. Consider using something like mechanize, that'll deal with the cookies for you. It's also more natural in that you tell it which fields to fill in with which data. If that still doesn't work, you can always use a jackhammer like selenium, but then technically you're using a browser.

Related

Using JobQueue to continuously refresh a message

I'm building a Telegram bot that uses ConversationHandler to prompt the user for a few parameters and settings about how the bot should behave. This information is stored in some global variables since it needs to be available and editable by different functions inside the program. Every global variable is a dictionary in which each user is associated with its own value. Here's an example:
language = {123456: 'English', 789012: 'Italian'}
where 123456 and 789012 are user ids obtained from update.message.from_user.id inside each function.
After all the required information has been received and stored, the bot should send a message containing a text fetched from a web page; the text on the web page is constantly refreshed, so I want the message to be edited every 60 seconds and updated with the new text, until the user sends the command /stop.
The first solution that came to my mind in order to achieve this was something like
info_message = bot.sendMessage(update.message.chat_id, text = "This message will be updated...")
...
def update_message(bot, update):
while True:
url = "http://example.com/etc/" + language[update.message.from_user.id]
result = requests.get(url).content
bot.editMessageText(result, chat_id = update.message.chat_id, message_id = info_message.message_id)
time.sleep(60)
Of course that wouldn't work at all, and it is a really bad idea. I found out that the JobQueue extension would be what I need. However, there is something I can't figure out.
With JobQueue I would have to set up a callback function for my job. In my case, the function would be
def update_message(bot, job):
url = "http://example.com/etc/" + language[update.message.from_user.id]
result = requests.get(url).content
bot.editMessageText(result, chat_id = update.message.chat_id, message_id = info_message.message_id)
and it would be called every 60 seconds. However this wouldn't work either. Indeed, the update parameter is needed inside the function in order to fetch the page according to the user settings and to send the message to the correct chat_id. I'd need to pass that parameter to the function along with bot, job, but that doesn't seem to be possible.
Otherwise I would have to make update a global variable, but I thought there must be a better solution. Any thoughts? Thanks.
I had the same issue. A little digging into the docs revealed that you can pass job objects a context parameter which can then be accessed by the callback function as job.context.
context (Optional[object]) – Additional data needed for the callback function. Can be accessed through job.context in the callback. Defaults to None
global language
language = {123456: 'English', 789012: 'Italian'}
j=updater.job_queue
context={"chat_id":456754, "from_user_id":123456, "message_id":111213}
update_job = job(update_message, 60, repeat=True, context=context)
j.put(update_job, next_t=0.0)
def update_message(bot, job):
global language
context=job.context
url = "http://example.com/etc/" + language[context["from_user_id"]]
result = requests.get(url).content
bot.editMessageText(result,
chat_id = context["chat_id"],
message_id = context["message_id"])

Using Sinatra to Parse JSON data from url

I'm using Sinatrarb to complete a task
I need to:
Parse the data of a JSON object from a url,
Single out one of attributes of the json data and store it as a variable
Run some arithmetic on the variable
Return the result as a new variable
then post this to a new url as a new json object.
I have seen bits and pieces of information all over including information on parsing JSON data in ruby and information on open-uri but I believe it would be very valuable having someone break this down step by step as most similar solutions given to this are either outdated or steeply complex.
Thanks in advance.
Here's a simple guide. I've done the same task recently.
Let's use this JSON (put it in a file called 'simple.json'):
{
"name": "obscurite",
"favorites": {
"icecream": [
"chocolate",
"pistachio"
],
"cars": [
"ferrari",
"porsche",
"lamborghini"
]
},
"location": "NYC",
"age": 100}
Parse the data of a JSON object from a url.
Step 1 is to add support for JSON parsing:
require 'json'
Step 2 is to load in the JSON data from our new .json file:
json_file = File.read('simple.json')
json_data = JSON.parse(json_file)
Single out one of attributes of the json data and store it as a variable
Our data is in the form of a Hash on the outside (curly braces with key:values). Some of the values are also hashes ('favorites' and 'cars'). The values of those inner hashes are lists (Arrays in Ruby). So what we have is a hash of hashes, where some hashes are arrays.
Let's pull out my location:
puts json_data['location'] # NYC
That was easy. It was just a top level key/value. Let's go deeper and pull out my favorite icecream(s):
puts json_data['favorites']['icecream'] # chocolate pistachio
Now only my second favorite car:
puts json_data['favorites']['cars'][1] # porsche
Run some arithmetic on the variable
Step 3. Let's get my age and cut it down by 50 years. Being 100 is tough!
new_age = json_data['age'] / 2
puts new_age
Return the result as a new variable
Step 4. Let's put the new age back into the json
json_data['age'] = new_age
puts json_data['age'] # 50
then post this to a new url as a new json object.
Step 5. Add the ability for your program to do an HTTP POST. Add this up at top:
require 'net/http'
and then you can post anywhere you want. I found a fake web service you could use, if you just want to make sure the request got there.
# use this guy's fake web service page as a test. handy!
uri = URI.parse("http://jsonplaceholder.typicode.com/posts")
header = {'Content-Type'=> 'text/json'}
http = Net::HTTP.new(uri.host, uri.port)
request = Net::HTTP::Post.new(uri.request_uri, header)
request.body = json_data.to_json
response = http.request(request)
# Did we get something back?
puts response.body
On linux or mac you can open a localhost port and listen as a test:
nc -4 -k -l -v localhost 1234
To POST to this port change the uri to:
uri = URI.parse("http://localhost:1234")
Hope this helps. Let me know if you get stuck and I'll try to lend a hand. I'm not a ruby expert, but wanted to help a fellow explorer. Good luck.

Concept for recipe-based parsing of webpages needed

I'm working on a web-scraping solution that grabs totally different webpages and lets the user define rules/scripts in order to extract information from the page.
I started scraping from a single domain and build a parser based on Nokogiri.
Basically everything works fine.
I could now add a ruby class each time somebody wants to add a webpage with a different layout/style.
Instead I thought about using an approach where the user specifies elements where content is stored using xpath and storing this as a sort of recipe for this webpage.
Example: The user wants to scrape a table-structure extracting the rows using a hash (column-name => cell-content)
I was thinking about writing a ruby function for extraction of this generic table information once:
# extracts a table's rows as an array of hashes (column_name => cell content)
# html - the html-file as a string
# xpath_table - specifies the html table as xpath which hold the data to be extracted
def basic_table(html, xpath_table)
xpath_headers = "#{xpath_table}/thead/tr/th"
html_doc = Nokogiri::HTML(html)
html_doc = Nokogiri::HTML(html)
row_headers = html_doc.xpath(xpath_headers)
row_headers = row_headers.map do |column|
column.inner_text
end
row_contents = Array.new
table_rows = html_doc.xpath('#{xpath_table}/tbody/tr')
table_rows.each do |table_row|
cells = table_row.xpath('td')
cells = cells.map do |cell|
cell.inner_text
end
row_content_hash = Hash.new
cells.each_with_index do |cell_string, column_index|
row_content_hash[row_headers[column_index]] = cell_string
end
row_contents << [row_content_hash]
end
return row_contents
end
The user could now specify a website-recipe-file like this:
<basic_table xpath='//div[#id="grid"]/table[#id="displayGrid"]'
The function basic_table is referenced here, so that by parsing the website-recipe-file I would know that I can use the function basic_table to extract the content from the table referenced by the xPath.
This way the user can specify simple recipe-scripts and only has to dive into writing actual code if he needs a new way of extracting information.
The code would not change every time a new webpage needs to be parsed.
Whenever the structure of a webpage changes only the recipe-script would need to be changed.
I was thinking that someone might be able to tell me how he would approach this. Rules/rule engines pop into my mind, but I'm not sure if that really is the solution to my problem.
Somehow I have the feeling that I don't want to "invent" my own solution to handle this problem.
Does anybody have a suggestion?
J.

Posting data on website using Mechanize Nokogiri Selenium

I need to post data on a website through a program.
To achieve this I am using Mechanize Nokogiri and Selenium.
Here's my code :
def aeiexport
# first Mechanize is submitting the form to identify yourself on the website
agent = Mechanize.new
agent.get("https://www.glou.com")
form_login_AEI = agent.page.forms.first
form_login_AEI.util_vlogin = "42"
form_login_AEI.util_vpassword = "666"
# this is suppose to submit the form I think
page_compet_list = agent.submit(form_login_AEI, form_login_AEI.buttons.first)
#to be able to scrap the page you end up on after submitting form
body = page_compet_list.body
html_body = Nokogiri::HTML(body)
#tds give back an array of td
tds = html_body.css('.L1').xpath("//table/tbody/tr[position()>1]/td")
# Checking my array of td with some condition
tds.each do |td|
link = td.children.first # Select the first children
if link.html = "2015 32 92 0076 012"
# Only consider the html part of the link, if matched follow the previous link
previous_td = td.previous
previous_url = previous_td.children.first.href
#following the link contained in previous_url
page_selected_compet = agent.get(previous_url)
# to be able to scrap the page I end up on
body = page_selected_compet.body
html_body = Nokogiri::HTML(body)
joueur_access = html_body.search('#tabs0head2 a')
# clicking on the link
joueur_access.click
rechercher_par_numéro_de_licence = html_body.css('.L1').xpath("//table/tbody/tr/td[1]/a[1]")
pure_link_rechercher_par_numéro_de_licence = rechercher_par_numéro_de_licence['href']
#following pure_link_rechercher_par_numéro_de_licence
page_submit_licence = agent.get(pure_link_rechercher_par_numéro_de_licence)
body_submit_licence = page_submit_licence.body
html_body = Nokogiri::HTML(body_submit_licence)
#posting my data in the right field
form.field_with(:name => 'lic_cno[0]') == "9511681"
1) So far what do you think about this code, Do you think there is an error in there
2) This part is the one I am really not sure about : I have posted my data in the right field but now I need to submit it. The problem is that the button I need to click is like this:
<input type="button" class="button" onclick="dispatchAndSubmit(document.JoueurRechercheForm, 'rechercher');" value="Rechercher">
it triggers a javascript function onclick. I am triying Selenium to trigger the click event. Then I end up on another page, where I need to click a few more times.. I tried this:
driver.find_element(:value=> 'Rechercher').click
driver.find_element(:name=> 'sel').click
driver.find_element(:value=> 'Sélectionner').click
driver.find_element(:value=> 'Inscrire').click
But so far I have not succeeded in posting the data.
Could you please tell me if selenium will enable me to do what I need to do. If can I do it ?
At a glance your code can use less indentation and more white space/empty lines to separate the internal logic of AEIexport (which should be changed to aei_export since Ruby uses snake case for method names. You can find more recommendations on how to style ruby code here).
Besides the style of your code, an error I found at the beginning of your method is using an undefined variable page when defining form_login_AEI.
For your second question, I'm not familiar with Selenium; however since it does use a real web browser it can handle JavaScript. Watir is another possible solution.
An alternative would be to view the page source (i.e. in Firebug) and understand what the JavaScript on the page does. Then use Mechanize to follow the link manually.

YQL Yahoo Finance Scraper on XML in Ruby

I am using a YQL query (the standard example query, with GOOG, YHOO, MSFT and AAPL) to generate XML for all of the available fields. I wanted to scrape the YQL site for the XML output once it is generated using a Ruby script, so that I could run it over and over again for different stocks and store the data somewhere. I haven't finished my script yet, but what I have seems to just not run. Here is the code:
yahoo_finance_scrape.rb
require 'rubygems'
require 'nokogiri'
require 'restclient'
PAGE_URL = "http://developer.yahoo.com/yql/console/"
yql_query = 'use "http://github.com/spullara/yql-tables/raw/d60732fd4fbe72e5d5bd2994ff27cf58ba4d3f84/yahoo/finance/yahoo.finance.quotes.xml"
as quotes; select * from quotes where symbol in ("YHOO","AAPL","GOOG","MSFT") '
if page = RestClient.post(PAGE_URL, {'name' => yql_query, 'submit' => 'Test'})
puts "YQL query: #{yql_query}, is valid"
xml_output = Nokogiri::HTML(page)
lines = xml_output.css('#container #layout-doc #yui-gen3000008 #yui-gen3000009 #yui_3_11_0_3_1393417778356_354
#yui-gen3000015 #yui-gen3000016 div#yui_3_11_0_2_1393417778356_10 #centerBottomView
#outputContainer div#output #outputTabContent #formattedView #viewContent #prexml')
lines.each do |line|
puts line.css('span').map{|span| span.text}.join(' ')
sleep 0.03
end
end
When I run the program, it only prints
"YQL query: use "http://github.com/spullara/yql-tables/raw/d60732fd4fbe72e5d5bd2994ff27cf58ba4d3f84/yahoo/finance/yahoo.finance.quotes.xml"
as quotes; select * from quotes where symbol in ("YHOO","AAPL","GOOG","MSFT") , is valid"
And then just stops. Oh, I am using that Github url because yahoo.finance.quotes was not working, and someone else on Stackoverflow suggested to use it.
If you want to check the css tags, just go to http://developer.yahoo.com/yql/console/ and enter my query and do an inspect element on it. I would post it here, but I don't know how.
The output is just the content of your yql_query var. so this does not help much.
You probably should not put the "use xxxx ax quotes" as a string in your code.
Check out what "someone else" had in mind.
The RestClient.post() method returns a response object. With all HTTP operations, always check the response.code, otherwise you don't know about errors.
response = RestClient.post(...)
puts "HTTP Response code: #{response.code}"
if response.code == 200
page = repsonse.to_str
...
end
According to the Nokogiri website the xml_output.css() method filters like it is a css selector. if you have for example "#container #layout-doc", this means "filter elements with the id 'layout-doc' inside elements of the id 'container' and so on. Is this really what you itend to do? if yes, the last "#prexml" should be enough and much less error-prone, as ids should normally be unique.

Resources