Ruby RSS/Atom creation - including content

Ruby RSS/Atom creation - including content - ruby

I am creating an Atom feed using ruby's stdlib rss library. This library is essentially undocumented , but I have it working using the example provided on this page:
require 'rss'
rss = RSS::Maker.make("atom") do |m|
m.channel.author = "Steve Wattam"
m.channel.updated = Time.now
m.channel.about = "http://stephenwattam.com/blog/"
m.channel.title = "Steve W's Blog"
storage.posts.each do |p|
m.items.new_item do |item|
item.link = p.link
item.title = p.title
item.updated = p.edited
item.pubDate = p.date
item.summary = p.summary
end
end
end
This works fine. I am unable, however, to add a content element. There is no such thing as item.content=, and I can't seem to find any example code online---a browse of the source indicates that content is stored in the item (docs here), but I lack the knowledge to tease it out.
Does anyone know how I might go about adding a content element?
Incidentally, I'm aware other libraries exist to do this, but would ideally like to get this working without requiring any gems.

By digging through the source of the library, I've discovered that item.content yields an object of type RSS::Maker::Atom::Feed::Items::Item::Content. It's possible to set the content on that object:
item.content.content = 'text to set as content'
This object also responds to #xml_content.
Hope this helps someone!

Related

Process Jekyll content to replace first occurrence of any post title with a hyperlink of the post with that title

What I'm trying to do
I am building a Jekyll ruby plugin that will replace the first occurrence of any word in the post copy text content with a hyperlink linking to the URL of a post by the same name.
The problems I'm having
I've gotten this to work but I can't figure out two problems in the process_words method:
How to only search for a post title in the main content copy text of the post, and not the meta tags before the post or the table of contents (which is also generated before main post copy text)? I can't get this to work with Nokigiri, even though that seems to be the tool of choice here.
If a post's URL is not at post.data['url'], where is it?
Also, is there a more efficient, cleaner way to do this?
The current code works but will replace the first occurrence even if it's the value of an HTML attribute, like an anchor or a meta tag.
Example result
We have a blog with 3 posts:
Hobbies
Food
Bicycles
And in the "Hobbies" post body text, we have a sentence with each word appearing in it for the first time in the post, like so:
I love mountain biking and bicycles in general.
The plugin would process that sentence and output it as:
I love mountain biking and bicycles in general.
My current code (UPDATED 1)
# _plugins/hyperlink_first_word_occurance.rb
require "jekyll"
require 'uri'
module Jekyll
# Replace the first occurance of each post title in the content with the post's title hyperlink
module HyperlinkFirstWordOccurance
POST_CONTENT_CLASS = "page__content"
BODY_START_TAG = "<body"
ASIDE_START_TAG = "<aside"
OPENING_BODY_TAG_REGEX = %r!<body(.*)>\s*!
CLOSING_ASIDE_TAG_REGEX = %r!</aside(.*)>\s*!
class << self
# Public: Processes the content and updates the
# first occurance of each word that also has a post
# of the same title, into a hyperlink.
#
# content - the document or page to be processes.
def process(content)
#title = content.data['title']
#posts = content.site.posts
content.output = if content.output.include? BODY_START_TAG
process_html(content)
else
process_words(content.output)
end
end
# Public: Determines if the content should be processed.
#
# doc - the document being processes.
def processable?(doc)
(doc.is_a?(Jekyll::Page) || doc.write?) &&
doc.output_ext == ".html" || (doc.permalink&.end_with?("/"))
end
private
# Private: Processes html content which has a body opening tag.
#
# content - html to be processes.
def process_html(content)
content.output = if content.output.include? ASIDE_START_TAG
head, opener, tail = content.output.partition(CLOSING_ASIDE_TAG_REGEX)
else
head, opener, tail = content.output.partition(POST_CONTENT_CLASS)
end
body_content, *rest = tail.partition("</body>")
processed_markup = process_words(body_content)
content.output = String.new(head) << opener << processed_markup << rest.join
end
# Private: Processes each word of the content and makes
# the first occurance of each word that also has a post
# of the same title, into a hyperlink.
#
# html = the html which includes all the content.
def process_words(html)
page_content = html
#posts.docs.each do |post|
post_title = post.data['title'] || post.name
post_title_lowercase = post_title.downcase
if post_title != #title
if page_content.include?(" " + post_title_lowercase + " ") ||
page_content.include?(post_title_lowercase + " ") ||
page_content.include?(post_title_lowercase + ",") ||
page_content.include?(post_title_lowercase + ".")
page_content = page_content.sub(post_title_lowercase, "#{ post_title.downcase }")
elsif page_content.include?(" " + post_title + " ") ||
page_content.include?(post_title + " ") ||
page_content.include?(post_title + ",") ||
page_content.include?(post_title + ".")
page_content = page_content.sub(post_title, "#{ post_title }")
end
end
end
page_content
end
end
end
end
Jekyll::Hooks.register %i[posts pages], :post_render do |doc|
# code to call after Jekyll renders a post
Jekyll::HyperlinkFirstWordOccurance.process(doc) if Jekyll::HyperlinkFirstWordOccurance.processable?(doc)
end
Update 1
Updated my code with #Keith Mifsud's advice. Now using either the sidebar's aside element or the page__content class to select body content to work on.
Also improved checking and replacing the correct term.
PS: The code base example I started with working on my plugin was #Keith Mifsud's jekyll-target-blank plugin

this code looks very familiar :) I suggest you look into the Rspecs test file to test against your issues: https://github.com/keithmifsud/jekyll-target-blank
I'll try to answer your questions, sorry I couldn't test these myself the time of writing.
How to only search for a post title in the main content copy text of the post, and not the meta tags before the post or the table of contents (which is also generated before main post copy text)? I can't get this to work with Nokigiri, even though that seems to be the tool of choice here.
Your requirements here are:
1) Ignore content outside the <body></body> tags.
This seems to already be implemented in the process_html() method. This method is stating the only process the body_content and it should work as it is. Have you got tests for it? How are you debugging it? The same string splitting works in my plugin. I.e. only content inside the body is processed.
2) Ignore content inside the Table of Contents (TOC).
I suggest you extend the process_html() method by further splitting the body_content variable. Search for content in between the opening and closing tags of your TOC (by id, css class etc..) and exclude it, then add it back in it's position before or after process_words string.
3) Whether to use the Nokigiri plugin?
This plugin is great for parsing html. I think you are parsing strings and then creating html. So vanilla Ruby and the URI plugin should suffice. You can still use it if you want but it won't be any faster then splitting strings in ruby.
If a post's URL is not at post.data['url'], where is it?
I think you should a have method to get all all post titles and then match the "words" against the array. You can get all the posts collection from the doc itself doc.site.posts and foreach post return the title. The the process_words() method can check each work to see if it matched an item from the array. But what if the title is made of more than one word?
Also, is there a more efficient, cleaner way to do this?
So far so good. I'll start with getting the issues fixed and then refactor for speed and coding standards.
Again I suggest you use testing to help you with this.
Let me know if I can help more :)

Concept for recipe-based parsing of webpages needed

I'm working on a web-scraping solution that grabs totally different webpages and lets the user define rules/scripts in order to extract information from the page.
I started scraping from a single domain and build a parser based on Nokogiri.
Basically everything works fine.
I could now add a ruby class each time somebody wants to add a webpage with a different layout/style.
Instead I thought about using an approach where the user specifies elements where content is stored using xpath and storing this as a sort of recipe for this webpage.
Example: The user wants to scrape a table-structure extracting the rows using a hash (column-name => cell-content)
I was thinking about writing a ruby function for extraction of this generic table information once:
# extracts a table's rows as an array of hashes (column_name => cell content)
# html - the html-file as a string
# xpath_table - specifies the html table as xpath which hold the data to be extracted
def basic_table(html, xpath_table)
xpath_headers = "#{xpath_table}/thead/tr/th"
html_doc = Nokogiri::HTML(html)
html_doc = Nokogiri::HTML(html)
row_headers = html_doc.xpath(xpath_headers)
row_headers = row_headers.map do |column|
column.inner_text
end
row_contents = Array.new
table_rows = html_doc.xpath('#{xpath_table}/tbody/tr')
table_rows.each do |table_row|
cells = table_row.xpath('td')
cells = cells.map do |cell|
cell.inner_text
end
row_content_hash = Hash.new
cells.each_with_index do |cell_string, column_index|
row_content_hash[row_headers[column_index]] = cell_string
end
row_contents << [row_content_hash]
end
return row_contents
end
The user could now specify a website-recipe-file like this:
<basic_table xpath='//div[#id="grid"]/table[#id="displayGrid"]'
The function basic_table is referenced here, so that by parsing the website-recipe-file I would know that I can use the function basic_table to extract the content from the table referenced by the xPath.
This way the user can specify simple recipe-scripts and only has to dive into writing actual code if he needs a new way of extracting information.
The code would not change every time a new webpage needs to be parsed.
Whenever the structure of a webpage changes only the recipe-script would need to be changed.
I was thinking that someone might be able to tell me how he would approach this. Rules/rule engines pop into my mind, but I'm not sure if that really is the solution to my problem.
Somehow I have the feeling that I don't want to "invent" my own solution to handle this problem.
Does anybody have a suggestion?
J.

Posting data on website using Mechanize Nokogiri Selenium

I need to post data on a website through a program.
To achieve this I am using Mechanize Nokogiri and Selenium.
Here's my code :
def aeiexport
# first Mechanize is submitting the form to identify yourself on the website
agent = Mechanize.new
agent.get("https://www.glou.com")
form_login_AEI = agent.page.forms.first
form_login_AEI.util_vlogin = "42"
form_login_AEI.util_vpassword = "666"
# this is suppose to submit the form I think
page_compet_list = agent.submit(form_login_AEI, form_login_AEI.buttons.first)
#to be able to scrap the page you end up on after submitting form
body = page_compet_list.body
html_body = Nokogiri::HTML(body)
#tds give back an array of td
tds = html_body.css('.L1').xpath("//table/tbody/tr[position()>1]/td")
# Checking my array of td with some condition
tds.each do |td|
link = td.children.first # Select the first children
if link.html = "2015 32 92 0076 012"
# Only consider the html part of the link, if matched follow the previous link
previous_td = td.previous
previous_url = previous_td.children.first.href
#following the link contained in previous_url
page_selected_compet = agent.get(previous_url)
# to be able to scrap the page I end up on
body = page_selected_compet.body
html_body = Nokogiri::HTML(body)
joueur_access = html_body.search('#tabs0head2 a')
# clicking on the link
joueur_access.click
rechercher_par_numéro_de_licence = html_body.css('.L1').xpath("//table/tbody/tr/td[1]/a[1]")
pure_link_rechercher_par_numéro_de_licence = rechercher_par_numéro_de_licence['href']
#following pure_link_rechercher_par_numéro_de_licence
page_submit_licence = agent.get(pure_link_rechercher_par_numéro_de_licence)
body_submit_licence = page_submit_licence.body
html_body = Nokogiri::HTML(body_submit_licence)
#posting my data in the right field
form.field_with(:name => 'lic_cno[0]') == "9511681"
1) So far what do you think about this code, Do you think there is an error in there
2) This part is the one I am really not sure about : I have posted my data in the right field but now I need to submit it. The problem is that the button I need to click is like this:
<input type="button" class="button" onclick="dispatchAndSubmit(document.JoueurRechercheForm, 'rechercher');" value="Rechercher">
it triggers a javascript function onclick. I am triying Selenium to trigger the click event. Then I end up on another page, where I need to click a few more times.. I tried this:
driver.find_element(:value=> 'Rechercher').click
driver.find_element(:name=> 'sel').click
driver.find_element(:value=> 'Sélectionner').click
driver.find_element(:value=> 'Inscrire').click
But so far I have not succeeded in posting the data.
Could you please tell me if selenium will enable me to do what I need to do. If can I do it ?

At a glance your code can use less indentation and more white space/empty lines to separate the internal logic of AEIexport (which should be changed to aei_export since Ruby uses snake case for method names. You can find more recommendations on how to style ruby code here).
Besides the style of your code, an error I found at the beginning of your method is using an undefined variable page when defining form_login_AEI.
For your second question, I'm not familiar with Selenium; however since it does use a real web browser it can handle JavaScript. Watir is another possible solution.
An alternative would be to view the page source (i.e. in Firebug) and understand what the JavaScript on the page does. Then use Mechanize to follow the link manually.

is there any plugin for complicated searching with mongoid (like meta_search for ActiveRecord)?

I tried meta_search, but after adding "include MetaSearch::Searches::ActiveRecord" into my model, it raised an error as "undefined method `joins_values'" when run "MyModel.search(params[:search])"
I think I dont need full text, so I think following gems are not suitable for my project now::
mongoid_fulltext
mongoid-sphinx
sunspot_mongoid
mongoid_search
I tried a old gem named scoped-search
I can make it work for example:
get :search do
#search = Notification.scoped_search(params[:search]
search_scope = #search.scoped
defaul_scope = current_user.notifications
result_scope = search_scope.merge defaul_scope
#notifications = result_scope
render 'notifications/search'
end
but it will be allow to call any scopes in my model.
Is there any "best practice" for doing this job ?

If you want limit the scope you want use on your scoped_search you can filter your params[:search] like :
def limit_scope_search
params[:search].select{|k,v| [:my_scope, :other_scope_authorized].include?(k) }
end

How do you combine PDFs in ruby?

This was asked in 2008. Hopefully there's a better answer now.
How can you combine PDFs in ruby?
I'm using the pdf-stamper gem to fill out a form in a PDF. I'd like to take n PDFs, fill out a form in each of them, and save the result as an n-page document.
Can you do this with a native library like prawn? Can you do this with rjb and iText? pdf-stamper is a wrapper on iText.
I'd like to avoid using two libraries (i.e. pdftk and iText), if possible.

As of 2013 you can use Prawn to merge pdfs. Gist: https://gist.github.com/4512859
class PdfMerger
def merge(pdf_paths, destination)
first_pdf_path = pdf_paths.delete_at(0)
Prawn::Document.generate(destination, :template => first_pdf_path) do |pdf|
pdf_paths.each do |pdf_path|
pdf.go_to_page(pdf.page_count)
template_page_count = count_pdf_pages(pdf_path)
(1..template_page_count).each do |template_page_number|
pdf.start_new_page(:template => pdf_path, :template_page => template_page_number)
end
end
end
end
private
def count_pdf_pages(pdf_file_path)
pdf = Prawn::Document.new(:template => pdf_file_path)
pdf.page_count
end
end

After a long search for a pure Ruby solution, I ended up writing code from scratch to parse and combine/merge PDF files.
(I feel it is such a mess with the current tools - I wanted something native but they all seem to have different issues and dependencies... even Prawn dropped the template support they use to have)
I posted the gem online and you can find it at GitHub as well.
you can install it with:
gem install combine_pdf
It's very easy to use (with or without saving the PDF data to a file).
For example, here is a "one-liner":
(CombinePDF.load("file1.pdf") << CombinePDF.load("file2.pdf") << CombinePDF.load("file3.pdf")).save("out.pdf")
If you find any issues, please let me know and I will work on a fix.

Use ghostscript to combine PDFs:
options = "-q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite"
system "gs #{options} -sOutputFile=result.pdf file1.pdf file2.pdf"

I wrote a ruby gem to do this — PDF::Merger. It uses iText. Here's how you use it:
pdf = PDF::Merger.new
pdf.add_file "foo.pdf"
pdf.add_file "bar.pdf"
pdf.save_as "combined.pdf"

Haven't seen great options in Ruby- I got best results shelling out to pdftk:
system "pdftk #{file_1} multistamp #{file_2} output #{file_combined}"

We're closer than we were in 2008, but not quite there yet.
The latest dev version of Prawn lets you use an existing PDF as a template, but not use a template over and over as you add more pages.

Via iText, this will work... though you should flatten the forms before you merge them to avoid field name conflicts. That or rename the fields one page at a time.
Within PDF, fields with the same name share a value. This is usually not the desired behavior, though it comes in handy from time to time.
Something along the lines of (in java):
PdfCopy mergedPDF = new PdfCopy( new Document(), new FileOutputStream( outPath );
for (String path : paths ) {
PdfReader reader = new PdfReader( path );
ByteArrayOutputStream curFormOut = new ByteArrayOutputStream();
PdfStamper stamper = new PdfStamper( reader, curFormOut );
stamper.setField( name, value ); // ad nauseum
stamper.setFlattening(true); // flattening setting only takes effect during close()
stamper.close();
byte curFormBytes = curFormOut.toByteArray();
PdfReader combineMe = new PdfReader( curFormBytes );
int pages = combineMe .getNumberOfPages();
for (int i = 1; i <= pages; ++i) { // "1" is the first page
mergedForms.addPage( mergedForms.getImportedPage( combineMe, i );
}
}
mergedForms.close();

If you want to add any template (created by macOS Pages or Google Docs) using the combine_pdf gem then you can try with this:
final_pdf = CombinePDF.new
company_template = CombinePDF.load(template_file.pdf).pages[0]
pdf = CombinePDF.load (content_file.pdf)
pdf.pages.each {|page| final_pdf << (company_template << page)}
final_pdf.save "final_document.pdf"

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio