how to split in ruby-scraping-ERROR undefined method - ruby

I am scraping the website https://www.bananatic.com/de/forum/games/. I want to extract only the year of the dates.
require 'nokogiri'
require 'open-uri'
require 'pp'
unless File.readable?('data.html')
url = 'https://www.bananatic.com/de/forum/games/'
data = URI.open(url).read
File.open('data.html', 'wb') { |f| f << data }
end
data = File.read('data.html')
document = Nokogiri::HTML(data)
links3 = document.css('.topics ul li div')
re = links3.map do |lk3|
name = lk3.css('.name').children.text.strip.split("\n")[2]
end
date = ' '
size_dates = re.length
(0..size_dates).each do |i|
unless i.nil?
date = re[i]
print date
end
end
As a result of the execution I get dates in what appears to be a String with the following format:
day .month.year, hour:minutes
But I only need the year I have made a split but I get an error.

Your issue is that if you look at the output from this block
re = links3.map do |lk3|
lk3.css('.name').children.text.strip.split("\n")[2]
end
You will see:
[" 07.08.2016, 13:47", nil, nil, nil, nil, " 06.08.2016, 9:24", nil, nil, nil, nil,...]
So you could solve your immediate issue by just adding .compact to the end or switching map to filter_map.
That being said here is another way to solve your issue:
You can get just the year from that text on that page using the following:
require 'nokogiri'
require 'open-uri'
url = "https://www.bananatic.com/de/forum/games/"
doc = Nokogiri::HTML(URI.open(url))
doc
.xpath('//div[#class="name"]/text()[string-length(normalize-space(.)) > 0]')
.map {|node| node.to_s[/\d{4}/]}
#=> ["2016", "2016", "2022", "2022", "2022", "2021", "2022", "2017", "2022", "2021", "2019", "2016", "2021", "2021", "2021", "2021", "2020", "2021", "2017", "2021"]
The 2 parts are:
//div[#class="name"]/text()[string-length(normalize-space(.)) > 0] - the XPath which finds all divs with the class "name" and then pulls the non zero length (trimmed of white space) text nodes.
.map {|node| node.to_s[/\d{4}/]} - map these into an array by slicing the String based on a regex for 4 contiguous digits.
If you would like the XPath to be as specific as your post you can use:
'//div[#class="topics"]/ul/li//div[#class="name"]/text()[string-length(normalize-space(.)) > 0]'

You could use REGEX to get only the year after having the list.
Of course, if what you showing is the pattern. Will work. Years would be the only one with 4 straight digits.
Example:
17.01.2023, 17:40
this \b\d{4}\b will result in 2023.

Related

Unscrambling a string given the number of splits and words that the sentence can be comprised of

Im working on a problem in which I'm given a string that has been scrambled. The scrambling works like this.
An original string is chopped into substrings at random positions and a random number of times.
Each substring is then moved around randomly to form a new string.
I'm also given a dictionary of words that are possible words in the string.
Finally, i'm given the number of splits in the string that were made.
The example I was given is this:
dictionary = ["world", "hello"]
scrambled_string = rldhello wo
splits = 1
The expected output of my program would be the original string, in this case:
"hello world"
Suppose the initial string
"hello my name is Sean"
with
splits = 2
yields
["hel", "lo my name ", "is Sean"]
and those three pieces are shuffled to form the following array:
["lo my name ", "hel", "is Sean"]
and then the elements of this array are joined to form:
scrambled = "lo my name helis Sean"
Also suppose:
dictionary = ["hello", "Sean", "the", "name", "of", "my", "cat", "is", "Sugar"]
First convert dictionary to a set to speed lookups.
require 'set'
dict_set = dictionary.to_set
#=> #<Set: {"hello", "Sean", "the", "name", "of", "my", "cat", "is", "Sugar"}>
Next I will create a helper method.
def indices_to_ranges(indices, last_index)
[-1, *indices, last_index].each_cons(2).map { |i,j| i+1..j }
end
Suppose we split scrambled twice (because splits #=> 2), specifically after the 'y' and the 'h':
indices = [scrambled.index('y'), scrambled.index('h')]
#=> [4, 11]
The first element of indices will always be -1 and the last value will always be scrambled.size-1.
We may then use indices_to_ranges to convert these indices to ranges of indices of characters in scrambed:
ranges = indices_to_ranges(indices, scrambled.size-1)
#=> [0..4, 5..11, 12..20]
a = ranges.map { |r| scrambled[r] }
#=> ["lo my", " name h", "elis Sean"]
We could of course combine these two steps:
a = indices_to_ranges(indices, scrambled.size-1).map { |r| scrambled[r] }
#=> ["lo my", " name h", "elis Sean"]
Next I will permute the values of a. For each permutation I will join the elements to form a string, then split the string on single spaces to form an array of words. If all of those words are in the dictionary we may claim success and are finished. Otherwise, a different array indices will be constructed and we try again, continuing until success is realized or all possible arrays indices have been considered. We can put all this in the following method.
def unscramble(scrambled, dict_set, splits)
last_index = scrambled.size-1
(0..scrambled.size-2).to_a.combination(splits).each do |indices|
indices_to_ranges(indices, last_index).
map { |r| scrambled[r] }.
permutation.each do |arr|
next if arr[0][0] == ' ' || arr[-1][-1] == ' '
words = arr.join.split(' ')
return words if words.all? { |word| dict_set.include?(word) }
end
end
end
Let's try it.
original string: "hello my name is Sean"
scrambled = "lo my name helis Sean"
splits = 4
unscramble(scrambled, dict_set, splits)
#=> ["my", "name", "hello", "is", "Sean"]
See Array#combination and Array#permutation.
bonkers answer (not quite perfect yet ... trouble with single chars):
#
# spaces appear to be important!
#check = {}
#ordered = []
def previous_words (word)
#check.select{|y,z| z[:previous] == word}.map do |nw,z|
#ordered << nw
previous_words(nw)
end
end
def in_word(dictionary, string)
# check each word in the dictionary to see if the string is container in one of them
dictionary.each do |word|
if word.include?(string)
return word
end
end
return nil
end
letters=scrambled.split("")
previous=nil
substr=""
letters.each do |l|
if in_word(dictionary, substr+l)
substr+= l
elsif (l==" ")
word=in_word(dictionary, substr)
#check[word]={found: 1}
#check[word][:previous] = previous if previous
substr=""
previous=word
else
word=in_word(dictionary, substr)
#check[word]={found: 1}
#check[word][:previous] = previous if previous
substr=l
previous=nil
end
end
word=in_word(dictionary, substr)
#check[word]={found: 1}
#check[word][:previous] = previous if previous
#check.select{|y,z| z[:previous].nil?}.map do |w,z|
#ordered << w
previous_words(w)
end
pp #ordered
output:
dictionary = ["world", "hello"]
scrambled = "rldhello wo"
... my code here ...
2.5.8 :817 > #ordered
=> ["hello", "world"]
dictionary = ["hello", "my", "name", "is", "Sean"]
scrambled = "me is Shelleano my na"
... my code here ...
2.5.8 :879 > #ordered
=> ["Sean", "hello", "my", "name", "is"]

manipulating csv with ruby

I have a CSV from which I've removed the irrelevant data.
Now I need to split "Name and surname" into 2 columns by space but ignoring a 3rd column in case there are 3 names, then invert the order of the columns "Name and surname" and "Phone" (phone first) and then put them into a file ignoring the headers. I've never actually learned Ruby but I've played with Python 10 years ago. Can you help me? This is what I was able to do until now:
E.g.
require 'csv'
csv_table = CSV.read(ARGV[0], :headers => true)
keep = ["Name and surname", "Phone", "Email"]
new_csv_table = csv_table.by_col!.delete_if do |column_name,column_values|
!keep.include? column_name
end
new_csv_table.to_csv
Begin by creating a CSV file.
str =<<~END
Name and surname,Phone,Email
John Doe,250-256-3145,John#Doe.com
Marsha Magpie,250-256-3154,Marsha#Magpie.com
END
File.write('t_in.csv', str)
#=> 109
Initially, let's read the file, add two columns, "Name" and "Surname", and optionally delete the column, "Name and surname", without regard to column order.
First read the file into a CSV::Table object.
require 'csv'
tbl = CSV.read('t_in.csv', headers: true)
#=> #<CSV::Table mode:col_or_row row_count:3>
Add the new columns.
tbl.each do |row|
row["Name"], row["Surname"] = row["Name and surname"].split
end
#=> #<CSV::Table mode:col_or_row row_count:3>
Note that if row["Name and surname"] had equaled “John Paul Jones”, we would have obtained row["Name"] #=> “John” and row["Surname"] #=> “Paul”.
If the column "Name and surname" is no longer required we can delete it.
tbl.delete("Name and surname")
#=> ["John Doe", "Marsha Magpie"]
Write tbl to a new CSV file.
CSV.open('t_out.csv', "w") do |csv|
csv << tbl.headers
tbl.each { |row| csv << row }
end
#=> #<CSV::Table mode:col_or_row row_count:3>
Let's see what was written.
puts File.read('t_out.csv')
displays
Phone,Email,Name,Surname
250-256-3145,John#Doe.com,John,Doe
250-256-3154,Marsha#Magpie.com,Marsha,Magpie
Now let's rearrange the order of the columns.
header_order = ["Phone", "Name", "Surname", "Email"]
CSV.open('t_out.csv', "w") do |csv|
csv << header_order
tbl.each { |row| csv << header_order.map { |header| row[header] } }
end
puts File.read('t_out.csv')
#=> #<CSV::Table mode:col_or_row row_count:3>
displays
Phone,Name,Surname,Email
250-256-3145,John,Doe,John#Doe.com
250-256-3154,Marsha,Magpie,Marsha#Magpie.com

Storing in CSV file - ruby separator

I am trying to store the results from my scrapping exercice into a CSV file.
The current CSV file gives me the following output :
Name of Movie 1
Rating 1
Name of Movie 2
Rating 2
I would like to get the following output :
Name of Movie 1 Rating 1
Name of Movie 2 Rating 2
Here is my code, I guess it has to deal with the row / column separator :
require 'open-uri'
require 'nokogiri'
require 'csv'
array = []
for i in 1..10
url = "http://www.allocine.fr/film/meilleurs//?page=#{i}"
html_file = open(url).read
html_doc = Nokogiri::HTML(html_file)
html_doc.search('.img_side_content').each do |element|
array << element.search('.no_underline').inner_text
element.search('.note').each do |data|
array << data.inner_text
end
end
end
puts array
csv_options = { row_sep: ',', force_quotes: true, quote_char: '"' }
filepath = 'allocine.csv'
CSV.open(filepath, 'wb', csv_options) do |csv|
array.each { |item| csv << [item] }
end
I think the problem here is that you are not pushing the elements correctly into your array variable. Basically, your array ends up looking like this:
['Movie 1 Title', 'Movie 1 rating', 'Movie 2 Title', 'Movie 2 rating', ...]
What you actually want is an array of arrays, like so:
[
['Movie 1 Title', 'Movie 1 rating'],
['Movie 2 Title', 'Movie 2 rating'],
...
]
And once your array is correctly set, you don't even need to specify a row separator in your CSV options.
The following should do the trick:
require 'open-uri'
require 'nokogiri'
require 'csv'
array = []
10.times do |i|
url = "http://www.allocine.fr/film/meilleurs//?page=#{i}"
html_file = open(url).read
html_doc = Nokogiri::HTML(html_file)
html_doc.search('.img_side_content').each do |element|
title = element.search('.no_underline').inner_text.strip
notes = element.search('.note').map { |note| note.inner_text }
array << [title, notes].flatten
end
end
puts array
filepath = 'allocine.csv'
csv_options = { force_quotes: true, quote_char: '"' }
CSV.open(filepath, 'w', csv_options) do |csv|
array.each do |item|
csv << item
end
end
( I also took the liberty of changing your for loop to a times, which is more ruby-like ;) )

Scraping the web : data separator needed

I am trying to scrape the allocine website as an exercice and my output is the following :
Movie Name
Rating 1 Rating 2
Example :
Coco
4,14,6
Forrest Gump
2,64,6
it should be instead :
Movie Name
Rating 1
Rating 2
Hope you can help me !
require 'open-uri'
require 'nokogiri'
require 'csv'
array = []
for i in 1..10
url = "http://www.allocine.fr/film/meilleurs//?page=#{i}"
html_file = open(url).read
html_doc = Nokogiri::HTML(html_file)
html_doc.search('.img_side_content').each do |element|
array << element.search('.no_underline').inner_text
array << element.search('.note').inner_text
end
end
puts array
csv_options = { col_sep: ',', force_quotes: true, quote_char: '"' }
filepath = 'allocine.csv'
CSV.open(filepath, 'wb', csv_options) do |csv|
array.each { |item| csv << [item] }
end
You forgot to parse the notes, this is why they appear without a space in the console.
What you can do is to add an each and fill your array like this :
element.search('.note').each do |data|
array << data.inner_text
end

Assign hash key and value from string using split

I have a few strings that I am retrieving from a file birthdays.txt. An example of a string is below:
Christopher Alexander, Oct 4, 1936
I would like to separate the strings and let variable name be a hash key and birthdate the hash value. Here is my code:
birthdays = {}
File.read('birthdays.txt').each_line do |line|
line = line.chomp
name, birthdate = line.split(/\s*,\s*/).first
birthdays = {"#{name}" => "#{birthdate}"}
puts birthdays
end
I managed to assign name to the key. However, birthdate returns "".
File.new('birthdays.txt').each.with_object({}) do
|line, birthdays|
birthdays.store(*line.chomp.split(/\s*,\s*/, 2))
puts birthdays
end
I feel like some of the other solutions are overthinking this a bit. All you need to do is split each line into two parts, the part before the first comma and the part after, which you can do with line.split(/,\s*/, 2), then call to_h on the resulting array of arrays:
data = <<END
Christopher Alexander, Oct 4, 1936
Winston Churchill, Nov 30, 1874
Max Headroom, Apr 4, 1985
END
data.each_line.map do |line|
line.chomp.split(/,\s*/, 2)
end.to_h
# => { "Christopher Alexander" => "Oct 4, 1936",
# "Winston Churchill" => "Nov 30, 1874",
# "Max Headroom" => "April 4, 1985" }
(You will, of course, want to replace data with your File object.)
birthdays = Hash.new
File.read('birthdays.txt').each_line do |line|
line = line.chomp
name, birthdate = line.split(/\s*,\s*/, 2)
birthdays[name]= birthdate
puts birthdays
end
Using #Jordan's data:
data.each_line.with_object({}) do |line, h|
name, _, bdate = line.chomp.partition(/,\s*/)
h[name] = bdate
end
#=> {"Christopher Alexander"=>"Oct 4, 1936",
# "Winston Churchill"=>"Nov 30, 1874",
# "Max Headroom"=>"Apr 4, 1985"}

Resources