How avoid interval with Mechanize - ruby

I'm trying to scrape Craiglist with Mechanize. I code this:
require 'mechanize'
a = Mechanize.new
page = a.get("http://paris.craigslist.fr/search/apa")
i = 0
list_per_page = 99
while i <= list_per_page do
title = page.search(".hdrlnk")[i].text
price = page.search(".price")[i].text
puts title
puts price
puts "-----------"
i+=1
end
It works but when a listing hasn't any price there is an interval. I think it's because I use search()[i] but I don't know what I have to do to avoid interval. Any idea?
Edit:
On Craiglist there is:
listing_title1 -> $100
listing_title2 -> $200
listing_title3 ->
listing_title4 -> $60
listing_title5 -> $150
My output CSV displays:
listing_title1 -> $100
listing_title2 -> $200
listing_title3 -> $60
listing_title4 -> $150
listing_title5 -> $300
$300 is listing_title6

If by 'interval' you mean the blank line that is printed when the listing doesn't have a price, you could fix this by making the puts conditional:
puts price unless price.empty?
Edit
If I understand right, your hdrlnk and price entries are getting out of sync with each other. This happens because your current loop is skipping entries with blank price fields and going straight to the next one.
The best way to get around this is to find a container that includes both price and hdrlnk and iterate over those instead of over the hdrlnk and price entries separately. On this page that would be the .row which contains all the info for each search result. So something like this would work:
page.search(".row").each do |row|
title = row.search(".hdrlnk").first
price = row.search(".price").first
puts title.text if title
puts price.text if price
puts "------------"
end

I know you've already accepted an answer and that's fine, but I wanted to introduce the concept of next which is a more powerful solution than putting if <thing> checks all over.
Your method could look like this:
while <condition> do
title = page.search(".hdrlnk")[i].text
price = page.search(".price")[i].text
# skip to the next iteration if any of the vars are nil
next unless [title, price].all?
# ... the rest of code
end
By the way, I think you're usage of the term 'interval' is a bit misleading. I think of an interval as a special kind of loop which runs on a specified time interval, i.e. every second or minute. It's probably clearer to use the terms loop or iteration in this case.

Related

Separate characters and numbers following specific rules

I am trying to distinguish flight numbers.
Example:
flightno = "FR556"
split_data = flightno.upcase.match(/([A-Za-z]+)(\d+)/)
first = split_data[1] # FR
second = split_data[1] # 556
I then go on to query the database to find an airline based on the FR in this example and apply some logic with the result which is Ryanair.
My problem is when the flight number might be:
flightno = "U21920"
split_data = flightno.upcase.match(/([A-Za-z]+)(\d+)/)
first = split_data[1] # U
second = split_data[1] # 21920
i basically want first to be U2 not just U. This is used to search the database of airlines by their IATA code in this case is U2
****EDIT**
In the interest of clarity i made some mistakes in terminology when asking my question. Due to the complexities of booking reference numbers, the input is taken from whatever the passenger provides. For an easyJet flight for example, the passenger may input EZY1920 or U21920 only the airline provides either so the passenger is ignorant really.
"EZY" = ICAO
"U2" = IATA
I take the input from the user and try to separate the ICAO or IATA from the flight number "1920" but there is no way of determining that without searching the database or separating the input which i feel is cumbersome from a user experience point of view.
Using a regex to separate characters from numbers works until the user inputs an IATA as part of their flight number (the passenger won't know the difference) and as you can see in the example above this confuses the regex.**
The trouble is i cant think of any other pattern with flight numbers. They always have at least two characters made up of just letters or a mixture of a letter and a number and can be 3 characters in length. The numbers part can be as short as 1 but can also be as long as 4 - always numbers.
****edit**
As has been mentioned in the comments, there is no fixed size however one thing that is always true (at least so far) is the first character will always be a letter regardless if it is ICAO or IATA.
After considering every bodies input so far i'm wondering if searching the database and returning airlines with an IATA or ICAO that matches the first two letters provided by the user (U2), (FR), (EZ) might be one way to go, however this is subject to obvious problems should an ICAO or IATA be released that matches another airline, for example "EZY" & "EZT". This is not future proof and i'm looking for better ruby or regex solutions.**
Appreciate your input.
EDIT
I have answered my own question below. While other answers provide a solution for handling some conditions they would fall down if the flight number began with a number so i worked out a crass but to date stable way to analyse the string for digits and then work out if it is an ICAO or IATA from that.
A solution I think of is that you match your given flight number against a complete list of ICAO/IATA codes: https://raw.githubusercontent.com/datasets/airport-codes/master/data/airport-codes.csv
Spending some time with google might give you a more appropriate list.
Then use the first three characters (if that is the maximum) of your flight number to find a match within the icao codes. If you find one, you will know where to seperate your string.
Here a minimal ugly example that should set you on a track. Feel free to update!
ICAOCODES = %w(FR DEU U21) # grab your data here
def retrieve_flight_information(flightnumber)
ICAOCODES.each do |icao|
co = flightnumber.match(icao).to_s
if co.length > 0
# airline
puts co
# flight number
puts flightnumber.gsub(co,'')
end
end
end
retrieve_flight_information("FR556")
#=> FR
#=> 556
retrieve_flight_information("U21214123")
#=> U21
#=> 214123
The biggest flaw lies in using .gsub() as it might mess up your flightnumber in case it looks like this: "FR21413FR2"
However you will find plenty of solutions to this problem on so.
As mentioned in the comments, a list of icao codes is not what you are looking for. But what is relevant here, is that you somehow need a list of strings that you can securely compare against.
I have a fairly crass solution that seems to be working in all scenarios i can throw at it to date. I wanted to make this available to anybody else that might find it useful?
The general rule of thumb for flight codes/numbers seems to be:
IATA: two characters made up of any combination letters and digits
ICAO: three characters made up of letters only (to date)
With that in mind we should be able to work out if we need to search the database by IATA or ICAO depending on the condition of the first three characters.
First we take the flight number and convert to uppercase
string = "U21920".upcase
Next we analyse the first three characters to check for any numbers.
first_three = string[0,3] # => U21
Is there a digit in first_three?
if first_three =~ /\d/ # => true
iata = first_three[0,2] # => If true lets get rid of the last character
# Now we go to the database searching IATA (U2)
search = Airline.where('iata LIKE ?', "#{iata}%") # => Starts with search, just in case
Otherwise if there isnt a digit found in the string
else
icao = string.match(/([A-Za-z]+)(\d+)/)
search = Airline.where('icao LIKE ?', "#{icao[1]}%")
This seems to work for the random flight numbers ive tested it with today from a few of the major airport live departure/arrival boards. Its an interesting problem because some airlines issue tickets with either an ICAO or IATA code as part of the flight number which means passengers won't know any different, not to mention, some airports provide flight information in their own format so assumign there isnt a change to the ICAO and IATA build then the above should work.
Here is an example script you can run
test.rb
puts "What is your flight number?"
string = gets.upcase
first_three = string[0,3]
puts "Taking first three from #{string} is #{first_three}"
if first_three =~ /\d/ # Calling String's =~ method.
puts "The String #{first_three} DOES have a number in it."
iata = first_three[0,2]
search = Airline.where('iata LIKE ?', "#{iata}%")
puts "Searching Airlines starting with IATA #{iata} = #{search.count}"
puts "Found #{search.first.name} from IATA #{iata}"
else
puts "The String #{first_three} does not have a number in it."
icao = string.match(/([A-Za-z]+)(\d+)/)
search = Airline.where('icao LIKE ?', "#{icao[1]}%")
puts "Searching Airlines starting with ICAO #{icao[1]} = #{search.count}"
puts "Found #{search.first.name} from IATA #{icao[1]}"
end
Airline
Airline(id: integer, name: string, iata: string, icao: string, created_at: datetime, updated_at: datetime )
stick this in your lib folder and run
rails runner lib/test.rb
Obviously you can remove all of the puts statements to get straight to the result. I'm using rails runner to include access to my Airline model when running the script.

Scraping tracklist

I'm trying to scrape a tracklist from a website. My relevant code is:
page.css('ol').each do |line|
subarray = line.text.strip.split(" - ")
end
This makes the array take the first artist into the first index (as I want), but adds the track and the artist of track two into the second index like this:
subarray[0] = Rick Wilhite
subarray[1] = Magic Water [Still Music]
Edward
subarray[2] = Into A Better Future [Giegling]
Kassem Mosse
subarray[3] = Zolarem [Mikrodisko Recordings]
After Hours
I included the nested tag so my code reads:
page.css('ol li').each do |line|
subarray = line.text.strip.split(" - ")
end
but this only seems to leave subarray[0] displaying "Klara Lewis" and subarray[1] displaying "Shine [Editions Mego]", which is the last track on the tracklist. All other index values are blank.
A further complication is that I would like to remove the record label from what will end up being the track value. I believe the correct regular expression is \[[\d\D]*?\], but I'm under the impression that this needs to be applied before the data goes into the array to avoid complications involved in iterating over arrays. I tried passing it as a second delimiter to split (along with ' - ') which didn't work, and I also attempted to test it by changing my code to:
page.css('ol').each do |line|
subarray = line.text.strip.split("\[[\d\D]*?\]")
end
but that also appears not to work. Can anyone help me on this or give me the right pointers?
Here's what's happening:
page.css('ol') gives you the entire <ol> with every one of the <li> tags:
<ol>
<li>Rick Wilhite...</li>
<li>Edward...</li>
...
<li>Klara Lewis...</li>
</ol>
When that one big chunk enters the .each loop, you're only running through the loop once. So when you apply the .split(" - ") method, subarray will be filled once with all the text separated by -.
On the other hand, page.css('ol li') gives you each individual <li>, like this:
<li>Rick Wilhite...</li>
<li>Edward...</li>
...
<li>Klara Lewis...</li>
This time, you're running through the loop 17 times, once for each <li> tag. The first time through, .split(" - ") is applied to the text and stored in the subarray variable. The problem is that the next time through the loop, subarray is overwritten with the split text of the second <li>. So after the final time through, the only contents of the subarray variable is the split text of the final <li>: "Klara Lewis" and "Shine [Editions Mego]".
I think you've gotten the general idea of how to scrape from a website, but I recommend building your script more incrementally so you understand exactly what you're doing in each step. For example, use puts to check what page.css('ol') gives you and how it differs from page.css('ol li'). What happens when it goes through a loop? What do you get when you apply .split()? Building more slowly and exploring around to make sure you understand what you're doing will help you avoid hitting dead ends. Hope that helps!

Ruby Search Array And Replace String

My question is, how can I search through an array and replace the string at the current index of the search without knowing what the indexed array string contains?
The code below will search through an ajax file hosted on the internet, it will find the inventory, go through each weapon in my inventory, adding the ID to a string (so I can check if that weapon has been checked before). Then it will add another value after that of the amount of times it occurs in the inventory, then after I have check all weapon in the inventory, it will go through the all of the IDs added to the string and display them along with the number (amount of occurrences). This is so I know how many of each weapon I have.
This is an example of what I have:
strList = ""
inventory.each do |inv|
amount = 1
exists = false
ids = strList.split(',')
ids.each do |ind|
if (inv['id'] == ind.split('/').first) then
exists = true
amount = ind.split('/').first.to_i
amount += 1
ind = "#{inv['id']}/#{amount.to_s}" # This doesn't seem work as expected.
end
end
if (exists == true) then
ids.push("#{inv['id']}/#{amount.to_s}")
strList = ids.join(",")
end
end
strList.split(",").each do |item|
puts "#{item.split('/').first} (#{item.split('/').last})"
end
Here is an idea of what code I expected (pseudo-code):
inventory = get_inventory()
drawn_inv = ""
loop.inventory do |inv|
if (inv['id'].occurred_before?)
inv['id'].count += 1
end
end loop
loop.inventory do |inv|
drawn_inv.add(inv['id'] + "/" + inv['id'].count)
end loop
loop.drawn_inv do |inv|
puts "#{inv}"
end loop
Any help on how to replace that line is appreciated!
EDIT: Sorry for not requiring more information on my code. I skipped the less important part at the bottom of the code and displayed commented code instead of actual code, I'll add that now.
EDIT #2: I'll update my description of what it does and what I'm expecting as a result.
EDIT #3: Added pseudo-code.
Thanks in advance,
SteTrezla
You want #each_with_index: http://ruby-doc.org/core-2.2.0/Enumerable.html#method-i-each_with_index
You may also want to look at #gsub since it takes a block. You may not need to split this string into an array at all. Basically something like strList.gsub(...){ |match| #...your block }

How to iterate only through unique combinations of multiple objects?

The title is a bit of a doozy.
I'm working on a project where users can make bids. The resulting items can be won exclusively or split between up to 3 users. One user can put in an exclusive bet of $20, and another 3 users can both agree to do a 3-way split and each only pay $10, resulting in $30, beating the first bidder.
I need to run through a list of possibly a dozen different bidders who agreed to the 3-way split to determine the winning trio:
Rza => $20 # loses
ODB + Gza => $25 # loses
InspectahDeck + Ghostface + ODB => $50 # wins
Alternatively
Rza => $100,000 # wins
ODB + Gza => $25 # loses
InspectahDeck + Ghostface + ODB => $50 # loses
All I have is an array of Bid objects, belonging to a variety of users. My goal is to see all possible combinations of up those who wish to split with others and see who comes out on top.
I tried to do something like:
bids.each do |bid1|
bids.each do |bid2|
bids.each do |bid3|
# Fill a hash here, but only if the permutation of the bids is unique
end
end
end
I'm having a hard time with this since it seems horribly inefficient and has tons of duplicates, sometimes same bids appearing twice. I'd like some help or at tips to point me in the right direction.
I'm really stumped.
Thanks in advance.
PS: Another tricky detail: Each bidder can have multiple bids set. So the same guy can have 1 exclusive, 1 2-way and 1 3-way.
Suppose you have something like this:
class Bid
attr_accessor :user # link to the user
attr_accessor :price # dollar amount
attr_accessor :way # 1 means 1-way, 2 means 2-way, 3 means 3-way
end
Get the highest bets of each kind:
best_1_way = bids.select{|bid| bid.way == 1}.max
best_2_ways = bids.select{|bid| bid.way == 2}.sort[-2,2]
best_3_ways = bids.select{|bid| bid.way == 3}.sort[-3,3]
Get the total prices:
total_1_way_price = best_1_way.price
total_2_ways_price = best_2_ways.map(&:price).inject(&:+)
total_3_ways_price = best_3_ways.map(&:price).inject(&:+)
Compare these three items, and you get your winner.
If you have a lot of bids and want to optimize:
all_1_ways, all_2_ways, all_3_ways =
bids.group_by{|bid| bid.way }.values_at(1,2,3)

How to match between two arrays and update one based on criteria

I'm trying to match two supplier csv's and update one based on the results of the other; things like if price is different, update one file with the matching item of the other. If the product is in the first csv but not in the other, update it. Once the data set is adjusted, I'll write it back to the csv which I'm ok with. Each supplier file is about 9000 lines long. Sample data from the two Puts lines in the code are:
#<struct RecordBUY item_type=nil, buy_product_id="1000", product_name="Plastic Jeweled Crown", product_type=nil, product_code_SKU="105238", option_set=nil, duplicate={"1000"=>["105238"]}, brand_name="Rubies Costumes", prod_desc="This plastic crown has six large jewel stones accross the top. Adjustable headband. (Colors of the jewel stones may vary, our choice please.)", cost_price="$3.76", prod_weight="00.14", prod_width="5.75", prod_height="0.5", prod_depth="23.5", prod_category="Hats, Wigs & Masks", prod_upn="082686025935", prod_size="One Size", prod_color="Gold">
#<struct BCRecord item_type="Product", bc_product_id="620", product_name="Dollar Ring", product_type=nil, product_code_SKU="109624", option_set=nil, duplicate=nil, brand_name="Rubies Costumes", prod_desc="Ring has three large glittery Dollar Signs '$' that extend over your fingers.", cost_price="3.20", prod_weight="0.7200", prod_width="4.0000", prod_height="1.0000", prod_depth="7.0000", prod_category="Accessories & Makeup", prod_upn="82686006996", prod_size=nil, prod_color=nil, option_set=nil, price="5.60", allow_purchases=[21]>
I read the csv data into arrays against respective objects, but don't know how to do searching and updating efficiently. I did not come across concepts to avoid the bad ones (or whether doing a bad one on 9k lines is actually bad or just frowned upon). What I have is:
puts records[0]
puts recordsBC[1]
#start script
records.each do | buyline |
recordsBC.each do | bcline |
if bcline.product_code_SKU == buyline.product_code_SKU
##update pricing (brute force);
#bcline.price = buyline.cost_price * 1.75 #this fails with undefined method `price=' for #<Record:0x007fbb9088b960>
bcline.cost_price = buyline.cost_price
end
##if product is in BC currently, but not in buy - needs to be marked as inactive in BC
if bcline.product_code_SKU.include? buyline.product_code_SKU
#bcline.allow_purchases = "N" # this fails with undefined method `allow_purchases=' for #<Record:0x007fb2878822c8>
end
#if product is in Buy but not in BC then add it into BC
if buyline.product_code_SKU.include? bcline.product_code_SKU
recordsBC.push buyline
end
end
end
I can't figure out a better way, nor understand why I'm getting the undefined method errors on some but not all lines. I'm not after complete answers, just enough to figure out the rest of the solution.
I'd start by reducing the number of iterations. At the moment you are iterating through all of recordsBC for each buyline. So I'd start with:
records.each do | buyline |
record_subset = recordsBC.select{|r|!(r.product_code_SKU.split & buyling.product_code_SKU.split).empty?}
record_subset.each do |bcline|
.....
end
end
That should mean you only iterate through bcline items that have a matching product_code_SKU. You may have to modify the split as your example doesn't show how multiple SKUs are separated (e.g. '123 456', '123,456', or '123/456')

Resources