I'm trying to scrape a tracklist from a website. My relevant code is:
page.css('ol').each do |line|
subarray = line.text.strip.split(" - ")
end
This makes the array take the first artist into the first index (as I want), but adds the track and the artist of track two into the second index like this:
subarray[0] = Rick Wilhite
subarray[1] = Magic Water [Still Music]
Edward
subarray[2] = Into A Better Future [Giegling]
Kassem Mosse
subarray[3] = Zolarem [Mikrodisko Recordings]
After Hours
I included the nested tag so my code reads:
page.css('ol li').each do |line|
subarray = line.text.strip.split(" - ")
end
but this only seems to leave subarray[0] displaying "Klara Lewis" and subarray[1] displaying "Shine [Editions Mego]", which is the last track on the tracklist. All other index values are blank.
A further complication is that I would like to remove the record label from what will end up being the track value. I believe the correct regular expression is \[[\d\D]*?\], but I'm under the impression that this needs to be applied before the data goes into the array to avoid complications involved in iterating over arrays. I tried passing it as a second delimiter to split (along with ' - ') which didn't work, and I also attempted to test it by changing my code to:
page.css('ol').each do |line|
subarray = line.text.strip.split("\[[\d\D]*?\]")
end
but that also appears not to work. Can anyone help me on this or give me the right pointers?
Here's what's happening:
page.css('ol') gives you the entire <ol> with every one of the <li> tags:
<ol>
<li>Rick Wilhite...</li>
<li>Edward...</li>
...
<li>Klara Lewis...</li>
</ol>
When that one big chunk enters the .each loop, you're only running through the loop once. So when you apply the .split(" - ") method, subarray will be filled once with all the text separated by -.
On the other hand, page.css('ol li') gives you each individual <li>, like this:
<li>Rick Wilhite...</li>
<li>Edward...</li>
...
<li>Klara Lewis...</li>
This time, you're running through the loop 17 times, once for each <li> tag. The first time through, .split(" - ") is applied to the text and stored in the subarray variable. The problem is that the next time through the loop, subarray is overwritten with the split text of the second <li>. So after the final time through, the only contents of the subarray variable is the split text of the final <li>: "Klara Lewis" and "Shine [Editions Mego]".
I think you've gotten the general idea of how to scrape from a website, but I recommend building your script more incrementally so you understand exactly what you're doing in each step. For example, use puts to check what page.css('ol') gives you and how it differs from page.css('ol li'). What happens when it goes through a loop? What do you get when you apply .split()? Building more slowly and exploring around to make sure you understand what you're doing will help you avoid hitting dead ends. Hope that helps!
Related
it run corectly but it should have around 500 matches but it only has around 50 and I dont know why!
This is a probelm for my comsci class that I am having isues with
we had to make a function that checks a list for duplication I got that part but then we had to apply it to the birthday paradox( more info here http://en.wikipedia.org/wiki/Birthday_problem) thats where I am runing into problem because my teacher said that the total number of times should be around 500 or 50% but for me its only going around 50-70 times or 5%
duplicateNumber=0
import random
def has_duplicates(listToCheck):
for i in listToCheck:
x=listToCheck.index(i)
del listToCheck[x]
if i in listToCheck:
return True
else:
return False
listA=[1,2,3,4]
listB=[1,2,3,1]
#print has_duplicates(listA)
#print has_duplicates(listB)
for i in range(0,1000):
birthdayList=[]
for i in range(0,23):
birthday=random.randint(1,365)
birthdayList.append(birthday)
x= has_duplicates(birthdayList)
if x==True:
duplicateNumber+=1
else:
pass
print "after 1000 simulations with 23 students there were", duplicateNumber,"simulations with atleast one match. The approximate probibilatiy is", round(((duplicateNumber/1000)*100),3),"%"
This code gave me a result in line with what you were expecting:
import random
duplicateNumber=0
def has_duplicates(listToCheck):
number_set = set(listToCheck)
if len(number_set) is not len(listToCheck):
return True
else:
return False
for i in range(0,1000):
birthdayList=[]
for j in range(0,23):
birthday=random.randint(1,365)
birthdayList.append(birthday)
x = has_duplicates(birthdayList)
if x==True:
duplicateNumber+=1
print "after 1000 simulations with 23 students there were", duplicateNumber,"simulations with atleast one match. The approximate probibilatiy is", round(((duplicateNumber/1000.0)*100),3),"%"
The first change I made was tidying up the indices you were using in those nested for loops. You'll see I changed the second one to j, as they were previously bot i.
The big one, though, was to the has_duplicates function. The basic principle here is that creating a set out of the incoming list gets the unique values in the list. By comparing the number of items in the number_set to the number in listToCheck we can judge whether there are any duplicates or not.
Here is what you are looking for. As this is not standard practice (to just throw code at a new user), I apologize if this offends any other users. However, I believe showing the OP a correct way to write a program should be could all do us a favor if said user keeps the lack of documentation further on in his career.
Thus, please take a careful look at the code, and fill in the blanks. Look up the python doumentation (as dry as it is), and try to understand the things that you don't get right away. Even if you understand something just by the name, it would still be wise to see what is actually happening when some built-in method is being used.
Last, but not least, take a look at this code, and take a look at your code. Note the differences, and keep trying to write your code from scratch (without looking at mine), and if it messes up, see where you went wrong, and start over. This sort of practice is key if you wish to succeed later on in programming!
def same_birthdays():
import random
'''
This is a program that does ________. It is really important
that we tell readers of this code what it does, so that the
reader doesn't have to piece all of the puzzles together,
while the key is right there, in the mind of the programmer.
'''
count = 0
#Count is going to store the number of times that we have the same birthdays
timesToRun = 1000 #timesToRun should probably be in a parameter
#timesToRun is clearly defined in its name as well. Further elaboration
#on its purpose is not necessary.
for i in range(0,timesToRun):
birthdayList = []
for j in range(0,23):
random_birthday = random.randint(1,365)
birthdayList.append(random_birthday)
birthdayList = sorted(birthdayList) #sorting for easier matching
#If we really want to, we could provide a check in the above nester
#for loop to check right away if there is a duplicate.
#But again, we are here
for j in range(0, len(birthdayList)-1):
if (birthdayList[j] == birthdayList[j+1]):
count+=1
break #leaving this nested for-loop
return count
If you wish to find the percent, then get rid of the above return statement and add:
return (count/timesToRun)
Here's a solution that doesn't use set(). It also takes a different approach with the array so that each index represents a day of the year. I also removed the hasDuplicate() function.
import random
sim_total=0
birthdayList=[]
#initialize an array of 0's representing each calendar day
for i in range(365):
birthdayList.append(0)
for i in range(0,1000):
first_dup=True
for n in range(365):
birthdayList[n]=0
for b in range(0, 23):
r = random.randint(0,364)
birthdayList[r]+=1
if (birthdayList[r] > 1) and (first_dup==True):
sim_total+=1
first_dup=False
avg = float(sim_total) / 1000 * 100
print "after 1000 simulations with 23 students there were", sim_total,"simulations with atleast one duplicate. The approximate problibility is", round(avg,3),"%"
I'm trying to display total calls from a twilio object as well as unique calls.
The total calls is simple enough:
# set up a client to talk to the Twilio REST API
#sub_account_client = Twilio::REST::Client.new(#account_sid, #auth_token)
#subaccount = #sub_account_client.account
#calls = #subaccount.calls
#total_calls = #calls.list.count
However, I'm really struggling to figure out how to display unique calls (people sometimes call back form the same number and I only want to count calls from the same number once). I'm thinking this is a pretty simple method or two but I've burnt quite a few hours trying to figure it out (still a ruby noob).
Currently I've been working it in the console as follows:
#sub_account_client = Twilio::REST::Client.new(#account_sid, #auth_token)
#subaccount = #sub_account_client.account
#subaccount.calls.list({})each do |call|
#"from" returns the phone number that called
print call.from
end
This returns the following strings:
+13304833615+13304833615+13304833615+13304833615+13304567890+13304833615+13304833615+13304833615
There are only two unique numbers there so I'd like to be able to return '2' for this.
Calling class on that output shows strings. I've used "insert" to add a space then have done a split(" ") to turn them into arrays but the output is the following:
[+13304833615][+13304833615][+13304833615][+13304833615][+13304567890][+13304833615][+13304833615][+13304833615]
I can't call 'uniq' on that and I've tried to 'flatten' as well.
Please enlighten me! Thanks!
If what you have is a string that you want to manipulate the below works:
%{+13304833615+13304833615+13304833615+13304833615+13304567890+13304833615+13304833615+13304833615}.split("+").uniq.reject { |x| x.empty? }.count
=> 2
However this is more ideal:
#subaccount.calls.list({}).map(&:from).uniq.count
Can you build an array directly instead of converting it into a string first? Try something like this perhaps?
#calllist = []
#subaccount.calls.list({})each do |call|
#"from" returns the phone number that called
#calllist.push call.from
end
you should then be able to call uniq on #calllist to shorten it to the unique members.
Edit: What type of object is #subaccount.calls.list anyway?
uniq should work for creating a unique list of strings. I think you may be getting confused by other non-related things. You don't want .split, that's for turning a single string into an array of word strings (default splits by spaces). Which has turned each single number string, into an array containing only that number. You may also have been confused by performing your each call in the irb console, which will return the full array iterated on, even if your inner loop did the right thing. Try the following:
unique_numbers = #subaccount.calls.list({}).map {|call| call.from }.uniq
puts unique_numbers.inspect
For a project that I am working on for school, one of the parts of the project asks us to take a collection of all the Federalist papers and run it through a program that essentially splits up the text and writes new files (per different Federalist paper).
The logic I decided to go with is to run a search, and every time the search is positive for "Federalist No." it would save into a new file everything until the next "Federalist No".
This is the algorithm that I have so far:
file_name = "Federalist"
section_number = "1"
new_text = File.open(file_name + section_number, 'w')
i = 0
n= 1
while i < l.length
if (l[i]!= "federalist") and (l[i+1]!= "No")
new_text.puts l[i]
i = i + i
else
new_text.close
section_number = (section_number.to_i +1).to_s
new_text = File.open(file_name + section_number, "w")
new_text.puts(l[i])
new_text.puts(l[i+1])
i=i+2
end
end
After debugging the code as much as I could (I am a beginner at Ruby), the problem that I run into now is that because the while function always holds true, it never proceeds to the else command.
In terms of going about this in a different way, my TA suggested the following:
Put the entire text in one string by looping through the array(l) and adding each line to the one big string each time.
Split the string using the split method and the key word "FEDERALIST No." This will create an array with each element being one section of the text:
arrayName = bigString.split("FEDERALIST No.")
You can then loop through this new array to create files for each element using a similar method you use in your program.
But as simple as it may sound, I'm having an extremely difficult time putting even that code together.
i = i + i
i starts at 0, and 0 gets added to it, which gives 0, which will always be less than l, whatever that value is/means.
Since this is a school assignment, I hesitate to give you a straight-up answer. That's really not what SO is for, and I'm glad that you haven't solicited a full solution either.
So I'll direct you to some useful methods in Ruby instead that could help.
In Array: .join, .each or .map
In String: .split
Fyi, your TA's suggestion is far simpler than the algorithm you've decided to embark on... although technically, it is not wrong. Merely more complex.
i am doing "find and repleace button" for my application. I am using gtk and ruby. And i can find that how many word, if there is. Also i want to get selection word that searched word, and i should mark them. My some code:
def search(ent, txtvu)
start = txtvu.buffer.start_iter
first, last = start.forward_search(ent.text, Gtk::TextIter::SEARCH_TEXT_ONLY, nil)
count = 0
while (first)
mark = start.buffer.create_mark(nil, first, false)
txtvu.scroll_mark_onscreen(mark)
txtvu.buffer.delete_mark(mark)
txtvu.buffer.select_range(first, last)
start.forward_char
first, last = start.forward_search(ent.text, Gtk::TextIter::SEARCH_TEXT_ONLY, nil)
start = first
count += 1
end
count says me how many words involve My code does't work. :( Why? I want to mark all searched words.
If I understand you correctly, you want to highlight all found words, not just one. In that case, select_range is not the function to call, because it will change the selection to the current word, and GtkTextView selection is single and contiguous.
Instead, create a highlight tag and apply it to all searches. For example:
# create the "highlight" tag (run this only once)
textvu.buffer.create_tag("highlight", {background => "yellow"})
# ... later, in the loop:
textvu.buffer.apply_tag("highlight", first, last)
Your matches will all appear highlighted.
This is an exercise from a class on CodeLesson.com: Write a program that will accept a list of words from a user. These can either be one per line or all on a single line and delimited in some way (with commas perhaps). Then print out every combination of two words. For example, if a user were to type in book,bus,car,plane, then the output would be something like:
bookbook bookbus bookcar bookplane busbook busbus buscar busplane
carbook carbus carcar carplane planebook planebus planecar planeplane"
If you want a kickstart then use the built in Array#repeated_permutation. For doing it yourself, think of a way to loop the array; inside that loop, loop again.
You need an algorithm to obtain "permutations with repetition". Google it and you'll find many pages with explanations and algorithms. Since it's a learning assignment I won't provide an actual implementation :) but see here for instance:
Permutation with repetition without allocate memory
Now, you say none of your ideas are working. Perhaps if you add some of them to your question you can get specific tips on why they're not working and how to get your algorithm to work.
As a first idea, you could loop over each element in the array of words, and then loop over each word again inside the loop:
# Ask the user for a comma-separated input.
input = gets
# Split the input into an array, and map
# each element of the array to the same
# element, but with surrounding whitespace
# removed.
words = input.split(',').map { |w| w.strip }
# Iterate over each word.
words.each do |w1|
# For each word, iterate over
# all words once again.
words.each do |w2|
# Skip the loop if the two
# words are the same.
next if w1 == w2
puts w1 + w2
end
end
However, there is a more concise way of saying "loop through the array, and loop through the array again inside each loop": it's called repeated permutation. The method Array#repeated_permutation allows you to do this. It takes as a parameter the length of the permutation (in our case, the length is two: we iterate once over the array, and then once again inside each loop). Here's how that would look:
input = gets
words = input.split(',').map { |w| w.strip }
words.repeated_permutation(2) do |w1, w2|
next if w1 == w2
puts w1 + w2
end
Hope this helps.