I'm writing a web scraping script in Ruby that opens a used car website, searches for a make/model of car, loops over the pages of results, and then scrapes the data on each page.
The problem I'm having is that I don't necessarily know the max # of pages at the beginning, and only as I iterate closer to the last few known pages does the pagination increase and reveal more pages.
I've defined cleanpages as an array and populated it with what I know are the available pages when first opening the site. Then I use cleanpages.each do to iterate over those "pages". Each time I'm on a new page I add all known pages back into cleanpages and then run cleanpages.uniq to remove duplicates. The problem seems to be that cleanpages.each do only iterates as many times as its original length.
Can I make it so that within the each do loop, I increase the number of times it will iterate?
Rather than using Array#each, try using your array as a queue. The general idea is:
queue = initial_pages
while queue.any?
page = queue.shift
new_pages = process(page)
queue.concat(get_unprocessed_pages(new_pages))
end
The idea here is that you just keep taking items from the head of your queue until it's empty. You can push new items into the end of the queue during processing and they'll be processed correctly.
You'll want to be sure to remove pages from new_pages which are already in the queue or were already processed.
You could also do it by just keeping your array data structure, but manually keep a pointer to the current element in your list. This has the advantage of maintaining a full list of "seen" pages so you can remove them from your new_pages list before appending anything remaining to the list:
index = 0
queue = initial_pages
while true do
page = queue[index]
break if page.nil?
index += 1
new_pages = get_new_pages(page) - queue
queue.concat(new_pages)
end
Related
Desired Outcome: I would like to write a test that clicks on each of these "X"s. I would like to do this until there are not any images left.
Use Case:
I reload the list of images each time, to ensure I backfill up to 15.
The user has 18 images
On the first page I show 15
When the user deletes 1 image, I reload the images so there are 15 on page 1 again, and now only 2 images on page 2.
Error:
Because of the reload of the images, it is causing the .each functionality from Cypress to break with the following error message:
cy.click() failed because the page updated as a result of this
command, but you tried to continue the command chain. The subject is
no longer attached to the DOM, and Cypress cannot requery the page
after commands such as cy.click().
Cypress Code Implementation:
cy.get('[datacy="deleteImageIconX"]').each(($el) => cy.wrap($el).click());
What can I do to run a successful test that meets my use-case?
I've seen this before, this answer helped me tremendously waiting-for-dom-to-load-cypress.
The trick is not to iterate the elements, rather use a loop over the total (18) and to confirm deletion within the loop.
for (let i = 0; i < 18; i++) {
cy.get('[datacy="deleteImageIconX"]')
.first()
.click()
.should('not.exist');
}
This strategy eliminates the problems:
trying to work with stale element references (detached)
iterating too fast (confirm current action completes first)
page count being less than total count (page adjusts between iterations)
The last iteration of my "while(res.next())"-loop takes 1.5 seconds whereas the previous iterations only take around ~0,003 milliseconds. I'm very confused.
I'm getting a ResultSet from my SQLite database with a common JDBC query. The ResultSet is not huge, it's always 4 columns wide and between 0 and 100 rows long. Then I want to save the ResultSet-data in POJOs and store those POJOs in an ArrayList. I use the usual iteration-method where you simply put up a "while(res.next())"-loop and iterate over the entire ResultSet. I recognized a bottleneck in my code and could narrow it down to the while(res.next())-loop. I started debugging and also measuring execution-times in my code and it turns out that the last .next()-call, which should return false so that the loop stops, takes a very long time.
ArrayList<Trade> trades = new ArrayList<>();
try
{
Statement statement = connector.getConnection().createStatement();
statement.setFetchSize(1337);
ResultSet res = statement.executeQuery("SELECT * FROM trades WHERE timestamp >= 0 AND timestamp < 1337;");
res.setFetchSize(1337);
int numberOfTrades = 0;
long startTime = System.nanoTime();
while(res.next())
{
long addObjectTime = System.nanoTime();
trades.add(new Trade(res.getLong("id"),res.getLong("timestamp"),res.getDouble("price"),res.getDouble("amount")));
numberOfTrades++;
System.out.println("Added trade to list within "+(System.nanoTime()-addObjectTime)+"ns.");
}
System.out.println("List-creation completed within "+(System.nanoTime()-startTime)+"ns.");
System.out.println("Number of trades: "+numberOfTrades);
} catch (SQLException e)
{
e.printStackTrace();
}
That's my code. As you can see I already tried playing around with various fetchSizes as other people mentioned in performance-threads concerning the .next()-method. I tried everything I could but the outcome of it all looks still like this:
Added trade to list within 46000ns.
Added trade to list within 3200ns.
Added trade to list within 2400ns.
Added trade to list within 2300ns.
Added trade to list within 2300ns.
Added trade to list within 2300ns.
Added trade to list within 2300ns.
Added trade to list within 4500ns.
Added trade to list within 2300ns.
Added trade to list within 2300ns.
Added trade to list within 3100ns.
Added trade to list within 2400ns.
Added trade to list within 2300ns.
Added trade to list within 2300ns.
Added trade to list within 2400ns.
Added trade to list within 2400ns.
Added trade to list within 2300ns.
Added trade to list within 2200ns.
Added trade to list within 2300ns.
Added trade to list within 2300ns.
Added trade to list within 2300ns.
Added trade to list within 11100ns.
List-creation completed within 1548543200ns.
Number of trades: 22
Adding a POJO to my ArrayList with the data from the ResultSet usually takes between 2-5 microseconds. So all in all, the loop shouldn't take much longer than all the execution-times for adding trades combined right? In my example that would be 0,1073 milliseconds. Instead the loop takes a total of more than 1,5 seconds which would be 10,000x the amount that I'd expect. I actually have zero clue what's happening here. And this is a severe problem for my program because the code-fragment is executed about 100,000 times, so 150,000 seconds would be 40 hours of performance-loss :(
I actually solved the problem but I am not 100% sure why it's solved now. The database I was querying had many millions of entries and was always accessed by a single column (timestamp). I indexed timestamp and the performance issues vanished completely. I still don't know why the .next()-method behaved the way it did.
I have a table in which data can be refreshed by selecting some filter checkboxes. One or more checkboxes can be selected and after each is selected a spinner is displayed on the page. Subsequent filters can only be selected once the previous selection has refreshed the table. The issue I am facing is that I keep getting StaleElementException intermittently.
This is what I do in capybara -
visit('/table-page') # table with default values is displayed
# select all filters one by one. Wait for spinner to disappear after each selection
filters.each {|filter| check(filter); has_no_css?('.loading-overlay', wait: 15)}
# get table data as array of arrays. Added *minimum* so it waits for table
all('tbody tr', minimum: 1).map { |row| row.all('th,td').map(&:text) }
I am struggling to understand why am I seeing StaleElementException. AFAIK Capybara uses synchronize to reload node when using text method on a given node. It also happens that sometimes the table data returns stale data(i.e the one before the last filter update)
The use of all or first disables reloading of any elements returned (If you use find the element is reloadable since the query used to locate the element is fully known). This means that if the page changes at all during the time the last line of your code is running you'll end up with the StaleElement errors. This is possible in your code because has_no_css? can run before the overlay appears. One solution to this is to use has_css? with a short wait time, to detect the overlay before checking that it disappears. The has_xxx? methods just return true/false and don't raise errors so worst case has_css? misses the appearance/disappearance of the overlay completely and basically devolves into a sleep for the specified wait time.
visit('/table-page') # table with default values is displayed
# select all filters one by one. Wait for spinner to disappear after each selection
filters.each do |filter|
check(filter);
has_css?('.loading_overlay', wait: 1)
assert_no_selector('.loading-overlay', wait: 15)
end
# get table data as array of arrays. Added *minimum* so it waits for table
all('tbody tr', minimum: 1).map { |row| row.all('th,td').map(&:text) }
I am trying to get to grips with perl. I am trying to write a few scripts as a scheduling simulator. FCFS, SSTF and Scan and Look
I have one array with a list of block requests and another to act as the buffer. First I will copy over the first request, then I need to work out the time it takes to get from the first to the second block.
the buffer reads in blocks at 1 per ms, seek, search and access time are all 1ms to make the calculations a bit easier, the simulator always starts on block 1 track 1.
http://postimg.org/image/d9osb8tkj/
so if the first block is 5, the search time will be 3ms to traverse to the start of the 5th block, the seek time will be zero as its on the same track and the access time to read the block will always be 1ms. This means that the time for this request will be 4ms so the simulator will read in the next 4 requests into the buffer. In first come first served this will just be the order that the requests are served.
So if the next request to serve is 12 the arm is on the end of the 5th block so will take 2ms to get to the right track then 1ms to get to the start of the 12th block and another 1ms to access it.
I was just wondering if anyone could give me some idea how I could express this as an algorithm. Just some pointers would be much appreciated.
write a class HardDiskSim::Abstract, 3 subs seek_time(), spin_time(), and read_time()
Write a subclass of AbstractDisk for each different set of values/logic for the three methods.
Fir example:
package HardDiskSim::Simple;
use base qw(HardDiskSim::Abstract);
our $SECTORS_PER_TRACK = 5;
our $SEEK_TTIM_PER_TRACK = 1;
sub read_time { return 1 }
sub seek_time {
my $block = #_;
my $tracks_to_seek = int($block / $SECTORS_PER_TRACK);
return $tracks_to_seek * $SEEK_TTIM_PER_TRACK;
}
sub spin_time {
# compute head position at end of seek using seek time and RPM of disk
# compute number of sectors to spin past using computed head position
# return number_of_sectors_to_spin_past * time_per_sector
}
I had the fun of writing this kind of code in Fortran, for a class, back in 1985.
I'm trying to make an ajax autocomplete search box that of course uses SQL, min 3 characters, and have a SQL view of relevant fields already set up and indexed in the db. The CPU still spikes when searching, which I expected as it's running a query for every character. I want to use Zend shm cache to speed up results and reduce CPU usage. The results are stored in an array which is to be cached like this:
while($row = db2_fetch_row($stmt)) {
$fSearch[trim($row[0]).trim($row[1])] = array(/*array built here*/);
}
if (zend_shm_cache_store('fSearch', $fSearch, 10 * 60) === false) {
error_log('Failed to store search cache!');
}
Of course there's actual data inside the array instead of comments, I just shortened the code for simplicity. Rows 0&1 form the PK, and this has tested to be working properly. It's the zend_shm_cache_store that fails because the error log gets flooded with 'Failed to store search cache!'. I read that zend_shm_cache_store can store any array that can be serialized - how can I tell if my data is serialized or can be serialized? Are there any other potential causes? I did make a test page that only stored a string and that was successful, so I know caching is on.
Solved: cache size was too small for array - increased cache size and it worked fine. Sorry for the trouble.