My loop in scrapy is not running sequentially - for-loop

I am scraping a sequence of urls. The code is working but scrapy is not parsing the urls in sequential order. E.g. Although I am trying to parse url1, url2,...,url100, scrapy parses url2, url10,url1...etc.
It parses all the urls but when a specific url does not exist (e.g example.com/unit.aspx?b_id=10) Firefox shows me the result of my previous request. As I want to make sure that I don´t have duplicates, I need to ensure that the loop is parsing the urls sequentially and not "at will".
I tried "for n in range(1,101) and also a "while bID<100" the result is the same. (see below)
thanks in advance!
def check_login_response(self, response):
"""Check the response returned by a login request to see if we are
successfully logged in.
"""
if "Welcome!" in response.body:
self.log("Successfully logged in. Let's start crawling!")
print "Successfully logged in. Let's start crawling!"
# Now the crawling can begin..
self.initialized()
bID=0
#for n in range(1,100,1):
while bID<100:
bID=bID+1
startURL='https://www.example.com/units.aspx?b_id=%d' % (bID)
request=Request(url=startURL ,dont_filter=True,callback=self.parse_add_tables,meta={'bID':bID,'metaItems':[]})
# print self.metabID
yield request #Request(url=startURL ,dont_filter=True,callback=self.parse2)
else:
self.log("Something went wrong, we couldn't log in....Bad times :(")
# Something went wrong, we couldn't log in, so nothing happens.

You could try something like this. I'm not sure if it's fit for purpose on the basis that I haven't seen the rest of the spider code but here you go:
# create a list of urls to be parsed, in reverse order (so we can easily pop items off)
crawl_urls = ['https://www.example.com/units.aspx?b_id=%s' % n for n in xrange(99,1,-1)]
def check_login_response(self, response):
"""Check the response returned by a login request to see if we are successfully logged in.
"""
if "Welcome!" in response.body:
self.log("Successfully logged in. Let's start crawling!")
print "Successfully logged in. Let's start crawling!"
# Now the crawling can begin..
self.initialized()
return Request(url='https://www.example.com/units.aspx?b_id=1',dont_filter=True,callback=self.parse_add_tables,meta={'bID':1,'metaItems':[]})
else:
self.log("Something went wrong, we couldn't log in....Bad times :(")
# Something went wrong, we couldn't log in, so nothing happens.
def parse_add_tables(self, response):
# parsing code here
if self.crawl_urls:
next_url = self.crawl_urls.pop()
return Request(url=next_url,dont_filter=True,callback=self.parse_add_tables,meta={'bID':int(next_url[-1:]),'metaItems':[]})
return items

You can use use priority attribute in Request object. Scrapy guarantees the urls are crawled in DFO by default. But it does not ensure that the urls are visited in the order they were yielded within your parse callback.
Instead of yielding Request objects you want to return an array of Requests from which objects will be popped till it is empty.
For more info you can see here
Scrapy Crawl URLs in Order

Related

Displaying JSON output from an API call in Ruby using VScode

For context, I'm someone with zero experience in Ruby - I just asked my Senior Dev to copy-paste me some of his Ruby code so I could try to work with some APIs that he ended up putting off because he was too busy.
So I'm using an API wrapper called zoho_hub, used as a wrapper for Zoho APIs (https://github.com/rikas/zoho_hub/blob/master/README.md).
My IDE is VSCode.
I execute the entire length of the code, and I'm faced with this:
[Done] exited with code=0 in 1.26 seconds
The API is supposed to return a paginated list of records, but I don't see anything outputted in VSCode, despite the fact that no error is being reflected. The last 2 lines of my code are:
ZohoHub.connection.get 'Leads'
p "testing"
I use the dummy string "testing" to make sure that it's being executed up till the very end, and it does get printed.
This has been baffling me for hours now - is my response actually being outputted somewhere, and I just can't see it??
Ruby does not print anything unless you tell it to. For debugging there is a pretty printing method available called pp, which is decent for trying to print structured data.
In this case, if you want to output the records that your get method returns, you would do:
pp ZohoHub.connection.get 'Leads'
To get the next page you can look at the source code, and you will see the get request has an additional Hash parameter.
def get(path, params = {})
Then you have to read the Zoho API documentation for get, and you will see that the page is requested using the page param.
Therefore we can finally piece it together:
pp ZohoHub.connection.get('Leads', page: NNN)
Where NNN is the number of the page you want to request.

Querying Twilio calls list resource doesn't paginate the results using Ruby or PHP

According to Twilio's documentation here regarding "paging":
The list returned to you includes paging information. If you plan on requesting more records than will fit on a single page, you may want to use the provided nextpageuri rather than incrementing through the pages by page number.
It then gives an example:
# Initialize Twilio Client
#client = Twilio::REST::Client.new(account_sid, auth_token)
#client.calls.list
.each do |call|
puts call.direction
end
However, doing this just returns an array of all calls, there isn't any paging information or limiting of results or any "pages".
For my purposes I'm actually filtering the query like this:
#calls = #client.calls.list(
start_time_after: #time
start_time_before: #another_time
)
Because my date filter range is a 1 month period and there are currently about 4.5k calls to retrieve, its taking quite a while to process (and sometimes it just never processes)
I'm using the twilio helper library ruby gem "twilio-ruby" and running ruby 2.5
I've also tried using PHP with the respective twilio helper library and have found the same result.
Using curl however does work and gives paging information, its also incredibly fast compared to using the helper libraries
Twilio developer evangelist here.
list will paginate through, loading all the resources it can.
There are other calls that will stream the API in a lazier fashion, if that is more useful for your use case. For example, you can use each and it will load the calls lazily until they have run out.
#calls = #client.calls.each(
start_time_after: #time
start_time_before: #another_time
) do |call|
puts call.direction
end
If you do want to manually paginate yourself, you can the page method to get a CallPage object and iterate from there.
page = #client.calls.page(
start_time_after: #time
start_time_before: #another_time
)
while !page.nil? do
page.each { |call| puts call.direction }
page = page.next_page
end
Let me know if that helps at all.

Skip part of execution on specific page

I have a plug that checks for data on every page, kind of like authentication. If they don't have at least one entry of data, they get redirected to a page to add an entry of the data. The problem is that it will keep redirecting them even if they are on that page. I still need to check the data on every page, but skip the redirect if it's on a specific page. I've tried pipelines, but it seemed overkill for what I need since I just need to skip one piece of the execution (the redirect). What's the best way to skip the redirect for a certain controller / action?
Plug
def add_default_team(conn, user, opts) do
repo = Keyword.fetch!(opts, :repo)
default_team = user.default_team_id
if default_team do
assign conn, :current_team, repo.get!(Team, default_team)
else
conn
|> redirect(to: Helpers.team_path(conn, :new))
|> halt()
end
end
The route I'm trying to avoid redirecting on would be the team_path(conn, :new)
You can check the value of conn.request_path and skip processing it if it matches a certain string by adding this clause above the existing clause you have:
def add_default_team(%Plug.Conn{request_path: "/team/new"}, _user, _opts), do: conn
It would be better to use path_info though, since it removes consecutive and trailing slashes:
def add_default_team(%Plug.Conn{path_info: ["team", "new"]}, _user, _opts), do: conn

How to not show extracted links and scraped items?

Newbie here, running scrapy in windows. How to avoid showing the extracted links and crawled items in the command window? I found comments in the "parse" section on this linkhttp://doc.scrapy.org/en/latest/topics/commands.html, not sure if it's relevant and how to apply it if so. Here is more detail with part of the code, starting from my second Ajax request (In the first Ajax request, the callback function is "first_json_response":
def first_json_response(self, response):
try:
data = json.loads(response.body)
meta = {'results': data['results']}
yield Request(url=url, callback=self.second_json_response,headers={'x-requested-with': 'XMLHttpRequest'}, meta = meta)
def second_json_response(self, response):
meta = response.meta
try:
data2 = json.loads(response.body)
...
The "second_json_response" is to retrieve the response from the requested result in first_json_response, as well as to load the new requested data. "meta" and "data" are then both used to define items that need to be crawled. Currently, the meta and links are shown in the windows terminal where I submitted my code. I guess it is taking up some extra time for computer to show them on the screen, and thus want them to disappear. I hope by running scrapy on a kinda-of batch mode will speed up my lengthy crawling process.
Thanks! I really appreciate your comment and suggestion!
From scrapy documentation:
"You can set the log level using the –loglevel/-L command line option, or using the LOG_LEVEL setting."
So append to your scray crawl etc command -loglevel='ERROR' . That should make all the info disappear from your command line, but I don't think this will speed things much.
In your pipelines.py file, try using something like:
import json
class JsonWriterPipeline(object):
def __init__(self):
self.file = open('items.jl', 'wb')
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
This way, when you yield an item from your spider class, it will print it out to items.jl.
Hope that helps.

Ruby tweetstream stop unexpectedly

I use tweetstream gem to get sample tweets from Twitter Streaming API:
TweetStream.configure do |config|
config.username = 'my_username'
config.password = 'my_password'
config.auth_method = :basic
end
#client = TweetStream::Client.new
#client.sample do |status|
puts "#{status.text}"
end
However, this script will stop printing out tweets after about 100 tweets (the script continues to run). What could be the problem?
The Twitter Search API sets certain arbitrary (from the outside) limits for things, from the docs:
GET statuses/:id/retweeted_by Show user objects of up to 100 members who retweeted the status.
From the gem, the code for the method is:
# Returns a random sample of all public statuses. The default access level
# provides a small proportion of the Firehose. The "Gardenhose" access
# level provides a proportion more suitable for data mining and
# research applications that desire a larger proportion to be statistically
# significant sample.
def sample(query_parameters = {}, &block)
start('statuses/sample', query_parameters, &block)
end
I checked the API docs but don't see an entry for 'statuses/sample', but looking at the one above I'm assuming you've reached 100 of whatever statuses/xxx has been accessed.
Also, correct me if I'm wrong, but I believe Twitter no longer accepts basic auth and you must use an OAuth key. If this is so, then that means you're unauthenticated, and the search API will also limit you in other ways too, see https://dev.twitter.com/docs/rate-limiting
Hope that helps.
Ok, I made a mistake there, I was looking at the search API when I should've been looking at the streaming API (my apologies), but it's possible some of the things I was talking about could be the cause of your problems so I'll leave it up. Twitter definitely has moved away from basic auth, so I'd try resolving that first, see:
https://dev.twitter.com/docs/auth/oauth/faq

Resources