Python Selenium webscraping mutiple pages(increment page number in url)

Python Selenium webscraping mutiple pages(increment page number in url) - xpath

I have little to none programming experience and I just started learning Python two weeks ago it was a pain running under windows (e.g. environment variable etc something I didn't really know what it is until two weeks ago).
I am using seleniumto try to web scrape information.
basically the url(mainly jscript) pages changes incrementally:
e.g.
http://sssss.proxy.sssss/sample#/detail/0/1
http://sssss.proxy.sssss/sample#/detail/0/2
http://sssss.proxy.sssss/sample#/detail/1/1
http://sssss.proxy.sssss/sample#/detail/1/2
http://sssss.proxy.sssss/sample#/detail/2/1
http://sssss.proxy.sssss/sample#/detail/2/2
http://sssss.proxy.sssss/sample#/detail/50/1
http://sssss.proxy.sssss/sample#/detail/50/2
I want to pragmatically and systemically webscrape specific content(find by xpath) under each page and its subpage. However, I don't know how loop works (e.g. for i to xxx) in this case because every url has to be "get" by browser web driver. (does it mean the loop will initiate the browser to open every single page or will it happen within python shell like in request package)
There are method for scraping content for url that is fixed. But in my case the url does changes so I assume it can be done differently.
Please enlighten me
With thanks,
Iverson

Related

Scrape a website using Selenium and Tor with python 3 on Windows 10

I know that there are many threads talking about this but i've tried many of the solutions suggested but nothing seems to work. Im gonna be very specific so you guys could please help me!
Im trying to do web scraping to a website using Selenium in Python 3 on Windows 10. This website blocks me after a certain number of requests so what I've red is that if I use Tor as the Selenium web driver I can just ask Tor for a new identity (which means a different IP) every specific number of requests.
The following code lets me do the scraping I want using Tor firefox profile in the Tor Browser folder. The only thing missing with these code is that I've havent been able to request a new identity (new IP).
profiler = webdriver.FirefoxProfile(r"C:\Users\Samir\Desktop\Tor
Browser\Browser\TorBrowser\Data\Browser\profile.default")
profiler.set_preference("network.proxy.type", 1)
profiler.set_preference("network.proxy.socks",'127.0.0.1')
profiler.set_preference("network.proxy.socks_port",9050)
driver = webdriver.Firefox(firefox_profile=profiler)
driver.implicitly_wait(15)
driver.get("The URL I want to scrape")
#Extract whatever information i want from the URL
I tried to use the Stem library to get new identity but this does not seem to work with the Tor firefox profile of the previous code.However this works fine if I just open the browser double clicking on the Tor Browser shortcut icon that is created when I install Tor.
#This is the code in stem that gets new identity using Stem. As I said,
#this does not work with the selenium firefox profile for Tor.
from stem import Signal
from stem.control import Controller
with Controller.from_port(port = 9051) as controller:
controller.authenticate()
controller.signal(Signal.NEWNYM)
Okay so to wrap up, is there a way to get a new IP with the previous code I showed? Or what can I do to achieve what I want using python 3, selenium and tor on windows 10 plus anyother library or whatever thats necessary.
If you have questions or need more information to help me just let me know.
Thaks a lot!!

is there are way to ignore page load when running seleniume cucumber

Hi is there any way to ignore the page load when running selenium cucumber, because it always fail my test and i just want to check if that content is present or not.
please don't say add sleep.
the issue im having is that the content is present but its always waiting the page to be fully loaded and sometimes it got stock trying to get some api call to a 3rd party company.

Here are some approaches you could try
Change your driver and us webkit. Setup webkit to not load external links. See http://robots.thoughtbot.com/speed-up-javascript-capybara-specs-by-blacklisting-urls
Ensure you understand and us the has_no methods if you are testing that something is not present e.g. use
expect(page).to have_no_css '.test' # fast
rather than
expect(page).to !have_css('.test') # slow will always wait until timeout
Change the default timeout to something shorter (perhaps only for this scenario, using a tag)

strutrs2 and ajax(Displaying dynamic value on jsp)

Im pretty new to struts2 and Ajax ,Actually i have a drop down menu in JSP lets say first.jsp, When user select a choice from dropdown menu,I am calling a function of Action class lets say Method1.In this method i am fetching some value from DB(lets say:a,b,c) and one value from java memory lets say d.Then I am forwarding to second.jsp and display all the parameters(a,b,c and d) in tabular format.
Now problem is that the parameter d is dynamic ,this is updating by some other application and if its change then I have to show it on JSP wihout any action.
One solution is I use in second.jsp , so after interval of 10 second again Mehod1 will call and it will fetch value(a,b,c) from db and updated value of d from java memory. and disply it to second.jsp.But in this case i am unnecessary retrieving value from db while my purpose is just to get value d from memory.This is working but this is causing my application to slower.
Can any body suggetst some other solution? or can i do it using ajax and how?
Any other advice? any help is appreciated.try to be more clear, i'm in lack of ideas in this problem, even it sounds like a classic :I have spend hours trying to play around with this but have got nowhere

Okay... What you're asking is a little fuzzy so let me rephrase:
You have a user (USER1) who opens a web page and sees some data.
You have a second user (USER2) (who may be an application) who is able set a value from time to time.
When USER2 updates that value you want USER1 to see it change in their open browser window?
If this is the case you need to understand basic ajax. For that get these demo applications working:
This example uses dojo and perhaps the S2 ajax tag lib I don't remember I prefer not to use ajax tags (as they are deprecated and prefer jquery for ajax):
http://struts.apache.org/2.x/docs/struts-2-spring-2-jpa-ajax.html
This example here shows a very similar application but using jquery, no tag library, upgraded to Spring 3, it still needs polish:
http://www.kenmcwilliams.com/Downloads/
Now that you know how to get data via ajax, look at the request with firebug. You'll see that the request is just like a typical function call, the browser keeps waiting for the data to come back.
What you do is simply not return from the action until new data is provided. This is called long polling see: http://en.wikipedia.org/wiki/Comet_%28programming%29#Ajax_with_long_polling
If you have not written a simple chat program, using just terminal windows I recommend you do so. Two windows per client (client-send, client-receive windows) and you'll need a server program. I remember hacking one together in a few hours using _Thinking In Java 2nd Edition (Later books took out the networking section if I remember correctly). Anyways between understanding client server interaction and long polling will let you get things working. It would be fun to extend the simple terminal based chat application to a S2 ajax chat application. Would make an awesome tutorial! PS: This is just an application of the producer/consumer problem (If you understand that then I guess you don't need to do the fun exercise).
The interfaces would look very pretty if the server was managed by spring. I know there must be nice servers already written but I am not familiar with any, but would love to hear of one.

How do I log in to a site remotely using a script?

I'm trying to write a script to automate a repetitive task I currently do manually. I would like to log in to a site remotely and inspect the returned page.
I've been doing this by sending a direct POST request (the site is PHP, I'm pretty sure it's Joomla) with my login details and data for the other fields of the form from the front page, but I'm getting either sockaddrinfo errors on the Net:HTTP Ruby library when I try a HTTP.post() (with data as a param=val1&param2=val2 string), and a rejected redirect to home page if I use HTTP.post_form (using a Hash)
I'm willing to do this in any language, really, I just picked Ruby since it's a favorite for quick scripting. If you have any ideas in bash, Python, etc. I'd be happy to try it.
I've tried variations on some examples, to no avail. Have any of you tried something like this with success? Any stumbling blocks we beginners run into frequently?
Thanks for your time ^_^
-Paul

Try mechanize:
http://mechanize.rubyforge.org/mechanize/EXAMPLES_rdoc.html

Have a look at mechanize (Python) which is written with your problem in mind:
import re
from mechanize import Browser
br = Browser()
br.open("http://www.example.com/")
# follow second link with element text matching regular expression
response1 = br.follow_link(text_regex=r"cheese\s*shop", nr=1)

How can a bookmarklet access a Firefox extension (or vice versa)

I have written a Firefox extension that catches when a particular URL is entered and does some stuff. My main app launches Firefox with this URL. The URL contains sensitive information so I don't want it being stored in the history.
I'm concerned about the case where the extension is not installed. If its not installed and Firefox gets launched with the sensitive URL, it will get stored in history and there's nothing I can do about it. So my idea is to use a bookmarklet.
I will launch Firefox with "javascript:window.location.href='pleaseinstallthisplugin.html'; sensitiveinfo='blahblah'".
If the extension is not installed they will get redirected to a page that tells them to install it and the sensitive info won't get stored in the history. If the extension IS installed it will grab the information in the sensitiveinfo variable and do its thing.
My question is, can the bookmarklet call a method in the extension to pass the sensitive info (and if so, how) or can the extension catch when javascript is being called in the bookmarklet?
How can a bookmarklet and Firefox extension communicate?
p.s. The alternative means of getting around this situation would be for my main app to launch Firefox and communicate with the extension using sockets but I am loath to do that because I've run into too many issues over the years with users with crazy firewalls blocking socket communication. I'd like to do everything without sockets if possible.

As far as I know, bookmarklets can never access chrome files (extensions).

Bookmarklets are executed in the scope of the current document, which is almost always a content document. However, if you are passing it in via the command line, it seems to work:
/Applications/Namoroka.app/Contents/MacOS/firefox-bin javascript:alert\(Components\)
Accessing Components would throw if it was not allowed, but the alert displays the proper object.

You could use unsafeWindow to inject a global. You can add a mere property so that your bookmarklet only needs to detect whether the global is defined or not, but you should know that, as far as I know, there is no way to prohibit sites in a non-bookmarklet context from also sniffing for this same global (since it may be a privacy concern to some that sites can detect whether they are using the extension). I have confirmed in my own add-on which injects a global in a manner similar to that below that it does work in a bookmarklet as well as regular site context.
If you register an nsIObserver, e.g., where content-document-global-created is the topic, and then unwrap the subject, you can inject your global (see this if you need to inject something more sophisticated like an object with methods).
Here is some (untested) code which should do the trick:
var observerService = Cc['#mozilla.org/observer-service;1'].getService(Ci.nsIObserverService);
observerService.addObserver({observe: function (subject, topic, data) {
var unsafeWindow = XPCNativeWrapper.unwrap(subject);
unsafeWindow.myGlobal = true;
}}, 'content-document-global-created', false);
See this and this if you want an apparently easier way in an SDK add-on (not sure whether SDK postMessage communication would work as an alternative but with the apparently same concern that this would be exposed to non-bookmarklet contexts (i.e., regular websites) as well).

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Python Selenium webscraping mutiple pages(increment page number in url) - xpath

Related

Scrape a website using Selenium and Tor with python 3 on Windows 10

is there are way to ignore page load when running seleniume cucumber

strutrs2 and ajax(Displaying dynamic value on jsp)

How do I log in to a site remotely using a script?

How can a bookmarklet access a Firefox extension (or vice versa)

Categories

Resources