Selenium search in google, then scan page if keyword exists - windows

1.
I'm using Selenium to search for "sage release dates" in google.
2.
Then I want to scan the entire results page if my search word "release date" exists in the results.
I'm reusing this search pattern code from a previous project of mine but that one used urllib. So I had to adjust the search pattern code slightly. But it doesn't do what I want. I'm stuck. Can somebody point me in the right direction?
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import re
# Version Alpha 3
#_______________________________________________________________________________
browser = webdriver.Chrome(executable_path=r"C:\Selenium_Drivers\chromedriver.exe")
browser.get('http://www.google.com')
input_element = browser.find_element_by_name('q')
input_element.send_keys('sage release dates')
# input_element.send_keys('Wolters Kluwer release dates')
input_element.submit()
'''
RESULTS_LOCATOR = '//div/h3/a'
WebDriverWait(browser, 10).until(
EC.visibility_of_element_located((By.XPATH, RESULTS_LOCATOR)))
page1_results = browser.find_elements(By.XPATH, RESULTS_LOCATOR)
'''
page1_results = browser.find_elements_by_class_name('med')
for item in page1_results:
print(item.text)
#..................................................
keywords = ['release date']
# sequence = page1_results.decode('utf-8', 'ignore')
sequence = page1_results
for k in keywords:
pattern = '(?i)' + k
keyword = re.search(pattern, str(sequence))
if keyword:
# print(keyword.group(0))
print('k-1')
print(k)
print(keyword)
else:
print('k-2')
print('-')
print(k)
print(keyword)
#..................................................
# browser.quit()

You can simply create an intelligent xpath to find if search results have elements with keyword('sage release dates') text. For example, check if entire results page has one of the following texts or any of the below:
result elements with text 'sage'
result elements with text 'sage release'
result elements with text 'release dates'
This way you can improve your search. However, you modify the xpath if you dont want additional filters.
If you want results which has text 'sage release dates', use below xpath:
//*[contains(text(), 'sage release dates')]
If you want results with text 'release dates' only, use below xpath:
//*[contains(text(), 'release dates')]
Sample code snippet in Python:
from selenium import webdriver
driver.get('http://www.google.com')
elem = driver.find_element_by_name("q")
elem.send_keys("sage release dates")
elem.submit()
allResults = driver.find_elements_by_xpath("//*[contains(text(), 'sage release dates') or contains(text(), 'sage') or contains(text(), 'release') or contains(text(), 'sage release')]")
releaseDateResults = driver.find_elements_by_xpath("//*[contains(text(), 'release date')]")
print len(allResults)
print len(releaseDateResults)
driver.quit()

Related

How to calculate shap values for ADABoost model?

I am running 3 different model (Random forest, Gradient Boosting, Ada Boost) and a model ensemble based on these 3 models.
I managed to use SHAP for GB and RF but not for ADA with the following error:
Exception Traceback (most recent call last)
in engine
----> 1 explainer = shap.TreeExplainer(model,data = explain_data.head(1000), model_output= 'probability')
/home/cdsw/.local/lib/python3.6/site-packages/shap/explainers/tree.py in __init__(self, model, data, model_output, feature_perturbation, **deprecated_options)
110 self.feature_perturbation = feature_perturbation
111 self.expected_value = None
--> 112 self.model = TreeEnsemble(model, self.data, self.data_missing)
113
114 if feature_perturbation not in feature_perturbation_codes:
/home/cdsw/.local/lib/python3.6/site-packages/shap/explainers/tree.py in __init__(self, model, data, data_missing)
752 self.tree_output = "probability"
753 else:
--> 754 raise Exception("Model type not yet supported by TreeExplainer: " + str(type(model)))
755
756 # build a dense numpy version of all the tree objects
Exception: Model type not yet supported by TreeExplainer: <class 'sklearn.ensemble._weight_boosting.AdaBoostClassifier'>
I found this link on Git that state
TreeExplainer creates a TreeEnsemble object from whatever model type we are trying to explain, and then works with that downstream. So all you would need to do is and add another if statement in the
TreeEnsemble constructor similar to the one for gradient boosting
But I really don't know how to implement it since I quite new to this.
I had the same problem and what I did, was to modify the file in the git you are commenting.
In my case I use windows so the file is in C:\Users\my_user\AppData\Local\Continuum\anaconda3\Lib\site-packages\shap\explainers but you can do double click over the error message and the file will be opened.
The next step is to add another elif as the answer of the git help says. In my case I did it from the line 404 as following:
1) Modify the source code.
...
self.objective = objective_name_map.get(model.criterion, None)
self.tree_output = "probability"
elif str(type(model)).endswith("sklearn.ensemble.weight_boosting.AdaBoostClassifier'>"): #From this line I have modified the code
scaling = 1.0 / len(model.estimators_) # output is average of trees
self.trees = [Tree(e.tree_, normalize=True, scaling=scaling) for e in model.estimators_]
self.objective = objective_name_map.get(model.base_estimator_.criterion, None) #This line is done to get the decision criteria, for example gini.
self.tree_output = "probability" #This is the last line I added
elif str(type(model)).endswith("sklearn.ensemble.forest.ExtraTreesClassifier'>"): # TODO: add unit test for this case
scaling = 1.0 / len(model.estimators_) # output is average of trees
self.trees = [Tree(e.tree_, normalize=True, scaling=scaling) for e in model.estimators_]
...
Note in the other models, the code of shap needs the attribute 'criterion' that the AdaBoost classifier doesn't have in a direct way. So in this case this attribute is obtained from the "weak" classifiers with the AdaBoost has been trained, that's why I add model.base_estimator_.criterion .
Finally you have to import the library again, train your model and get the shap values. I leave an example:
2) Import again the library and try:
from sklearn import datasets
from sklearn.ensemble import AdaBoostClassifier
import shap
# import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target
ADABoost_model = AdaBoostClassifier()
ADABoost_model.fit(X, y)
shap_values = shap.TreeExplainer(ADABoost_model).shap_values(X)
shap.summary_plot(shap_values, X, plot_type="bar")
Which generates the following:
3) Get your new results:
It seems that the shap package has been updated and still does not contain the AdaBoostClassifier. Based on the previous answer, I've modified the previous answer to work with the shap/explainers/tree.py file in lines 598-610
### Added AdaBoostClassifier based on the outdated StackOverflow response and Github issue here
### https://stackoverflow.com/questions/60433389/how-to-calculate-shap-values-for-adaboost-model/61108156#61108156
### https://github.com/slundberg/shap/issues/335
elif safe_isinstance(model, ["sklearn.ensemble.AdaBoostClassifier", "sklearn.ensemble._weighted_boosting.AdaBoostClassifier"]):
assert hasattr(model, "estimators_"), "Model has no `estimators_`! Have you called `model.fit`?"
self.internal_dtype = model.estimators_[0].tree_.value.dtype.type
self.input_dtype = np.float32
scaling = 1.0 / len(model.estimators_) # output is average of trees
self.trees = [Tree(e.tree_, normalize=True, scaling=scaling) for e in model.estimators_]
self.objective = objective_name_map.get(model.base_estimator_.criterion, None) #This line is done to get the decision criteria, for example gini.
self.tree_output = "probability" #This is the last line added
Also working on testing to add this to the package :)

xpath could not recognize predicate for a tag

I try to use scrapy xpath to scrape a page, but it seems it cannot capture the tag with predicates when I use a for loop,
# This package will contain the spiders of your Scrapy project
from cunyfirst.items import CunyfirstSectionItem
import scrapy
import json
class CunyfristsectionSpider(scrapy.Spider):
name = "cunyfirst-section-spider"
start_urls = ["file:///Users/haowang/Desktop/section.htm"]
def parse(self, response):
url = response.url
yield scrapy.Request(url, self.parse_page)
def parse_page(self, response):
n = -1
for section in response.xpath("//a[contains(#name,'MTG_CLASS_NBR')]"):
print(response.xpath("//a[#name ='MTG_CLASSNAME$10']/text()"))
n += 1
class_num = section.xpath('text()').extract_first()
# print(class_num)
classname = "MTG_CLASSNAME$" + str(n)
date = "MTG_DAYTIME$" + str(n)
instr = "MTG_INSTR$" + str(n)
print(classname)
class_name = response.xpath("//a[#name = classname]/text()")
I am looking for a tags with name as "MTG_CLASSNAME$" + str(n), with n being 0,1,2..., and I am getting empty output from my xpath query. Not sure why...
PS.
I am basically trying to scrape course and their info from https://hrsa.cunyfirst.cuny.edu/psc/cnyhcprd/GUEST/HRMS/c/COMMUNITY_ACCESS.CLASS_SEARCH.GBL?FolderPath=PORTAL_ROOT_OBJECT.HC_CLASS_SEARCH_GBL&IsFolder=false&IgnoreParamTempl=FolderPath%252cIsFolder&PortalActualURL=https%3a%2f%2fhrsa.cunyfirst.cuny.edu%2fpsc%2fcnyhcprd%2fGUEST%2fHRMS%2fc%2fCOMMUNITY_ACCESS.CLASS_SEARCH.GBL&PortalContentURL=https%3a%2f%2fhrsa.cunyfirst.cuny.edu%2fpsc%2fcnyhcprd%2fGUEST%2fHRMS%2fc%2fCOMMUNITY_ACCESS.CLASS_SEARCH.GBL&PortalContentProvider=HRMS&PortalCRefLabel=Class%20Search&PortalRegistryName=GUEST&PortalServletURI=https%3a%2f%2fhome.cunyfirst.cuny.edu%2fpsp%2fcnyepprd%2f&PortalURI=https%3a%2f%2fhome.cunyfirst.cuny.edu%2fpsc%2fcnyepprd%2f&PortalHostNode=ENTP&NoCrumbs=yes
with filter applied: Kingsborough CC, fall 18, BIO
Thanks!
Well... I've visited the website you put in the question description, I used element inspection and searched for "MTG_CLASSNAME" and I got 0 matches...
So I will give you some tools:
In your settings.py set that:
LOG_FILE = "log.txt"
LOG_STDOUT=True
then print the response body ( response.body ) where you should ( in the top of parse_page function in this case ) and search it in log.txt
Check there if there is what you are looking for.
If there is, use this https://www.freeformatter.com/xpath-tester.html (
or similar ) to check your xpath statement.
In addition, change for section in response.xpath("//a[contains(#name,'MTG_CLASS_NBR')]"):
by for section in response.xpath("//a[contains(#name,'MTG_CLASS_NBR')]").extract():, this will raise an error when you get the data that you are looking for.

Reading Keystrokes and Placing into Textbox

I am a teacher that is writing a program to read an 8-digit ID barcode for students who are late to school. I am an experienced programmer, but new to Python and very new to Tkinter (about 36 hours experience) I have made heavy use of this site so far, but I have been unable to find the answer to this question:
How can I read exactly 8 digits, and display those 8 digits in a textbox immediately. I can do 7, but can't seem to get it to 8. Sometimes, I will get nothing in the text box. I have used Entry, bind , and everything works OK, except I can't seem to get the keys read in the bind event to place the keys in the textbox consistently that were inputted. The ID seems to be always correct when I PRINT it, but it is not correct in the textbox. I seem unable to be allowed to show the tkinter screen, so it shows only 7 digits or nothing in the text box upon completion.
Here is a snippet of my code, that deals with the GUI
from tkinter import *
from collections import Counter
import time
i=0
class studentNumGUI():
def __init__(self, master):
master.title("Student ID Reader")
self.idScanned = StringVar()
localTime = time.asctime(time.localtime(time.time()))
self.lblTime = Label(master, text=localTime)
self.lblTime.pack()
self.lbl = Label(master, text="Enter Student ID:")
self.lbl.pack()
self.idScanned.set("")
self.idScan = Entry(master,textvariable=self.idScanned,width=12)
self.idScan.pack()
self.frame=Frame(width=400,height=400)
self.frame.pack()
self.frame.focus()
self.frame.bind('<Key>',self.key)
def key(self,event):
global i
self.frame.focus()
self.idScan.insert(END,event.char)
print(repr(event.char)," was pressed") #just to make sure that my keystrokes are accepted
if (i < 7):
i += 1
else:
#put my other python function calls here once I fix my problem
self.frame.after(2000)
#self.idScan.delete(0,END) #Then go blank for the next ID to be read
i=0
root = Tk()
nameGUI = studentNumGUI(root)
root.mainloop()
enter image description here
You are doing some unusual things in order to place text inside the Entry field based on keypresses. I've changed your code so that it sets the focus on the Entry widget and will check the contents of the Entry field each time a key is pressed (while the Entry has focus). I'm then getting the contents of the Entry field and checking if the length is less than 8. If it is 8 (or greater) it will clear the box.
How does this work for you?
I've left in the commented out code
from tkinter import *
from collections import Counter
import time
class studentNumGUI():
def __init__(self, master):
master.title("Student ID Reader")
self.idScanned = StringVar()
localTime = time.asctime(time.localtime(time.time()))
self.lblTime = Label(master, text=localTime)
self.lblTime.pack()
self.lbl = Label(master, text="Enter Student ID:")
self.lbl.pack()
self.idScanned.set("")
self.idScan = Entry(master,textvariable=self.idScanned,width=12)
self.idScan.pack()
self.idScan.focus_set()
self.frame=Frame(width=400,height=400)
self.frame.pack()
#self.frame.focus()
#self.frame.bind('<Key>',self.key)
self.idScan.bind('<Key>',self.key)
def key(self,event):
#self.frame.focus()
#self.idScan.insert(END,event.char)
print(repr(event.char)," was pressed") #just to make sure that my keystrokes are accepted
len(self.idScanned.get())
if (len(self.idScanned.get())<8):
pass
else:
#put my other python function calls here once I fix my problem
self.idScan.delete(0,END) #Then go blank for the next ID to be read
#self.frame.after(2000)
root = Tk()
nameGUI = studentNumGUI(root)
root.mainloop()

Scrapy: How to get a correct selector

I would like to select the following text:
Bold normal Italics
I need to select and get: Bold normal italist.
The html is:
<strong>Bold</strong> normal <i>Italist</i>
However, a/text() yields
normal
only. Does anyone know a fix? I'm testing bing crawling, and the bold text is in different position depending on the query.
You can use a//text() instead of a/text() to get all text items.
# -*- coding: utf-8 -*-
from scrapy.selector import Selector
doc = """
<strong>Bold</strong> normal <i>Italist</i>
"""
sel = Selector(text=doc, type="html")
result = sel.xpath('//a/text()').extract()
print result
# >>> [u' normal ']
result = u''.join(sel.xpath('//a//text()').extract())
print result
# >>> Bold normal Italist
You can try to use
a/string()
or
normalize-space(a)
which returns Bold normal Italist

Pygal bar chart says "No Data"

I am trying to create a bar graph in pygal that uses the api for github and charts the most popular projects based on stars. I posted my code below, but I cannot figure out why my graph keep saying "No Data"??? Any suggestions? Thanks!
import requests
import pygal
from pygal.style import LightColorizedStyle as LCS, LightenStyle as LS
url = 'https://api.github.com/search/repositories?q=language:python&sort=stars'
r = requests.get(url)
print("Status code:", r.status_code)
response_dict = r.json()
print('Total repositories:', response_dict['total_count'])
repo_dicts = response_dict['items']
names,stars = [],[]
for repo_dict in repo_dicts:
names.append(repo_dict['name'])
stars.append(repo_dict['stargazers_count'])
my_style = LS('#333366',base_style=LCS)
chart = pygal.Bar(style=my_style,x_label_rotation=45,show_legend=False)
chart.title = 'Most Starred Python Projects on GitHub'
chart.x_labels = names
chart.add = ('',stars)
chart.render_to_file('python_repos.svg')
on the second last line of your code, chart.add=('',stars) there should not be a '=' equal sign there , it should be chart.add('',stars) then the code should work! :)

Resources