Method for Generalising XPath selectors

Method for Generalising XPath selectors - xpath

From a table I am looking at on the web in firefox this is an xpath selector.
id('ls-page')/x:div[5]/x:div[1]/x:div[2]/x:table/x:tbody/x:tr[2]/x:td[2]/x:a
So i remove /x:tbody because that was added by firefox. But how is it generalised to get a links in the table that have the same base Xpath. the only obvious difference is that tr increases by 1 for each link in the table.
id('ls-page')/x:div[5]/x:div[1]/x:div[2]/x:table/x:tr[2]/x:td[2]/x:a
id('ls-page')/x:div[5]/x:div[1]/x:div[2]/x:table/x:tr[3]/x:td[2]/x:a
If there are successive tables of links on the page. and the only difference to me appears that div increases from 1 to 2.
So second table link.
id('ls-page')/x:div[5]/x:div[2]/x:div[2]/x:table/x:tr[2]/x:td[2]/x:a
/x:div[5]/x:div[1]
becomes
/x:div[5]/x:div[2]
1) Is there a method or process to use to generalize an XPATH selector?
2) For each table do i have to create two separate generalised functions one to retrieve tables and one to retrieve links from tables?
Note I am referring to this site live nrl stats . I have been reading scrapy documentation and beautifulsoup documentation but am open to any suggestions regarding tooling as I am just learning.

XPATH is a query language, I don't know of any automated means of generalizing queries, it's something you have to work out for yourself based on the document structure.
My preferred library is lxml.etree. Here's a simple working example of a query that should return you all of the match links.
I've saved the html to the working directory to avoid hitting the website frequently while testing.
from lxml import etree
import os
local_file = 'season2012.html'
url = "http://live.nrlstats.com/nrl/season2012.html"
if not os.path.exists(local_file):
from urllib2 import urlopen
data = urlopen(url).read()
with open(local_file,'w') as f:
f.write(data)
else:
with open(local_file,'r') as f:
data = f.read()
doc = etree.HTML(data)
for link in doc.xpath('//table[#class="tablel"]/tr/td[2]/a'):
print "%s\t%s" % (link.attrib['href'],link.text)
Yielding:
/matches/nrl/match15300.html Melbourne v Newcastle
/matches/nrl/match15291.html Brisbane v St George Illawarra
/matches/nrl/match15313.html Penrith v Cronulla
/matches/nrl/match15312.html Parramatta v Manly
/matches/nrl/match15311.html Sydney Roosters v Warriors
[truncated]
I'd suggest working with the ElementTree object, doc in this example with the interactive python interpreter, to test your queries and have a look at other XPATH questions and answers on SO for working query examples to aid your learning.

Related

Is it possible to pair the affiliation history with a year in which a researcher served in given institution, as it appears in the scopus website?

I used au.affiliation_history to get the affiliation history from a list of Authors ID.  It worked great but now I am trying to pair the affiliation history with the Year in which the researcher served in the institution that appear in the affiliation history results.  
However, I cannot find the way to do this. Is is possible to do this? If so,  can you please give me a hint or idea how can I achieve this?

Unfortunately, the information Scopus shares on an author profile on scopus.com is not the same they share via the Author Retrieval API. I think the only way to get to yearly affiliations is to extract them from the publications that you get from the Scopus Search API.
from collections import defaultdict
from pybliometrics.scopus import ScopusSearch
AUTHOR = "7004212771"
q = f"AU-ID({AUTHOR})"
s = ScopusSearch(q)
yearly_affs = defaultdict(lambda: list())
for pub in s.results:
year = pub.coverDate[:4]
auth_idx = pub.author_ids.split(";").index(AUTHOR)
affs = pub.author_afids.split(";")[auth_idx].split("-")
yearly_affs[year].extend(affs)
yearly_affs then contains a list of all affiliations recorded in publications for that year.
Naturally, the list will contain duplicates. If you don't like that, use set() and update() instead.
The .split("-") part for affs is for multiple affiliations (when the researcher reports multiple affiliations on a paper. You might want to use the first reported instead. Then use [0] at the end and append() in the next row.
Also, there will likely be gaps. I recommend turning the yearly_affs into a pandas DataFrame, select for each year the main affiliation, and then fill gaps forward or backward.

how to handle spelling mistake(typos) in entity extraction in Rasa NLU?

I have few intents in my training set(nlu_data.md file) with sufficient amount of training examples under each intent.
Following is an example,
##intent: SEARCH_HOTEL
- find good [hotel](place) for me in Mumbai
I have added multiple sentences like this.
At the time of testing, all sentences in training file are working fine. But if any input query is having spelling mistake e.g, hotol/hetel/hotele for hotel keyword then Rasa NLU is unable to extract it as an entity.
I want to resolve this issue.
I am allowed to change only training data, also restricted not to write any custom component for this.

To handle spelling mistakes like this in entities, you should add these examples to your training data. So something like this:
##intent: SEARCH_HOTEL
- find good [hotel](place) for me in Mumbai
- looking for a [hotol](place) in Chennai
- [hetel](place) in Berlin please
Once you've added enough examples, the model should be able to generalise from the sentence structure.
If you're not using it already, it also makes sense to use the character-level CountVectorFeaturizer. That should be in the default pipeline described on this page already

One thing I would highly suggest you to use is to use look-up tables with fuzzywuzzy matching. If you have limited number of entities (like country names) look-up tables are quite fast, and fuzzy matching catches typos when that entity exists in your look-up table (searching for typo variations of those entities). There's a whole blogpost about it here: on Rasa.
There's a working implementation of fuzzy wuzzy as a custom component:
class FuzzyExtractor(Component):
name = "FuzzyExtractor"
provides = ["entities"]
requires = ["tokens"]
defaults = {}
language_list ["en"]
threshold = 90
def __init__(self, component_config=None, *args):
super(FuzzyExtractor, self).__init__(component_config)
def train(self, training_data, cfg, **kwargs):
pass
def process(self, message, **kwargs):
entities = list(message.get('entities'))
# Get file path of lookup table in json format
cur_path = os.path.dirname(__file__)
if os.name == 'nt':
partial_lookup_file_path = '..\\data\\lookup_master.json'
else:
partial_lookup_file_path = '../data/lookup_master.json'
lookup_file_path = os.path.join(cur_path, partial_lookup_file_path)
with open(lookup_file_path, 'r') as file:
lookup_data = json.load(file)['data']
tokens = message.get('tokens')
for token in tokens:
# STOP_WORDS is just a dictionary of stop words from NLTK
if token.text not in STOP_WORDS:
fuzzy_results = process.extract(
token.text,
lookup_data,
processor=lambda a: a['value']
if isinstance(a, dict) else a,
limit=10)
for result, confidence in fuzzy_results:
if confidence >= self.threshold:
entities.append({
"start": token.offset,
"end": token.end,
"value": token.text,
"fuzzy_value": result["value"],
"confidence": confidence,
"entity": result["entity"]
})
file.close()
message.set("entities", entities, add_to_output=True)
But I didn't implement it, it was implemented and validated here: Rasa forum
Then you will just pass it to your NLU pipeline in config.yml file.

Its a strange request that they ask you not to change the code or do custom components.
The approach you would have to take would be to use entity synonyms. A slight edit on a previous answer:
##intent: SEARCH_HOTEL
- find good [hotel](place) for me in Mumbai
- looking for a [hotol](place:hotel) in Chennai
- [hetel](place:hotel) in Berlin please
This way even if the user enters a typo, the correct entity will be extracted. If you want this to be foolproof, I do not recommend hand-editing the intents. Use some kind of automated tool for generating the training data. E.g. Generate misspelled words (typos)

First of all, add samples for the most common typos for your entities as advised here
Beyond this, you need a spellchecker.
I am not sure whether there is a single library that can be used in the pipeline, but if not you need to create a custom component. Otherwise, dealing with only training data is not feasible. You can't create samples for each typo.
Using Fuzzywuzzy is one of the ways, generally, it is slow and it doesn't solve all the issues.
Universal Encoder is another solution.
There should be more options for spell correction, but you will need to write code in any way.

Why is Xpath printed instead of the text inside the element?

I'm trying to scrape match statistics of a game of football yesterday at the following url:
https://www.flashscore.com/match/8S0QVm38/#match-statistics;0
I've written code, just for Webdriver to select the stats I want and print them for me so I can then see what I want to use. My code is:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
browser = webdriver.Firefox()
browser.get("https://www.flashscore.com/match/8S0QVm38/#match-statistics;0")
print(browser.find_elements_by_class_name("statText--homeValue"))
A list of elements are printed out and to be honest, I don't know if this was what I was looking for because what is returned doesn't show anything to identify with what i'm looking at in the developer tools.
I'm trying to get all the numbers under statistics like Possession and shots on target, but print returns a list of xpaths like this, where the session is the same but the element is always different:
[<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="da88ca87-e318-934e-ba75-dca1d652cd37", element="c53f5f3e-2c89-b34c-a639-ab50fbbf0c33")>,
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="da88ca87-e318-934e-ba75-dca1d652cd37", element="3e422b45-e26d-de44-8994-5f9788462ec4")>,
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="da88ca87-e318-934e-ba75-dca1d652cd37", element="9e110a54-4ecb-fb4b-9d8f-ccd1b210409d")>, <
Anyone know why this is and what I can do to get the actual numbers?

What you're getting are not XPaths, but a list of WebElement objects. To get text from each try
print([node.text for node in browser.find_elements_by_class_name("statText--homeValue")])

You have printed the generators instead of actual contents. For that you have to use .text for each element. Like,
elements = browser.find_elements_by_class_name("statText--homeValue")
for element in elements:
print(element.text)
You can opt for a list comprehensive method shown in Andersson's answer also.
Hope this helps! Cheers!

Scraping all data from Reddit searches

I am using PRAW to scrape data off of reddit. I am using the .search method to search very specific people. I can easily print the title of the submission if the keyword is in the title, but if the keyword is in the text of the submission nothing pops up. Here is the code I have so far.
import praw
reddit = praw.Reddit(----------)
alls = reddit.subreddit("all")
for submission in alls.search("Yoa ming",sort = comment, limit = 5):
print(submission.title)
When I run this code i get
Yoa Ming next to Elephant!
Obama's Yoa Ming impression
i used to yoa ming... until i took an arrow to the knee
Could someone make a rage face out of our dearest Yoa Ming? I think it would compliment his first one so well!!!
If you search Yoa Ming on reddit, there are posts that dont contain "Yoa Ming" in the title but "Yoa Ming" in the text and those are the posts I want.
Thanks.

You might need to update the version of PRAW you are using. Using v6.3.1 yields the expected outcome and includes submissions that have the keyword in the body and not the title.
Also, the sort=comment parameter should be sort='comments'. Using an invalid value for sort will not throw an error but it will fall back to the default value, which may be why you are seeing different search results between your script and the website.

Sorting by counting the intersection of two lists in MongoDB

We have a posting analyzing requirement, that is, for a specific post, we need to return a list of posts which are mostly related to it, the logic is comparing the count of common tags in the posts. For example:
postA = {"author":"abc",
"title":"blah blah",
"tags":["japan","japanese style","england"],
}
there are may be other posts with tags like:
postB:["japan", "england"]
postC:["japan"]
postD:["joke"]
so basically, postB gets 2 counts, postC gets 1 counts when comparing to the tags in the postA. postD gets 0 and will not be included in the result.
My understanding for now is to use map/reduce to produce the result, I understand the basic usage of map/reduce, but I can't figure out a solution for this specific purpose.
Any help? Or is there a better way like custom sorting function to work it out? I'm currently using the pymongodb as I'm python developer.

You should create an index on tags:
db.posts.ensure_index([('tags', 1)])
and search for posts that share at least one tag with postA:
posts = list(db.posts.find({_id: {$ne: postA['_id']}, 'tags': {'$in': postA['tags']}}))
and finally, sort by intersection in Python:
key = lambda post: len(tag for tag in post['tags'] if tag in postA['tags'])
posts.sort(key=key, reverse=True)
Note that if postA shares at least one tag with a large number of other posts this won't perform well, because you'll send so much data from Mongo to your application; unfortunately there's no way to sort and limit by the size of the intersection using Mongo itself.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Method for Generalising XPath selectors - xpath

Related

Is it possible to pair the affiliation history with a year in which a researcher served in given institution, as it appears in the scopus website?

how to handle spelling mistake(typos) in entity extraction in Rasa NLU?

Why is Xpath printed instead of the text inside the element?

Scraping all data from Reddit searches

Sorting by counting the intersection of two lists in MongoDB

Categories

Resources