Scrapy response.xpath not returning anything for a query - shell

I am using the scrapy shell to extract some text data. Here are the commands i gave in the scrapy shell:
>>> scrapy shell "http://jobs.parklandcareers.com/dallas/nursing/jobid6541851-nurse-resident-cardiopulmonary-icu-feb2015-nurse-residency-requires-contract-jobs"
>>> response.xpath('//*[#id="jobDesc"]/span[1]/text()')
[<Selector xpath='//*[#id="jobDesc"]/span[1]/text()' data=u'Dallas, TX'>]
>>> response.xpath('//*[#id="jobDesc"]/span[2]/p/text()[2]')
[<Selector xpath='//*[#id="jobDesc"]/span[2]/p/text()[2]' data=u'Responsible for attending assigned nursi'>]
>>> response.xpath('//*[#id="jobDesc"]/span[2]/p/text()[preceding-sibling::*="Education"][following-sibling::*="Certification"]')
[]
The third command is not returning any data. I was trying to extract data between 2 keywords in the command. Where am i wrong?

//*[#id="jobDesc"]/span[2]/p/text() would return you a list of text nodes. You can filter the relevant nodes in Python. Here's how you can get the text between "Education/Experience:" and "Certification/Registration/Licensure:" text paragraphs:
>>> result = response.xpath('//*[#id="jobDesc"]/span[2]/p/text()').extract()
>>> start = result.index('Education/Experience:')
>>> end = result.index('Certification/Registration/Licensure:')
>>> print ''.join(result[start+1:end])
- Must be a graduate from an accredited school of Nursing.
UPD (regarding an additional question in comments):
>>> response.xpath('//*[#id="jobDesc"]/span[3]/text()').re('Job ID: (\d+)')
[u'143112']

Try:
substring-before(
substring-after('//*[#id="jobDesc"]/span[2]/p/text()', 'Education'), 'Certification')
Note: I couldn't test it.
The idea is that you cannot use preceding-sibling and following-sibling because you look in the same text node. You have to extract the text part that you want using substring-before() and substring-after()
By combining those two functions, you select what is in between.

Related

Extract 2 fields from string with search

I have a file with several lines of data. The fields are not always in the same position/column. I want to search for 2 strings and then show only the field and the data that follows. For example:
{"id":"1111","name":"2222","versionCurrent":"3333","hwVersion":"4444"}
{"id":"5555","name":"6666","hwVersion":"7777"}
I would like to return the following:
"id":"1111","hwVersion":"4444"
"id":"5555","hwVersion":"7777"
I am struggling because the data isn't always in the same position, so I can't chose a column number. I feel I need to search for "id" and "hwVersion" Any help is GREATLY appreciated.
Totally agree with #KamilCuk. More specifically
jq -c '{id: .id, hwVersion: .hwVersion}' <<< '{"id":"1111","name":"2222","versionCurrent":"3333","hwVersion":"4444"}'
Outputs:
{"id":"1111","hwVersion":"4444"}
Not quite the specified output, but valid JSON
More to the point, your input should probably be processed record by record, and my guess is that a two column output with "id" and "hwVersion" would be even easier to parse:
cat << EOF | jq -j '"\(.id)\t\(.hwVersion)\n"'
{"id":"1111","name":"2222","versionCurrent":"3333","hwVersion":"4444"}
{"id":"5555","name":"6666","hwVersion":"7777"}
EOF
Outputs:
1111 4444
5555 7777
Since the data looks like a mapping objects and even corresponding to a JSON format, something like this should do, if you don't mind using Python (which comes with JSON) support:
import json
def get_id_hw(s):
d = json.loads(s)
return '"id":"{}","hwVersion":"{}"'.format(d["id"], d["hwVersion"])
We take a line of input string into s and parse it as JSON into a dictionary d. Then we return a formatted string with double-quoted id and hwVersion strings followed by column and double-quoted value of corresponding key from the previously obtained dict.
We can try this with these test input strings and prints:
# These will be our test inputs.
s1 = '{"id":"1111","name":"2222","versionCurrent":"3333","hwVersion":"4444"}'
s2 = '{"id":"5555","name":"6666","hwVersion":"7777"}'
# we pass and print them here
print(get_id_hw(s1))
print(get_id_hw(s2))
But we can just as well iterate over lines of any input.
If you really wanted to use awk, you could, but it's not the most robust and suitable tool:
awk '{ i = gensub(/.*"id":"([0-9]+)".*/, "\\1", "g")
h = gensub(/.*"id":"([0-9]+)".*/, "\\1", "g")
printf("\"id\":\"%s\",\"hwVersion\":\"%s\"\n"), i, h}' /your/file
Since you mention position is not known and assuming it can be in any order, we use one regex to extract id and the other to get hwVersion, then we print it out in given format. If the values could be something other then decimal digits as in your example, the [0-9]+ but would need to reflect that.
And for the fun if it (this preserves the order) if entries from the file, in sed:
sed -e 's#.*\("\(id\|hwVersion\)":"[0-9]\+"\).*\("\(id\|hwVersion\)":"[0-9]\+"\).*#\1,\3#' file
It looks for two groups of "id" or "hwVersion" followed by :"<DECIMAL_DIGITS>".

How do I create an XPath query to extract a substring of text?

I am trying to create xpath so that it only returns order number instead of whole line.Please see attached screenshot
What you want is the substring-after() function -
fn:substring-after(string1,string2)
Returns the remainder of string1 after string2 occurs in it
Example: substring-after('12/10','/')
Result: '10'
For your situation -
substring-after(string(//p[contains(text(), "Your order # is")]), ": ")
To test this, I modified the DOM on this page to include a "Order Number: ####" string.
See it in action:
You could also just use your normal Xpath selector to get the complete text, being "Your oder # is: 123456" and then perform a regex on the string like mentioned in Get numbers from string with regex

How to produce xpath link with characters from [a-z]

I am scraping data from the website and I need to iterate over pages, but instead of a counter they have an alphabetical index
http://funny2.com/jokesb.htm'
http://funny2.com/jokesc.htm')
...
But I can't figure out how to include the [a-z] iterator. I tried
http://funny2.com/jokes^[a-z]+$.htm'
which didn't work.
XPath doesn't support regular expressions. However as Scrapy built atop lxml it supports some EXSLT extensions, particularly re extension. You can use operations from EXSLT prepending them with corresponding namespace like this:
response.xpath('//a[re:test(#href, "jokes[a-z]+\.htm")]/#href')
Docs: https://doc.scrapy.org/en/latest/topics/selectors.html?highlight=selector#using-exslt-extensions
If you need just to extract the links, use LinkExtractor with regexp:
LinkExtractor(allow=r'/jokes[a-z]+\.htm').extract_links(response)
You can iterate through every letter in the alphabet and format that letter into some url template:
from string import ascii_lowercase
# 'abcdefghijklmnopqrstuvwxyz'
from char in ascii_lowercase:
url = "http://funny2.com/jokes{}.htm".format(char)
In scrapy context, you need to find a way to increment character in the url. You can find it with regex, figure out the next character in alphabet and put it into the current url, something like:
import re
from string import ascii_lowercase
def parse(self, response):
current_char = re.findall('jokes(\w).htm', response.url)
next_char = ascii_lowercase[current_char] + 1
next_char = ascii_lowercase[next_char]
next_url = re.sub('jokes(\w).htm', 'jokes{}.htm'.format(next_char), response.url)
yield Request(next_url, self.parse2)

Is there an easy way to find a substring that matches a pattern within a string and extract it?

I know this is open-ended, but I'm not sure how to go about it.
Say I have the string "FDBFBDFLDJVHVBDVBD" and want to find every sub-string that starts with something like "BDF" and ends with either "EFG" or "EDS", is there an easy way to do this?
You can use re.finditer
>>> import re
>>> s = "FDBFBDFLDJVHVBDVBDBDFEFGEDS"
>>> print [s[a.start(): a.end()] for a in re.finditer('BDF', s)]
['BDF', 'BDF']
find every sub-string that starts with something like "BDF" and ends with either "EFG" or "EDS"
It is a job for a regular expression. To extract all such substrings as a list:
import re
substrings = re.findall(r'BDF.*?E(?:FG|DS)', text)
If a substring might contain newlines then pass flags=re.DOTALL.
Example:
>>> re.findall(r'BDF.*?E(?:FG|DS)', "FDBFBDFLDJVHVBDVBDBDFEFGEDS")
['BDFLDJVHVBDVBDBDFEFG']
.*? is not greedy and therefore the shortest substrings are selected. Remove ?, to get the longest match instead.
Seeing as there is no regex expert here yet, I will propose this solution (BTW I added "BDFEFGEDS" to the end of your string so it would give some results):
import re
s = "FDBFBDFLDJVHVBDVBDBDFEFGEDS"
endings = ['EFG', 'EDS']
matches = []
for ending in endings:
match = re.findall(r'(?=(BDF.*{0}))'.format(ending), s)
matches.extend(match)
print matches
giving the result:
['BDFLDJVHVBDVBDBDFEFG', 'BDFEFG', 'BDFLDJVHVBDVBDBDFEFGEDS', 'BDFEFGEDS']

XPath 2.0:reference earlier context in another part of the XPath expression

in an XPath I would like to focus on certain elements and analyse them:
...
<field>aaa</field>
...
<field>bbb</field>
...
<field>aaa (1)</field>
...
<field>aaa (2)</field>
...
<field>ccc</field>
...
<field>ddd (7)</field>
I want to find the elements who's text content (apart from a possible enumeration, are unique. In the aboce example that would be bbb, ccc and ddd.
The following XPath gives me the unique values:
distinct-values(//field[matches(normalize-space(.), ' \([0-9]\)$')]/substring-before(., '(')))
Now I would like to extent that and perform another XPath on all the distinct values, that would be to count how many field start with either of them and retreive the ones who's count is bigger than 1.
These could be a field content that is equal to that particular value, or it starts witrh that value and is followed by " (". The problem is that in the second part of that XPath I would have refer to the context of that part itself and to the former context at the same time.
In the following XPath I will - instead of using "." as the context- use c_outer and c_inner:
distinct-values(//field[matches(normalize-space(.), ' \([0-9]\)$')]/substring-before(., '(')))[count(//field[(c_inner = c_outer) or starts-with(c_inner, concat(c_outer, ' ('))]) > 1]
I can't use "." for both for obvious reasons. But how could I reference a particular, or the current distinct value from the outer expression within the inner expression?
Would that even be possible?
XQuery can do it e.g.
for $s
in distinct-values(
//field[matches(normalize-space(.), ' \([0-9]\)$')]/substring-before(., '(')))
where count(//field[(. = $s) or starts-with(., concat($s, ' ('))]) > 1
return $s

Resources