Scrapy: How to get a correct selector - xpath

I would like to select the following text:
Bold normal Italics
I need to select and get: Bold normal italist.
The html is:
<strong>Bold</strong> normal <i>Italist</i>
However, a/text() yields
normal
only. Does anyone know a fix? I'm testing bing crawling, and the bold text is in different position depending on the query.

You can use a//text() instead of a/text() to get all text items.
# -*- coding: utf-8 -*-
from scrapy.selector import Selector
doc = """
<strong>Bold</strong> normal <i>Italist</i>
"""
sel = Selector(text=doc, type="html")
result = sel.xpath('//a/text()').extract()
print result
# >>> [u' normal ']
result = u''.join(sel.xpath('//a//text()').extract())
print result
# >>> Bold normal Italist

You can try to use
a/string()
or
normalize-space(a)
which returns Bold normal Italist

Related

Selenium search in google, then scan page if keyword exists

1.
I'm using Selenium to search for "sage release dates" in google.
2.
Then I want to scan the entire results page if my search word "release date" exists in the results.
I'm reusing this search pattern code from a previous project of mine but that one used urllib. So I had to adjust the search pattern code slightly. But it doesn't do what I want. I'm stuck. Can somebody point me in the right direction?
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import re
# Version Alpha 3
#_______________________________________________________________________________
browser = webdriver.Chrome(executable_path=r"C:\Selenium_Drivers\chromedriver.exe")
browser.get('http://www.google.com')
input_element = browser.find_element_by_name('q')
input_element.send_keys('sage release dates')
# input_element.send_keys('Wolters Kluwer release dates')
input_element.submit()
'''
RESULTS_LOCATOR = '//div/h3/a'
WebDriverWait(browser, 10).until(
EC.visibility_of_element_located((By.XPATH, RESULTS_LOCATOR)))
page1_results = browser.find_elements(By.XPATH, RESULTS_LOCATOR)
'''
page1_results = browser.find_elements_by_class_name('med')
for item in page1_results:
print(item.text)
#..................................................
keywords = ['release date']
# sequence = page1_results.decode('utf-8', 'ignore')
sequence = page1_results
for k in keywords:
pattern = '(?i)' + k
keyword = re.search(pattern, str(sequence))
if keyword:
# print(keyword.group(0))
print('k-1')
print(k)
print(keyword)
else:
print('k-2')
print('-')
print(k)
print(keyword)
#..................................................
# browser.quit()
You can simply create an intelligent xpath to find if search results have elements with keyword('sage release dates') text. For example, check if entire results page has one of the following texts or any of the below:
result elements with text 'sage'
result elements with text 'sage release'
result elements with text 'release dates'
This way you can improve your search. However, you modify the xpath if you dont want additional filters.
If you want results which has text 'sage release dates', use below xpath:
//*[contains(text(), 'sage release dates')]
If you want results with text 'release dates' only, use below xpath:
//*[contains(text(), 'release dates')]
Sample code snippet in Python:
from selenium import webdriver
driver.get('http://www.google.com')
elem = driver.find_element_by_name("q")
elem.send_keys("sage release dates")
elem.submit()
allResults = driver.find_elements_by_xpath("//*[contains(text(), 'sage release dates') or contains(text(), 'sage') or contains(text(), 'release') or contains(text(), 'sage release')]")
releaseDateResults = driver.find_elements_by_xpath("//*[contains(text(), 'release date')]")
print len(allResults)
print len(releaseDateResults)
driver.quit()

How to decoding IFC using Ruby

In Ruby, I'm reading an .ifc file to get some information, but I can't decode it. For example, the file content:
"'S\X2\00E9\X0\jour/Cuisine'"
should be:
"'Séjour/Cuisine'"
I'm trying to encode it with:
puts ifcFileLine.encode("Windows-1252")
puts ifcFileLine.encode("ISO-8859-1")
puts ifcFileLine.encode("ISO-8859-5")
puts ifcFileLine.encode("iso-8859-1").force_encoding("utf-8")'
But nothing gives me what I need.
I don't know anything about IFC, but based solely on the page Denis linked to and your example input, this works:
ESCAPE_SEQUENCE_EXPR = /\\X2\\(.*?)\\X0\\/
def decode_ifc(str)
str.gsub(ESCAPE_SEQUENCE_EXPR) do
$1.gsub(/..../) { $&.to_i(16).chr(Encoding::UTF_8) }
end
end
str = 'S\X2\00E9\X0\jour/Cuisine'
puts "Input:", str
puts "Output:", decode_ifc(str)
All this code does is replace every sequence of four characters (/..../) between the delimiters, which will each be a Unicode code point in hexadecimal, with the corresponding Unicode character.
Note that this code handles only this specific encoding. A quick glance at the implementation guide shows other encodings, including an \X4 directive for Unicode characters outside the Basic Multilingual Plane. This ought to get you started, though.
See it on eval.in: https://eval.in/776980
If someone is interested, I wrote here a Python Code that decode 3 of the IFC encodings : \X, \X2\ and \S\
import re
def decodeIfc(txt):
# In regex "\" is hard to manage in Python... I use this workaround
txt = txt.replace('\\', 'µµµ')
txt = re.sub('µµµX2µµµ([0-9A-F]{4,})+µµµX0µµµ', decodeIfcX2, txt)
txt = re.sub('µµµSµµµ(.)', decodeIfcS, txt)
txt = re.sub('µµµXµµµ([0-9A-F]{2})', decodeIfcX, txt)
txt = txt.replace('µµµ','\\')
return txt
def decodeIfcX2(match):
# X2 encodes characters with multiple of 4 hexadecimal numbers.
return ''.join(list(map(lambda x : chr(int(x,16)), re.findall('([0-9A-F]{4})',match.group(1)))))
def decodeIfcS(match):
return chr(ord(match.group(1))+128)
def decodeIfcX(match):
# Sometimes, IFC files were made with old Mac... wich use MacRoman encoding.
num = int(match.group(1), 16)
if (num <= 127) | (num >= 160):
return chr(num)
else:
return bytes.fromhex(match.group(1)).decode("macroman")

Geany: Syntax highlighting for custom filetype for SOME words

Geany is a simple, fast and yet powerful text editor.
It has quite strong support for syntax highlighting for almost all kinds
of programming languages.
I was wondering how to make a customized syntax highlighting for my
special need program called "Phosim" which has the file extension .cat.
So far I have done this:
First I created filetype extension configuration file: ~/.config/geany/filetype_extensions.conf
The contents of this looks like this:
[Extensions]
Gnuplot=*.gp;*.gnu;*.plt;
Galfit=*.gal;
Phosim=*.cat;
[Groups]
Script=Gnuplot;Galfit;Phosim;
Here, I am trying to apply custom highlight to programs Gnuplot, Galfit, and Phosim. For Gnuplot and Galfit it works fine. But for Phosim I got some problems.
Then I created file definition configuration file: ~/.config/geany/filedefs/filetypes.Phosim.conf
The contents of which looks like this:
# Author : Bhishan Poudel
# Date : May 24, 2016
# Version : 1.0
[styling]
# Edit these in the colorscheme .conf file instead
default=default
comment=comment_line
function=keyword_1
variable=string_1,bold
label=label
userdefined=string_2
number=number_2
[keywords]
# all items must be in one line separated by space
variables=object Unrefracted_RA_deg SIM_SEED none
functions=
lables=10
userdefined=angle 30 Angle_RA 20.0 none
numbers=0 1 2 3 4 5 6 7 8 9
[lexer_properties]
nsis.uservars=1
nsis.ignorecase=1
[settings]
# default extension used when saving files
extension=cat
# single comments, like # in this file
comment_single=#
# multiline comments
#comment_open=
#comment_close=
# This setting works only for single line comments
comment_use_indent=true
# context action command (please see Geany's main documentation for details)
context_action_cmd=
# lexer filetype should be an existing lexer that does not use lexer_filetype itself
lexer_filetype=NSIS
[build-menu]
EX_00_LB=Execute
EX_00_CM=
EX_00_WD=
FT_00_LB=
FT_00_CM=
FT_00_WD=
FT_02_LB=
FT_02_CM=
FT_02_WD=
Now my example.cat looks like this:
# example.cat
angle 30
Angle_RA 20.0
object none
# Till now,
# Words highlighted : angle 30 object none
# Words not highlighted: Angle_RA 20.0
# I like them also to be highlighted!
I got syntax highlighting for only two words, viz., object and none.
I tried styling equal to Fortran since it has uppercase letters but it also did not work.
How can we get the syntax highlight for the variable names which contains uppercase, lowercase, and underscore?
For example:
I got syntax highlight for words: object none.
But, did not get syntax highlight for words: Angle_RA 20.0
Also, I my numbers 0,1,..,9 are highlighted but the decimals are not highlighted. How can we highlight decimals too?
For example:
I got syntax highlight for words: 1 1000 but, did not get syntax highlight for words: 49552.3 180.0
Some useful links are following:
Make Geany recognize additional file extensions
Custom syntax highlighting in Geany
http://www.geany.org/manual/current/index.html#custom-filetypes
http://www.geany.org/manual/#lexer-filetype
Instead of creating new file definition files I added file extensions for Python and it worked for me.
For example, I wanted to custom highlight the files with extension .icat (If you are interested, this is instance catalog file for Phosim Software in Astronomy.)
Drawback: The additional words are also highlighted in python scripts (.py,.pyc,.ipy)
Note: If anybody posts solution that works with new file extension, ~/.config/geany/filedefs/filetypes.Phosim.conf I would heartly welcome that.
My example.pcat file looks like this:
# example.pcat
Unrefracted_RA_deg 0
Unrefracted_Dec_deg 0
Unrefracted_Azimuth 0
Unrefracted_Altitude 89
Slalib_date 1994/7/19/0.298822999997
Opsim_rotskypos 0
Opsim_rottelpos 0
Opsim_moondec -90
Opsim_moonra 180
Opsim_expmjd 49552.3
Opsim_moonalt -90
Opsim_sunalt -90
Opsim_filter 2
Opsim_dist2moon 180.0
Opsim_moonphase 10.0
Opsim_obshistid 99999999
Opsim_rawseeing 0.65
SIM_SEED 1000
SIM_MINSOURCE 1
SIM_TELCONFIG 0
SIM_CAMCONFIG 1
SIM_VISTIME 15000.0
SIM_NSNAP 1
object 0 0.0 0.0 20 ../sky/sed_flat.txt 0 0 0 0 0 0 bhishan.fits 0.09 0.0 none
I want geany to highlight all first words with yellow color, numbers with mangenta, and the word 'none' with blue color.
First I created (or, edited if already exists) the file:
~/.config/geany/filetype_extensions.conf
And added following stuff inside it.
[Extensions]
Gnuplot=*.gp;*.gnu;*.plt;
Galfit=*.gal;
Phosim=*.pcat;
Python=*.py;*.pyc;*.ipy;*.icat;*.pcat
[Groups]
Script=Gnuplot;Galfit;Phosim;Python;
Then, I added the additional KEYWORDS to the already existing keywords in python filetype.
For this I created (or, edited if already exists) the file:
~/.config/geany/filedefs/filetypes.python
Now, the file ~/.config/geany/filedefs/filetypes.python looks like this:
# Author : Bhishan Poudel
# Date : June 9, 2016
# Version : 1.0
# File : Filetype for both python and phosim_instance_catalogs
[styling]
default=default
commentline=comment_line
number=number_1
string=string_1
character=character
word=keyword_1
triple=string_2
tripledouble=string_2
classname=type
defname=function
operator=operator
identifier=identifier_1
commentblock=comment
stringeol=string_eol
word2=keyword_2
decorator=decorator
[keywords]
# all items must be in one line
primary=and as assert break class continue def del elif else except exec finally for from global if import in is lambda not or pass print raise return try while with yield False None True Words_after_this_are_for_Phosim_pcat_files Unrefracted_RA_deg Unrefracted_Dec_deg Unrefracted_Azimuth Unrefracted_Altitude Slalib_date Opsim_moondec Opsim_rotskypos Opsim_rottelpos Opsim_moondec Opsim_moonra Opsim_expmjd Opsim_moonalt Opsim_sunalt Opsim_filter Opsim_dist2moon Opsim_moonphase Opsim_obshistid Opsim_rawseeing SIM_SEED SIM_MINSOURCE SIM_TELCONFIG SIM_CAMCONFIG SIM_VISTIME SIM_NSNAP object
identifiers=ArithmeticError AssertionError AttributeError BaseException BufferError BytesWarning DeprecationWarning EOFError Ellipsis EnvironmentError Exception FileNotFoundError FloatingPointError FutureWarning GeneratorExit IOError ImportError ImportWarning IndentationError IndexError KeyError KeyboardInterrupt LookupError MemoryError NameError NotImplemented NotImplementedError OSError OverflowError PendingDeprecationWarning ReferenceError RuntimeError RuntimeWarning StandardError StopIteration SyntaxError SyntaxWarning SystemError SystemExit TabError TypeError UnboundLocalError UnicodeDecodeError UnicodeEncodeError UnicodeError UnicodeTranslateError UnicodeWarning UserWarning ValueError Warning ZeroDivisionError __debug__ __doc__ __import__ __name__ __package__ abs all any apply basestring bin bool buffer bytearray bytes callable chr classmethod cmp coerce compile complex copyright credits delattr dict dir divmod enumerate eval execfile exit file filter float format frozenset getattr globals hasattr hash help hex id input int intern isinstance issubclass iter len license list locals long map max memoryview min next object oct open ord pow print property quit range raw_input reduce reload repr reversed round set setattr slice sorted staticmethod str sum super tuple type unichr unicode vars xrange zip array arange Catagorical cStringIO DataFramedate_range genfromtxt linspace loadtxt matplotlib none numpy np pandas pd plot plt pyplot savefig scipy Series sp StringIO
[lexer_properties]
fold.comment.python=1
fold.quotes.python=1
[settings]
# default extension used when saving files
extension=py
# the following characters are these which a "word" can contains, see documentation
wordchars=_abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789
# MIME type
mime_type=text/x-python
comment_single=#
comment_open="""
comment_close="""
comment_use_indent=true
# context action command (please see Geany's main documentation for details)
context_action_cmd=
[indentation]
width=4
# 0 is spaces, 1 is tabs, 2 is tab & spaces
type=0
[build_settings]
# %f will be replaced by the complete filename
# %e will be replaced by the filename without extension
# (use only one of it at one time)
compiler=python -m py_compile "%f"
run_cmd=python "%f"
[build-menu]
FT_00_LB=Execute
FT_00_CM=python %f
FT_00_WD=
FT_01_LB=
FT_01_CM=
FT_01_WD=
FT_02_LB=
FT_02_CM=
FT_02_WD=
EX_00_LB=Execute
EX_00_CM=clear; python %f
EX_00_WD=
error_regex=([^:]+):([0-9]+):([0-9:]+)? .*
EX_01_LB=
EX_01_CM=
EX_01_WD=
Now, I restarted the geany and I can see all the first words in yellow, numbers other color and the word 'none' is blue colored.

XPath - extracting text between two nodes

I'm encountering a problem with my XPath query. I have to parse a div which is divided to unknown number of "sections". Each of these is separated by h5 with a section name. The list of possible section titles is known and each of them can occur only once. Additionally, each section can contain some br tags. So, let's say I want to extract the text under "SecondHeader".
HTML
<div class="some-class">
<h5>FirstHeader</h5>
text1
<h5>SecondHeader</h5>
text2a<br>
text2b
<h5>ThirdHeader</h5>
text3a<br>
text3b<br>
text3c<br>
<h5>FourthHeader</h5>
text4
</div>
Expected result (for SecondSection)
['text2a', 'text2b']
Query #1
//text()[following-sibling::h5/text()='ThirdHeader']
Result #1
['text1', 'text2a', 'text2b']
It's obviously bit too much, so I've decided to restrict the result to the content between selected header and the header before.
Query #2
//text()[following-sibling::h5/text()='ThirdHeader' and preceding-sibling::h5/text()='SecondHeader']
Result #2
['text2a', 'text2b']
Yielded results meet the expectations. However, this can't be used - I don't know whether SecondHeader/ThirdHeader will exist in parsed page or not. It is needed to use only one section title in a query.
Query #3
//text()[following-sibling::h5/text()='ThirdHeader' and not[preceding-sibling::h5/text()='ThirdHeader']]
Result #3
[]
Could you please tell me what am I doing wrong? I've tested it in Google Chrome.
If all h5 elements and text nodes are siblings, and you need to group by section, a possible option is simply to select text nodes by count of h5 that come before.
Example using lxml (in Python)
>>> import lxml.html
>>> s = '''
... <div class="some-class">
... <h5>FirstHeader</h5>
... text1
... <h5>SecondHeader</h5>
... text2a<br>
... text2b
... <h5>ThirdHeader</h5>
... text3a<br>
... text3b<br>
... text3c<br>
... <h5>FourthHeader</h5>
... text4
... </div>'''
>>> doc = lxml.html.fromstring(s)
>>> doc.xpath("//text()[count(preceding-sibling::h5)=$count]", count=1)
['\n text1\n ']
>>> doc.xpath("//text()[count(preceding-sibling::h5)=$count]", count=2)
['\n text2a', '\n text2b\n ']
>>> doc.xpath("//text()[count(preceding-sibling::h5)=$count]", count=3)
['\n text3a', '\n text3b', '\n text3c', '\n ']
>>> doc.xpath("//text()[count(preceding-sibling::h5)=$count]", count=4)
['\n text4\n']
>>>
You should be able to just test the first preceding sibling h5...
//text()[preceding-sibling::h5[1][normalize-space()='SecondHeader']]

Capybara, rspec- How to find text anywhere on page

There are multiple ways to find it but I want to do this in a specific manner. Here it is-
To get an element with some text in it, my framework creates an xpath in this manner-
#xpath = "//h1[contains(text(), '[the-text-i-am-searching-for]')]"
Then it executes-
find(:xpath, #xpath).visible?
Now in similar format I want to create an xpath which just looks for a text anywhere in the page and then can be used in find(:xpath,#xpath).visible? to return a true or false.
To give a little more context:
My HTML paragraph looks something like this-
<blink><p>some text here <b><u>some bold and underlined text here</u></b> again some text Learn more [the-text-i-am-searching-for]</p></blink>
but if I try to find it using find(:xpath, #xpath) where my xpath is
#xpath = "//p[contains(text(), '[the-text-i-am-searching-for]')]"
it fails.
Try replacing "//p[contains(text(), '[the-text-i-am-searching-for]')]" with "//p[contains(., '[the-text-i-am-searching-for]')]"
I don't know your environment but in Python with lxml it works:
>>> import lxml.etree
>>> doc = lxml.etree.HTML("""<blink><p>some text here <b><u>some bold and underlined text here</u></b> again some text Learn more [the-text-i-am-searching-for]</p></blink>""")
>>> doc.xpath('//p[contains(text(), "[the-text-i-am-searching-for]")]')
[]
>>> doc.xpath('//p[contains(., "[the-text-i-am-searching-for]")]')
[<Element p at 0x1c1b9b0>]
>>>
The context node . will be converted to a string to match the signature boolean contains(string, string) (http://www.w3.org/TR/xpath/#section-String-Functions)
>>> doc.xpath('string(//p)')
'some text here some bold and underlined text here again some text Learn more [the-text-i-am-searching-for]'
>>>
Consider these variations
>>> doc.xpath('//p')
[<Element p at 0x1c1b9b0>]
>>> doc.xpath('//p/*')
[<Element b at 0x1e34b90>, <Element a at 0x1e34af0>]
>>> doc.xpath('string(//p)')
'some text here some bold and underlined text here again some text Learn more [the-text-i-am-searching-for]'
>>> doc.xpath('//p/text()')
['some text here ', ' again some text ', ' [the-text-i-am-searching-for]']
>>> doc.xpath('string(//p/text())')
'some text here '
>>> doc.xpath('//p/text()[3]')
[' [the-text-i-am-searching-for]']
>>> doc.xpath('//p/text()[contains(., "[the-text-i-am-searching-for]")]')
[' [the-text-i-am-searching-for]']
>>> doc.xpath('//p[contains(text(), "[the-text-i-am-searching-for]")]')
[]

Resources