I'm trying to place some whitespace at the start of a string, like so:
- sbSecId: 4
title: ' VideoJS'
link: /examples/video/instream/videojs/pb-ve-videojs.html
isLastSubSectionItem: 0
isHeader: 0
isSectionHeader: 0
sectionTitle:
subgroup: 1
this is for a site being generated by Jekyll. I'm using Liquid to make an array of the yml file, looping through the array and displaying the title key's value like so:
{{thisSubItem.title}}
despite having the key value in quotes the whitespace is being deleted. Is this a Jekyll thing? How can I get the whitespace to be retained?
This is not a Jekyll thing this html that strips unnecessary spaces.
Here you can use CSS rule
<span style="white-space: pre;">{{thisSubItem.title}}</span>
Or by replacing spaces by non-breaking spaces
{% assign preserved_ws = thisSubItem.title | replace: " ", " " %}
{{ preserved_ws }}
Nevertheless, if it's only a presentation matter, you must get rid of spaces and go with CSS margins.
Instead, you could leave the spaces out of the string and have it in the layout or wherever the value is being rendered. You could remove the whitespace from the string and wouldn't need to worry about remembering to keep the whitespace consistent.
{{thisSubItem.title}}
Related
I am trying to sanitalize Solr search results, cause it has html tags inside:
ActionController::Base.helpers.sanitize( result_string )
It is easy to sanitalize not highlighted string like: I know <ul><li>ruby</li> <li>rails</li></ul>.
But when results is highlighted I have additional important tags inside - <em> and </em>:
I <em>know</em> <<em>ul</em>><<em>li</em>><em>ruby</em></<em>li</em>> <<em>li</em>><em>rails</em></<em>li</em>></<em>ul</em>>.
So, when I sanitalize string with nested html and highlighting tags, I get string with peaces of htmls tags. And it is bad :)
How can I sanitalize highlighted string with <em> tags inside to get correct result (string with <em> tags only)?
I found the way, but it's slow and not pretty:
string = 'I <em>know</em> <<em>ul</em>><<em>li</em>><em>ruby</em></<em>li</em>> <<em>li</em>><em>rails</em></<em>li</em>></<em>ul</em>>'
['p', 'ul', 'li', 'ol', 'span', 'b', 'br'].each do |tag|
string.gsub!( "<<em>#{tag}</em>>", '' )
string.gsub!( "</<em>#{tag}</em>>", '' )
end
string = ActionController::Base.helpers.sanitize string, tags: %w(em)
How can I optimize it or do it using some better solution?
to write some regex and remove html_tags, but keep <em> and </em> e.g.
Please help, thanks.
You could call gsub! to discard all tags but keep only tags that are independent, or that are not included in html tag.
result_string.gsub!(/(<\/?[^e][^m]>)|(<<em>\w*<\/em>>)|(<\/<em>\w*<\/em>>)/, '')
would do the trick
To explain:
# first group (<\/?[^e][^m]>)
# find all html tags that are not <em> or </em>
# second group (<<em>\w*<\/em>>)
# find all opening tags that have <em> </em> inside of them like:
# <<em>li</em>> or <<em>ul</em>>
# third group (<\/<em>\w*<\/em>>)
# find all closing tags that have <em> </em> inside of them:
# </<em>li</em>> or </<em>ul</em>>
# and gsub replaces all of this with empty string
I think you can use the sinitize:
Custom Use (only the mentioned tags and attributes are allowed, nothing else)
<%= sanitize #article.body, tags: %w(table tr td), attributes: %w(id class style) %>
So, something like that should work:
sanitize result_string, tags: %w(em)
With an additional parameter to sanitize, you can specify which tags are allowed.
In your example, try:
ActionController::Base.helpers.sanitize( result_string, tags: %w(em) )
It should do the trick
I've been working with Nokogiri for a couple of days and I absolutely adore it. Everything was working brilliantly until I got a requirement to scrape a website that uses the data-reactid javascript attribute tag. The problem is that Nokogiri seems to be getting confused with the attribute id format this website is using (several periods, some dollar signs and some other invalid xml/css characters):
An example of what I need to scrape would be:
<td data-reactid=".3.3.1:$contract_23.$=1$dataRow:0.1">94.280</td>
I need the value (94.280) inside of the attribute with an id of ".3.3.1:$contract_23.$=1$dataRow:0.1"
which usually in nokogiri we would select by doing something like:
doc.css("type[attributename=attributeid]")
in my example it would be:
doc.css("td[data-reactid=.3.3.1:$contract_23.$=1$dataRow:0.1]")
but no matter what I do to escape the invalid characters, it keeps telling me there is an invalid character after my equals sign:
Error message for code above:
nokogiri-1.4.3.1/lib/nokogiri/css/parser.rb:78:in `on_error': unexpected '.3' after 'equal'
I've tried:
a) Getting my string defined as a variable and forced into a string
b) Escaping it with backslashes (.3.[...])
c) Prefixing it with a hash (#.3.3[...])
d) Escaping it using cgi escapedString
e) Placing it inside '%{ }' eg '%{.3.3[...]}'
No matter what I do, I keep getting the same message (except for option e which gives me an altogether different error message:
: no .<digit> floating literal anymore; put 0 before dot
Can you guys help me get the right value with such an oddly-named attribute?
You didn't show how you are parsing your document, but if I parse it as HTML and then use single quotes around the attribute value in the css selector, I can get the tag:
require 'nokogiri'
html = <<END_OF_HTML
<td data-reactid="hello">10</td>
<td data-reactid=".3.3.1:$contract_23.$=1$dataRow:0.1">94.280</td>
<td data-reactid="goodbye">20</td>
END_OF_HTML
html_doc = Nokogiri::HTML(html)
html_doc.css("td[data-reactid='.3.3.1:$contract_23.$=1$dataRow:0.1']").each do |tag|
puts tag.text
end
--output:--
94.280
Check out the Mothereffing Unquoted Attribute Value Validator via this SO post:
CSS attribute selectors: The rules on quotes (", ' or none?)
I have asked a similar question before but this one is slightly different
I have content with this sort of links in:
Professor Steve Jackson
[UPDATE]
And this is how i read it:
content = doc.xpath("/wcm:root/wcm:element[#name='Body']").inner_text
The links has two pairs of double quotes after the href=.
I am trying to strip out the tag and retrieve only the text like so:
Professor Steve Jackson
To do this I'm using the same method which works for this sort of link which has only a single pair of double quotes:
World
This returns World:
content = Nokogiri::XML.fragment(content_with_link)
content.css('a[href^="ssLINK"]')
.each{|a| a.replace("<>#{a.content}</>")}
=>World
When I try To do the same for the link that has two pairs of double quotes it complains:
content = Nokogiri::XML.fragment(content_with_link)
content.css('a[href^=""ssLINK""]')
.each{|a| a.replace("<>#{a.content}</>")}
Error:
/var/lib/gems/1.9.1/gems/nokogiri-1.6.0/lib/nokogiri/css/parser_extras.rb:87:in
`on_error': unexpected 'ssLINK' after '[:prefix_match, "\"\""]' (Nokogiri::CSS::SyntaxError)
Anyone know how I can overcome this issue?
I can suggest you two ways to do it, but it depends on whether : every <a> tag has href's with two "" enclosing them or its just the one with ssLINK
Assume
output = []
input_text = 'Professor Steve Jackson'
1) If a tags has href with "" only with ssLink then just do
Nokogiri::HTML(input_text).css('a[href=""]').each do |nokogiri_obj|
output << nokogiri_obj.text
end
# => output = ["Professor Steve Jackson"]
2) If all the a tags has href with ""then you can try this
nokogiri_a_tag_obj = Nokogiri::HTML(input_text).css('a[href=""]')
nokogiri_a_tag_obj.each do |nokogiri_obj|
output << nokogiri_obj.text if nokogiri_obj.has_attribute?('sslink')
end
# => output = ["Professor Steve Jackson"]
With this second approach if
input_text = 'Professor Steve Jackson Some other TextSecond link'
then also the output will be ["Professor Steve Jackson"]
Your content is not XML, so any attempt to solve the problem using XML tools such as XSLT and XPath is doomed to failure. Use a regex approach, e.g. awk or Perl. However, it's not immediately obvious to me how to match
<a href="" sometext"">
without also matching
<a href="" sometext="">
so we need to know a bit more about this syntax that you are trying to parse.
I have a text similar to this:
<p>some text ...</p><p>The post text... appeared first on some another text.</p>
I need to remove everything from <p>The post, so the results would be:
<p>some text ...</p>
I am trying ot do that this way:
text.sub!(/^<p>The post/, '')
But it returns just an empty string... how to fix that?
Your regex is incorrect. It matches every <p>The post that is in the beginning of the string. You want the opposite: match from its position to the end of the string. Check this out.
s = '<p>some text ...</p><p>The post text... appeared first on some another text.</p>'
s.sub(/<p>The\spost.*$/, '') # => "<p>some text ...</p>"
You have specified ^, which matches the beginning of a string. You should do
text.sub!(/<p>The post.*$/, '')
Play with this in http://rubular.com/r/c91EbHN0Af
'^' is matching the beginning of the whole string. try doing
text.sub!(/<p>The post/, '')
EDIT just read it more carefully...
text.sub!(/<p>The post.*$/, '')
I have this HTML:
<tr class="even expanded first>
<td class="score-time status">
<a href="/matches/2012/08/02/europe/uefa-cup/">
16 : 00
</a>
</td>
</tr>
I want to extract the (16 : 00) string without the extra whitespace. Is this possible?
I. Use this single XPath expression:
translate(normalize-space(/tr/td/a), ' ', '')
Explanation:
normalize-space() produces a new string from its argument, in which any leading or trailing white-space (space, tab, NL or CR characters) is deleted and any intermediary white-space is replaced by a single space character.
translate() takes the result produced by normalize-space() and produces a new string in which each of the remaining intermediary spaces is replaced by the empty string.
II. Alternatively:
translate(/tr/td/a, '

', '')
Please try the below xpath expression :
//td[#class='score-time status']/a[normalize-space() = '16 : 00']
You can use XPath's normalize-space() as in //a[normalize-space()="16 : 00"]
I came across this thread when I was having my own issue similar to above.
HTML
<div class="d-flex">
<h4 class="flex-auto min-width-0 pr-2 pb-1 commit-title">
<a href="/nsomar/OAStackView/releases/tag/1.0.1">
1.0.1
</a>
XPath start command
tree.xpath('//div[#class="d-flex"]/h4/a/text()')
However this grabbed random whitespace and gave me the output of:
['\n ', '\n 1.0.1\n ']
Using normalize-space, it removed the first blank space node and left me with just what I wanted
tree.xpath('//div[#class="d-flex"]/h4/a/text()[normalize-space()]')
['\n 1.0.1\n ']
I could then grab the first element of the list, and use strip() to remove any further whitespace
XPath final command
tree.xpath('//div[#class="d-flex"]/h4/a/text()[normalize-space()]')[0].strip()
Which left me with exactly what I required:
1.0.1
you can check if text() nodes are empty.
/path/text()[not(.='')]
it may be useful with axes like following-sibling:: if these are no containers, or with child::.
you can use string() or the regex() function of xpath 2.
NOTE: some comments say that xpath cannot do string manipulation... even if it's not really designed for that you can do basic things: contains(), starts-with(), replace().
if you want to check whitespace nodes it's much harder, as you will generally have a nodelist result set, and most xpath functions, like match or replace, only operate one node.
you can separate node and string manipulation
So you may use xpath to retrieve a container, or a list of text nodes, and then process it with another language. (java, php, python, perl for instance).