How to correctly have multi line yaml strings? - syntax

newlines on multiple lines does not seem to work out for me:
Something like:
intro: |
We are happy that you are interested in
and
more
and + more needs to be on a newline but it fails.
intro: |
| We are happy that you are interested in
| and
| more
or
intro: |
We are happy that you are interested in \n
and
more <2 spaces >
another one
All fail.
How to correctly have multiline in a yaml text block?
I use this in HAML view in rails app like
= t("mailer.beta_welcome.intro")
But no newlines are printed this way, do i need to output it differently with raw or something?

Your first example works fine
foo.yml
intro: |
We are happy that you are interested in
and
more
foo.rb
require 'yaml'
puts YAML.load_file('foo.yml').inspect
Output
{"intro"=>"We are happy that you are interested in\nand \nmore\n"}

Late answer for Googlers:
It looks like you were trying to output it as HTML, which means it was indeed outputting the newlines if you were to inspect the page. HTML largely ignores whitespace, however, so your newlines and spaces were being converted into just a space by the HTML renderer.
According to the simple_format docs, simple_format applies a few simple formatting rules to text output in order to render it closer to what the plaintext output would be - significantly, it converts newlines to <br/> tags.
So your problem had nothing to do with YAML, which was performing as expected. It was actually because of how HTML works, which is also as expected. simple_format fixed it because it took your string from YAML with newlines and converted it to a string with <br/> tags so that the newlines actually showed up in the HTML, which is what you wanted in the first place.

Ugh.. after digging more on different keywords I found that
= simple_format(t("mailer.beta_welcome.intro"))
does the trick although this seems stupid i see no workaround for now

You can put your string in single quotes, it helps me:
intro: 'We are happy that you are interested in
and
more'

Related

Scraping specific hyperlinks from a website using bash

I have a website containing several dozen hyperlinks in the following format :
<a href=/news/detail/1/hyperlink>textvalue</a>
I want to get all hyperlinks, and their text values, where the hyperlink begins with /news/detail/1/.
The output should be in the following format :
textvalue
/news/detail/1/hyperlink
First of all, people are going to come in here (possibly talking about someone named Cthuhlu) and tell you that awk/regex are not HTML parsers. And they are right, and you should give some thought to what they say. Realistically, you can very often get away with something like this:
sed -n 's/^.*<a\s\+href\=\([^>]\+\)>\([^<]\+\)<\/a>.*$/\2\n\1/p' input_file.html
This tells sed to read the file input_file.html, find lines that match the regex, replace them with the sections you specified for the output, and discard everything else. The result will print to the terminal.
This also assumes that the file is formatted such that each instance of <a href=/news/detail/1/hyperlink>textvalue</a> is on a separate line. The regex could easily be modified to accommodate different formatting, if needed.
If all of the links you want happen to start with /news/detail/1/, this will probably work:
sed -n 's/^.*<a\s\+href\=\(\/news\/detail\/1\/[^>]\+\)>\([^<]\+\)<\/a>.*$/\2\n\1/p' input_file.html

Preserving whitespace / line breaks with REXML

I'm using Ruby 1.9.3 and REXML to parse an XML document, make a few changes (additions/subtractions), then re-output the file. Within this file is a block that looks like this:
<someElement>
some.namespace.something1=somevalue1
some.namespace.something2=somevalue2
some.namespace.something3=somevalue3
</someElement>
The problem is that after re-writing the file, this block always ends up looking like this:
<someElement>
some.namespace.something1=somevalue1
some.namespace.something2=somevalue2 some.namespace.something3=somevalue3
</someElement>
The newline after the second value (but never the first!) has been lost and turned into a space. Later, some other code which I have no control or influence over will be reading this file and depending on those newlines to properly parse the content. Generally in this situation i'd use a CDATA to preserve the whitespace, but this isn't an option as the code that parses this data later is not expecting one - it's essential that the inner text of this element is preserved exactly as-is.
My read/write code looks like this:
xmlFile = File.open(myFile)
contents = xmlFile.read
xmlDoc = REXML::Document.new(contents, { :respect_whitespace => :all })
xmlFile.close
{perform some tasks}
out = ""
xmlDoc.write(out, 2)
File.open(filePath, "w"){|file| file.puts(out)}
I'm looking for a way to preserve the whitespace of text between elements when reading/writing a file in this manner using REXML. I've read a number of other questions here on stackoverflow on this subject, but none that quite replicate this scenario. Any ideas or suggestions are welcome.
I get correct behavior by removing the indent (second) parameter to Document.write():
#xmlDoc.write(out, 2)
xmlDoc.write(out)
That seems like a bug in Document.write() according to my reading of the docs, but if you don't really need to set the indentation, then leaving that off should solve yor problem.

Find all occurrences of text in different casing

We have an acronym which has specific casing. Business now wants us to find all occurrences where the casing is wrong and fix it.
Example of correct casing: HtMl
The search operation would then need to return all occurrences of HTML, html, Html, HtML etc. So I could then examine each case manually to see if it's really our acronym.
I was thinking Regular Expressions but I'm unsure how to write one that would exclude the correct case. Something like: \b((H|h)(T|t)(M|m)(L|l))&(~HTML)\b. Only & as AND doesn't exist (or does it?).
Solved using bash script:
echo "Hello, I'm not HtmL, HTML or html, but not HtMl." | grep -o "[H|h][T|t][M|m][L|l]" | grep -v "HtMl"
The "exception" is in the "grep -v" part.
You could convert the text to lowercase, then find occurrences of the word (lowercased, too) in the lowercased text. Now, whereever you found it in the lowercased version, replace it in the original text.
But now that I think this over, using regular expression is much simpler. Not much to add here, but if you have many such replacements to do, here's a little Python script that should generate (and apply) those regular expressions for you.
import re
def replaceAllVariants(acronym, text):
regex = "".join("[%s%s]" % (c.lower(), c.upper()) for c in acronym)
return re.sub(regex, acronym, text)
# usage
text = replaceAllVariants("HTML", "Bla bla html HTML HtMl hTMl foo bar.")

How to enumerate unique characters in a UTF-8 document? With sed?

I'm converting some Polish<->English dictionaries from RTF to HTML. The Polish special characters are coming out fine. But IPA (International Phonetic Alphabet) glyphs get changed to funny things, depending on what program I use for conversion. For example, /ˈbiːrɪ/ comes out as /ÈbiùrI/ or /∪βιρΙ/.
I'd like to correct these documents with a search & replace, but I want to make sure I don't miss any characters and don't want to manually pore over dictionary entries. I'd like to output a list of all unique, NON-ascii characters in a document.
I found this thread:
Find Unique Characters in a File
... and I tried the following two proposals:
sed -e "s/./\0\n/g" inputfile | sort -u
sed -e "s/(.)/\1\n/g" inputfile | sort -u
They both work nicely, and seem to both generate the same output. My problem is that they only output standard ASCII characters, and what I'm looking for is exactly the opposite.
The sed tool looks awesome, but I don't have time to learn it right now (though I intend to later). I'm hoping the solution will be clear to someone who's already mastered this tool, and they can save me a lot of time. [-:
Thanks in advance!
This is not a sed solution but a Python solution. It reads the contents of a file, takes it as UTF-8 and then turns it into a set (thus throwing away duplicates), throws away ASCII characters (0-127), sorts it and then joins it back together again with a blank line between each character:
'\n'.join(sorted(set(unicode(open(inputfile).read(), 'utf-8')) - set(chr(i) for i in xrange(128))))
As something you'd run from the command line if you felt so inclined,
python -c "print '\n'.join(sorted(set(unicode(open('inputfile').read(), 'utf-8')) - set(chr(i) for i in xrange(128))))"
(You could also use ''.join instead of '\n'.join which would list the characters without a newline in between.)

How can I strip whitespace and newlines with Mako templates? My 12362 line HTML file is killing IE

I'm using the Mako template system in my Pylons website and am having some issues with stripping whitespace.
My reasoning for stripping whitespace is the generated HTML file comes out as 12363 lines of code. This, I assume, is why Internet Explorer is hanging when it tries to load it.
I want to be able to have my HTML file look nice and neat so I can make changes to it with ease and have the generated output look as ugly and messy as required to cut down on filesize and memory usage.
The Mako documentation http://www.makotemplates.org/docs/filtering.html says you can use the trim flag but that doesn't seem to work. Example code:
<div id="content">
${next.body() | trim}
</div>
The only way I've been able to strip the newlines is to add a \ (backslash) to the end of each line. This is rather annoying when coding the views and I'd prefer to have a centralized solution.
How do I remove the whitespace/newlines ?
I don't believe IE hanging is due to the file size. You have probably just found a combination of markup that hits an IE bug that causes it to freeze. Try trimming down you document until the freeze no longer happens to isolate the offending piece of markup pattern (and then avoid the markup pattern), or try changing the markup until this no longer happens.
Another good thing to do would be to run your page through HTML validator and fixing any issues reported.
I'm guessing that the trim filter is treating your html as a single string, and only stripping the leading and trailing whitespace characters. You want it to strip whitespace from every line. I would create a filter and iterate over each line.
<%!
def better_trim(html):
clean_html = ''
for line in html:
clean_html += line.strip()
return clean_html
%>
<div id="content">
${next.body() | better_trim}
</div>
You could create your own simple WSGI Middleware to strip whitespace from your templates after they've been rendered (possibly using the methods described in this answer).
Below is a quick example which might help you get started. Note: I wrote this code without testing or even running it; it may work first time, but it probably won't suit your needs so you'll need to expand on this yourself.
class HTMLMinifyMiddleware(object):
def __init__(self, app):
self.app = app
def __call__(self, environ, start_response):
resp_body = self.app(environ, start_response)
for i, part in enumerate(resp_body):
resp_body[i] = ' '.join(part.split())
return resp_body

Resources