Find all occurrences of text in different casing - full-text-search

We have an acronym which has specific casing. Business now wants us to find all occurrences where the casing is wrong and fix it.
Example of correct casing: HtMl
The search operation would then need to return all occurrences of HTML, html, Html, HtML etc. So I could then examine each case manually to see if it's really our acronym.
I was thinking Regular Expressions but I'm unsure how to write one that would exclude the correct case. Something like: \b((H|h)(T|t)(M|m)(L|l))&(~HTML)\b. Only & as AND doesn't exist (or does it?).

Solved using bash script:
echo "Hello, I'm not HtmL, HTML or html, but not HtMl." | grep -o "[H|h][T|t][M|m][L|l]" | grep -v "HtMl"
The "exception" is in the "grep -v" part.

You could convert the text to lowercase, then find occurrences of the word (lowercased, too) in the lowercased text. Now, whereever you found it in the lowercased version, replace it in the original text.
But now that I think this over, using regular expression is much simpler. Not much to add here, but if you have many such replacements to do, here's a little Python script that should generate (and apply) those regular expressions for you.
import re
def replaceAllVariants(acronym, text):
regex = "".join("[%s%s]" % (c.lower(), c.upper()) for c in acronym)
return re.sub(regex, acronym, text)
# usage
text = replaceAllVariants("HTML", "Bla bla html HTML HtMl hTMl foo bar.")

Related

Scraping specific hyperlinks from a website using bash

I have a website containing several dozen hyperlinks in the following format :
<a href=/news/detail/1/hyperlink>textvalue</a>
I want to get all hyperlinks, and their text values, where the hyperlink begins with /news/detail/1/.
The output should be in the following format :
textvalue
/news/detail/1/hyperlink
First of all, people are going to come in here (possibly talking about someone named Cthuhlu) and tell you that awk/regex are not HTML parsers. And they are right, and you should give some thought to what they say. Realistically, you can very often get away with something like this:
sed -n 's/^.*<a\s\+href\=\([^>]\+\)>\([^<]\+\)<\/a>.*$/\2\n\1/p' input_file.html
This tells sed to read the file input_file.html, find lines that match the regex, replace them with the sections you specified for the output, and discard everything else. The result will print to the terminal.
This also assumes that the file is formatted such that each instance of <a href=/news/detail/1/hyperlink>textvalue</a> is on a separate line. The regex could easily be modified to accommodate different formatting, if needed.
If all of the links you want happen to start with /news/detail/1/, this will probably work:
sed -n 's/^.*<a\s\+href\=\(\/news\/detail\/1\/[^>]\+\)>\([^<]\+\)<\/a>.*$/\2\n\1/p' input_file.html

How to correctly have multi line yaml strings?

newlines on multiple lines does not seem to work out for me:
Something like:
intro: |
We are happy that you are interested in
and
more
and + more needs to be on a newline but it fails.
intro: |
| We are happy that you are interested in
| and
| more
or
intro: |
We are happy that you are interested in \n
and
more <2 spaces >
another one
All fail.
How to correctly have multiline in a yaml text block?
I use this in HAML view in rails app like
= t("mailer.beta_welcome.intro")
But no newlines are printed this way, do i need to output it differently with raw or something?
Your first example works fine
foo.yml
intro: |
We are happy that you are interested in
and
more
foo.rb
require 'yaml'
puts YAML.load_file('foo.yml').inspect
Output
{"intro"=>"We are happy that you are interested in\nand \nmore\n"}
Late answer for Googlers:
It looks like you were trying to output it as HTML, which means it was indeed outputting the newlines if you were to inspect the page. HTML largely ignores whitespace, however, so your newlines and spaces were being converted into just a space by the HTML renderer.
According to the simple_format docs, simple_format applies a few simple formatting rules to text output in order to render it closer to what the plaintext output would be - significantly, it converts newlines to <br/> tags.
So your problem had nothing to do with YAML, which was performing as expected. It was actually because of how HTML works, which is also as expected. simple_format fixed it because it took your string from YAML with newlines and converted it to a string with <br/> tags so that the newlines actually showed up in the HTML, which is what you wanted in the first place.
Ugh.. after digging more on different keywords I found that
= simple_format(t("mailer.beta_welcome.intro"))
does the trick although this seems stupid i see no workaround for now
You can put your string in single quotes, it helps me:
intro: 'We are happy that you are interested in
and
more'

Regex Markdown Header

I'm trying to create a regular (ruby) expression which checks for multiple conditions. I use this regex to replace the content of my object. My regex is close to finished, except two problems I'm facing with regard to markdown.
First of, headers are giving me trouble. For example, I don't want to replace the word "Hi" for "Hello" if "Hi" is in a header.
Hi John <== # should not change
==================
Text: Hi, how are you? <== # Should be: Hello, how are you? after substitution
Or:
#### Hi Peter <== # should not change
Text: Hi, how are you? <== # Should be: Hello, how are you? after substitution
Question: How can I escape markdown headers within my regex? I've tried negative lookbehind and lookahead assertions, but to no avail.
My second problem should be quite easy, but somehow I'm struggling. If words are Italic "hi" I want to find and replace them, without changing the underscores. I can find the word with this regex:
\b[_]*hi[_]*\b
Question 2: But if I would replace it, I would also change the underscores. Is there a way to only detect the word itself and replace it, while still using word boundaries?
Code Example
#website.autolinks.all.each do |autolink|
autolink.name #for example returns "Iphone5"
autolink.url #for example returns "http://www.apple.com"
regex = /\b(?<!##\s)(?<![\d.\[])([_]*)#{autolink.name}([_]*)(?![\d'"<\/a>])\b/
if #permalink.blog_entry.content.match(regex)
#permalink.blog_entry.content.gsub!(regex, "[#{autolink.name}](# {autolink.url})")
end
end
Example text
Iphone5
==============
Iphone5 is the best mobile phone there is, even though the people at Samsung probably think, or perhaps only hope that their Samsung Galaxy S3 is better.
#### Samsung Galaxy S3?
Yes, that's the name of the newest Samsung phone.
This will result in a text with HTML tags, but when I use my regex my content uses Markdown syntax (used before the markdown converter).
Regexes work best when they do one clear thing. If you have multiple conditions, your code should usually reflect that by dividing the processing into steps.
In this case, you have two clear steps:
Use a simple regex or other logic to skip over the header portion of the message.
Once you know you are in the content, use another regex to process the content.
I've found a solution:
regex = /(?<!##\s)(?<![\d.\[a-z])#{autolink.name}(?![\d'"a-z<\/a>])(?!.*\n(==|--))/i
if #permalink.blog_entry.content.match(regex)
#permalink.blog_entry.content.gsub!(regex, "[\\0](#{autolink.url})")
end

Find URLs in text and wrap in anchor tag

I'm basically writing my own Markdown parser. I want to detect a URL in a string and wrap it with an anchor tag if it's a valid URL. For example:
string = 'here is a link: http://google.com'
# if string matches regex (which it does)
# should return:
'here is a link: http://google.com'
# but this would remain unchanged:
string 'here is a link: google.com'
How can I achieve this?
Bonus points if you can point me to the code in an existing Ruby markdown parser that I can use as an example.
In general: use a regular expression to find URLs and wrap them in your HTML:
urls = %r{(?:https?|ftp|mailto)://\S+}i
html = str.gsub urls, '\0'
Note that this particular solution will turn this text:
See more at http://www.google.com.
…into…
See more at http://www.google.com.
So you may want to play with the regex a bit to figure out where the URL should really end.
You can use this jquery plugin
http://www.jquery.gr/linker/

Ruby Regex: Return just the match

When I do
puts /<title>(.*?)<\/title>/.match(html)
I get
<h2>foobar</h2>
But I want just
foobar
What's the most elegant method for doing so?
The most elegant way would be to parse HTML with an HTML parser:
require 'nokogiri'
html = '<title><h2>Pancakes</h2></title>'
doc = Nokogiri::HTML(html)
title = doc.at('title').text
# title is now 'Pancakes'
If you try to do this with a regular expression, you will probably fail. For example, if you have an <h2> in your <title> what's to prevent you from having something like this:
<title><strong>Where</strong> is <span>pancakes</span> <em>house?</em></title>
Trying to handle something like that with a single regex is going to be ugly but doc.at('title').text handles that as easily as it handles <title>Pancakes</title> or <title><h2>Pancakes</h2></title>.
Regular expressions are great tools but they shouldn't be the only tool in your toolbox.
Something of this style will return just the contents of the match.
html[/<title>(.*?)<\/title>/,1]
Maybe you need to tell us more, like what html might contain, but right now, you are capturing the contents of the title block, irrespective of the internal tags. I think that is the way you should do it, rather than assuming that there is an internal tag you want to handle, especially because what would happen if you had two internal tags? This is why everyone is telling you to use an html parser, which you really should do.

Resources