What the heck are these characters? - utf-8

I recently read this post on stack overflow:
RegEx match open tags except XHTML self-contained tags
The top reply contains text with text which appears to 'bleed':
ea͠ki̧n͘g fr̶ǫm ̡yo​͟ur eye͢s̸ ̛l̕ik͏e liq​uid pain, the song of re̸gular exp​ression parsing will exti​nguish the voices of mor​tal man from the sp​here I can see it can you see ̲͚̖͔̙î̩́t̲͎̩̱͔́̋̀ it is beautiful t​he final snuffing of the lie​s of Man ALL IS LOŚ͖̩͇̗̪̏̈́T ALL I​S LOST the pon̷y he comes he c̶̮omes he comes the ich​or permeates all MY FACE MY FACE ᵒh god no NO NOO̼O​O NΘ stop the an​*̶͑̾̾​̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨe̠̅s ͎a̧͈͖r̽̾̈́͒͑e n​ot rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚​N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ
..
Lookig at these individually they look like single characters. How are they created? How can I find more information about them? For example, the "A" character:
A̡͊͠͝
WTF is that?

Those are combined Unicode characters.
http://es.wikipedia.org/wiki/Unicode
and
http://en.wikipedia.org/wiki/Combining_character

Related

Writing Capybara expectations to verify phone numbers

I'm using AWS Textract to pull information from PDF documents. After the scanned text is returned from AWS and persisted to a var, I'm doing this:
phone_number = '(555) 123-4567'
scanned_pdf_text.should have_text phone_number
But this fails about 20% of the time because of the non-deterministic way that AWS is returning the scanned PDF text. On occasion, the phone numbers can appear either of these two ways:
(555)123-4567 or (555) 123-4567
Some of this scanned text is very large, and I'd prefer not to go through the exercise of sanitizing the text coming back if I can avoid it (I'm also not good at regex usage). I also think using or logic to handle both cases seems to be a little heavy handed just to check text that is so similar (and clearly near-identical to the human eye).
Is there an rspec matcher that'll allow me to check on this text? I'm also using Capybara.default_normalize_ws = true but that doesn't seem to help in this case.
Assuming scanned_pdf_text is a string and the only differences you're seeing is in spaces then you can just get rid of the spaces and compare
scanned_pdf_text.gsub(/\s+/, '').should eq('(555)123-4567') # exact
scanned_pdf_text.gsub(/\s+/, '').should match('(555)123-4567') # partial
scanned_pdf_text.gsub(/\s+/, '').should have_text('(555)123-4567') # partial

To build a flow using Power Automate to download linked csv report in gmail

I'm trying to create a flow using Power Automate (which I'm quite new to) that can get the link/URL in an email I receive daily, then download the .csv file that normally a click to the link would do, and then save the file to a given local folder.
An example of the email I get:
Screenshot of the email I get daily
I searched in Power Automate Community and found this insightful LINK post & answer almost solved it. However, after following the steps and built the flow, it kept failing at the Compose step.
Screenshot of the Flow & Error Message
The flow
Error message
Expression used:
substring(body('Html_to_text'),add(indexOf(body('Html_to_text'),'here'),5),sub(indexOf(body('Html_to_text'),'Name'),5))
Seems the expression couldn't really get the URL/Link? I'm not sure and searched but couldn't find any more posts that can help.
Please kindly share all insights on approaches or workarounds that you think may help me solve the problem and truly thanks!
PPPPPPPPisces
We need to breakdown the bits of the function here which needs 3 bits of info
substring(1 text to search, 2 starting position of the text you want, 3 length of text)
For example, if you were trying to return an unknown number from the text dog 4567 bird
Our function would have 3 parts.
body('Html_to_text'), this bit gets the text we are searching for
add(indexOf(body('Html_to_text'),'dog'),4), this bit finds the position in the text 4 characters after the start of the word dog (3 letters for dog + the space)
sub(sub(indexOf(body('Html_to_text'),'bird'),2)),add(indexOf(body('Html_to_text'),'dog'),4)), I've changed the structure of your code here because this part needs to return the length of the URL, not the ending position. So here, we take the position of the end of the URL (position of the word bird minus two spaces) and subtract it from the position of the start of the URL (position of the word dog + 4 spaces) to get the length.
In your HTML to text output, you need to check what the HTML looks like, and search for a word before the URL starts, and a word after the URL starts, and count the exact amount of spaces to reach the URL. You can then put those words and counts into your code.
More generally, when you have a complicated problem that you need to troubleshoot, you can break it down into steps. For example. Rather than putting that big mess of code into a single block, you can make each chunk of the code in its own compose, and then one final compose to bring them all together - that way when you run it you can see what information each bit is giving out, or where it is failing, and experiment from there to discover what is wrong.

Whenever specific button clicked it does something in Pascal

Hello.
Can anyone help me with scanning buttons always and when I click specific one it does something even Im writing? I want to be filling a record with like 9 attributes but when Im atc. at fourth I want to close it. I tried some readkey stuff :
procedure searching();
var p:char;search:string='';
begin
repeat
p:=readkey();
write(p);
search+=chr(p);
until (p=#27) or (p=#13);
if (p=#27) then menu()
else
...
but the problem was that it written some character it was not possible to erase it and I knew that backspacing and writing again made my search filled with characters I did not wanted in there. Couldn't find a topic about it in pascal so im trying here. Don't flame for english please. Hope you get what I meant, also English is maybe the problem I can't find it, whatever. Waiting for answer, Thanks, Maroš.
but the problem was that it written some character it was not possible to erase it
Why not? Just handle #8 (backspace) and truncate your last character from the search string. You can use both System.Delete (by deleting the last character) and System.SetLength (by setting the length to current length - 1).

Find HTML Tags in Properties

My current issue is to find HTML-Tags inside of property values. I thought it would be easy to search with a query like /jcr:root/content/xgermany//*[jcr:contains(., '<strong>')] order by #jcr:score
It looks like there is a problem with the chars < and > because this query finds everything which has strong in it's property. It finds <strong>Some Text</strong> but also This is a strong man.
Also the Query Builder API didn't helped me.
Is there a possibility to solve it with a XPath or SQL Query or do I have to iterate through the whole content?
I don't fully understand why it finds This is a strong man as a result for '<strong>', but it sounds like the unexpected behavior comes from the "simple search-engine syntax" for the second argument to jcr:contains(). Apparently the < > are just being ignored as "meaningless" punctuation.
You could try quoting the search term:
/jcr:root/content/xgermany//*[jcr:contains(., '"<strong>"')]
though you may have to tweak that if your whole XPath expression is enclosed in double quotes.
Of course this will not be very robust even if it works, since you're trying to find HTML elements by searching for fixed strings, instead of actually parsing the HTML.
If you have an specific jcr:primaryType and the targeted properties you can do something like this
select * from nt:unstructured where text like '%<strong>%'
I tested it , but you need to know the properties you are intererested in.
This is jcr-sql syntax
Start using predicates like a champ this way all of this will make sense to you!
HTML Encode <strong>
HTML Decimal <strong>
Query builder is your friend:
Predicates: (like a CHAMP!)
path=/content/geometrixx
type=nt:unstructured
property=text
property.operation=like
property.value=%<strong>%
Have go here:
http://localhost:4502/libs/cq/search/content/querydebug.html?charset=UTF-8&query=path%3D%2Fcontent%2Fgeometrixx%0D%0Atype%3Dnt%3Aunstructured%0D%0Aproperty%3Dtext%0D%0Aproperty.operation%3Dlike%0D%0Aproperty.value%3D%25%3Cstrong%3E%25
Predicates: (like a CHAMP!)
path=/content/geometrixx
type=nt:unstructured
property=text
property.operation=like
property.value=%<strong>%
Have a go here:
http://localhost:4502/libs/cq/search/content/querydebug.html?charset=UTF-8&query=path%3D%2Fcontent%2Fgeometrixx%0D%0Atype%3Dnt%3Aunstructured%0D%0Aproperty%3Dtext%0D%0Aproperty.operation%3Dlike%0D%0Aproperty.value%3D%25%26lt%3Bstrong%26gt%3B%25
XPath:
/jcr:root/content/geometrixx//element(*, nt:unstructured)
[
jcr:like(#text, '%<strong>%')
]
SQL2 (already covered... NASTY YUK..)
SELECT * FROM [nt:unstructured] AS s WHERE ISDESCENDANTNODE([/content/geometrixx]) and text like '%<strong>%'
Although I'm sure it's entirely possible with a string of predicates, it's possibly heading down the wrong route. Ideally it would be better to parse the HTML when it is stored or published.
The required information would be stored on simple properties on the node in question. The query will then be a lot simpler with just a property = value query, than lots of overly complex query syntax.
It will probably be faster too.
So if you read in your HTML with something like HTMLClient and then parse it with a OSGI service, that can accurately save these properties for you. Every time the HTML is changed the process would update these properties as necessary. Just some thoughts if your SQL is getting too much.

Nokogiri How can I extract text from HTML with correct spacing?

I'm trying to extract the text for a document to index it for search. The below mostly works except various words and punctuation run together. When it removes tags, I need to replace them with spaces so I do not get this issue. I have been trying to figure out the most efficient way to do this but I'm coming up empty so far.
doc = Nokogiri::HTML(html)
doc.xpath("//script").remove
doc.xpath("//style").remove
doc.xpath("//a").remove
text = doc.text.gsub(/\s+/,' ')
Here is some sample text I extracted from http://www.washingtontimes.com/blog/redskins-watch/2012/oct/18/redskins-linemen-respond-jason-pierre-paul-rg3-com/
Before the season it was New York Giants defensive end Osi Umenyiora
who made waves by saying he wouldn't call Robert Griffin III by “RG3”
until he did something. Until then, it was “Bob Griffin.”After
Griffin's 76-yard touchdown run in the Washington Redskins' victory
over the Minnesota Vikings, fellow Giants defensive end Jason
Pierre-Paul was the one who had some comments about Griffin.“Don’t
bring it to my side," Pierre-Paul told New York media. “Go the other
way. …“Yes, it'll be a very good matchup. Not on my side, though. Not
on my side. Or the other side.”Griffin, asked jokingly Wednesday about
running for office, said: “I’ve got a lot other guys to be running
away from right now, Pierre-Paul, Osi, all those guys.”But according
to a couple of Redskins linemen, Griffin shouldn't have much to worry
about Sunday if he gets into the open field.“If Robert gets into that
situation, I don't think there's many people that can run him down,”
right guard Chris Chester said. “I'm still going to go out there and
try to block and make sure no one touches Robert at all. But he's a
plenty good athlete to be able to outrun a lot of people in this
league.”Prompted with Pierre-Paul's comments, left tackle Trent
Williams responded: “What do you want me to say about that?”“Robert's
my guy. I don't know Pierre-Paul. I don't know why he would say
something like that,” he said. “Maybe he knows something I don't.”
You could try inserting a space before each p tag:
doc.search('p').each{|el| el.before ' '}
but a better approach probably is something like:
text = doc.search('div.story p').map{|p| p.text}.join(" ")
Other answers are discussing inserting whitespace into the document, but if (as the question asks) your requirement is to replace those nodes with whitespace, Nokogiri has a replace method. So to replace script tags with spaces do:
doc.xpath('//script').each do |node|
node.replace(' ')
end
The question also asks about 'correct' spacing. Most browsers will not insert a space when they render around a <script> tag, so while useful for text extraction, this is not necessarily the 'correct' thing to do.

Resources