extract a substring from clob in oracle - oracle

I have a clob with data
<?xml version='1.0' encoding='UTF-8'?><root available-locales="en_US" default-locale="en_US"><static-content language-id="en_US"><![CDATA[<script type="text/javascript">
function change_case()
{
alert("here...");
document.form1.type.value=document.form1.type.value.toLowerCase();
}
</script>
<form name=form1 method=post action=''''>
<input type=text name=type value=''Enter USER ID'' onBlur="change_case();">
<input type=submit value=Submit> </form>
</form>]]></static-content></root>
I want to extract the line with the onblur attribute, in this case:
<input type=text name=type value=''Enter USER ID'' onblur="change_case();">

Tom Kyte say how get varchar2 from clob in SQL or PL/SQL code
http://asktom.oracle.com/pls/asktom/f?p=100:11:0::NO::P11_QUESTION_ID:367980988799
And when you have varchar2 you can use SUBSTR or REGEXP_SUBSTR function for extract the line.
http://docs.oracle.com/cd/B14117_01/server.101/b10759/functions147.htm#i87066
http://docs.oracle.com/cd/B14117_01/server.101/b10759/functions116.htm
If you want to use SQL code, you can create this request
select col1, col2, func1(dbms_lob.substr( t.col_clob, 4000, 1 )) from table1 t
And in PL/SQL function "func1" you can do what you want with input string using SUBSTR or any other functions

Subdivide your problem. You want to extract a line of text from your CLOB which contains a particular substring. I can think of two possible interpretations of your requirements:
Option 1.
Split the CLOB into a series of lines - e.g. split it by newline/carriage return characters if that's really what you meant by "line".
Check each line to see if it includes the substring, e.g. onblur. If it does, you have found your line.
Option 2.
If you don't actually mean the line, but you want the <script>...</script> html fragment, you can use similar logic:
Search for the first occurrence of <script>.
Search for the next occurrence of </script> after that point. Extract the substring from <script> to </script>.
Search the substring for onblur. If it is found, return the substring. Otherwise, find the next occurrence of <script>, go to step 2, rinse, repeat.

Related

Oracle PL/SQL regex for sanitize input from textArea

I have basic editor which allow the user to enter notes. I am using the https://quilljs.com/ API for the editor. The content of the editor will be saved in a database, but before that persisting the data, I want to sanitize the HTML content, to remove all possible JavaScript events in Oracle PL/SQL. I am not able to get a regular expression to sanitize the HTML content before saving.
Example: <p>This is www.test.com</p><p>ffffff</p><p><br></p><p><br></p><p>Review at <a href="http://www.1159pm.com" rel="noopener noreferrer" target="_blank" **onclick="alert()" ondblclick="alert()" onmouseover="alert()" onkeypress="alert()"**>www.1159PM.com</a> </p><p>fffffff</p>'
Expected Result: <p>This is www.test.com</p><p>ffffff</p><p><br></p><p><br></p><p>Review at www.1159PM.com </p><p>fffffff</p>'
All Js events removed. Other scripts and styles should be removed as well. Please help me with the Oracle RegEx to solve this problem.
If your HTML is restricted to valid XHTML (i.e. it has a single root element and each of the opened tags is closed) then you can use:
INSERT INTO table_name (value) VALUES (
XMLQuery(
'copy $i := $p1
modify (delete nodes ($i//#onclick, $i//#ondblclick, $i//#onmouseover, $i//#onkeypress))
return $i'
PASSING XMLTYPE('<html><p>This is www.test.com</p><p>ffffff</p><p><br /></p><p><br /></p><p>Review at www.1159PM.com </p><p>fffffff</p></html>')
AS "p1"
RETURNING CONTENT
).getClobVal()
)
Which, for the table:
CREATE TABLE table_name (value CLOB);
Then the inserted value is:
VALUE
<html><p>This is www.test.com</p><p>ffffff</p><p><br/></p><p><br/></p><p>Review at www.1159PM.com</p><p>fffffff</p></html>
db<>fiddle here

XPath: How to grab multiple strings when doing a string, substring, or another function on text() nodes

I want to use XPath to grab a list of modified strings via the text() function
Example code:
<div>
<p>
Monday 2/4/13
</p>
<p>
Tuesday 2/5/13
</p>
</div>
Now in this example, if I wanted to grab an array of the text between the markups, I'd write an expression such as .//div/p/text(). However, if I wanted to only grab the dates, I could use a substring-after function, but the code substring-after(.//div/p/text(), ' ') only grabs one element. How does I write this expression to grab all the text elements?
In XPath 2.0, you can use the function directly in the text():
//div/p/substring-after(text(), ' ')
In XPath 1.0, that cannot be achieved with only one expression because:
the substring-after() function takes a string as first parameter, not a node-set
a function cannot be specified as a location step (as the 2.0 example above does).
So, in 1.0, your best bet is something like (which you'd have to repeat for each node - notice also it returns just a string):
concat(substring-after(//div/p[1]/text(), ' '),
' ',
substring-after(//div/p[2]/text(), ' '))

escaping double quote and comma for to generate CSV

Here is the data that I need to parse them as CSV.
Actually, I am making CSV string which I will need to import to another system.
Basically, I append comma between each field that I query from DB.
Column1 :
Testing
Column 2:
<p class="MsoNormal" style=""><b><span style="font-size: 10.0pt; ">This is just a test, test2, test3</span></b><span style="font-size: 10.0pt; "></span></p>
Column 3:
Blah Blah
Now, I am facing problem of retaining double quotes and comma (as I need to save as in HTML format of this text).
I try to append double quote for for the column2 data at the start and end, but it doesn't work out.
Any suggestion for this?
String has the instance method escapeCSV, this should be what you need.
If you need something different you could always use replace to replace any characters you want to escape with escapeCharacter+originalCharacter eg. (" => \").

Using Xpath and HtmlAgilityPack to find all elements with innertext containing a specific word or words

I am trying to build a simple search-engine using HtmlAgilityPack and Xpath with C# (.NET 4).
I want to find every node containing a userdefined searchword, but I can't seem to get the XPath right.
For Example:
<HTML>
<BODY>
<H1>Mr T for president</H1>
<div>We believe the new president should be</div>
<div>the awsome Mr T</div>
<div>
<H2>Mr T replies:</H2>
<p>I pity the fool who doesn't vote</p>
<p>for Mr T</p>
</div>
</BODY>
</HTML>
If the specified searchword is "Mr T" I'd want the following nodes: <H1>, The second <div>, <H2> and the second <p>.
I have tried numerous variants of doc.DocumentNode.SelectNodes("//text()[contains(., "+ searchword +")]"); but I always seem to wind up with every single node in the entire DOM.
Any hints to get me in the right direction would be very appreciated.
Use:
//*[text()[contains(., 'Mr T')]]
This selects all elements in the XML document that have a text-node child which contains the string 'Mr T'.
This can also be written shorter as:
//text()[contains(., 'Mr T')]/..
This selects the parent(s) of any text node that contains the string 'Mr T'.
According to Xpath, if you want to find a specific keyword you need to follow the format ("keyword" is the word you like to search) :
//*[text()[contains(., 'keyword')]]
You have to follow the same format as above in C#, keyword is the string variable you call:
doc.DocumentNode.SelectNodes("//*[text()[contains(., '" + keyword + "')]]");
Use the following:
doc.DocumentNode.SelectNodes("//*[contains(text()[1], " + searchword + ")]")
This selects all elements (*) whose first text child (text()[1]) contains the searchword.
Case-insensitive solution:
var xpathForFindText =
"//*[text()[contains(translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), '" + lowerFocusKwd + "')]]";
var result=doc.DocumentNode.SelectNodes(xpathForFindText);
Note:
Be careful, because the lowerFocusKwd must not contain the following character, because the xpath will be in bad format:
'

use YQL with substring-before in xpath

I am trying to get a string before '--' within a paragraph in an html page using the xpath and send it to yql
for example i want to get the date from the following article:
<div>
<p>Date --- the body of the article</p>
</div>
I tried this query in yql:
select * from html where url="article url" and xpath="//div/p/text()/[substring-before(.,'--')]"
but it does not work.
how can I get the date of the article which is before the '--'
You can simply use:
substring-before(//div/p,'--')
Use:
substring-before(/div/p/text(), '--')
This XPath expression evaluates to the string immediately preceding '--' in the first text node in the XML document, that is a child of a p that is a child of the div top element.
In case you want to get this value for every such text node, you have to use an expression like:
substring-before((//div/p/text())[$k], '--')
and evaluate this expression $N times, for $k = 1,2, ..., $N
where $N is count(//div/p/text())
Do note: Try to avoid using the // XPath pseudo-operator always when the structure of the XML document is statically known. Using // usually results in big inefficiency (O(N^2)) that are felt especially painful on big XML documents.

Resources