I'm trying to create a basic script that extracts some text from a webpage, but when I save I'm getting a Syntax error that I don't understand...
... please see screenshot.
The second Set (Role1Result) is working fine.
I'm a bit of a newbie at this, so any help really appreciated.
Here's the relevant bit of code pasted:
set tid to AppleScript's text item delimiters -- save them for later.
set AppleScript's text item delimiters to startText -- find the first one.
set liste to text items of SearchText
set AppleScript's text item delimiters to endText -- find the end one.
set extracts to {}
repeat with subText in liste
if subText contains endText then
copy text item 1 of subText to end of extracts
end if
end repeat
set AppleScript's text item delimiters to tid -- back to original values.
return extracts
end extractText
--- roles ---
set role0Result to extractText(input0, " <dd class="result-lockup__highlight-keyword">
<span data-anonymize="job-title" class="t-14 t-bold">", "</span>
<span>
at
")
set role1Result to extractText(input1, " <dd class=\"result-lockup__highlight-keyword\">
<span class=\"t-14 t-bold\">", "</span>
<span>
")
Within a string literal you have to escape all occurrences of double quotes with a backslash
set role0Result to extractText(input0, " <dd class=\"result-lockup__highlight-keyword\">
<span data-anonymize=\"job-title\" class=\"t-14 t-bold\">", "</span>
<span>
at
")
Related
I have an old ASP-page that contains a form but for some reason, a part of the code doens't work anymore.
The code was working before because I write the content of the form in a database and there I have values from the past.
A short description of my page:
I have a [select multiple]-field and when you select one (or more) option I create another input field (for every selected option a new fied). The ID of this field is FSRow_ + an edited value of de selected option (no spaces, commas, ...).
When I submit the form I want to read the content of my new fields so I can write this information to my database.
An example:
Suppose I have two selected options, to make it easy: "OptionA" and "OptionB", then I create two new fields (from the source code):
<input name="FSRow_OptionA" id="FSRow_OptionA" type="text">
<input name="FSRow_OptionB" id="FSRow_OptionB" type="text">
When I submit my form, I have this code:
if request.form()
...
getSelectedValues = split(request("OptionList"), ",")
for each li in getSelectedValues
li = replace(li, "'", "''")
li = replace(li, "||", ",")
fsrow = trim(request("FSRow_" & replace(replace(replace(replace(replace(trim(li), " ", ""), "&", ""), "+", ""), ",", ""), "|", "")))
if fsrow <> "" then
[write to DB with dynamic field]
else
[only write selected item]
end if
next
When I look in my database. I have only the information of the selected items (so, my i'm going in my else-code). Not the content of my dynamic field. So I suppose the variable "fsrow" is always empty, even if my new field contains some text.
My questions:
How can I get the content of my dynamic input field?
Any idea what could be wrong because it was working in july 2017 and didn't change anything.
[Not needed anymore, found the solution] I've tried to set an alert or write to the console to debug but this isn't working. How can I debug when I'm in my "for each"-loop?
Edit/Update 1:
As commented by Ricardo Pontual I can use "Response.End" in combination with "Response.Write" before.
If I do this:
Response.Write(li)
Response.Write("## FSRow_" & li & " ## ")
Response.Write("->" & request("FSRow_" & li))
I get this result:
OptionA## FSRow_OptionA ## ->
But in my form/input field I wrote "test" in the field with id "FSRow_OptionA"...
Edit/Update 2:
With a non-dynamic input field I can perfectly read and write the value in my database.
So it's realy a problem with the dynamic field...
Found the problem. This was very old code, from years ago that wasn't written by me, and the problem was the position of < form> and < /form>
Old code (NOT WORKING)
<table>
<form>
<tr>
...
</tr>
</form>
</table>
Changed the code by this:
<form>
<table>
<tr>
...
</tr>
</table>
</form>
I changed the order of < table> and < form> and it works now!
I've got many HTML files in a folder, for each file I want to replace n-dash and m-dash with linefeed or paragrah mark, but only for specific html class.
For example, I would like to find/replace only text in class "Center".
Original:
class=Center <p class="Center">« Sentence1 — Sentence2 – Sentence3</p>
class=Aligned <p class="Aligned">«Other Sentence4 — Other Sentence5 – OtherSentence6«</p>
Desired result:
<p class="Center">« Sentence1 </p><p></p><p> Sentence2 </p><p></p><p> Sentence3«</p>
<p class="Aligned">«Other Sentence4 — Other Sentence5 – OtherSentence6«</p>
So far I'm using this solution by Helen: https://stackoverflow.com/a/1758239/5471234
But implementing this "strText = Replace(strText, "–", "< /p>< p>< /p>< p>")" performs F/R in the whole text.
How can I limit it to class=Center? Any way to use RegEx? and/or html object .innerText to grab only specific class?
I am a completely new to Applescript, I think that this is the simplest script that you can imagine, but I still can't get it working.
What I want to do is:
Get the html code from the page
Get name from between tag
Get columns name from between <strong> tags
Get values for columns from between <*li><strong>any value<*/strong> and </li>
Create excel file with 1st column "Name" + value from 2, and multiple columns with the title from 3 and it's values from 4.
The code:
<pre>
<div>
<div>
<h3>NAME</h3>
</div>
<div>
<ul class="circle">
<li><strong>Admin: </strong>Name</li>
<li><strong>Phone </strong>+XX XX XXX XXX</li>
<li><strong>Email: </strong>email#email.com</li>
</ul>
</div>
<div>
<ul>
<li><strong></strong></li>
<li><strong>Title: </strong>value</li>
<li><strong>Title: </strong>value</li>
<li><strong>Title: </strong>value</li>
<li><strong>Title: </strong>value</li>
</ul>
</div>
</div>
</pre>
You can search for substrings in AppleScript as such:
set AppleScript's text item delimiters to "<strong>"
Then you can refer to each delimited item (what's between each delimiter) with text item # (where # is a number), or get the full list of delimited items with every text item.
By doing this, you can slice your text, get the text item, set the delimiters again to refine what you got, get the next text item you need from that, etc until you have the substring you want. You can make it more efficient by putting it into a subroutine (function).
When AppleScript's text item delimiters are set, they will also be inserted between list elements when you convert a list of strings into a string via as string. This also allows you to do a wholesale find/replace operation easily by getting your list of text items, changing the delimiters, and then rejoining them with as string.
It's good practice to always set AppleScript's text item delimiters to "" when you're done with them being something else. (Some people consider it better practice to save them in a variable first before changing them, e.g. set oldDelims to applescript's text item delimiters, and then change them back to that, but that's not my personal style.)
How can I recursively capture all the text with formatting tags using Nokogiri?
<div id="1">
This is text in the TD with <strong> strong </strong> tags
<p>This is a child node. with <b> bold </b> tags</p>
<div id=2>
"another line of text to a link "
<p> This is text inside a div <em>inside<em> another div inside a paragraph tag</p>
</div>
</div>
For example, I would like to capture:
"This is text in the TD with <strong> strong </strong> tags"
"This is a child node. with <b> bold </b> tags"
"another line of text to a link "
"This is text inside a div <em>inside<em> another div inside a paragraph tag"
I can't just use .text() because it strips the formatting tags and I'm not sure how to do it recursively.
ADDED DETAIL: Sanitize looks like an interesting gem, I'm reading it now. However, have some added info that might clarify what I need to do.
I need to traverse each node, get the text, process it and put it back. therefore I would grab the text from , "This is text in the TD with strong tags", modify it to something like, "This is the modified text in the TD with strong tags. Then goto the next tag from div 1 get the text. "This is a child node. with bold tags" modify it "This is a modified child node. with bold tags." and put it back. Goto the next div#2 and grab the text, "another line of text to a link ", modify it, "another line of modified text to a link ", and put it back and goto the next node, Div#2 and grab text from the paragraph tag. "This is modified text inside a div inside another div inside a paragraph tag"
so after everything is processed the new html should be look like this...
<div id="1">
This is modified text in the TD with <strong> strong </strong> tags
<p>This is a modified child node. with <b> bold </b> tags</p>
<div id=2>
"another line of modified text to a link "
<p> This is modified text inside a div <em>inside<em> another div inside a paragraph tag</p>
</div>
</div>
My quasi-code,but I'm really stuck on the two parts, grabbing just the text with formatting (which sanitize helps with), but sanitize grabs all tags. I need to preserve formatting of just the text with formatting, including spaces, etc. However, not grab the unrelated tag children. And two, traversing down all the children related directly with full text tags.
#Quasi-code
doc = Nokogiri.HTML(html)
kids=doc.at('div#1')
text_kids=kids.descendant_elements
text.kids.each do |i|
#grab full text(full sentence and paragraphs) with formating tags
#currently, I have not way to grab just the text with formatting and not the other tags
modified_text=processing_code(i.full_text_w_formating())
i.full_text_w_formating=modified_text
end
def processing_code(string)
#code to process string (not relevant for this example)
return modified_string
end
# Recursive 1
class Nokogiri::XML::Node
def descendant_elements
#This is flawed because it grabs every child and even
#splits it based on any tag.
# I need to traverse down only the text related children.
element_children.map{ |kid|
[kid, kid.descendant_elements]
}.flatten
end
end
I'd use two tactics, Nokogiri to extract the content you want, then a blacklist/whitelist program to strip tags you don't want or keep the ones you want.
require 'nokogiri'
require 'sanitize'
html = '
<div id="1">
This is text in the TD with <strong> strong <strong> tags
<p>This is a child node. with <b> bold </b> tags</p>
<div id=2>
"another line of text to a link "
<p> This is text inside a div <em>inside<em> another div inside a paragraph tag</p>
</div>
</div>
'
doc = Nokogiri.HTML(html)
html_fragment = doc.at('div#1').to_html
will capture the contents of <div id="1"> as an HTML string:
This is text in the TD with <strong> strong <strong> tags
<p>This is a child node. with <b> bold </b> tags</p>
<div id="2">
"another line of text to a link "
<p> This is text inside a div <em>inside<em> another div inside a paragraph tag</em></em></p>
</div>
</strong></strong>
The trailing </strong></strong> is the result of two opening <strong> tags. That might be deliberate, but with no closing tags Nokogiri will do some fixup to make the HTML correct.
Passing html_fragment to the Sanitize gem:
doc = Sanitize.clean(
html_fragment,
:elements => %w[ a b em strong ],
:attributes => {
'a' => %w[ href ],
},
)
The returned text looks like:
This is text in the TD with <strong> strong <strong> tags
This is a child node. with <b> bold </b> tags
"another line of text to a link "
This is text inside a div <em>inside<em> another div inside a paragraph tag</em></em>
</strong></strong>
Again, because the HTML was malformed with no closing </strong> tags, the two trailing closing tags are present.
I'm doing a conversion between two software which both use XML so the actual conversion part is fairly straightforward - adding text here, removing others here, converting a few information. I'm using VBSCript WSH.
The only issue I'm still having is the darn
character - because it's considered an HTML Character, it's not detectable as a string, even though it's a string...
I've tried both strText = Replace(strText, "
", "") and using a regex with Regex.pattern = "
" ... neither works. I also tried replacing char(13), VBCR... nothing seems to detect the actual string itself and not the character it's creating.
Code Snippet from incoming file:
<p>If necessary, [clip].</p>
<ul><li>
<p>In the <strong>Document </strong>properties dialog box, [clip].</p>
</li>
</ul></li>
<li>
<p>Click <strong>OK</strong>.</p>
</li>
</ol><p><span>To add or edit an advanced paper handling operation: </span></p>
<ol><li>
<p>To add an operation, [clip] </p></li></ol>
I'm surprised strText = Replace(strText, "
", "") doesn't work, and the regex should be ok too.
Can you try setting these options
Regex.IgnoreCase = True
Regex.Global = True
I used this test page and just setting the pattern to be "
" worked fine:
http://www.regular-expressions.info/vbscriptexample.html
This only works in IE, by the way.
A workaround to all of this is to use: regexp.pattern = ".;" , which of course will also detect other instances of HTML codes in that format - but in my case this works fine.