Need to extract between <h3>, <li> and <strong> using applescript - applescript

I am a completely new to Applescript, I think that this is the simplest script that you can imagine, but I still can't get it working.
What I want to do is:
Get the html code from the page
Get name from between tag
Get columns name from between <strong> tags
Get values for columns from between <*li><strong>any value<*/strong> and </li>
Create excel file with 1st column "Name" + value from 2, and multiple columns with the title from 3 and it's values from 4.
The code:
<pre>
<div>
<div>
<h3>NAME</h3>
</div>
<div>
<ul class="circle">
<li><strong>Admin: </strong>Name</li>
<li><strong>Phone </strong>+XX XX XXX XXX</li>
<li><strong>Email: </strong>email#email.com</li>
</ul>
</div>
<div>
<ul>
<li><strong></strong></li>
<li><strong>Title: </strong>value</li>
<li><strong>Title: </strong>value</li>
<li><strong>Title: </strong>value</li>
<li><strong>Title: </strong>value</li>
</ul>
</div>
</div>
</pre>

You can search for substrings in AppleScript as such:
set AppleScript's text item delimiters to "<strong>"
Then you can refer to each delimited item (what's between each delimiter) with text item # (where # is a number), or get the full list of delimited items with every text item.
By doing this, you can slice your text, get the text item, set the delimiters again to refine what you got, get the next text item you need from that, etc until you have the substring you want. You can make it more efficient by putting it into a subroutine (function).
When AppleScript's text item delimiters are set, they will also be inserted between list elements when you convert a list of strings into a string via as string. This also allows you to do a wholesale find/replace operation easily by getting your list of text items, changing the delimiters, and then rejoining them with as string.
It's good practice to always set AppleScript's text item delimiters to "" when you're done with them being something else. (Some people consider it better practice to save them in a variable first before changing them, e.g. set oldDelims to applescript's text item delimiters, and then change them back to that, but that's not my personal style.)

Related

Thymeleaf: Check if list contains a string containing a substring

I have a combined check that needs to happen in Thymeleaf:
List contains an item - can be done as th:if="${#lists.contains(data, '...')}" if you know the exact string
Item contains a substring - When iterating, can be done as th:each="item : *{data}" th:if="${#strings.contains(item,'(')}" e.g. to check for the substring "(" among the items of the list
I need to display a UL tag if the list contains an item containing the substring "(". No iteration, just this combined condition. How do I achieve that in one line?
<ul th:if="..."> <!-- This must be a combined check, no iteration. I don't even want to output the UL if not satisfied -->
</ul>
You can accomplish this with collection selection. Just test if the list size is greater than zero. Something like this will work:
<ul th:if="${#lists.size(data.?[#strings.contains(#this,'(')]) GT 0}">
</ul>
Assume you have a list like this, for testing:
List<String> data = Stream.of("abc", "d(ef", "ghi")
.collect(Collectors.toList());
You can use the following:
<ul th:if="${#strings.contains( #strings.listJoin(data,'') ,'(')}">
bazinga
</ul>
This first concatenates each item in the list into a single string.
It then checks to see if that string contains any ( characters.

xPath - Why is this exact text selector not working with the data test id?

I have a block of code like so:
<ul class="open-menu">
<span>
<li data-testid="menu-item" class="menu-item option">
<svg>...</svg>
<div>
<strong>Text Here</strong>
<small>...</small>
</div>
</li>
<li data-testid="menu-item" class="menu-item option">
<svg>...</svg>
<div>
<strong>Text</strong>
<small>...</small>
</div>
</li>
</span>
</ul>
I'm trying to select a menu item based on exact text like so in the dev tools:
$x('.//*[contains(#data-testid, "menu-item") and normalize-space() = "Text"]');
But this doesn't seem to be selecting the element. However, when I do:
$x('.//*[contains(#data-testid, "menu-item")]');
I can see both of the menu items.
UPDATE:
It seems that this works:
$x('.//*[contains(#class, "menu-item") and normalize-space() = "Text"]');
Not sure why using a class in this context works and not a data-testid. How can I get my xpath selector to work with my data-testid?
Why is this exact text selector not working
The fact that both li elements are matched by the XPath expression
if omitting the condition normalize-space() = "Text" is a clue.
normalize-space() returns ... Text Here ... for the first li
in the posted XML and ... Text ... for the second (or some other
content in place of ... from div/svg or div/small) causing
normalize-space() = "Text" to fail.
In an update you say the same condition succeeds. This has nothing to
do with using #class instead of #data-testid; it must be triggered
by some content change.
How can I get my xpath selector to work with my data-testid?
By testing for an exact text match in the li's descendant strong
element,
.//*[#data-testid = "menu-item" and div/strong = "Text"]
which matches the second li. Making the test more robust is usually
in order, e.g.
.//*[contains(#data-testid,"menu-item") and normalize-space(div/strong) = "Text"]
Append /div/small or /descendant::small, for example, to the XPath
expression to extract just the small text.
data-testid="menu-item" is matching both the outer li elements while text content you are looking for is inside the inner strong element.
So, to locate the outer li element based on it's data-testid attribute value and it's inner strong element text value you can use XPath expression like this:
//*[contains(#data-testid, "menu-item") and .//normalize-space() = "Text"]
Or
.//*[contains(#data-testid, "menu-item") and .//*[normalize-space() = "Text"]]
I have tested, both expressions are working correctly

AppleScript - Syntax Error trying to extract text

I'm trying to create a basic script that extracts some text from a webpage, but when I save I'm getting a Syntax error that I don't understand...
... please see screenshot.
The second Set (Role1Result) is working fine.
I'm a bit of a newbie at this, so any help really appreciated.
Here's the relevant bit of code pasted:
set tid to AppleScript's text item delimiters -- save them for later.
set AppleScript's text item delimiters to startText -- find the first one.
set liste to text items of SearchText
set AppleScript's text item delimiters to endText -- find the end one.
set extracts to {}
repeat with subText in liste
if subText contains endText then
copy text item 1 of subText to end of extracts
end if
end repeat
set AppleScript's text item delimiters to tid -- back to original values.
return extracts
end extractText
--- roles ---
set role0Result to extractText(input0, " <dd class="result-lockup__highlight-keyword">
<span data-anonymize="job-title" class="t-14 t-bold">", "</span>
<span>
at
")
set role1Result to extractText(input1, " <dd class=\"result-lockup__highlight-keyword\">
<span class=\"t-14 t-bold\">", "</span>
<span>
")
Within a string literal you have to escape all occurrences of double quotes with a backslash
set role0Result to extractText(input0, " <dd class=\"result-lockup__highlight-keyword\">
<span data-anonymize=\"job-title\" class=\"t-14 t-bold\">", "</span>
<span>
at
")

How to extract inner text of multiple Paragraph tags which are nested withing an anchor tag

Here is the code:
<a id='Letter1'>
<p>Dear Sir, </p>
<p>This is with.........</p>
<p>I would be.......</p>
<p>Hoping to hear from you soon</p>
<p>Regards.</p>
</a>
Using Xpath I want to extract the inner text of all the Paragraph tags which are contained inside the anchor tag as a single text entity.
The final result i want is
string letterBody= document.DocumentNode.SelectSingleNode("//XPATH QUERY").innerText;
where letterBody="Dear Sir, This is with...................Regards."
You need to just get the <a> element and you will get all the text nodes which are under <a> as its innertext.
So your xpath would be /a[#id='Letter1'] or just /a.

Xquery to extract text in html

I am working on extracting text out of html documents and storing in database. I am using webharvest tool for extracting the content. However I kind of stuck at a point. Inside webharvest I use XQuery expression inorder to extract the data. The html document that I am parsing is as follows:
<td><a name="hw">HELLOWORLD</a>Hello world</td>
I need to extract "Hello world" text from the above html script.
I have tried extracting the text in this fashion:
$hw :=data($item//a[#name='hw']/text())
However what I always get is "HELLOWORLD" instead of "Hello world".
Is there a way to extract "Hello World". Please help.
What if I want to do it this way:
<td>
<a name="hw1">HELLOWORLD1</a>Hello world1
<a name="hw2">HELLOWORLD2</a>Hello world2
<a name="hw3">HELLOWORLD3</a>Hello world3
</td>
I would like to extract the text Hello world 2 that is in betweeb hw2 and hw3. I would not want to use text()[3] but is there some way I could extract the text out between /a[#name='hw2'] and /a[#name='hw3'].
Your xpath is selecting the text of the a nodes, not the text of the td nodes:
$item//a[#name='hw']/text()
Change it to this:
$item[a/#name='hw']/text()
Update (following comments and update to question):
This xpath selects the second text node from $item that have an a tag containing a name attribute set to hw:
$item[a/#name='hw']//text()[2]
I would not want to use text()[3] but
is there some way I could extract the
text out between /a[#name='hw2'] and
/a[#name='hw3'].
If there is just one text node between the two <a> elements, then the following would be quite simple:
/a[#name='hw3']/preceding::text()[1]
If there are more than one text nodes between the two elements, then you need to express the intersection of all text nodes following the first element with all text nodes preceding the second element. The formula for intersection of two nodesets (aka Kaysian method of intersection) is:
$ns1[count(.|$ns2) = count($ns2)]
So, just replace in the above expression $ns1 with:
/a[#name='hw2']/following-sibling::text()
and $ns2 with:
/a[#name='hw3']/preceding-sibling::text()
Lastly, if you really have XQuery (or XPath 2), then this is simply:
/a[#name='hw2']/following-sibling::text()
intersect
/a[#name='hw3']/preceding-sibling::text()
This handles your expanded case, while letting you select by attribute value rather than position:
let $item :=
<td>
<a name="hw1">HELLOWORLD1</a>Hello world1
<a name="hw2">HELLOWORLD2</a>Hello world2
<a name="hw3">HELLOWORLD3</a>Hello world3
</td>
return $item//node()[./preceding-sibling::a/#name = "hw2"][1]
This gets the first node that has a preceding-sibling "a" element with a name attribute of "hw2".

Resources