I am getting data from a broken RSS feed that gives me wrong link. I wanted to fix this link so I made this code:
<link.*>(.*)&.*tid(.*)</link>
and the link could be like:
www.somedomain.com/?value=50&burrrdurrrr;tid=120
But the real working link is in this form:
www.somedomain.com/?value=50&tid=120
The thing that I'm asking is if my measure thing looks like this:
[FeedURL]
Measure=Plugin
Plugin=Plugins\WebParser.dll
Url=[Feed]
StringIndex=2 ;now I only get www.somedomain.com/?value=50
Substitute=#SubstituteFeed#
How am I supposed to concatenate the strings together to complete the url?
I'm guessing rather than &burrrdurrrr;, the link has &, which is how you have to write & in an HTML or XML file.
If that's the case, you just need to set the DecodeCharacterReference option, as described in this handy-looking tutorial. Another option mentioned there is Substitute, which would be able to strip it out even if it really was &burrrdurrrr;.
None of this is a particularly sensible way of dealing with HTML or XML - a much better approach would be a plugin which actually parsed the document structure and let you reference nodes using XPath or CSS rules - but you work with what you've got, I guess. (I've never heard of this "Rainmeter" before, despite its claim to be "the best known and most popular desktop customization program for Windows"; maybe because nobody else calls their program that, instead almost universally using the word "widget"?)
Related
I've got a rather large asciidoc document that I translate dynamically to PDF for our developer guide. Since the doc often refers to Java classes that are documented in our developer guide we converted them into links directly in the docs e.g.:
In this block we create a new
https://www.codenameone.com/javadoc/com/codename1/ui/Form.html[Form]
named `hi`.
This works rather well for the most part and looks great in HTML as every reference to a class leads directly to its JavaDoc making the reference/guide process much simpler.
However when we generate a PDF we end up with something like this on some pages:
Normally I wouldn't mind a lot of footnotes or even repeats from a previous page. However, in this case the link to Container appears 3 times.
I could remove some of the links but I'd rather not since they make a lot of sense on the web version. Since I also have no idea where the page break will land I'd rather not do it myself.
This looks to me like a bug somewhere, if the link is the same the footnote for the link should only be generated once.
I'm fine with removing all link footnotes in the document if that is the price to pay although I'd rather be able to do this on a case by case basis so some links would remain printable
Adding these two parameters in fo-pdf.xsl remove footnotes:
<xsl:param name="ulink.footnotes" select="0"></xsl:param>
<xsl:param name="ulink.show" select="0"></xsl:param>
The first parameter disable footnotes, which triggers urls to re-appear inline.
The second parameter removes urls from the text. Links remain active and clickable.
Non-zero values toggle these parameters.
Source:
http://docbook.sourceforge.net/release/xsl/1.78.1/doc/fo/ulink.show.html
We were looking for something similar in a slightly different situation and didn't find a solution. We ended up writing a processor that just stripped away some of the links e.g. every link to the same URL within a section that started with '==='.
Not an ideal situation but as far as I know its the only way.
I am using NSXMLParser to parse HTML from web sites. Testing site is under my control but in operation sites will not be.
Problem is when parser encounters javascript which contains "bad" characters. For example, javascript containing if(screen.width<=521). The problem is the < in the code. I can see the problem but am unsure if there is any good way round it. (the NSXMLParser is reporting NSXMLParserErrorDomain error 68. and I can see why - it is treating the <= as the start of a new tag but = is not a valid tag name character...). But then what would I do with e.g. if(var<20) ?
I actually not interested in the specific content so could do things like global replace/removal of e.g. "<=" and ">=" (etc.) but in some regards that seems a bit of a mess as I was using NSXMLParser to avoid having to start messing around with the content. If substitution is the best way forward, I can envisage "<=" and ">=" but any other sequences I should include ?
I am new to Cocoa so may easily have missed something obvious - in which case many apologies. I did see that others have found similar problems but could not get a good way forward from the questions.
I am handling the error OK (in a tidy manner) but it is preventing my app from doing what it is meant to do - i.e. I need to avoid the error rather than handle it.
Background: that application is doing a "before" and "after" comparison on the html and looking for changes. I could swap "<=" for something really weird, then swap it back when necessary. I could even check the data for the replace content first to eliminate possible ambiguities (e.g. find a UID sequence not in the downloaded page, replace "<=" with UID sequence, parse page, if need be, replace UID with "<=", ditto for ">=".
(I have looked at e.g. libtidy of libxml2 but cannot find easy documentation and am wary about launching down such a route if it will not solve the issues.)
NSXMLParser, as its name implies, is not meant for parsing HTML. XML is much stricter than HTML, and the errors you've encountered are certainly not the only ones that are possible with real-world HTML. There are HTML documents that are also valid XML, but that is the exception, rather than the norm.
I would suggest using a proper HTML parser instead, such as this one, which is an Objective-C wrapper around libxml's HTML parsing functions.
So what I would like to do is scrape this site: http://boxerbiography.blogspot.com/
and create one HTML page that I can either print or send to my Kindle.
I am thinking of using Hpricot, but am not too sure how to proceed.
How do I set it up so it recursively checks each link, gets the HTML, either stores it in a variable or dumps it to the main HTML page and then goes back to the table of contents and keeps doing that?
You don't have to tell me EXACTLY how to do it, but just the theory behind how I might want to approach it.
Do I literally have to look at the source of one of the articles (which is EXTREMELY ugly btw), e.g. view-source:http://boxerbiography.blogspot.com/2006/12/10-progamer-lim-yohwan-e-sports-icon.html and manually programme the script to extract text between certain tags (e.g. h3, p, etc.)?
If I do that approach, then I will have to look at each individual source for each chapter/article and then do that. Kinda defeats the purpose of writing a script to do it, no?
Ideally I would like a script that will be able to tell the difference between JS and other code and just the 'text' and dump it (formatted with the proper headings and such).
Would really appreciate some guidance.
Thanks.
I'd recomment using Nokogiri instead of Hpricot. It's more robust, uses less resources, fewer bugs, it's easier to use, and faster.
I did some scraping extensively for work on time, and had to switch to Nokogiri, because Hpricot would crash on some pages unexplicably.
Check this RailsCast:
http://railscasts.com/episodes/190-screen-scraping-with-nokogiri
and:
http://nokogiri.org/
http://www.rubyinside.com/nokogiri-ruby-html-parser-and-xml-parser-1288.html
http://www.engineyard.com/blog/2010/getting-started-with-nokogiri/
What's a good way to parse HTML in AppleScript?
I haven't dabbled in AppleScript in quite some time, and even when I did it was very minimal and uninvolved, so I don't really think naturally in the language quite yet. But I need to do some string manipulation and parse some HTML (basically some simple screen scraping).
Naturally, I'd like to avoid common pitfalls of HTML parsing. However, this is a temporary script and doesn't need to be particularly robust or supportable. I really just need to scrape specific substrings (from a known starting substring to the next known character) into a file.
I've done plenty of string manipulation in C# and similar languages, but AppleScript is an interesting change of pace to say the least. Can somebody point me to some good resources (Google searches on this subject seem to have a high noise-to-signal ratio), or help me out with some sample code snippets?
The ultimate goal of what I'm doing is to take a pre-determined list of pages, open each one in Safari (I'm doing everything through tell application "Safari"), parse out links which fit a certain pattern, and store all of those links in a file. Then go through that file, open each of those links, parse out more links which fit another pattern, and store all of those links in a file.
(The site is actually owned by someone we're working with, so don't worry about me violating any terms of service or anything like that. But for reasons outside the scope of this question, I'm doing some page scraping in AppleScript.)
I can't say enough good things about Matt Neuburg's AppleScript: the Definitive Guide. Without a doubt the most complete documentation of AppleScript ever done. Matt's also one of my favorite tech writers.
I would also check out this article. It contains a tutorial on how to do this; the example provided there parses HTML data from only one source, but I think it's worth looking at.
What are the best algorithms for recognizing structured data on an HTML page?
For example Google will recognize the address of home/company in an email, and offers a map to this address.
A named-entity extraction framework such as GATE has at least tackled the information extraction problem for locations, assisted by a gazetteer of known places to help resolve common issues. Unless the pages were machine generated from a common source, you're going to find regular expressions a bit weak for the job.
If you have the markup proper—and not just the text from the page—I second the Beautiful Soup suggestion above. In particular, the address tag should provide the lowest of low-hanging fruit. Also look into the adr microformat. I'd only falll back to regexes if the first two didn't pull enough info or I didn't have the necessary data to look for the first two.
If you also have to handle international addresses, you're in for a world of headaches; international address formats are amazingly varied.
I'd guess that Google takes a two step approach to the problem (at least that's what I would do). First they use some fairly general search pattern to pick out everything that could be an address, and then they use their map database to look up that string and see if they get any matches. If they do it's probably an address if they don't it probably isn't. If you can use a map database in your code that will probably make your life easier.
Unless you can limit the geographic location of the addresses, I'm guessing that it's pretty much impossible to identify a string as an address just by parsing it, simply due to the huge variation of address formats used around the world.
Do not use regular expressions. Use an existing HTML parser, for example in Python I strongly recommend BeautifulSoup. Even if you use a regular expression to parse the HTML elements BeautifulSoup grabs.
If you do it with your own regexs, you not only have to worry about finding the data you require, you have to worry about things like invalid HTML, and lots of other very non-obvious problems you'll stumble over..
What you're asking is really quite a hard problem if you want to get it perfect. While a simple regexp will get it mostly right most of them time, writing one that will get it exactly right everytime is fiendishly hard. There are plenty of strange corner cases and in several cases there is no single unambiguous answer. Most web sites that I've seen to a pretty bad job handling all but the simplest URLs.
If you want to go down the regexp route your best bet is probably to check out the sourcecode of
http://metacpan.org/pod/Regexp::Common::URI::http
Again, regular expressions should do the trick.
Because of the wide variety of addresses, you can only guess if a string is an address or not by an expression like "(number), (name) Street|Boulevard|Main", etc
You can consider looking into some firefox extensions which aim to map addresses found in text to see how they work
You can check this USA extraction example http://code.google.com/p/graph-expression/wiki/USAAddressExtraction
It depends upon your requirement.
for email and contact details regex is more than enough.
For addresses regex alone will not help. Think about NLP(NER) & POS tagging.
For finding people related information you cant do anything without NER.
If you need information like paragraphs get the contents by using tags.