Follow all links in JSON-LD API - hateoas

Say I want to consume an API that returns JSON-LD and follow all the links. (I'm experimenting with the Hydra API-Demo, but it should work with all JSON-LD APIs, not only Hydra-based ones. Any good APIs out there that I should try?)
So I want to follow all the links and my environment doesn't have native RDF support. Probably, I should first parse it with one of the libs and get it into extended form with jsonld.expand(). Then I just grab all the values with key #id. Is that the recommended way to do it or am I missing some edge-cases?

The purpose of the expansion API is to produce regular, context-free output (expanded form) for algorithmic processing -- which is exactly what it sounds like you want to do. So, yes, you've got the right approach; you shouldn't be missing any edge cases as I've understood you. After you have JSON-LD in expanded form, you can easily follow the #ids (and, if you also need to do some kind of analysis of the vocabulary/ontology, you can follow the properties which will then be fully-expanded URLs).

Related

Is there an OSM XAPI tag/value list?

I'm new to OSM querying, but would like to query vector data for a large area. Thus I need to limit the results I would like to get by tagging the request.
http://www.informationfreeway.org/api/0.6/way[tag=value][bbox=x,y,z,j]
I'd like to filter for specific tag/values when querying for a way. Though I don't know which tags/values exist. Is there a list listing the most common of them?
You are approaching your problem from the wrong direction. The number of different tags is almost unlimited. According to taginfo there are currently 75 380 856 different tags. I'm pretty sure you are not interested in most of them. Likewise you are probably not even interested in many of the most common tags.
What data do you want to query?
The OSM wiki should be your starting point for generating a list of tags you are interested in. For a generic overview take a look at the map features. Are you interested in streets? Then visit at the highway key. Routing? Then take a look at the routing wiki page.
Always remember that these lists aren't complete. People can use any tag they like (but should use well-established tags whenever possible of course).
Also consider using Overpass API instead of XAPI. Overpass API is much more powerful.

i18n server-side vs. client-side

Seeking some advice on two approaches to internationalization & localization. I have a web app using Spring MVC and Dojo, and I would like to support multiple languages. So, I could:
Use <spring:message> to generate the appropriate text on the server side using a properties file.
Use dojo/i18n to select the appropriate text on the client side using a js file.
And of course any combination of the two is also an option.
So, what are the pros and cons of each approach? When would you use one vs. the other?
The combination of these two approaches is the only reasonable answer.
Basically, you should try to stick to server-side, and only do client-side when it is really necessary (there is no other way, like you have some dynamically created controls).
Pros and cons? The main con of client-side string externalization is, you won't be able to translate everything correctly. That's because of context. The same English terms might be translated in a different way, depending on the context.
At the same time, you will often need to format message (add parameters to your message tag), which in regular Java you would do by calling MessageFormat.format(). Theoretically, you could do that on the client-side, but this is risky to say the least. You won't have access to original message parts (like dates, some data sources, whatever) and it might hurt translation correctness.
Formatting dates, numbers, etc. is more painful on the client-side. It is possible with Dojo or jQuery Globalize, but the results might not as good as they should be. But Spring has problem with formatting dates, anyway (lack of default local date/time designation, you may only choose from short, medium, long, full, which to me is completely useless).
Another issue might be handling plural forms (non-English). Believe, or not but languages may have more than one plural form (depending on the quantity) and because of that translations might differ. I don't think Dojo is handling it at all (however, I might be mistaken, some time has past since I evaluated it). Spring won't handle it as well, but you may build custom solution based on ICU's PluralRules (or PluralFormat if you're hard enough to learn formatting and want to kill the translators at the same time).
To cut a long story short, doing I18n correctly is far from being easy and you'll get better support on the server side.
BTW. I remember Dojo as quite "heavy", library itself was over 1MB... It might take a while to load it and your application might seem slow comparing to others... That was one of the reasons, I recommended Globalize rather than Dojo for our projects. It might not have so many features, but at least it seems lightweight.

Algorithms recognizing physical address on a webpage

What are the best algorithms for recognizing structured data on an HTML page?
For example Google will recognize the address of home/company in an email, and offers a map to this address.
A named-entity extraction framework such as GATE has at least tackled the information extraction problem for locations, assisted by a gazetteer of known places to help resolve common issues. Unless the pages were machine generated from a common source, you're going to find regular expressions a bit weak for the job.
If you have the markup proper—and not just the text from the page—I second the Beautiful Soup suggestion above. In particular, the address tag should provide the lowest of low-hanging fruit. Also look into the adr microformat. I'd only falll back to regexes if the first two didn't pull enough info or I didn't have the necessary data to look for the first two.
If you also have to handle international addresses, you're in for a world of headaches; international address formats are amazingly varied.
I'd guess that Google takes a two step approach to the problem (at least that's what I would do). First they use some fairly general search pattern to pick out everything that could be an address, and then they use their map database to look up that string and see if they get any matches. If they do it's probably an address if they don't it probably isn't. If you can use a map database in your code that will probably make your life easier.
Unless you can limit the geographic location of the addresses, I'm guessing that it's pretty much impossible to identify a string as an address just by parsing it, simply due to the huge variation of address formats used around the world.
Do not use regular expressions. Use an existing HTML parser, for example in Python I strongly recommend BeautifulSoup. Even if you use a regular expression to parse the HTML elements BeautifulSoup grabs.
If you do it with your own regexs, you not only have to worry about finding the data you require, you have to worry about things like invalid HTML, and lots of other very non-obvious problems you'll stumble over..
What you're asking is really quite a hard problem if you want to get it perfect. While a simple regexp will get it mostly right most of them time, writing one that will get it exactly right everytime is fiendishly hard. There are plenty of strange corner cases and in several cases there is no single unambiguous answer. Most web sites that I've seen to a pretty bad job handling all but the simplest URLs.
If you want to go down the regexp route your best bet is probably to check out the sourcecode of
http://metacpan.org/pod/Regexp::Common::URI::http
Again, regular expressions should do the trick.
Because of the wide variety of addresses, you can only guess if a string is an address or not by an expression like "(number), (name) Street|Boulevard|Main", etc
You can consider looking into some firefox extensions which aim to map addresses found in text to see how they work
You can check this USA extraction example http://code.google.com/p/graph-expression/wiki/USAAddressExtraction
It depends upon your requirement.
for email and contact details regex is more than enough.
For addresses regex alone will not help. Think about NLP(NER) & POS tagging.
For finding people related information you cant do anything without NER.
If you need information like paragraphs get the contents by using tags.

What are the url parameters naming convention or standards to follow

Are there any naming conventions or standards for Url parameters to be followed. I generally use camel casing like userId or itemNumber. As I am about to start off a new project, I was searching whether there is anything for this, and could not find anything. I am not looking at this from a perspective of language or framework but more as a general web standard.
I recommend reading Cool URI's Don't Change by Tim Berners-Lee for an insight into this question. If you're using parameters in your URI, it might be better to rewrite them to reflect what the data actually means.
So instead of having the following:
/index.jsp?isbn=1234567890
/author-details.jsp?isbn=1234567890
/related.jsp?isbn=1234567890
You'd have
/isbn/1234567890/index
/isbn/1234567890/author-details
/isbn/1234567890/related
It creates a more obvious data structure, and means that if you change the platform architecture, your URI's don't change. Without the above structure,
/index.jsp?isbn=1234567890
becomes
/index.aspx?isbn=1234567890
which means all the links on your site are now broken.
In general, you should only use query strings when the user could reasonably expect the data they're retrieving to be generated, e.g. with a search. If you're using a query string to retrieve an unchanging resource from a database, then use URL-rewriting.
There are no standards that I'm aware of. Just be mindful of IE's URL length limit of 2,083 characters.
Standard for URI are defined by RFC2396.
Anything after the standardized portion of the URL is left to you.
You probably only want to follow a particular convention on your parameters based on the framework you use.
Most of the time you wouldn't even really care because these are not under your control, but when they are, you probably want to at least be consistent and try to generate user-friendly bits:
that are short,
if they are meant to be directly accessible by users, they should be easy to remember,
case-insensitive (may be hard depending on the server OS).
follow some SEO guidelines and best practices, they may help you a lot.
I would say that cleanliness and user-friendliness are laudable goals to strive for when presenting URLs.
StackOverflow does a fairly good job of it.
I use lowercase. Depending on the technology you use, QS is either threated as case-sensitive (eg. PHP) or not (eg. ASP). Using lowercase avoids possible confusion.
Like the other answers I've not heard about any conventions.
The only "standard" I would adhere to is to use the more search engine friendly practice of using a URL rewriter.
There are no standards that I know of, and case shouldn't matter.
However within your application (website), you should stick to your own standards. For your own sanity if nothing else.

HTTP response splitting

I'm trying to handle this possible exploit and wondering what is the best way to do it? should i use apache's common-validator and create a list of known allowed symbols and use that?
From the wikipedia article:
The generic solution is to URL-encode strings before inclusion into HTTP headers such as Location or Set-Cookie.
Typical examples of sanitization include casting to integer, or aggressive regular expression replacement. It is worth noting that although this is not a PHP specific problem, the PHP interpreter contains protection against this attack since version 4.4.2 and 5.1.2.
Edit
im tied into using jsp's with java actions!
There don't appear to be any JSP-based protections for this attack vector - many descriptions on the web assume asp or php, but this link describes a fairly platform-neutral way to approach the problem (jsp used as an incidental example in it).
Basically your first step is to indentify the potentially hazardous characters (CRs, LFs, etc) and then to remove them. I'm afraid this about as robust a solution as you can hope for!
Solution
Validate input. Remove CRs and LFs (and all other hazardous characters) before embedding data into any HTTP response headers, particularly when setting cookies and redirecting. It is possible to use third party products to defend against CR/LF injection, and to test for existence of such security holes before application deployment.
Use PHP? ;)
According to Wikipedia and the PHP CHANGELOG, PHP's had protection against it in PHP4 since 4.4.2 and PHP5 since 5.1.2.
Only skimmed it -- but, this might help. His examples are written in JSP.
ok, well casting to an int is not much use when reading strings, also using regex in every action which recieves input from browser could be messy, im looking for a more robust solution

Resources