Is ISBN/ISSN + Epubcfi good enough to link to Epub paragraph from outside world? - uniqueidentifier

To link to mybook.epub,
http://idpf.org/epub/linking/cfi/epub-cfi.html#gloss-cfi-pub suggests:
book.epub#epubcfi(/6/4[chap01ref]!/4[body01]/10[para05]/3:10)
thinking of using isbn/issn for books instead, eg.
isbn-12#epubcfi(/6/4[chap01ref]!/4[body01]/10[para05]/3:10)
Is there a better, standard way of linking ?
cannot see one in epub specs.

As far as I know, there isn't a standard way of linking.
Moreover, your ISBN idea has several issues :
different editions of the same book (different author, digital vs paper version, pocket vs fullsize, reeditions, grouping of several books of a same trilogy into a single book, etc...) will have different ISBN.
some amazon kindle-published books won't even have ISBN, only ASIN.
books before 2007 will have a 10-digit ISBN, from 2007 and newer 13-digit ISBN, so reedition of old books might have isbn in the new format.
non-officially published books (self-published, ie available on the net, self-printed, etc...) won't have ISBN either. magazines will have ISSN. paper research will have DOI, which is complex.

Related

Generate EDGAR FTP File Path List

I'm brand new to programming (though I'm willing to learn), so apologies in advance for my very basic question.
The [SEC makes available all of their filings via FTP][1], and eventually, I would like to download a subset of these files in bulk. However, before creating such a script, I need to generate a list for the location of these files, which follow this format:
/edgar/data/51143/000005114313000007/0000051143-13-000007-index.htm
51143 = the company ID, and I already accessed the list of company IDs I need via FTP
000005114313000007/0000051143-13-000007 = the report ID, aka "accession number"
I'm struggling with how to figure this out as the documentation is fairly light. If I already have the 000005114313000007/0000051143-13-000007 (what the SEC calls the "accession number") then it's pretty straightforward. But I'm looking for ~45k entries and would obviously need to generate these automatically for a given CIK ID (which I already have).
Is there an automated way to achieve this?
Welcome to SO.
I'm currently scraping the same site, so I'll explain what I've done so far. What I am assuming is that you'll have the CIK numbers of the companies you're looking to scrape. If you search the company's CIK, you'll get a list of all of the files that are available for the company in question. Let's use Apple as an example (since they have a TON of files):
Link to Apple's Filings
From here you can set a search filter. The document you linked was a 10-Q, so let's use that. If you filter 10-Q, you'll have a list of all of the 10-Q documents. You'll notice that the URL changes slightly, to accommodate for the filter.
You can use Python and its web scraping libraries to take that URL and scrape all of the URLs of the documents in the table on that page. For each of these links you can scrape whatever links or information you want off the page. I personally use BeautifulSoup4, but lxml is another choice for web scraping, should you choose Python as your programming language. I would recommend using Python, as it's fairly easy to learn the basics and some intermediate programming constructs.
Past that, the project is yours. Good luck, I've posted some links below to get you started. I'm only allowed to post two links since I'm new to the site, so I'll give you the beautiful soup link:
Beautiful Soup Home Page
If you choose to use Python and are new to the language, check out the codecademy python course, and don't forget to check out lxml, as some people prefer it over BeautifulSoup (some people also use both in conjunction, so it's all a matter of personal preference).

parsing data and POS with treetop vs. stanford nlp

I'm trying to parse event (concerts, movies, etc. etc.) data in Ruby and can't decide on what tool to use.
I thought the stanford parser was the way to go initially, but then heard of treetop.
I'm struggling with both, as getting the stanford parser to work with Ruby on Windows has taken up two+ days of searching and struggling and no end of errors in just getting it installed.
Treetop installed no problem, but the documentation is very limited, and from what I can gather, it seems that treetop is best at dealing with a grammar structure than the actual content, but maybe I'm just not completely understanding Treetop capabilities.
One of the nice things (I think) is that I have is a large database/corpus(?) of band and movie names, and a fairly limited parts of data that I'm looking to retrieve.
For instance one listing is
The Tragically Hip with Guest Hey Rosetta!, Friday Jul 15th, 7:30pm, Deer Lake Park
Another listing is
07/08/11 - Tacoma Dome, New Kids on the Block & Backstreet Boys w/ Matthew Morrison, 7:30pm, Tacoma, WA
With each listing I'm trying to grab a rather specific group of details, being who/what, date, time, city, venue.
Seeing as I already have a dataset of band names, and city names should be fairly easy to get a listing of, it should be 'fairly' easy to pick out the other details, I'm just not sure which tool I should dedicate my time to, or if there is a better way to do this?
Any suggestions?
No, treetop is used to parse more structured languages (like computer languages). For Natural Language Parsing (NLP), you'd better use The Stanford Parser or something like it. Have a look at this blog entry about NLP in combination with Ruby:
http://mendicantbug.com/2009/09/13/nlp-resources-for-ruby/

Bing/Google/Flickr API: how would you find an image to go along each of 150,000 Japanese sentences?

I'm doing part-of-speech & morphological analysis project for Japanese sentences. Each sentence will have its own webpage. To make this page more visual, I want to show one picture which is somehow related to the sentence. For example, For the sentence "私は学生です" ("I'm a student"), the relevant pictures would be pictures of school, Japanese textbook, students, etc. What I have: part-of-speech tagging for every word. My approach now: use 2-3 nouns from every sentence and retrieve the first image from search results using Bing Images API. Note: all the sentence processing up to this point was done in Java.
Have a couple of questions though:
1) what is better (richer corpus & powerful search), Google Images API, Bing Images API, Flickr API, etc. for searching nouns in Japanese?
2) how do you select the most important noun from the sentence to do the query in Image Search Engine without doing complicated topic modeling, etc.?
Thanks!
Japanese WordNet has links to OpenClipart pictures. That could be another relevant source. They describe it in their paper called "Enhancing the Japanese WordNet".
I thought you would start by choosing any noun before は、が and を and giving these priority - probably in that order.
But that assumes that your part-of-speech tagging is good enough to get は=subject identified properly (as I guess you know that は is not always the subject marker).
I looked at a bunch of sample sentences here with this technique in mind and found it as good as could be expected. Except where none of those are used, which is rarish.
And sentences like this one, where you'd have to consider maybe looking for で and a noun before it in the case where there is no を or は. Because if you notice here, the word 人 (people) really doesn't tell you anything about what's being said. Without parsing context properly, you don't even know if the noun is person or people.
毎年 交通事故で 多くの人が 死にます
(many people die in traffic accidents every year)
But basically, couldn't you implement a priority/fallback type system like this?
BTW I hope your sentences all use kanji, or when you see はし (in one of the sentences linked to) you won't know whether to show a bridge or chopsticks - and showing the wrong one will probably not be good.

Which ISO format should I use to store a user's language code?

Should I use ISO 639-1 (2-letter abbreviation) or ISO 639-2 (3 letter abbrv) to store a user's language code? Both are official standards, but which is the de facto standard in the development community? I think ISO 639-1 would be easier to remember, and is probably more popular for that reason, but thats just a guess.
The site I'm building will have a separate site for the US, Brazil, Russia, China, & the UK.
http://en.wikipedia.org/wiki/ISO_639
You should use IETF language tags because they are already used for HTTP/HTML/XML and many other technologies. They are based on several standards including the ISO-639 collection (yes language, region and culture selection are not so simple to define).
I wrote a more detailed article regarding the proper language code selection and usage. The idea is to use the simplest/shorter ISO-639-1 codes and specify more only for special cases. Inside the article there are codes for ~30 most used languages with reasons why I consider one alternative better than another.
In case you want to skip reading the entire article here is a short list of language codes (not to be confused with country codes): ar, cs, da, de, el, en, en-gb, es, fr, fi, he, hu, it, ja, ko, nb, nl, pl, pt, pt-pt, ro, ru, sv, tr, uk, zh, zh-hant
The following points may not be obvious but should be borne in mind:
en is used for en-us - American English, and for British English is used en-gb
pt is used for pt-br, and not pt-pt witch has much less speakers
zh is used instead of zh-hans, zh-CN,...
zh-hant (Traditional Chinese) is used instead of more specific codes like zh-hant-TW or zh-TW
You can find more explanations inside the article.
I would go with a derivative of ISO 639. Specifically I like to use this: http://en.wikipedia.org/wiki/IETF_language_tag
I'm no expert, but every site I've ever seen uses ISO 639-1, including the current site I'm working on.
It works for us!
I've only ever seen 2-character language codes in use - so I'd recommend going with them unless your work involves delving into linguistics in some way. If all you're doing is customizing the browsing experience for the world at large, you won't need the extra repertoire offered by 3-character codes.
ISO 639-1 Alpha-2 are used pretty much universally.
They are used for example in HTTP content negotiation. If you ever wondered how an international website can automatically show you their homepage in your native language, that's how it works. (Although it's sometimes kinda annoying. I, for example, often get shown the default Apache homepage in German, because the webmaster turned on content negotiation, but only put content for English in.)
Most web browsers use them directly in their settings dialog box.
Most operating systems use them in their settings dialog boxes or configuration files.
Wikipedia uses them in their server names for the different language versions.
In other words: if your users aren't native English speakers, they will probably already have encountered them when configuring their software, because otherwise they wouldn't be able to use their computers.
The other members of the ISO 639 family are mostly of interest to linguists. Unless you expect Jesus Christ himself (ISO 639-2 Alpha-3 code arc) to visit your website, or maybe Klingons (tlh), ISO 639-1 has more languages than you ever can hope to support.

Internationalization of a VB 6 Application

Has anyone internationalized a VB 6 application?
Any helpful resources or tips/tricks you can offer?
Get hold of Michael Kaplan's book Internationalization With Visual Basic (maybe secondhand). It's a goldmine of useful information. I have some peeves with the editing - the index is awful and the chapter order is a bit random - but it's still excellent. There are some free sample chapters on the book's website.
If you are not already familiar with Unimess - the appalling mishmash that is VB6 Unicode support - do read Chapter 6 which is one of the free chapters. Cyberactivex.com also has a good tutorial on the subject.
Finally, do read the International Issues section in the VB6 manual. It's not exhaustive but it's worth reading.
EDIT: see this answer for a programming-language neutral discussion of internationalisation - nearly all relevant to VB6. VB6 Format function is useful for regionally aware display of numbers, currencies, dates and times. CDbl, CDate etc are useful for converting back from strings to the intrinsic types.

Resources