Does anyone know why Google recommends Microdata over Microformats and RDFa? - microdata

I have researched a lot about the mark-up formats Microdata, Microformats and RDFa. Google recommend Microdata over the other two mark-ups and I want to know why. Reading a ton of documents and studying these mark-ups leaves me more clueless than before I started reading. Does anyone know why Google recommend this type? Is it something to with HTML5?
Here is a link to the site where I got the information from Google.
Thank you.

Because of the many (syntax) errors found in RFDa usage. One of the leaders of Schema.org talks about the reason they chose microdata:
(...) the error rate (i.e., webmasters marking up their pages to
say X when the really meant to say Y) was about 3 times as much [with RFDa, red.] as it was
for other formats (which include microformats, sitemaps, Google shopping
feeds, etc.). (...) More than 40% of the errors had to
do with the confusion between rel and property.
(...) We really don't want to get into whether there is a distinction between rel
and property at a theoretical level. We also understand that there are some
corner cases which lead the authors of RDFa to make this distinction. But
the bottom line remains that as long as the error rate in RDFa usage does
not go down dramatically, it is not a viable option for us. (...)
Source

I think this is simply because schema.org is Google's own initiative (they created it together with Microsoft and Yahoo). See http://en.wikipedia.org/wiki/Schema.org.

Here is another perspective: http://manu.sporny.org/2012/microdata-cr/
Interesting how we all used to loath Microsoft for perverting standards. Guess Google is now stepping into their shoes.

Related

Getting most relevant content from page

I need to create a universal web scraper to parse articles on the different websites. Of course, I know about XPath, but I want to try to make it universal for any website despite the HTML markup of a page.
I need to determine whether there is an article on the page and if it is - parse a text of title, body and tags (if exists).
Frankly speaking, my knowledge in DS is not very huge, but I assume this task (determine whether it is article, and parsing only needed parts) is possible to solve.
What tools should I use? Any help?
Actually, for the second task, I need to implement something similar that google chrome mobile does. When page is not optimised for mobile, then propose to show the page in adaptive mode (just title, and main content).
If you are using Python, some libraries to look at are:
scrapy, which scrapes data and can extract some of the results) and,
BeautifulSoup, which is more geared towards the extraction part itself.
It is possible to request a version of a website (e.g. for Chrome, Safari, Mobile, old-school systems) by creating a custom header for your scraper.
HAve a look at the relevant documentation, and you can get an idea of how to use headers in scrapy here.
I do not know of any more specialised tools. Your tasks are more analytical and are typically not performed with the use of models for estimating e.g. what content is where on a webpage. This might be an intersting research direction though; to see if you can create a model that generalises across many websites to extract the desired content.
That leads me on to my last point, which is to say that creating a single scraper that works for any website *containing your artile type) is not usually possible. People create websites differently, however they see fit, which means they also change them. This usually leads to a good scraper requiring constant updates as time (and developers) moves on.
EDIT:
Then if you have lots of labelled examples, it might be possible to train a model. The challenge might be the look-back range of the model. For example, a typical LSTM model is given a parameter that tells it how far to look back into the past. It is stored within its memory internally. In your case, you might be looking for a start and end HTML tag of an article, to then extract just that part. These tahs could be thousands of words apart. Something a standard LSTM might not be fit to retain and use.
If you could pose your problem a little differently, then there are other approaches that might be plausible. E.g., you could make it a "question-answer" problem, by saying: I have this HTML, where is the article content? If that sounds ok for your use-case, have a look here for some model based approaches.

Microdata for dictionary : can I use yandex

I'm willing to use microdata/microformat/etc. for the part of my website which is an online dictionary. Basically I just want to tag word and definition to help search engines to grab the most important data in every page belonging to the dictionary, and maybe have Google use them as "rich snippets" in results page.
Main problem is it's hard to find dedicated vocabulary for words and definitions (no problem for recipes, movies and hotels though) and I'm not sure if I have to use the "http://schema.org/Article" tree for my lexicographic work. (To my mind, it makes sense to tag something when it's specific enough).
I have found something interesting at Yandex, for words and encyclopedia, I want to ask what to do with. See there :
https://yandex.ru/support/webmaster/microdata/what-is-microdata.xml?lang=en
https://yandex.com/support/webmaster/microdata/term-definition-markup.xml
It looks like it is very close to my request. But I'm sorry I dont know what is Yandex... will it work with Google ?
I'm asking here if that page, from Yandex, is a working model, is still on use, what are the pros and cons ? Will Google be able to use the specific vocabulary from Yandex and understand my Yandex-tagged data ? is it worth using that vocabulary for an online dictionary, or is something else I have missed of better use ?
(http://webmaster.yandex.ru/vocabularies/term-def.xml, which should be the vocabulary url, gives me a 404).
One more question, please : am I allowed to write (duplicate) the most important data in the header, something like (I believe I am, because Google microdata testing tool prooves to be able to extract the data from that code) :
<html itemscope itemtype="http://webmaster.yandex.ru/vocabularies/term-def.xml">
<meta itemprop="term" content="My term" />
<meta itemprop="definition" content="My definition" />
Just to mention I was interested, though not happy with these close discussions :
https://webmasters.stackexchange.com/questions/55073/what-meta-tag-or-structured-data-should-i-use-for-a-dictionary-web-application
schema.org and an online dictionary
Yandex is Russia's version of Google, and typically they both recognize and honor each other's search engine result implementations.
These articles you are referencing are incredibly outdated; I recommend you seeking out fresher sources, preferably where the term being defined uses the proper HTML element.
Here's the Yandex URL that is 404ing, the Wayback Machine is your friend!
Back to fresher documentation/resources, in this case the correct element as of 2016-10-05 is the <dfn> element. I know you want added semantics, but semantics is the proper place to start, and I'd follow that up by marking the entire dictionary up within a Definition List element, and placing the definition wrapped in the definition element into the <dt>, and the definition's of the term in the corresponding <dd>s.
I wouldn't waste time trying to find the perfect ontology here; implement [rel="tag" Microformat on all of the definitions], you can always come back and add a more desired one.
I've written a blog post about this, but a much more valuable resource is HTML5 Doctor's Glossary impementation, More importantly, view source - view-source:http://html5doctor.com/element-index/ (why stackoverflow doesn't recognize 'view-source' schema is beyond me)
More References/Resources:
Microformats Definition Examples has some very interesting ideas/code snippets
Utilizing the Underused by Semantically Awesome Definition List - Written Prior to HTML5's Redefinition of <dl> but Relevant

Joomla model associations

I have been working with CakePHP which has model associations such as hasMany, belongsTom etc...
Now I am working with Joomla and need this type of functionality.
All I need is a point the right direction such as a link describing it as I cannot seem to find something about it on Google.
take care,
lee
There' is a world of difference between the two, and goes far beyond a simple answer on SO. The API Docs are quite extensive, and there are many decent examples. Developing a Model-View-Controller (MVC) Component for Joomla!2.5 will likely also give you some good reference. Additionally, with 9000+ open source extensions available, it's usually quite easy to find one that comes close to what you want to do in order to have some sample code to work from.

Algorithm to find out if a website is a blog?

This is a creative one :-)
I'll be receiving a list of hundreds of new URLs regularly and want to find out if they are linking to a blog or not - between 80% and 95% accuracy would be sufficient.
Obviously I need to analyze the HTML of the page - but how exactly would you approach this (e.g. meta tags, structural analysis, pattern matching, machine learning ...)?
I would look at the generator <meta> tag for known blog editors. For example here's how it looks for Wordpress:
<meta name="generator" content="WordPress.com" />
Building on Darin's solution, I would look for the generator <meta> tag for known blog editors and combine it with a lookup table of common sites, ie. WordPress.com, Blogspot.com, Livejournal.com, and so forth. That should give you 80-95% in the near term, though it won't be robust enough for an ongoing process over an extended period of time.
An extended solution is much harder, given the amorphous definition of the term "blog". In which case, you'll want to consider breaking the list down into its hosting site and defining characteristics and create hard and fast rules on what constitutes a blog:
Is it hosted by a blogging service provider?
Is it listed in a blog aggregator, such as Technorati?
Does it include blog-like services, such as user-generated articles, tags, and the ability to comment?
Does it provide meta information that I can use to easily identify it as a blog?
Does it otherwise identify itself as a blog, via the inclusion of the term "blog" or some other criteria?
I can easily see a neural network constructed to determine if a page is a blog or not, but this serverely oversteps the bounds of your requirements. I'd say start simple, then extend your solution relative to the proposed lifetime of your system.
The above suggestions are good, and probably will work if you're aiming for 80-90% accuracy.
I would go one step further and look for any .xml RSS feed in either a meta tag, or as a link. Then check the feed to see if there are any comment tags (since there are feeds for other purposes too). I would OMIT this for certain blog platforms that don't give you a feed such as Tumblr.

Inline data representation

I would like to represent data that gives an overview but allows them to drill down in an inline fashion - so if you had a grouping of say 6 objects the user could expand the data and it would show the 6 objects immeadiately below it before any more high level data.
It would appear that MSHFlexgrid gives this ability but I can't find any information about actually using it, or what it's limitations are (can you have differing number of fields and/or can they have different spacing, what about column headers, indentation at for the start, etc).
I found this site, but the images are broken (in ie8 and ff3.5). Google searches show people just using the flat data representation but nothing using the hierarchical properties). Does anyone know any good tutorials or forums with a good discussion about pitfalls?
Due to lack of information about using it, I am thinking of coding my own version but if anyone has done work in this area I haven't found it - I would of thought it would be a natural wish for data representation. If someone has coded a version of this (any language) then I wouldn't mind reading about it - maybe my idea of how to do it wouldn't be the best way.
You might want to check out vbAccelerator. He has a Multi-Column Treeview control that sounds like what you may be looking for. He gives you the source and has some pretty decent samples.
The MSHFlexGrid reference pages and the "using the MSHFlexGrid" topic in the Visual Basic manual?
Sorry if you've already looked at these!

Resources