Is there an OSM XAPI tag/value list? - filter

I'm new to OSM querying, but would like to query vector data for a large area. Thus I need to limit the results I would like to get by tagging the request.
http://www.informationfreeway.org/api/0.6/way[tag=value][bbox=x,y,z,j]
I'd like to filter for specific tag/values when querying for a way. Though I don't know which tags/values exist. Is there a list listing the most common of them?

You are approaching your problem from the wrong direction. The number of different tags is almost unlimited. According to taginfo there are currently 75 380 856 different tags. I'm pretty sure you are not interested in most of them. Likewise you are probably not even interested in many of the most common tags.
What data do you want to query?
The OSM wiki should be your starting point for generating a list of tags you are interested in. For a generic overview take a look at the map features. Are you interested in streets? Then visit at the highway key. Routing? Then take a look at the routing wiki page.
Always remember that these lists aren't complete. People can use any tag they like (but should use well-established tags whenever possible of course).
Also consider using Overpass API instead of XAPI. Overpass API is much more powerful.

Related

Applying different parsefilters to each domain in the same topology

I am trying to crawl different websites (e-commerce websites) and extract specific information from the pages of each website (i.e. product price, quantity, date of publication, etc.).
My question is: how to configure the parsing since each website has a different HTML layout which means I need different Xpaths for the same item depending on the website? Can we add multiple parser bolts in the topology for each website? If yes, how can we assign different parsefilters.json files to each parser bolt?
You need #586. At the moment there is no way to do it but to put all your XPATH expressions regardless of the site you want to use them on in the parsefilters.json.
You can't assign different parsefilters.json to the various instances of a bolt.
UPDATE however you could have multiple XpathFilters sections within the parseFilters.json. Each could cover a specific source, however, there is currently no way of constraining which source a parse filter gets applied to. You could extend XPathFilter so that it takes some extra config e.g. regular expression a URL must match in order to be applied. That would work quite nicely I think.
I've recently added JsoupFilters which will be in the next release. These should be useful for your use case but that still doesn't solve the issue that you need an implementation of the filter that organizes the resources per host. It shouldn't be too hard to implement taking the URL filter one as a example and would also make a very nice contribution to the project.

Getting most relevant content from page

I need to create a universal web scraper to parse articles on the different websites. Of course, I know about XPath, but I want to try to make it universal for any website despite the HTML markup of a page.
I need to determine whether there is an article on the page and if it is - parse a text of title, body and tags (if exists).
Frankly speaking, my knowledge in DS is not very huge, but I assume this task (determine whether it is article, and parsing only needed parts) is possible to solve.
What tools should I use? Any help?
Actually, for the second task, I need to implement something similar that google chrome mobile does. When page is not optimised for mobile, then propose to show the page in adaptive mode (just title, and main content).
If you are using Python, some libraries to look at are:
scrapy, which scrapes data and can extract some of the results) and,
BeautifulSoup, which is more geared towards the extraction part itself.
It is possible to request a version of a website (e.g. for Chrome, Safari, Mobile, old-school systems) by creating a custom header for your scraper.
HAve a look at the relevant documentation, and you can get an idea of how to use headers in scrapy here.
I do not know of any more specialised tools. Your tasks are more analytical and are typically not performed with the use of models for estimating e.g. what content is where on a webpage. This might be an intersting research direction though; to see if you can create a model that generalises across many websites to extract the desired content.
That leads me on to my last point, which is to say that creating a single scraper that works for any website *containing your artile type) is not usually possible. People create websites differently, however they see fit, which means they also change them. This usually leads to a good scraper requiring constant updates as time (and developers) moves on.
EDIT:
Then if you have lots of labelled examples, it might be possible to train a model. The challenge might be the look-back range of the model. For example, a typical LSTM model is given a parameter that tells it how far to look back into the past. It is stored within its memory internally. In your case, you might be looking for a start and end HTML tag of an article, to then extract just that part. These tahs could be thousands of words apart. Something a standard LSTM might not be fit to retain and use.
If you could pose your problem a little differently, then there are other approaches that might be plausible. E.g., you could make it a "question-answer" problem, by saying: I have this HTML, where is the article content? If that sounds ok for your use-case, have a look here for some model based approaches.

Skip common/duplicate parts while indexing web pages with ElasticSearch

I don't have any experience with ElasticSearch yet, but from what I read I think it suits most my needs. I have a web scraper which scrapes pages of certain domains.
I want to feed these pages into SE and offer a front end interface to search the scraped content. I'm building some sort of vertical search engine.
But as we all know, web pages of one host often only contain a little bit of unique content, a great part of the pages are common. Footer, header, menu etc. are the same on every page.
Does ElasticSearch have some build in intelligence that can filter out the common parts and only search the real content??
It's not terribly difficult to pump web content into Elastic, so I'll assume you have that down. =)
I think this article is fantastic for understanding how to index/search web pages:
http://blog.urx.com/urx-blog/2014/9/4/the-science-of-crawl-part-1-deduplication-of-web-content
It's a complex problem and they have some great detail. There is nothing I know of natively in Elastic that has intelligence to help you eliminate duplicates etc.
The strategy you need to adopt here would be to create a unique key per document. Taking checksum using sha1 or similar algorithm will do the job for getting the unique key. Make this the document ID so that only one page occurs at all point of time. Again use _create API to index if you dont want new duplicates to be indexed ( More efficient ) , and in case you want the new ones to be the document use normal indexing.
In case you need to modify the orginal document in case of disocvery of duplicate document , use upser.
I have explained a great deal of this in this blog.

Classify documents with tags

I have a huge amount of documents (mainly pdfs and doc's) I want to classify, so I can search over them according to certain tags. These tags could either be of my own (I put the tags to the document) or extracted from the text.
I've just seen a post related to this (Classify data using Apache Mahout), but perhaps there is something even more simple.
Mahout might be overkill for your problem - but you can get a fairly quick, easy solution by using OpenNLP.
http://opennlp.sourceforge.net/api/index.html
Specifically, look at the opennlp.tools.doccat package. Essentially, you have to go through and manually tag a small(ish) set of the items for each category you desire. If they are really distinct, you can get away with a small sample size.
You can use the DocumentCategorizerME.train() static function to train a collection of documents, where each requires a category tag and the text block to train on. Then, you can initialize the DocumentCategorizerME with the trained model and begin classifying all the rest of your documents.
Once you do this, you can (I think) write the model to a file so you don't have to ever do that again.
This post on extracting keywords and classifying webpages is related and may be helpful. In your example it sounds like you can use tags in lieu of the keyword extraction piece (although you may want to use both in combination). Weka is easy to use, I would definitely recommend giving it a look.

Inline data representation

I would like to represent data that gives an overview but allows them to drill down in an inline fashion - so if you had a grouping of say 6 objects the user could expand the data and it would show the 6 objects immeadiately below it before any more high level data.
It would appear that MSHFlexgrid gives this ability but I can't find any information about actually using it, or what it's limitations are (can you have differing number of fields and/or can they have different spacing, what about column headers, indentation at for the start, etc).
I found this site, but the images are broken (in ie8 and ff3.5). Google searches show people just using the flat data representation but nothing using the hierarchical properties). Does anyone know any good tutorials or forums with a good discussion about pitfalls?
Due to lack of information about using it, I am thinking of coding my own version but if anyone has done work in this area I haven't found it - I would of thought it would be a natural wish for data representation. If someone has coded a version of this (any language) then I wouldn't mind reading about it - maybe my idea of how to do it wouldn't be the best way.
You might want to check out vbAccelerator. He has a Multi-Column Treeview control that sounds like what you may be looking for. He gives you the source and has some pretty decent samples.
The MSHFlexGrid reference pages and the "using the MSHFlexGrid" topic in the Visual Basic manual?
Sorry if you've already looked at these!

Resources