Skip common/duplicate parts while indexing web pages with ElasticSearch - elasticsearch

I don't have any experience with ElasticSearch yet, but from what I read I think it suits most my needs. I have a web scraper which scrapes pages of certain domains.
I want to feed these pages into SE and offer a front end interface to search the scraped content. I'm building some sort of vertical search engine.
But as we all know, web pages of one host often only contain a little bit of unique content, a great part of the pages are common. Footer, header, menu etc. are the same on every page.
Does ElasticSearch have some build in intelligence that can filter out the common parts and only search the real content??

It's not terribly difficult to pump web content into Elastic, so I'll assume you have that down. =)
I think this article is fantastic for understanding how to index/search web pages:
http://blog.urx.com/urx-blog/2014/9/4/the-science-of-crawl-part-1-deduplication-of-web-content
It's a complex problem and they have some great detail. There is nothing I know of natively in Elastic that has intelligence to help you eliminate duplicates etc.

The strategy you need to adopt here would be to create a unique key per document. Taking checksum using sha1 or similar algorithm will do the job for getting the unique key. Make this the document ID so that only one page occurs at all point of time. Again use _create API to index if you dont want new duplicates to be indexed ( More efficient ) , and in case you want the new ones to be the document use normal indexing.
In case you need to modify the orginal document in case of disocvery of duplicate document , use upser.
I have explained a great deal of this in this blog.

Related

When is it appropriate to use tensor-based DBs (e.g Marqo) instead of keyword based (e.g Elasticsearch)?

I'm implementing end user search for an e-commerce website. The catalogue contains images, text, prices of different items. LLMs are all the hype at the moment but I'm not sure how well proven the performance is in comparison to keyword based for e-commerce.
I've tried tensor based search and it appears to perform well but its hard to benchmark search against relevance so I'm not sure of putting it into production.
What frameworks are people using to determine when you use tensor/vector based search vs keyword based?

Is there an OSM XAPI tag/value list?

I'm new to OSM querying, but would like to query vector data for a large area. Thus I need to limit the results I would like to get by tagging the request.
http://www.informationfreeway.org/api/0.6/way[tag=value][bbox=x,y,z,j]
I'd like to filter for specific tag/values when querying for a way. Though I don't know which tags/values exist. Is there a list listing the most common of them?
You are approaching your problem from the wrong direction. The number of different tags is almost unlimited. According to taginfo there are currently 75 380 856 different tags. I'm pretty sure you are not interested in most of them. Likewise you are probably not even interested in many of the most common tags.
What data do you want to query?
The OSM wiki should be your starting point for generating a list of tags you are interested in. For a generic overview take a look at the map features. Are you interested in streets? Then visit at the highway key. Routing? Then take a look at the routing wiki page.
Always remember that these lists aren't complete. People can use any tag they like (but should use well-established tags whenever possible of course).
Also consider using Overpass API instead of XAPI. Overpass API is much more powerful.

Kofax Seperate Main Invoice from Supporting Document without using Seperator sheet

When a batch gets created documents should get separated automatically without using separator sheet or Barcode separator.
How can I classify documents for Invoice and supporting document.
In our project we get many invoices with supporting document so the scanning person has to insert the separator sheets manually, so to avoid this we want to automatically classify the supporting documents.
In general the concept would be that you would enable separation in the project and then train your classes with examples to be used for the layout or content classifiers.
However, as I'm sure you've seen, the obstacle with invoices is that they are different enough between vendors that it would not reliably classify all to an Invoice class. Similarly with "Supporting Documents" which are likely to be very different from each other, so unfortunately there isn't a completely easy answer without separator sheets (or barcode stickers affixed to supporting docs).
What you might want to do is write code in the one of the separation events like Document_AfterSeparate event. Despite the name, the document has not yet been split at this point, but the classifiers have run. See Scripting Help topic "Server Script Events Sequence > Document Separation > Standard Document Separation" for more detail. Setting the SplitPage property on the CDocPage (pXDoc.CDoc.Pages.ItemByIndex(lPage).SplitPage) will allow you to use your own logic to determine which pages to separate.
For example if you know that you will always have single page invoices, you can split on the first page and classify accordingly. Or you can try to search for something that indicates the end of the invoice like "Total" or other characteristics. There is an example of how you can use locators to help separation in the Scripting Help topic "Script Samples > Use Locator Results for Standard Document Separation". The example uses a Barcode Locator, but the same concept works if you wanted to try it with a Format Locator or anything else.
Without Separator sheets you will need a smart classification software like Kofax Transformation Module (KTM). Its kind of expensive. you will need to verify the cost saving and ROI.

How does Market Samurai and Long Tail Pro handle retrieving the top 10 Google search results for a keyword?

I'm curious to know how Market Samurai, Long Tail Pro and other software handle retrieving the top 10 Google search results and not running into limits. It appears that these software packages use the users own Google account. Google Custom Search limits users to 100 queries per day (the free limit) but people tend to do keyword research on hundreds or even thousands of keywords per day and don't pay any additional amounts to Google.
Are they paying extra for this service, are they using a different API (perhaps the Adwords API?) or are they scraping the Google search results page (violation of TOS)? Really would like to know! Thanks.
i have done this in one of my project (in java).
this is very simple, in java there is one library call JSoup by using this library you can send get request to google, for example:
https://www.google.co.in/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=<your url encoded search term>
this will return you an HTML code of google search result with your own term.
using Jsoup u can find specific HTML tag with specific class or id. this concept helps you to extract url link, title and description from google search result.
for working example check here, in that example you can extract google serach result links with custom search term.
i hope this will help you.

Classify documents with tags

I have a huge amount of documents (mainly pdfs and doc's) I want to classify, so I can search over them according to certain tags. These tags could either be of my own (I put the tags to the document) or extracted from the text.
I've just seen a post related to this (Classify data using Apache Mahout), but perhaps there is something even more simple.
Mahout might be overkill for your problem - but you can get a fairly quick, easy solution by using OpenNLP.
http://opennlp.sourceforge.net/api/index.html
Specifically, look at the opennlp.tools.doccat package. Essentially, you have to go through and manually tag a small(ish) set of the items for each category you desire. If they are really distinct, you can get away with a small sample size.
You can use the DocumentCategorizerME.train() static function to train a collection of documents, where each requires a category tag and the text block to train on. Then, you can initialize the DocumentCategorizerME with the trained model and begin classifying all the rest of your documents.
Once you do this, you can (I think) write the model to a file so you don't have to ever do that again.
This post on extracting keywords and classifying webpages is related and may be helpful. In your example it sounds like you can use tags in lieu of the keyword extraction piece (although you may want to use both in combination). Weka is easy to use, I would definitely recommend giving it a look.

Resources