Design Question for Notification System

Design Question for Notification System - algorithm

The original post was posted at https://stackoverflow.com/questions/6007097/design-question-for-notification-system
Here is more clarification of the problem: The notification system purpose is to get user notified (via email for now) when content of the site has changed or updated, or new posting is made. This could be treated as a notification system where people define a rule or keyword for 3rd party site and notification system goes out crawle 3rd party site and crate search inverted indexes. Then a new link or document show up for user defined keyword or rule (more explanation at bottom regarding use case),
For clarified used case: Let suppose I am craigslist user and looking for used vehicle. I define a rule “Honda accord”, “year “ 1996 and price range from “$2000 to $3000”.
For above use case to work what is best approach and how can I leverage on open source technology such as Apache Lucent, Apache Solr and Apache Nutch, and Apache Hadoop to solve this use case.
You can thing of building search engine and with rule and keyword notification system. I just need some pointers and help on how to integrate these open source package to solve use case ?
Any help and pointer will be appreciated. We need three important components are :
1) Web Crawler
2) Index Creator
3) Rule or keyword Mather
Any help will be greatly appreciated. I was referring this wiki which integrates Nutch and Solr together for above purpose http://wiki.apache.org/nutch/RunningNutchAndSolr

Your question is a big one but I'll take a stab at it as I've designed and implemented systems like this before.
Ignoring user account management, your system will need to provide the means to:
retrieve new prospect data (web spider)
identify and extract pertinent results from prospect data (filtering)
collect, maintain and organize results (storage)
select results based on various metadata (querying)
format results for delivery to users (templating)
deliver formatted results to users (delivery)
If the scope of your project is small (say less than 100 sites requiring spidering per day), you could probably get along with one of the many open-source web spiders including wget, Nutch, WebSphinx, etc. You might need to provide instrumentation (custom software) for scheduling, monitoring and control. If your project scope is larger than this, you may need to "roll your own" spidering solution (custom software). Typically this would be designed as a distributed, parallel architecture.
For simple filtering, regular expressions would suffice but for more complex tasks requiring knowledge of HTML layout (extract the textual component of the fifth list element (<LI/>) of the fourth table on the page) you'd need to use an XHTML parser. However you proceed, you'll need to provide custom software to conduct filtering based on your users' needs.
While any database technology can be used to store results extracted from retrieved documents, using an engine optimized for text like Apache SOLR will allow you to easily expand your search criteria as your needs dictate. Since SOLR supports the attachment of and search for metadata associated with each document, it would be a good choice. You'll also need to provide custom software here to automate this step.
Once you've selected a list of candidate results from SOLR, any scripting language could be used to template them into one or more emails and would also inject them into your mail transport agent (MTA). This also requires custom software to automate this process (and if required, to inject user-specific data into each message).

You should probably look at Google's Custom Search API also before diving into crawling the web yourself. This way, google can help you with returning keyword based search results, which you could later filter in your application based on your additional algorithms/rules etc, and make the whole thing work.

Related

Hosted full-text search options - IndexTank vs Solr vs Lucene

I am building an app using Ruby on Rails on Heroku and am confused about which full-text search option I should proceed with. A few things I care about:
Real-Time search: I am building a dynamic user-generated website.
Understands Rails Models: I would like to restrict search results based on who the user is (so, I don't really want "just" a site-wide search)
Additionally, something that is easy to configure on Heroku with Rails would be a bonus.
Heroku currently provides three options for full-text search: FlyingSphinx, Searchify IndexTank and WebSolr. Can anyone outline the pro's and cons of each.
Based on my research, it seems that a lot of people have been happy with IndexTank. In particular, this blog post by Gautam Rege briefly outlines his experience with the three options and how he prefers IndexTank.
However, after LinkedIn's acquisition of IndexTank, some key components of IndexTank were open-sourced and the IndexTank service discontinued. It seems that Searchify is one of the first few (if not, currently, the only) replacement for IndexTank. Does anyone have any experience using this? How does Searchify compare to IndexTank and the other two options - WebSolr and FlyingSphinx?

I'll address your question with regards to Searchify/IndexTank:
Searchify has true real-time indexing. The millisecond you add a document, it becomes searchable. No need to commit or reindex.
There is a Ruby client library for Searchify, here are the docs & download links: http://www.searchify.com/documentation/ruby-client
There is also a nice 3rd party client by kidpollo called Tanker that some Ruby folks prefer: https://github.com/kidpollo/tanker

Modern reporting solutions for distributed data systems

We've built a SAAS solution, which has a Frontend in PHP/MySQL. The solution uses our in-house "Backend" API to manage user transactions (financial-ish type of stuff). So basically, some of our data is in the "Frontend" database, while all transactional data is in the "Backend" database.
When it comes to reporting, the Frontend requests transactional reports from the Backend, augments it with Frontend data (user attributes, etc), and draws the report. Usually it's slow and cumbersome to create a new report, and they lack robust features like sorting & filtering. This is partially because there is no single data-source for all the info. Also, we are constantly being asked to provide "adhoc" reporting capabilities - the type of thing that is complex, and has the potential to bring a server to its knees if you aren't careful.
I think we're at the point where we need to invest in a Reporting system, which would be responsible for combining data dumps from Frontend/Backend, and would allow a non-developer to create new reports. One thing that would be important to us is to provide as seamless of an interface as possible to the reports via our Frontend. That might mean the Reporting system exposes web widgets, or perhaps has a web interface that can be accessed with SSO between our system and the Reporting system. In a nutshell, we aren't looking for a dinosaur, we need something modern. Hosted solutions are preferred, but we'd consider something we need to run ourselves. Looking for advice. Thanks!
EDIT: A hosted solution might not work for us. We are located in Canada, and many customers have policies about having data reside in the US (Patriot Act).

Have a look at myDBR reporting solution. Reports are built using stored procedures, so anyone familiar with SQL will be able to create reports. There is also a built in wizard to get you started quickly. It is also very easy to link reports to each other allowing for easy drill-down style reports.
The solution is very reasonably priced at 129 EUR (~ 170 USD) and can be installed in minutes on any standard web server (PHP being to only requirement).
myDBR can be easily integrated into your existing web-pages via the built-in SSO and styled via CSS to match your sites overall look and feel.

Performance logging/monitoring API/product

I'm not sure how to categorize this question, so let me just explain what I would like and hopefully it will make sense.
I'm after a product (with an API) which I can send different numbers to with tags, and it will take care of all the monitoring/logging stuff.
So for example, say I have a program that downloads a file from a website every 10 seconds. I would like to monitor how long each of these downloads is taking. It is quite easy in my application to time how long it takes. I would now like to send this number and tag (e.g., tag='download time', value = '1.234') to a 3rd party product. The 3rd party product will now store this value/tag for me. The product will have a website I can go to, and configure a bunch of things. So in this example, I could setup an alert like "if 'download time' > 5 send me an email". I could also visit a website, and view a graph of the logged values and maybe some random statistics (e.g., how often the value has been in the warning/error zone).
I think that's about it. Sure it wouldn't be too hard to do this myself, but I'm no web designer and it'd end up looking pretty ugly. The more user friendly this kind of product is the more willing users will be to look at the data and actually monitor stuff.
Does such a service exist?
EDIT: Products similar to this: http://dashboard.kpilibrary.com/. This is pretty much exactly what I was after, but am still searching around.

There are many monitoring tools out there. Nagios or RHQ (http://rhq-project.org/) come to mind. Most of the tools work a little different: rather than throwing stuff at them, they have plugins that actively go out and do something to do the measuring. In your example, the plugin would download the file and then report the measurement data to the central server, which can then show you graphs or run alerts on it.

On Windows, you can use this:
http://technet.microsoft.com/en-us/library/cc771692%28WS.10%29.aspx
(Windows Performance Monitor)
It pretty much does what you are looking for:
Passively collects performance data (E.g. CPU Usage)
Can be fed App specific performance metrics (E.g. download time)
Can alert you on various thresholds
Has a reporting interface for analyzing metrics
EDIT : http://technet.microsoft.com/en-us/library/cc749249.aspx , more documentation on this.

This answer is specific to Windows.
If you are looking to analyze events from various systems and you also what the opportunity to create your own events you should consider ETW.
The ETW system allows you to consume data events from any number of sub-systems. You can look at an exhaustive list of built in providers by running the following command:
logman query providers
The beauty of ETW is that you also have the opportunity to create your own providers and push your own data into the resulting report. This is a high-performance logging mechanism and is used by Windows itself for many performance investigations.
The resulting report will be an ETL file. This is a standard file that can be viewed using xPerf, ships with Windows SDK, or the build-in ETL analyzer, tracerpt.exe.

Document Conversion Realtime - Implementation Questions

We have a need to convert MS Office documents to PDF real time when someone provides a link to a document after checking whether user is authorized to view the document or not for an intranet portal. We also need to cache the documents based on the last modified date of the document, we should not convert the document again if another user requests the same document and the document content is not modified since it was last converted.
I have some basic questions on how we can implement this - and would like to check if anyone has previous experience or thoughts how they see this implemented?
For example, if we choose J2EE as the technology, and choose one of the open source Java libraries for PDF conversion; I have following questions.
If there is a 100 MB document - we would need to download entire document from the system where the document is hosted before we start converting the document. This approach may have major concerns on the response time given that this needs to be real time viewing. Is there an option to read first page of a document without downloading entire document so that we can convert document page by page?
How can we cache a document? I do not think we can either store the document in server or database. The reason is this could lead to anyone who is having access to either database or server - can access document content. Any thoughts?
Or do you suggest any out of the box product to do this instead of custom development?
Thanks

I work for a company that creates a product that does exactly what you are trying to do using Java / .NET Web service calls, so let me see if I can answer your questions without bias.
The whole document will need to be downloaded as it will need to be interpreted before PDF Conversion (e.g. for page numbering purposes) can take place. I am sure you are just giving an example, but 100MB is very large for an MS-Office document, although we do see it from time to time.
You can implement caching based on your exact security requirements. If you don't want to store the converted files in a (secured) DB or file system then perhaps you want to store them on a different server behind a firewall. Depending on the number of documents and size you anticipate you may want to cache them in memory. I am sure there are many J2EE caching libraries available, I know there are plenty in .NET. Just keep the most frequently requested documents in your cache.
Depending on your budget you may go for an out of the box product (hint hint :-). I know there are free libraries available for Java that leverage Open Office, but you get the same formatting limitations when opening MS-Office Files in OO. Be careful when trying to do your own MS-Office integration / automation. It is possible to make it reliable and scalable (we did), but it takes a long time and a lot of work.
I hope this helps.

Reporting vs. Coding - thoughts?

Recently I had a project in which I had to get some data from particular software system to a portlet. The software used a database, and I spent a fair bit of time modeling the data I wanted and then creating a web service so that my portlet could grab the information.
Then it suddenly struck me that I was wasting my time. I grabbed BIRT, tossed it into a portlet, and then just wrote some reports that directly grabbed the necessary data from the database. I was done in an afternoon.
I understand that reporting is a one way street, but this got me thinking. Reporting tools can be very effective for creating reports (duh) from your actual data, but when you're doing this you're bypassing your model which except in simple cases is not a direct representation of your data as it exists in your database.
If you're writing a data-intensive application and require the ability to perform non-trivial reporting, do you bypass your application and use something like BIRT or Crystal Reports? How do you manage these tools as part of your overall process? Do you consider the reports you write as being part of your application and treat them as such? A report is a view and a model and a controller (if you will) all in one big mess, how do you deal with and interpret and plan for that?
Revised question: it's possible and even common that a report will perform some business calculations that in a perfect world you would like to have contained in your application. This can lead to a mismatch of information given back to the user. On the other hand, reporting tools make it so easy to gather and display information that it's hard to take a purist's approach and do everything from within the application. Are there any good techniques for ensuring that the data in your reports matches the data that you might be showing in the regular GUI?

I see reporting as simply another view on the data, not a view/model/controller in one (well, maybe a view and controller in one).
We have our reports (built in sql 2008 reporting services) consume a service in our application layer to get data (keeping with our standard, that data access is in a repository). These functions could do a simple query or handle very complex processing that would be a nightmare in your reporting evironment or a stored procedure. In practice, we find this takes no longer than coding up some one-off stored procedure that will, as your system grows and grows, become a nightmare to maintain.
Treating reporting as simply a one-off or not integrating into your application design is a huge mistake.

Reporting is crucial. Reporting is mostly crucial to share values collected in one system to external users, e.g. users not directly using the system (eg management for sales figures). So reporting is a lot more than just displaying facts and figures and is something central to almost every system that drives a commercial.
At least the more advanced systems allow you to enhance them: with your own reusable "controls". Even a way back can be implemented - if you just use the correct plugins. Once I wrote a system to send emails out of a report, because the system did not allow for change. It worked - though it was not meant to be used that way ;)
Reports make a good part of the application, and you gain a lot freedom if you make reports changeable for your customers. Sometimes you come up with more possibilities than you thought of when you built the system in the first place.
So yes, for me reporting is part of the system.

Reports are part of your app but because they are generally something a user will have strong ideas about than, say, your data capture UI, I'd sacrifice purity for convenience/speed of delivery and get back to "real" coding... :-)
As soon as you've done a report, users want another one or change the colour or optional grouping or more filtering or... something that takes you away from whizzier stuff... so I don't bust a gut maintaining purity.

This is a fine line indeed. You don't want to spend too much time building reports (that users want you to change all the time anyway) but you don't want to duplicate logic by putting business logic into your reports! With our reporting products at Data Dynamimcs I think we have reached a happy medium between these two tradeoffs.
By using the ObjectDataProvider (see links below for more info) you can bind the report directly to business objects (plain old objects) so you don't have to bypass your business layer for getting data. At the same time we provide a way to reference and use functions from other libraries in your report. This way if you have some code configured already to do some business logic calculations you can reuse those functions directly within your report. You can see an example of this in the links below too.
Binding to Objects for your Data (see "Object Provider" section): http://www.datadynamics.com/Help/ddReports/ddrconDataSetAndObjectDataSource.html
Adding Custom Code to your reports Walkthrough: http://www.datadynamics.com/Help/ddReports/ddrwlkCustomCode.html
Using Custom Assemblies (referencing shared libraries/dlls from your report): http://www.datadynamics.com/Help/ddReports/ddrconCustomCode.html, and http://www.datadynamics.com/Help/ddReports/ddrtskCreatingAnInstanceMethod.html
Scott Willeke
Data Dynamics / GrapeCity

The way I've always worked with reports is to consider part reports as part of the code-base, and stored in the source along with the application. In some contexts, reports are more important than the application, in that management makes business decisions off of report data, having the wrong information can cause them to cancel a product line, cancel a campaign, or fire a sales person. Obviously, this depends highly on your management and your application.
Regarding keeping your model consistent, this is a bit trickier question. One way to ensure consistent model between reports and your application is to use stored procedures (or views) to retrieve data, depending on your application's architecture.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio