web analytics software which can analyze existing log archives too - web-analytics

I am looking for a web analytic solution which can also help me to analyze my existing log files. we are moving from sawmill to other solutions.. explored Google Urchin and it has some limitations on analyzing custom existing logs.
Currently exploring webtrends, but i am not sure if it supports custom log analysis
any ideas??

WebTrends will not support custom log formats. Maybe the best way with WebTrends would be to use the Data Collector and/or the API from the Data Collector. Or you put your specific informations within the URL as the Data Collector would do. Here is the link to the API: http://product.webtrends.com/dcapi/sw/index.html

Related

How do I instrument my code for Splunk metrics?

I'm brand new to Splunk, having worked exclusively with Prometheus before. The one obvious thing I can't see from looking at the Splunk website is how in my code, I create/expose a metric... if I must provide an HTTP endpoint for consumption, or call into some API to push values, etc. Further, I cannot see which languages Splunk provide libraries for, in order to aid instrumentation - I cannot see where all this low level stuff is documented!
Can anyone help me understand how Splunk works, particularly how it compares to Prometheus?
Usually, programs write their normal log files and Splunk ingests those files so they can be searched and data extracted.
There are other ways to get data into Splunk, though. See https://dev.splunk.com/enterprise/reference for the SDKs available in a few languages.
You could write your metrics to collectd and then send them to Splunk. See https://splunkonbigdata.com/2020/05/09/metrics-data-collection-via-collectd-part-2/
You could write your metrics directly to Splunk using their HTTP Event Collector (HEC). See https://dev.splunk.com/enterprise/docs/devtools/httpeventcollector/

What would be the advantages of using ELK for log management over a simple python logging + existing database log table combo?

Assuming I have many Python processes running on an automation server such as Jenkins, let's say I want to use Python's native logging module and, other than writing to the Jenkins console or to a log file, I want to store & centralize the logs somewhere.
I thought of using ELK for that, but then I realized that I can just as well create a dedicated log table in an existing database (I'm using Redshift), use something like Grafana for log dashboards/visualization and save myself the trouble of deploying a new system (most of the people in my team are familiar with Redshift but not with ElasticSearch).
Although it sounds straightforward, I feel like I'm not looking at the big picture and that I would be missing some powerful capabilities that components like Logstash were written for the in the first place. What would these capabilities be and how would it be advantageous to use ELK instead of my solution?
Thank you!
I have implemented a full ELK stack in my company in the past year.
The project was huge and took a lot of time to properly implement. The advantages of using ELK and not implementing our own centralized logging solution would be:
Not needing to re-invent the wheel- There is already a product that is doing just that. (and the installation part is extremely easy)
It is battle tested and can stand huge amount of logs in a short time.
As your business and product grows and shift you will need to parse more logs with different structure which will mean DB changes on self built system. logstash will give you endless possibilities of filtering and parsing those new formatted logs.
It has Cluster and HA capabilities, and you can scale your logging system vertically and horizontally.
Very easy to maintain and change over time.
It can send the needed output to a variety of products including Zabbix, Grafana, elasticsearch and many more.
Kibana will give you ability to view the logs, build graphs and dashboards, alerts and more...
The options with ELK are really endless and the more I work with it, the more I find new ways it can help me. not just from viewing logs on distributed remote server systems, but also security alerts and SLA graphs and many other insights.

How do I get from "Big Data" to a webpage?

I've spent a lot of time reading and watching videos of people talking about how they use tools designed for handling huge datasets and real-time processing in their architectures. And while I understand what it is that tools like Hadoop/Cassandra/Kafka etc do, no one seems to explain how the data gets from these large processing tools to rendering something on a client/webpage.
From what I understand of big data tools, is that you can't build your application the same way you would a standard web-app querying MySQL, which I can understand given the size of the data that flows through these tools, however, for all this talk of "realtime data analytics" I cannot find any explanation of how the actual analytics gets put in front of someone in terms of some chart/table/etc?
explain how the data gets from these large processing tools to rendering something on a client/webpage.
With respect to this, one way would be to process the big data using Spark or Hadoop and store the results onto a RDBMS. Then have your webapp pull data from RDBMS to render charts, table etc. I can provide you the examples that I have done myself if you need more information.
Impala supports ODBC/JDBC interfaces. So, you actually could hook up a web app to it the same way you do with MySQL.
Other stuff you might want to check out is HBase, Kudu or Solr. In some realtime architectures data ends up in one of those. And all of them have some sort of an API that you can use in your web app to access their data.
If you want a simple solution for realtime data processing and analytics, check out the new Stride API, which enables developers to collect, process, and analyze streaming data and then either visualize summary data in Stride or push processed data out to applications in realtime. This is a very easy way to build the kind of realtime reporting dashboards and monitoring / alerting systems you described above.
Take a look at the Stride API technical docs for examples and more info on how to implement this.

Design Question for Notification System

The original post was posted at https://stackoverflow.com/questions/6007097/design-question-for-notification-system
Here is more clarification of the problem: The notification system purpose is to get user notified (via email for now) when content of the site has changed or updated, or new posting is made. This could be treated as a notification system where people define a rule or keyword for 3rd party site and notification system goes out crawle 3rd party site and crate search inverted indexes. Then a new link or document show up for user defined keyword or rule (more explanation at bottom regarding use case),
For clarified used case: Let suppose I am craigslist user and looking for used vehicle. I define a rule “Honda accord”, “year “ 1996 and price range from “$2000 to $3000”.
For above use case to work what is best approach and how can I leverage on open source technology such as Apache Lucent, Apache Solr and Apache Nutch, and Apache Hadoop to solve this use case.
You can thing of building search engine and with rule and keyword notification system. I just need some pointers and help on how to integrate these open source package to solve use case ?
Any help and pointer will be appreciated. We need three important components are :
1) Web Crawler
2) Index Creator
3) Rule or keyword Mather
Any help will be greatly appreciated. I was referring this wiki which integrates Nutch and Solr together for above purpose http://wiki.apache.org/nutch/RunningNutchAndSolr
Your question is a big one but I'll take a stab at it as I've designed and implemented systems like this before.
Ignoring user account management, your system will need to provide the means to:
retrieve new prospect data (web spider)
identify and extract pertinent results from prospect data (filtering)
collect, maintain and organize results (storage)
select results based on various metadata (querying)
format results for delivery to users (templating)
deliver formatted results to users (delivery)
If the scope of your project is small (say less than 100 sites requiring spidering per day), you could probably get along with one of the many open-source web spiders including wget, Nutch, WebSphinx, etc. You might need to provide instrumentation (custom software) for scheduling, monitoring and control. If your project scope is larger than this, you may need to "roll your own" spidering solution (custom software). Typically this would be designed as a distributed, parallel architecture.
For simple filtering, regular expressions would suffice but for more complex tasks requiring knowledge of HTML layout (extract the textual component of the fifth list element (<LI/>) of the fourth table on the page) you'd need to use an XHTML parser. However you proceed, you'll need to provide custom software to conduct filtering based on your users' needs.
While any database technology can be used to store results extracted from retrieved documents, using an engine optimized for text like Apache SOLR will allow you to easily expand your search criteria as your needs dictate. Since SOLR supports the attachment of and search for metadata associated with each document, it would be a good choice. You'll also need to provide custom software here to automate this step.
Once you've selected a list of candidate results from SOLR, any scripting language could be used to template them into one or more emails and would also inject them into your mail transport agent (MTA). This also requires custom software to automate this process (and if required, to inject user-specific data into each message).
You should probably look at Google's Custom Search API also before diving into crawling the web yourself. This way, google can help you with returning keyword based search results, which you could later filter in your application based on your additional algorithms/rules etc, and make the whole thing work.

Need Reporting System for Restful API Data

Is there a SaaS tool which will let me interface to a XML based Restful API and do advanced reporting on it? We have a basic report generating system in our application, but need a more advanced solution for some of our customers...
you can find information on GoodData's REST APIs, integration with Talend, Java sample code, and other tips and tricks on our support forum - this is a good place to start: http://support.gooddata.com/forums/46715/entries/77166
Feel free to email support#gooddata.com if you want some help.
Thanks
-Sam [sam#gooddata.com]
What about something like Good Data or Zoho reports?
Are you wanting to have their reports inside your SaaS app? Or is it ok for them to provide the dashboarding?
I'm not sure if I understand what you want to achieve. I assume that you have a REST API based application and you want to analyze it's traffic. The pure Apache log analysis doesn't work as you need more API level analysis (analyzing your application's events).
I think that we something similar. Our application produces an audit log and error log that we load into GoodData and analyze it there.
Let me know if you are interested in more details.

Resources