How to store heroku logs for data science purposes? - heroku

We can see how to view heroku logs, as well as how to write the last n lines as a text file.
Is there any established pattern for sensible and easy log storage, (potentially ETL), and analysis?
At least, this would involve:
storing logs
moving logs (e.g. via an ETL) to somewhere they can be analysed en mass (e.g. AWS S3 or GCP GCS)
Is there any established pattern to achieve this?
Background
Why would anyone want logs en mass? In case it's relevant, a specific task I'm trying to achieve is to use bayesian inference on web logs to answer questions like: "if a person clicked on A, B and C, then they're x% likely to click on D" (so as to better understand which other pages a user may be interested in, and therefore suggest more relevant pages to the user). This is all pretty straight forward in python or R. But obviously one needs access to the logs (all the logs) before such data science can be carried out.
What I know so far
Heroku provides several logging addons

Really the best solution is probably to setup the heroku app to also pipe your logs into an S3 bucket or something like that. Though perhaps you want to set it up so it only sends the log data you are actually interested in. Even better if you can get something that does this for you.
Looks like PaperTrail at least allows this. Here is the current documentation link:
https://documentation.solarwinds.com/en/Success_Center/papertrail/Content/kb/how-it-works/automatic-s3-archive-export.htm?cshid=pt-how-it-works-automatic-s3-archive-export
Though it might get rather costly depending on the volume of logs you need to handle to use an outside service. Otherwise, you may just need to roll your own solution (or better yet, look for gems that can help)

Related

Assisted manual inspection of log files

Rookie question here. I've been inspecting a lot of log files to try to pinpoint errors in an application. Specifically I'm trying to compare success scenarios with failure scenarios... but due to the volume of logs it's difficult to identify which log messages are "good" and which ones are "bad". (I'm not the developer, so I can't just change the logs.)
Ideally I'd love a tool where I could...
Load a log file and manually specify which entries are associated with a success scenario
The rest of the logfile would then be formatted according to whether a given line appears in the "success" scenario or not. (Ideally it could do fuzzy matching, so an exact match (sans timestamp, of course) could be one color and a close match (different value) could be another.
This would make it easy to skim through the failure scenarios and identify messages associated with the failure condition. Think of it like a smart diff.
Most of the tools I've looked at (e.g., Splunk, OtrosLogViewer) seem to be focused on automated server-side deployments. While that could work, I'd love something lighter weight for quick analysis.
Does anything like this exist? Any pointers welcome/appreciated.

How can one detect changes in a directory across program executions?

I am making a protocol, client and server which provide file transfer functionality similar to FTP (among other features). One difference between my protocol and FTP is that I would like to store a copy of the remote server's directory structure in a local cache. The server will only be running on Windows (written in C++) so any applicable Win32 API calls would be appreciated (if any). When initially connected, the client requests the immediate children (both files and directories, just like "ls" or "dir" with no options), then when a user navigates into a directory, this step repeats with the new parent like you might expect.
Of course, most of the time, if the same directory of a given server is requested twice by a client, the directory's contents will be the same. Therefore I would like to cache the results of each directory listing on the client. I would like a simple way of implementing this, but it would need to take into account expiring cache entries because of file/directory access and modification time and name changes, which is the tricky part. I would ideally like something which would enable almost instant directory listings by the client, with something like a hash which takes into account not only file contents, but also changes in subdirectories' contents' filenames, data, modification and access dates, etc.
This is NOT something that could completely rely on FileSystemWatcher (or similar) objects because it would need to maintain this cache even if the program is only run occasionally. Of course these would be nice to help maintain the cache, but that's only part of the problem.
My best(?) idea so far is using FindFirstFile() and FindNextFile(), and sorting (somehow), concatenating and hashing values found in the WIN32_FIND_DATA structs (with file contents maybe), and using that as a token for expiration (just to indicate change in any of these fields). Then I would have one of these tokens for each directory. When a directory is requested, the server would hash everything and compare that to the cached hash provided by the client, and if it's different, return the normal data, otherwise an HTTP 304 equivalent. Is there a less elaborate way of doing something like this? Does "directory last modified date" take into account every one of its subdirectories' files' modification dates under all circumstances? I'm sure that the built-in Windows indexing service has something like this but ideally I wouldn't need to rely on it.
Because this service is for file sharing, something involving hashes would be especially nice so that I could automatically and efficiently find other people who are sharing a given file, but that's less of a concern then hosing the disk during the hash calculation.
I'm wondering what others who are more experienced than I am with programming would do to solve this problem (rsync and subversion have solved similar problems but not identical).
You're asking a lot of a File System Implementation of Very Little Brain (with apologies to A. A. Milne).
This is actually well-trammeled ground and you'd do well to look at the existing literature on distributed filesystems. AFS comes to mind as an example of a very well studied approach.
I doubt you'll be able to come up with something useful and accurate without doing some serious homework. Put another way, 'twould be folly to ignore all the prior art.

Techniques to reduce data harvesting from AJAX/JSON services

I was wondering if anyone had come across any techniques to reduce the chances of data exposed through JSON type services on the server (intended to supply AJAX functions) from being harvested by external agents.
It seems to me that the problem is not so difficult if you had say a Flash client consuming the data. Then you could send encrypted data to the client, which would know how to decrypt it. The same method seems impossible with AJAX though, due to the open nature of the Javascript source.
Has anybody implemented a clever technique here?
Whatever the method, it should still allow a genuine AJAX function to consume the data.
Note that I'm not really talking about protecting 'sensitive' information here, the odd record leaking out is not a problem. Rather I am thinking about stopping a situation where the whole DB is hoovered up by bots (either in one go, or gradually over time).
Thanks.
First, I would like to clear on this:
It seems to me that the problem is not
so difficult if you had say a Flash
client consuming the data. Then you
could send encrypted data to the
client, which would know how to
decrypt it. The same method seems
impossible with AJAX though, due to
the open nature of the Javascrip
source.
It will be pretty obvious the information is being sent encrypted to the flash client & it won't be that hard for the attacker to find out from your flash compiled program what's being used for this - replicate & get all that data.
If the data does happens to have the value you are thinking, you can count on the above.
If this is public information, embrace that & don't combat it - instead find ways to capitalize on it.
If this is information that you are only exposing to a set of users, make sure you have the corresponding authentication / secure communication. Track usage as others have said, and have measures that act on it,
The first thing to prevent bots from stealing your data is not technological, it's legal. First, make sure you have the right language in your site's Terms of Use that what you're trying to prevent is actually disallowed and defensible from a legal standpoint. Second, make sure you design your technical strategy with legal issues in mind. For example, in the US, if you put data behind an authentication barrier and an attacker steals it, it's likely a violation of the DMCA law. Third, find a lawyer who can advise you on IP and DMCA issues... nice folks on StackOverflow aren't enough. :-)
Now, about the technology:
A reasonable solution is to require that users be authenticated before they can get access to your sensitive Ajax calls. This allows you to simply monitor per-user usage of your Ajax calls and (manually or automatically) cancel the account of any user who makes too many requests in a particular time period. (or too many total requests, if you're trying to defend against a trickle approach).
This approach of course is vulnerable to sophisticated bots who automatically sign up new "users", but with a reasonably good CAPTCHA implementation, it's quite hard to build this kind of bot. (see "circumvention" section at http://en.wikipedia.org/wiki/CAPTCHA)
If you are trying to protect public data (no authentication) then your options are much more limited. As other answers noted, you can try IP-address-based limits (and run afoul of large corporate proxy users) but sophisticated attackers can get around this by distributing the load. There's also likley sophisticated software which watches things like request timing, request patterns, etc. and tries to spot bots. Poker sites, for example, spend a lot of time on this. But don't expect these kinds of systems to be cheap. One easy thing you can do is to mine your web logs (e.g. using Splunk) and find the top N IP addresses hitting your site, and then do a reverse-IP lookup on them. Some will be legitimate corporate or ISP proxies. But if you recognize a compeitor's domain name among the list, you can block their domain or follow up with your lawyers.
In addition to pre-theft defense, you might also want to think about inserting a "honey pot": deliberately fake information that you can track later. This is how, for example, maps manufacturers catch plaigarism: they insert a fake street in their maps and see which other maps show the same fake street. While this doesn't prevent determined folks from sucking out all your data, it does let you find out later who's re-using your data. This can be done by embedding unique text strings in your text output, and then searching for those strings on Google later (assuming your data is re-usable on another public website). If your data is HTML or images, you can include an image which points back to your site, and you can track who is downloading it, and look for patterns you can use to bust the freeloaders.
Note that the javascript encryption approach noted in one of the other answers won't work for non-authenticated sessions-- an attacker can simply download the javascript and run it just like a regular browser would. Moral of the story: public data is essentially indefensible. If you want to keep data protected, put it behind an authentication barrier.
This is obvious, but if your data is publicly searchable by search engines, you'll both need a non-AJAX solution for them (Google won't read your ajax data!) and you'll want to mark those pages NOARCHIVE so your data doesn't show up in Google's cache. You'll also probably want a white list of search engine crawler IP addreses which you allow into your search-engine-crawlable pages (you can work with Google, Bing, Yahoo, etc. to get these), otherwise malicious bots could simply impersonate Google and get your data.
In conclusion, I want to echo #kdgregory above: make sure that the threat is real enough that it's worth the effort required. Many companies overestimate the interest that other people (both legitimate customers and nefarious actors) have in their business. It might be that yours is an oddball case where you have particularly important data, it's particularly valuable to obtain, it must be publicly accessible without authentication, and your legal recourses will be limited if someone steals your data. But all those together is admittedly an unusual case.
P.S. - another way to think about this problem which may or may not apply in your case. Sometimes it's easier to change how your data works which obviates securing it. For example, can you tie your data in some way to a service on your site so that the data isn't very useful unless it's being used in conjunction with your code. Or can you embed advertising in it, so that wherever it's shown you get paid? And so on. I don't know if any of these mitigations apply to your case, but many businesses have found ways to give stuff away for free on the Internet (and encourage rather than prevent wide re-distribution) and still make money, so a hybrid free/pay strategy may (or may not) be possible in your case.
If you have an internal Memcached box, you could consider using a technique where you create an entry for each IP that hits your server with an hour expiration. Then increment that value each time the IP hits your AJAX endpoint. If the value gets over a particular threshold, fry the connection. If the value expires in Memcached, you know it isn't getting "hoovered away".
This isn't a concrete answer with a proof of concept, but maybe a starting point for you. You could create a javascript function that provides encryption/decryption functions. The javascript would need to be built dynamically, and you would include an encryption key that is unique to the session. On the server side, you'd have an encryption service that uses the key from the session to encrypt your JSON before delivering it.
This would at least prevent someone from listening to your web traffic, pulling information out of your database.
I'm with kdgergory though, it sounds like your data is too open.
Some techniques are listed in Further thoughts on hindering screen scraping.
If you use PHP, Bad behavior is a nice tool to help. If you don't use PHP, it can give some ideas on how to filter (see How it works page).
Incredibill's blog is giving nice tips, lists of User-agents/IP ranges to block, etc...
Here are a variety of suggestions:
Issue tokens required for redemption along with each AJAX request. Expire the tokens.
Track how many queries are coming from each client, and throttle excessive usage based on expected normal usage of your site.
Look for patterns in usage such as sequential queries, spikes in requests, or queries that occur faster than a human could conduct.
Check user-agents. Many bots don't completely replicate the user agent info of a browser, and you can eliminate programatic scraping of your data using this method.
Change the front-end component of your website to redirect to a captcha (or some other human verifying mechanism) once a request threshold is exceeded.
Modify your logic so the respsonse data is returned in a few different ways to complicate the code required to parse.
Obsfucate your client-side javascript.
Block IPs of offending clients.
Bots usually doesn't parse Javascript, so your ajax code won't be instantly executed. And if they even do, bots usually doesn't maintain sessions/cookies as well. Knowing that, you could reject the request if it is invoked without a valid session/cookie (which is obviously set on the server side beforehand by the request on the parent page).
This does not protect you from human hazard though. The safest way is to restrict access to users with a login/password. If that is not your intent, well, then you have to live with the fact that it's a public application. You could of course scan logs and maintian blacklists with IP addresses and useragents, but that goes extreme.

Fast Text Search Over Logs

Here's the problem I'm having, I've got a set of logs that can grow fairly quickly. They're split into individual files every day, and the files can easily grow up to a gig in size. To help keep the size down, entries older than 30 days or so are cleared out.
The problem is when I want to search these files for a certain string. Right now, a Boyer-Moore search is unfeasibly slow. I know that applications like dtSearch can provide a really fast search using indexing, but I'm not really sure how to implement that without taking up twice the space a log already takes up.
Are there any resources I can check out that can help? I'm really looking for a standard algorithm that'll explain what I should do to build an index and use it to search.
Edit:
Grep won't work as this search needs to be integrated into a cross-platform application. There's no way I'll be able to swing including any external program into it.
The way it works is that there's a web front end that has a log browser. This talks to a custom C++ web server backend. This server needs to search the logs in a reasonable amount of time. Currently searching through several gigs of logs takes ages.
Edit 2:
Some of these suggestions are great, but I have to reiterate that I can't integrate another application, it's part of the contract. But to answer some questions, the data in the logs varies from either received messages in a health-care specific format or messages relating to these. I'm looking to rely on an index because while it may take up to a minute to rebuild the index, searching currently takes a very long time (I've seen it take up to 2.5 minutes). Also, a lot of the data IS discarded before even recording it. Unless some debug logging options are turned on, more than half of the log messages are ignored.
The search basically goes like this: A user on the web form is presented with a list of the most recent messages (streamed from disk as they scroll, yay for ajax), usually, they'll want to search for messages with some information in it, maybe a patient id, or some string they've sent, and so they can enter the string into the search. The search gets sent asychronously and the custom web server linearly searches through the logs 1MB at a time for some results. This process can take a very long time when the logs get big. And it's what I'm trying to optimize.
grep usually works pretty well for me with big logs (sometimes 12G+). You can find a version for windows here as well.
Check out the algorithms that Lucene uses to do its thing. They aren't likely to be very simple, though. I had to study some of these algorithms once upon a time, and some of them are very sophisticated.
If you can identify the "words" in the text you want to index, just build a large hash table of the words which maps a hash of the word to its occurrences in each file. If users repeat the same search frequently, cache the search results. When a search is done, you can then check each location to confirm the search term falls there, rather than just a word with a matching hash.
Also, who really cares if the index is larger than the files themselves? If your system is really this big, with so much activity, is a few dozen gigs for an index the end of the world?
You'll most likely want to integrate some type of indexing search engine into your application. There are dozens out there, Lucene seems to be very popular. Check these two questions for some more suggestions:
Best text search engine for integrating with custom web app?
How do I implement Search Functionality in a website?
More details on the kind of search you're performing could definitely help. Why, in particular do you want to rely on an index, since you'll have to rebuild it every day when the logs roll over? What kind of information is in these logs? Can some of it be discarded before it is ever even recorded?
How long are these searches taking now?
You may want to check out the source for BSD grep. You may not be able to rely on grep being there for you, but nothing says you can't recreate similar functionality, right?
Splunk is great for searching through lots of logs. May be overkill for your purpose. You pay according to the amount of data (size of the logs) you want to process. I'm pretty sure they have an API so you don't have to use their front-end if you don't want to.

What logging implementation do you prefer?

I'm about to implement a logging class in C++ and am trying to decide how to do it. I'm curious to know what kind of different logging implementations there are out there.
For example, I've used logging with "levels" in Python. Where you filter out log events that are lower than a certain threshold. It also includes logging "names" where you can filter out events via a hierarchy, for example "app.apples.*" will not be displayed but "app.bananas.*" will be.
I've had thoughts about using "tags", but unsure of the implementation. I've seen games use "bits" for compactness.
So my questions:
What implementations have you created or used before?
What do you think the advantages and disadvantages of them are?
I'd read this post by Jeff Atwood
It's about the overflow of Logging and how to avoid it.
There are lots of links on the Log4J wikipedia page.
One of our applications uses Registry entries to dynamically control logging/tracing during production execution.
For example:
if (Logger.TraceOptionIsEnabled(TraceOption.PLCF_ShowConfig)) {...whatever
Whe executed at run-time, if registry value PLCF_ShowConfig is true, the call returns true, and whatever is executed.
Quite handy.
Jeff Atwood had a pretty interesting blog entry about logging. The ultimate message of it was that logging is generally unnecessary (to some extents).
Logging generally doesn't scale well (too much data on high traffic systems).
I think the best point of it is that you generally don't need it. It's easier to trace through your code by hand to understand what values are being assigned to things than it is to sift through lots of log files.
It's just information overload.
Now the same can't be said for single user applications. For things like media encoding or general OS usage, it can be nice to have a log for small apps because debug info is useful (to me) in this situation. If you're burning a DVD and something goes wrong, looking at log info can be very helpful to troubleshoot with if you understand the log output.
I think having a few levels would help for the user, such as:
No logs
Basic logging for general user feedback
Highly technical data for a developer or tech-support person to interpret
Depending on the situation, it may be useful to store ALL log data and only display to the user the basic info, or perhaps giving the option to see all detailed data.
It all depends on the domain.

Resources