PostgreSQL config settings on dynamically created EC2 instances - amazon-ec2

Let me start by saying that I think there is a better way of doing things than I'm doing now... so, please don't post comments and answers saying that I should be using a different technology, etc. I have a "reasonably" specific question.
A little background:
Basically, I have system where I'm processing a lot of varied, but fairly structured data feeds each day (CSV files). It's a fairly generic ETL type of system. I started off writing Python scripts to do it all in memory. But, I found that I was writing a lot of code to check and enforce rules that could easily be described by a db schema. So, I've got a of a series of SQS queue (one for each source) that has file locations (on s3) to process and a PostgreSQL db script to load to do it. Hacky? Yes; probably. But, in a way, it's pretty easy to just define all of your rules in PostgreSQL. At least for me with approx 15 years of RDBMS experience (what's that old saying about when you only have a hammer, everything looks like a nail?)
So, all works pretty well. But, when creating EC2 instances, I have a choice of an image_id and a type/size. I have my base "PostgreSQL worker image" that I use, but it's really geared for one size (micro).
But, now I'm thinking about trying to play around and see what kind of gains I could get if I went with small or medium. My initial thought is that I would just created separate image_ids with a postgres conf settings geared to them. But, seems a bit messy. (but, the whole thing is a bit messy and hacky)
Given what I have in place, is there a better way to accomplish this than just separate AMIs?
Final notes:
My AMIs are all PostgreSQL 9.1 and Ubuntu 12.04. And the DBs are just temporary storage. They only exist for the 15 or 20 minutes they are needed to load/process/output the data.
If you feel like this question could be better answered on the SE's DBA site, then please feel free to add a comment. I usually start with StackOverflow because it's a bigger community and it's a community that I feel more at home with. I'm much more of a developer than a DBA.

Related

ETL vs Workflow Management and which to apply? Can they be used the same?

I am in the process of setting up a data pipeline for a client. I've spent a number of years being on the analysis side of things but now I am working with a small shop that only really has a production environment.
The first thing we did was to create a replicated instance of production but I would like to apply a sort of data warehouse mentality to make the analysis portion easier.
My question comes down to what tool to use? Also, why? I have been looking at solutions like Talened for ETL but also am very interested in Airflow. The problem is that I'm not quite sure which suits my needs better. I would like to monitor and create jobs easily (I write python pretty fluently so Airflow job creation isn't an issue) but also be able to transform the data as it comes in.
Any suggestions are much appreciated
Please consider that the open source of talend (Talend Open Studio) does not provide any monitoring / scheduling capabilities. It is only "code generator". The more sophisticated infrastructure is part of the enterprise editions.
For anyone that sees this. Four years later and what we have done is leverage Airflow for scheduling, Fivetran and/or Sticher for extraction and loading, and dbt for transformations.

SSIS vs Pentaho

Has anyone used both of these to provide a good comparison. I am doing a school project so the cost of SSIS isn't an issue as we already have the license for it.
Background on whats going on. I will be downloading about 10 years of patent information into flat files. The result will be 2,080 delimited files. I want a way to load them into MS SQL server all at once. Then I want to be able to append additional files into the DB as they are released.
Speed of the software doesn't bother me much as I can just let it run overnight. I am just looking for something with some flexibility, and more importantly fairly easy to use. I have never done a project like this before and will be learning how to do this from the boards.
THANKS!
Have used both in real live projects. I do prefer Pentaho (PDI) over SSIS because of its ease of use and flexibility.
Do read a little on the subject before you start using it. There are a couple of excellent books on kettle (PDI), or you could just read the Get Started in the Help menu of PDI. The forum is a good place if you are stuck or ##pentaho on IRC.
What also helps a lot are the Samples that you can find in the Welcome Screen.
I hope you enjoy it, I know I still do. Have been using it since 2006 and am always pissed when I have to use SSIS on some project :-)
PS : use a jtds jdbc-driver to connect to a SQL Server db, it will save you some headaches
Hope this helps,
Bart
After spending a couple days developing an ETL package in PDI and SSIS I feel confident in saying that PDI is definitely more user friendly. The user interface alone is much cleaner and seems to flow in a manner that is very intuitive and as such easy to use.

Why would a long Rake task just stop, then start again?

I have a complex legacy data migration problem. MS Access data going into MySQL. I'm using a Rake task. There's a lot of data and it requires a lot of transforming and examining. The Rake task is hundreds of lines across about 12 files. The whole thing takes about two hours to run. It has to run on Windows (I'm using XP VMware VM hosted on an OS X Leopard system) because the Ruby libraries that can talk to MS Access only work on Windows.
I'm finding that sometimes, not every time, I'll start the task and come back later and it will be stalled. No error message. I put numerous print statements in it so you should see lots of reporting going by, but there's just the last thing it did just sitting there. "11% done" or whatever.
I hit Ctrl-C to and instead of going back to the command prompt, the task starts up again where it left off, reported output starts going by again.
I'm sorry for the abstract question, but I'm hoping someone might have an idea or notion of what's happening. Maybe suggestions for troubleshooting.
Well, if the access side seems to be freezing, consider shoving the data into MySql, and see if that eliminates this problem. In other words, the data has to go over eventually, you might as move the data into that system from the get go. There is a number of utilities around that allow you to move the data into MySql (or just export the access data to CSV files).
So, you not doing data transformations during that transfer of data while you move it into MySql (so, it not a labor nor programming cost of time hereā€¦just transfer the data).
Once you have the data in MySql, then your code is doing the migration (transformation) of data from one MySql database (or table) to another. And, you out of a VM environment and running things in a native environment. (faster performance, likely more stable).
So, I would vote to get your data over into MySql..then you down to a single platform.
The less systems involved, the less chance of problems.

Distributing Video on a LAN to alternate Locations - Can the browser detect this?

I'm the administrator for a company intranet and I'd like to start producing videos. However, we have a very small bandwidth tunnel between our locations, and I'd like to avoid hogging it by streaming videos by multiple users.
I'd like to synchronize the files to servers at each of the locations. Then I'd like the browser (or the intranet) to detect which site I'm at. From there, I'd like it to request the video from the closest location.
I've never done this, and was wondering if there is already a solution out there for this. It looks like Hadoop may do this, but I guess I'd like to hear that from someone using it before I commit to learning it.
I don't know how to achieve exactly what you want, but hadoop does not do what you desire.
Hadoop provides an infrastructure to process large amounts of data (like e.g. log file analysis) in a distributed environment (cluster) but the machines in the cluster are usually connected with high speed communication links (may be even on the same server rack).
So I can answer the last part of your question and tell you that hadoop not a good fit for your type of problem. You might still want to learn what hadoop gives you, so that you may be able to use it in another scenario, though.
You might also want to check out serverfault to find an answer to your problem, as it seems you are looking to a more system administrator like answer than a programming one.
Here are good materials to learn about hadoop

What is the most challenging development environment you've ever had to work in and what did you do to get around the limitations?

By 'challenging development environment' I don't mean you're on a small boat that's rocking up and down and someone is holding a gun to your head. I mean, are the tools at your disposal making the problem difficult?
Development is typically a cycle of code, run, observe the result, repeat. In some environments this is a very quick and painless process, but in others it's very difficult. We end up using little tricks to help us observe the result and run the code faster.
I was thinking of this because I just started using SSIS (an ETL tool included with SQL Server 2005/8). It's been quite challenging for me to make progress, mainly because there's no guidance on what all the dialogs mean and also because the errors are very cryptic and most of the time don't really tell you what the problem is.
I think the easiest environment I've had to work in was VB6 because there you can edit code while the application is running and it will continue running with your new code! You don't even have to run it again. This can save you a lot of time. Surprisingly, in Netbeans with Java code, you can do the same. It steps out of the method and re-runs the method with the new code.
In SQL Server 2000 when there is an error in a trigger you get no stack trace, which can make it really tricky to locate where the problem occurred since an insert can have a cascading effect and trigger many triggers. In Oracle you get a very nice little stack trace with line numbers so resolving the problem is very easy.
Some of the things that I see really help in locating problems:
Good error messages when things go wrong.
Providing a stack trace when a problem occurs.
Debug environment where you can pause, then output the value of variables and step to follow the execution path.
A graphical debug environment that shows the code as it's running.
A screen that can show the current values of variables so you can print to them.
Ability to turn on debug logging on a production system.
What is the worst you've seen and what can be done to get around these limitations?
EDIT
I really didn't intend for this to be flame bait. I'm really just looking for ideas to improve systems so that if I'm creating something I'll think about these things and not contribute to people's problems. I'm also looking for creative ways around these limitations that I can use if I find myself in this position.
I was working on making modifications to Magento for a client. There is very little information on how the Magento system is organized. There are hundreds of folders and files, and there are at least a thousand view files. There was little support available from Magento forums, and I suspect the main reason for this lack of information is because the creators of Magento want you to pay them to become a certified Magento developer. Also, at that time last year there was no StackOverflow :)
My first task was to figure out how the database schema worked and which table stored some attributes I was looking for. There are over 300 tables in Magento, and I couldn't find out how the SQL queries were being done. So I had just one option...
I exported the entire database (300+ tables, and at least 20,000 lines of SQL code) into a .sql file using PhpMyAdmin, and I 'committed' this file into the subversion repositry. Then, I made some changes to the database using the Magento administration panel, and redownloaded the .sql file. Then, I ran a DIFF using TortioseSvn, and scrolled through the 20k+ lines file to find which lines had changed, LOL. As crazy as it sounds, it did work, and I was able to figure out which tables I needed to access.
My 2nd problem was, because of the crazy directory structure, I had to ftp to about 3 folders at the same time for trivial changes. So I had to keep 3 windows of my ftp program open, switch between them and ftp each time.
The 3rd problem was figuring out how the url mapping worked and where some of the code I wanted was being stored. Here, by sheer luck, I managed to find the Model class I was looking for.
Mostly by sheer luck and other similar crazy adventures I managed to work my way through and complete the project. Since then, StackOverflow was started and by a helpful answer to this bounty question I was able to finally get enough information about Magento that I can do future projects in a less crazy manner (hopefully).
Try keypunching your card deck in Fortran, complete with IBM JCL (Job Control Language), handing it in at the data center window, coming back the next morning and getting an inch-thick stack of printer paper with the hex dump of your crash, and a list of the charges to your account.
Grows hair on your fingernails.
I guess that was an improvement on the prior method of sitting at the console, toggling switches and reading the lights.
Occam on a 400x transputer network. As there was only one transputer that could output to console debugging was a nightmare. Had to build a test harness on a Sun network.
I took a class once, that was loosely based on SICP, except it was taught in Dylan rather than Scheme. Actually, it was in the old Dylan syntax, the prefix one that was based on Scheme. But because there were no interpreters for that old version of Dylan, the professor wrote one. In Java. As an applet. Which meant that it had no access to the filesystem; you had to write all of your code in a separate text editor, and then paste it into the Dylan interpreter. Oh, and it had no debugging facilities, of course. And being a Dylan interpreter written in Java, and this was back in 2000, it was ungodly slow.
Print statement debugging, lots of copying and pasting, and an awful lot of cursing at the interpreter were involved.
Back in the 90's, I was developing applications in Clipper, a compilable dBase-like language. I don't remember if it came with a debugger, we often used a 3rd-party debugger called 'Mr Debug' (really!). Although Clipper was fast, some of our more intensive routines were written in C. If you prayed to the correct gods and uttered the necessary incantations, you could use Microsoft's CodeView debugger to debug the C code. But usually not for more than a few minutes, as the program usually didn't like to spend much time running with CodeView (usually memory problems).
I had a series of makefile switches that I used to stub out the sections of code that I didn't need to debug at the time. My debugging environment was very sparse so there was as much free memory as possible for the program. I also think I drank a lot more back then...
Some years ago I reverse engineered game copy protections. Because the protections was written in C or C++ they were fairly easy to disassemble and understand what was going on. But in some cases it got hairy when the copy protection took a detour into the kernel to obfuscate what was happening. A few of them also started to use of custom made virtual machines to make the problem less understandable. I spent hours writing hooks and debuggers to be able to trace into them. The environment really offered a competetive and innovative mind. I had everything at my disposal save time. Misstakes caused reboots and very little feedback what went wrong. I realized thinking before acting is often a better solution.
Today I dispise debuggers. If the problem is in code visible to me I find it easiest to use verbose logging. (Sometimes the error is not understanding the interface/environment then debuggers are good.) I have also realized time is of an essance. You need to have a good working environment with possibility to test your code instantly. If you compiler takes 15 sec, your environment takes 20 sec to update or your caches takes 5 minutes to clear find another way to test your code. Progress keeps me motivated and without a good working environment I get bored, angry and frustrated.
The last job I had I was a Sitecore Developer. Bugfixing can be very painful if the bug only occurs on the client's system, and they do not have Visual Studio installed on the system, with the remote debugging off, and the problem only happens on the production server (not the staging server).
The worst in recent memory was developing SSRS reports using Dundas controls. We were doing quite a bit with the grids which required coding. The pain was the bugginess of the controls, and the lack of debugging support.
I never got around the limitations, but just worked through them.

Resources