Schema deployment management for Athena - continuous-integration

In order to apply devops principles to data (ugh, dataops!), things like continuous deployment need to be considered.
Hence why tools like dbDeploy exist. However dbDeploy seems to have been orphaned and is not maintained any more. In the past i've used this tool again and again, but I don't see much support for it, and I'm not sure why?
So i'm wondering just what do people use to manage and version their schemas. In particular i'm looking for something that will work with Athena (But this has a jdbc driver, so in theory any jdbc compliant tool)
I know one answer may be to switch mindset, and use the AWS Glue crawlers instead. But do people actually do that? Or are the crawlers more for POC/Quick start situations? I'm pretty sure you'll always want to override decisions the crawler makes, so how can that be handled?

Related

Data Transformations in Snowflake - View, Tools etc?

We're considering Snowflake and want to understand how we could use it, and possibly other tools, to overcome one of our main problems - ETL! We currently use a legacy DWH with an ETL process consisting of SSIS and some views. This has all the common pitfalls of this methodology - most notably that it takes ages!
I was under the assumption that we'd move to an ELT model in Snowflake, I started to research tools to do the 'T' part of it, however, I'm just listening to this podcast: https://www.dataengineeringpodcast.com/snowflakedb-cloud-data-warehouse-episode-110/
And it's suggesting that just slapping a SQL View over something and exposing it in say PowerBI or Tableau is enough for the T part of things!...
Just wondering what people's experience was here?
- Do you do transformations just by writing a view in Snowflake?
- Do you use a third party tool specifically to address this need?
Secondary to this, for the Extraction and Loading, do you:
- Do this using Snowflake only
- Use a third party tool
I'm specifically interested if you do this to create some kind of timeseries in Snowflake from a non timeseries source. That's something we'd be keen to do.
This question is hard to answer without sounding opinionated, especially not knowing your use case. For what it's worth here is what I think:
Don't stick views on top of your tables and expose to a reporting tool unless you have a very very simple setup. If you're considering a tool like Snowflake then you will probably want to go for something more sustainable, this approach can become prohibitive in terms of cost and complexity in your views.
Use a third-party tool to manage your ELT process. Your choice of tool will depend on your internal skills and cloud strategy, have a look at the tools out there like Stich, Fivetran etc. If you don't mind having on-premise technologies why not stick with SSIS or use something like Apache Airflow (requires up-skilling)
Snowflake will not help you with the E of ELT, you will need to use a third-party tool to manage the extract of data from your other systems like SSIS. It will help with the L part, for this you can use Snowpipe or COPY commands which are available within the Snowflake ecosystem. Snowflake will also help you share your data with external parties which is really nice.
My organization has created a fairly complicated dimensional model in Snowflake using layers of SQL views, against which we can point our reporting tools. We use a separate replication tool for extraction from source systems and loading into Snowflake. Using views simplifies our approach in that we don't need to use an additional tool. It also makes managing the code easier than something like SSIS. For instance, we can search for code using the Snowflake interface or our version control tool instead of having to open individual SSIS packages.

NoSQL for multi-site archival logging with full-text search

I'm looking at building a somewhat complex log handling system to replace an old ad-hoc setup and could use a bit of advice. I'm pretty familiar with SQL databases and networking, but am very new to NoSQL stores, which seem to be the key to solving this mess. Note that we have a very good team, but a limited licensing budget, so free/open-source options are vastly preferred. (That said, availability of support if something goes pear-shaped would be nice.)
Requirements:
Archive (test) logs generated in the several GB/day range at multiple sites around the world.
Provide full text search of those logs at each site fairly instantaneous for debugging purposes.
Push that archived data back to a central location (though a replica at each site would be absolutely okay).
Provide for analytics of that data back at the central location.
Constraints:
The sites have fairly crap Internet connections for the moment (high latency and fairly low bandwidth). Much of the data is generated during the day and a good portion of the sync would have to lag behind and finish overnight each day.
Sites MUST be able to function if the WAN goes completely off-line.
Extras
The log data is (as usual) highly compressible. Any solution that compresses data transacting from node to node across the WAN is preferred.
Many log files are related to each other in multi-level hierarchies, and that relationship is very important and must be maintained!
Sites will generally not modify the same data or modify it again once stored. This is all archival for the most part.
We can either stream as the logs are generated or push blocks of logs. Streaming is preferred, as it would simplify things considerably.
Options I'm aware of:
Local MySQL and folder structure for logging and local configuration management.
This is what we have now and it's running, but not a long-term solution by any means.
Elasticsearch
I've read that ElasticSearch would probably be really good for this, though from what I understand that doesn't support multi-site.
Cassandra
This seems to have built-in multi-site support, but I'm not exactly familiar with the data-model. Is this a good choice for something like this, or will I hate myself if I give it a try?
CouchDB
This is a document store that seems(?) like a good match for log data, but again doesn't appear to have multi-site support.
Apache Kafka
I read up on this, but I haven't quite wrapped my head around it yet...
Questions:
Do any of these actually let you stream-append logs or are they best suited to dumping completed files in?
Is there a solution I'm missing that might be better?
Any recommendations on multi-site with some of the options that don't support multi-site by themselves?
Interesting links:
https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
http://blog.cloudera.com/blog/2015/07/deploying-apache-kafka-a-practical-faq/
https://www.elastic.co/blog/scaling_elasticsearch_across_data_centers_with_kafka
https://kafka.apache.org/08/ops.html
https://github.com/Stratio/cassandra-lucene-index
I may be a bit biased, since Couchbase is my employer, but this sounds like the kind of problem that XDCR (Cross Datacenter Replication) was made to solve.
You could stand up a cluster on multiple geographical sites (Couchbase calls these "datacenters") and then XDCR would automatically replicate (bidirectionally) the data between sites. If I understand your requirements correctly, this sounds like just what you need.

What is a good choice for Fulltext indexing when developing a OSX application?

Hy,
I'm implementing an IMAP client as a Mac OSX application using MacRuby.
For the sake of offline availability, I wanted to allow fulltext indexing and attribute based indexing of all messages. Attributes include common E-Mail stuff like from:, to:, etc...
This would allow for advanced results sprinkled with faceting, analytic calculations and such.
Now I'm unsure about the choices and good practices when it comes to integrating such a search feature. I have a strong web development background, therefore my intuitive action would be to setup a Solr server and start feeding it with data. This might just work theoretically, as I could write an Agent that manages the solr instance for my application in the background. But to me, this approach seems like an infrastructure hassle.
On the other side, I've read about people using the FTS3 functionality from SQLite. This approach is easily accessible by CoreData. I haven't used SQLite's FTS3 but I don't think it is as powerful as Solr can be.
What is your weapon of choice for a use case like mine?
I'm mainly interested in solutions that are actually in use by Objective-C/Cocoa/MacRuby developers.
In you're going to develop the app with Ruby give a try to picky. It is very simple to use.
There is an Objective-C Lucene port
http://svn.gna.org/viewcvs/etoile/trunk/Etoile/Frameworks/LuceneKit/
I have not used it, but in your situation, I'd at least check it out. In my experience, SQL based full text search can't compete with Lucene, but haven't tried SQLite for this.
EDIT: just noticed the ruby tag -- this started out as port of Lucene
https://github.com/dbalmain/ferret

Appfabric Caching: Configuration Provider as single point of failure

After doing some initial research into using Appfabric for caching, my understanding is that the configuration provider for the cluster is a single point of failure as mentioned here:
MSDN
I want to use appfabric just for distributed caching, particularly for the tagging features. What are the options to avoid having the configuration provider as this failure point? I thought of two but not sure if one is better or if there are any other options.
(1) Create my own caching service configuration provider. I'm guessing this is possible (?) but I'm not sure how to go about it. I'd probably make a provider that fetched the xml file from S3 since I'm already using AWS.
(2) Configure each cache as a single node cluster and then create a proxy client that uses the individual nodes as a distributed cache, a la a memcached type client.
Thoughts or recommendations, or anything else I should consider in making this decision?
Yes, it is a single point of failure.
Microsoft's recomended solutions seem to be:
(SQL Server provider) Use SQL Server
clustering. In my limited
experience of it, using SQL Server
clustering for this is probably a
case of 'the cure is worse than the
disease' i.e. it brings a lot of
pain. Unless you've already got a SQL
Server cluster available, avoid!
(XML
provider) Use Windows Server
clustering. I have even less
knowledge of this than SQL
clustering, so I can't say how well (or otherwise)
this might work. It doesn't strike me as a trivial thing to do, though.
You can create your own configuration provider by implementing the ICustomProvider interface and making some registry entries. Using AWS seems like a really good idea to make the config provider resilient, I'd be interested to see how you got on with this.
Creating a proxy client seems to me like you'd be making a lot of work for yourself, at that point it feels like you'd be more fighting against AppFabric rather than working with it.
We have also tried AppFabric but it gave us fair few headaches like for one there's no API access which is making it very difficult to use our current unit testing strategy. We have now moved to NCache that is better option than AppFabric. NCache provides tagging feature and it is not a single point of failure.

What is risk if I use aspnet_membership

I have some doubt about aspnet_membership. My database team of my company dosen't let me use aspnet_membership. Because this feature use some stored procedure to install some thing that break the policy of my company. They said this feature could make some risk but I don't know what risk that I'll get if I used it.
Does anyone have any idea or reason about this issue?
Your "database team" is obviously paranoid. What sort of security risk do they anticipate if the aspnet_regsql.exe tool creates a database? I can understand that many Administrators feel uneasy using wizards because they want to know exactly what is going on behind the scenes. In this case, the command line tool allows you a high degree of customizability and should be preferred over the wizard mode.
Perhaps you and your database team should read up on the published implementation of the tool, rather than being obstinate about which you are ignorant. If you still find "Creating the Application Services Database for SQL Server" to be overkill, it is quite possible to create your own MembershipProviders and PersonalizationProviders and incorporate them with your own database structure.
I know this is an old question, but (as most good questions do) it made me think.
I can see the perspective of your database team in certain scenarios:
You are subject to compliance regulations that require complete separation of duties and thorough step-by-step documentation of every change made to an environment. This means that you can't just run any utility (no matter how well-documented or tested) on your own; you must hand it off to another team for approval, and that team must understand (in theory) every potential impact.
Perhaps there is a very granular object by object security policy in place, and your database team feels that auto-generated table structures would violate this.
Not a security breach, but depending on naming standards you have in place, the tables/procedures created by the tool may violate those conventions.
Perhaps you work in finance or healthcare and there is a specific set of security requirements that must be met. Either through ignorance or research, your database team believes that the membership database will not meet these requirements.
I've worked with many companies that have draconian security policies that often cause a great deal of extra time to be spent. Conversely, the consequences of not following those policies can have such dire consequences (multi-million dollar fines here in the US) that caution/paranoia usually wins out.
In summary: if the ASP.Net authentication/authorization scheme is a good fit, then invest the time in educating the database team on its internals so they can be comfortable with it. Listen to their objections; DBAs can be over-protective of their environments, but they often are responsible for key assets of a company.
As #Cerebrus pointed out, there are many different ways to use ASP.Net security. Try to leverage the parts that make sense. If your database team still doesn't "get it" (or raises valid objections) there are plenty of other ways to implement robust security in ASP.Net, they just require more effort.

Resources