Costs for setting up data integration tool for Salesforce - etl

I'm writing a report and thought you guys could help by providing me with the costs of company support in setting up and training a client on a data integrator for Salesforce. E.g., if someone wants to use Salesforce, but first needs a tool to consolidate and transfer data from back office systems to Salesforce how much would that support service cost?

Salesforce actually comes with a very good integration tool called Data Loader. It can be run as an interactive application under Windows or Macintosh, or it can be run as a command-line tool on Windows, Mac or Linux.
In interactive mode, it can import & export CSV files.
In batch mode it can also read data from, and write data to, a database.
For example, I have a Linux server where a daily cron job activates the Data Loader which runs several jobs. Some of these jobs run SQL against a database and upload the resulting data into Salesforce. Other jobs extract from Salesforce (using their SOQL query language, which is SQL-like) and store the information into a database.
Data Loader has a bit of a learning curve for batch mode (mostly around creating some XML configuration files), but the Interactive mode is very easy to use.
So, to answer your question... If it's a one-time data load, just run the interactive version and it's easy. If you want regularly-updated data, then use the batch mode. Support costs for operating the integration are really all in the setup. Once it's running, there shouldn't be any on-going costs unless the data structures change and you want to change the data being transferred. Better yet, if the system is setup by somebody who has done it before, you'll avoid a big learning curve.
If you want a figure to put into your report, then allow 3 days for the initial integration (allows for learning curve) and then a half-day for each additional one. That's generous, but provides extra time to debug problems.

To some degree, it depends on two factors:
Where is the data's source of truth?
How often do you want to sync the data?
If the answers are "it's a weird place and I only need to sync it once," then you probably want to figure out how to get it in CSV form and then use tools built into Salesforce to import it.
However, if the data lives in a database or data warehouse (postgres, mysql, mongo, redshift, snowflake, big query, etc) and especially if you want to keep Salesforce up to date with that source of truth continuously, then you could look into so-called "Reverse ETL" tools made for this purpose.
Costs depend on the tool chosen and the data volumes and other factors, but here are some options:
Grouparoo is an open source Reverse ETL tool. You can host it yourself for free. Paid plans start at $150/month.
Census is a SaaS Reverse ETL tool. Paid plans start at $300/month.
Hightouch is a SaaS Reverse ETL tool. Paid plans start at $350/month.

Related

Data Transformations in Snowflake - View, Tools etc?

We're considering Snowflake and want to understand how we could use it, and possibly other tools, to overcome one of our main problems - ETL! We currently use a legacy DWH with an ETL process consisting of SSIS and some views. This has all the common pitfalls of this methodology - most notably that it takes ages!
I was under the assumption that we'd move to an ELT model in Snowflake, I started to research tools to do the 'T' part of it, however, I'm just listening to this podcast: https://www.dataengineeringpodcast.com/snowflakedb-cloud-data-warehouse-episode-110/
And it's suggesting that just slapping a SQL View over something and exposing it in say PowerBI or Tableau is enough for the T part of things!...
Just wondering what people's experience was here?
- Do you do transformations just by writing a view in Snowflake?
- Do you use a third party tool specifically to address this need?
Secondary to this, for the Extraction and Loading, do you:
- Do this using Snowflake only
- Use a third party tool
I'm specifically interested if you do this to create some kind of timeseries in Snowflake from a non timeseries source. That's something we'd be keen to do.
This question is hard to answer without sounding opinionated, especially not knowing your use case. For what it's worth here is what I think:
Don't stick views on top of your tables and expose to a reporting tool unless you have a very very simple setup. If you're considering a tool like Snowflake then you will probably want to go for something more sustainable, this approach can become prohibitive in terms of cost and complexity in your views.
Use a third-party tool to manage your ELT process. Your choice of tool will depend on your internal skills and cloud strategy, have a look at the tools out there like Stich, Fivetran etc. If you don't mind having on-premise technologies why not stick with SSIS or use something like Apache Airflow (requires up-skilling)
Snowflake will not help you with the E of ELT, you will need to use a third-party tool to manage the extract of data from your other systems like SSIS. It will help with the L part, for this you can use Snowpipe or COPY commands which are available within the Snowflake ecosystem. Snowflake will also help you share your data with external parties which is really nice.
My organization has created a fairly complicated dimensional model in Snowflake using layers of SQL views, against which we can point our reporting tools. We use a separate replication tool for extraction from source systems and loading into Snowflake. Using views simplifies our approach in that we don't need to use an additional tool. It also makes managing the code easier than something like SSIS. For instance, we can search for code using the Snowflake interface or our version control tool instead of having to open individual SSIS packages.

Difference between Apache NiFi and StreamSets

I am planning to do a class project and was going through few technologies where I can automate or set the flow of data between systems and found that there are couple of them i.e. Apache NiFi and StreamSets ( to my knowledge ). What I couldn't understand is the difference between them and use-cases where they can be used? I am new to this and if anyone can explain me a bit would be highly appreciated. Thanks
Suraj,
Great question.
My response is as a member of the open source Apache NiFi project management committee and as someone who is passionate about the dataflow management domain.
I've been involved in the NiFi project since it was started in 2006. My knowledge of Streamsets is relatively limited so I'll let them speak for it as they have.
The key thing to understand is that NiFi was built to do one really important thing really well and that is 'Dataflow Management'. It's design is based on a concept called Flow Based Programming which you may want to read about and reference for your project 'https://en.wikipedia.org/wiki/Flow-based_programming'
There are already many systems which produce data such as sensors and others. There are many systems which focus on data processing like Apache Storm, Spark, Flink, and others. And finally there are many systems which store data like HDFS, relational databases, and so on. NiFi purely focuses on the task of connecting those systems and providing the user experience and core functions necessary to do that well.
What are some of those key functions and design choices made to make that effective:
1) Interactive command and control
The job of someone trying to connect systems is to be able to rapidly and efficiently interact with the constant streams of data they see. NiFi's UI allows you do just that as the data is flowing you can add features to operate on it, fork off copies of data to try new approaches, adjust current settings, see recent and historical stats, helpful in-line documentation and more. Almost all other systems by comparison have a model that is design and deploy oriented meaning you make a series of changes and then deploy them. That model is fine and can be intuitive but for the dataflow management job it means you don't get the interactive change by change feedback that is so vital to quickly build new flows or to safely and efficiently correct or improve handling of existing data streams.
2) Data Provenance
A very unique capability of NiFi is its ability to generate fine grained and powerful traceability details for where your data comes from, what is done to it, where its sent and when it is done in the flow. This is essential to effective dataflow management for a number of reasons but for someone in the early exploration phases and working a project the most important thing this gives you is awesome debugging flexibility. You can setup your flows and let things run and then use provenance to actually prove that it did exactly what you wanted. If something didn't happen as you expected you can fix the flow and replay the object then repeat. Really helpful.
3) Purpose built data repositories
NiFi's out of the box experience offers very powerful performance even on really modest hardware or virtual environments. This is because of the flowfile and content repository design which gives us the high performance but transactional semantics we want as data works its way through the flow. The flowfile repository is a simple write ahead log implementation and the content repository provides an immutable versioned content store. That in turn means we can 'copy' data by only ever adding a new pointer (not actually copying bytes) or we can transform data by simply reading from the original and writing out a new version. Again very efficient. Couple that with the provenance stuff I mentioned a moment ago and it just provides a really powerful platform. Another really key thing to understand here is that in the business of connecting systems you don't always get to dictate things like size of data involved. The NiFi API was built to honor that fact and so our API lets processors do things like receive, transform, and send data without ever having to load the full objects in memory. These repositories also mean that in most flows the majority of processors do not even touch the content at all. However, you can easily see from the NiFi UI precisely how many bytes are actually being read or written so again you get really helpful information in establishing and observing your flows. This design also means NiFi can support back-pressure and pressure-release naturally and these are really critical features for a dataflow management system.
It was mentioned previously by the folks from the Streamsets company that NiFi is file oriented. I'm not really sure what the difference is between a file or a record or a tuple or an object or a message in generic terms but the reality is when data is in the flow then it is 'a thing that needs to be managed and delivered'. That is what NiFi does. Whether you have lots of really high speed tiny things or you have large things and whether they came from a live audio stream off the Internet or they come from a file sitting on your harddrive it doesn't matter. Once it is in the flow it is time to manage and deliver it. That is what NiFi does.
It was also mentioned by the Streamsets company that NiFi is schemaless. It is accurate that NiFi does not force conversion of data from whatever it is originally to some special NiFi format nor do we have to reconvert it back to some format for follow-on delivery. It would be pretty unfortunate if we did that because what this means is that even the most trivial of cases would have problematic performance implications and luckily NiFi does not have that problem. Further had we gone that route then it would mean handling diverse datasets like media (images, video, audio, and more) would be difficult but we're on the right track and NiFi is used for things like that all the time.
Finally, as you continue with your project and if you find there are things you'd like to see improved or that you'd like to contribute code we'd love to have your help. From https://nifi.apache.org you can quickly find information on how to file tickets, submit patches, email the mailing list, and more.
Here are a couple of fun recent NiFi projects to checkout:
https://www.linkedin.com/pulse/nifi-ocr-using-apache-read-childrens-books-jeremy-dyer
https://twitter.com/KayLerch/status/721455415456882689
Good luck on the class project! If you have any questions the users#nifi.apache.org mailing list would love to help.
Thanks
Joe
Both Apache NiFi and StreamSets Data Collector are Apache-licensed open source tools.
Hortonworks does have a commercially supported variant called Hortonworks DataFlow (HDF).
While both have a lot of similarities such as a web-based ui, both are used for ingesting data there are a few key differences. They also both consist of a processors linked together to perform transformations, serialization, etc.
NiFi processors are file-oriented and schemaless. This means that a piece of data is represented by a FlowFile (this could be an actual file on disk, or some blob of data acquired elsewhere). Each processor is responsible for understanding the content of the data in order to operate on it. Thus if one processor understands format A and another only understands format B, you may need to perform a data format conversion in between those two processors.
NiFi can be run standalone, or as a cluster using its own built-in clustering system.
StreamSets Data Collector (SDC) however, takes a record based approach. What this means is that as data enters your pipeline it (whether its JSON, CSV, etc) it is parsed into a common format so that the responsibility of understanding the data format is no longer placed on each individual processor and any processor can be connected to any other processor.
SDC also runs standalone, and also a clustered mode, but it runs atop Spark on YARN/Mesos instead, leveraging existing cluster resources you may have.
NiFi has been around for about the last 10 years (but less than 2 years in the open source community).
StreamSets was released to the open source community a little bit later in 2015. It is vendor agnostic, and as far as Hadoop goes Hortonworks, Cloudera, and MapR are all supported.
Full Disclosure: I am an engineer who works on StreamSets.
They are very similar for data ingest scenarios.
Apache NIFI(HDP) is more mature and StreamSets is more lightweight.
Both are easy to use, both have strong capability. And StreamSets could easily
They have companies behind, Hortonworks and Cloudera.
Obviously there are more contributors working on NIFI than StreamSets, of course, NIFI have more enterprise deployments in production.
Two of the key differentiators between the two IMHO are.
Apache NiFi is a Top Level Apache project, meaning it has gone through the incubation process described here, http://incubator.apache.org/policy/process.html, and can accept contributions from developers around the world who follow the standard Apache process which ensures software quality. StreamSets, is Apache LICENSED, meaning anyone can reuse the code, etc. But the project is not managed as an Apache project. In fact, in order to even contribute to Streamsets, you are REQUIRED to sign a contract. https://streamsets.com/contributing/ . Contrast this with the Apache NiFi contributor guide, which wasn't written by a lawyer. https://cwiki.apache.org/confluence/display/NIFI/Contributor+Guide#ContributorGuide-HowtocontributetoApacheNiFi
StreamSets "runs atop Spark on YARN/Mesos instead, leveraging existing cluster resources you may have." which imposes a bit of restriction if you want to deploy your dataflows further toward the Edge where the Devices that are generating the data live. Apache MiniFi, a sub-project of NiFi can run on a single Raspberry Pi, while I am fairly confident that StreamSets cannot, as YARN or Mesos require more resources than a Raspberry Pi provides.
Disclosure: I am a Hortonworks employee

Measuring Application Performance

I was wondering if there is a tool to keep track of application performance. What I have in mind is a tool that will listen for updates and register performance metrics published by an application. i.e. time to serve a request, time a certain operation took to finish. And this tool would then aggregate the data and measure performance trends.
If you want to measure your application from outside, then you can use RRDtool to collect the data.
You can use slamd for webapp written in Java.
For Django use hotshot.
Search for profiler + your language, framework
Take a look at HP SiteScope. It's ability to drive the system with a Web User Script, to monitor the metrics on the backend, even to the extent of creation of custom shell scripts and database queries, plus the ability to add logic for report/alert against these combined data sets appears to be what you need.
Other mechanisms that you might consider would be a roll your own service using CURL to push information in, queries to the systems involved to pull metrics or database information and then your own interface for alerting and reporting.
Then it becomes a cost question, can you roll the level of functionality for less money than you can purchase an already existing solution on the open market.
Ref:
HP SiteScope Wiki Page

Performance problems with external data dependencies

I have an application that talks to several internal and external sources using SOAP, REST services or just using database stored procedures. Obviously, performance and stability is a major issue that I am dealing with. Even when the endpoints are performing at their best, for large sets of data, I easily see calls that take 10s of seconds.
So, I am trying to improve the performance of my application by prefetching the data and storing locally - so that at least the read operations are fast.
While my application is the major consumer and producer of data, some of the data can change from outside my application too that I have no control over. If I using caching, I would never know when to invalidate the cache when such data changes from outside my application.
So I think my only option is to have a job scheduler running that consistently updates the database. I could prioritize the users based on how often they login and use the application.
I am talking about 50 thousand users, and at least 10 endpoints that are terribly slow and can sometimes take a minute for a single call. Would something like Quartz give me the scale I need? And how would I get around the schedular becoming a single point of failure?
I am just looking for something that doesn't require high maintenance, and speeds at least some of the lesser complicated subsystems - if not most. Any suggestions?
This does sound like you might need a data warehouse. You would update the data warehouse from the various sources, on whatever schedule was necessary. However, all the read-only transactions would come from the data warehouse, and would not require immediate calls to the various external sources.
This assumes you don't need realtime access to the most up to date data. Even if you needed data accurate to within the past hour from a particular source, that only means you would need to update from that source every hour.
You haven't said what platforms you're using. If you were using SQL Server 2005 or later, I would recommend SQL Server Integration Services (SSIS) for updating the data warehouse. It's made for just this sort of thing.
Of course, depending on your platform choices, there may be alternatives that are more appropriate.
Here are some resources on SSIS and data warehouses. I know you've stated you will not be using Microsoft products. I include these links as a point of reference: these are the products I was talking about above.
SSIS Overview
Typical Uses of Integration Services
SSIS Documentation Portal
Best Practices for Data Warehousing with SQL Server 2008

How do I deploy an Oracle database?

I have an ASP .NET application that connects to an Oracle or a SQL Server database. An installer has been developed to install a fresh database to an existing SQL Server using sql commands such as "restore database..." which simply restores a ".bak" file which we keep under source control.
I'm very new to Oracle and our application has only recently been ported to be compatible with 10g.
We are currently using the "exp.exe" tool to generate a ".dmp" file and then using the "imp.exe" to import it into a developers box.
How would you go about creating an "Oracle Database Installer"?
Would you create the database using script files and then populate the database with required default data?
Would you run the "imp.exe" tool behind the scenes?
Do we need to provide a clean interface for system administrators so that they can just select the destination server and have done, or should we just provide them with the ".dmp" file? What are the best practices?
Thanks.
The question is -- what do your customers know about Oracle?
Nothing? You should probably rethink this position. Oracle is very large and complex. If you assume your customers know nothing, you'll then start providing tutorials and help that's inappropriate.
Minimally Competent? If they're competent, they know enough to run imp by themselves. Also, they know enough to run a script that executes SQL.
Actual DBA's? Most organizations that can afford Oracle can afford real DBA's. Real DBA's can cope with a lot of things -- they do not need much hand-holding. Some of them like to assign storage parameters according to their shop standards.
You should provide a script with reasonable defaults. You should define your script in a way that someone can easily find all of your storage parameters and tweak them if necessary.
Your initial data can be via export/import or via a script. I prefer a script.
I have done this repeatedly from both sides (consumer and provider) as a DBA, developer, and architect.
As a provider, one of my grand accomplishments (in 1996) was the creation of an installation CD for a commercial insurance claims management software product targeted to the largest insurance carriers (a multi-million dollar item). That installation CD installed the Oracle 7.2 RDBMS engine, the FileNet optical storage system (scans paper documents and creates cataloged binary versions), and our custom claim-processing application (built in VB 4.0), all integrated and ready to run. As part of the installation process, the user could skip the Oracle software installation or customize it, and the user could customize/override the database configuration in all of its major details (database, schemas, tablespaces, sizes, disks, etc.).
I also provided the field service for this product, which included traveling to the client site as necessary. I tested the installation CD literally hundreds of times under every imaginable scenario that I could replicate, and we NEVER had a field failure that required even a phone call, let alone a trip (I did travel on four occasions, but for pre-sales stuff instead).
More recently (2007), I scripted the creation of an Oracle 10g database for an internal system at a megacorp. In production, the database was sized at 8 TB, mostly for a single transaction table with high data volume. In test, the database was sized around 1 TB for a modest server. In development, the database was sized around 100 MB to run on my laptop. The EXACT SAME SCRIPTS created all three environments, and I could extend them to handle a new environment/machine in about five minutes. This database involved extreme performance tuning, so customization of all pertinent characteristics was absolutely crucial.
Back to the insurance claims processing product--let me please add that I was originally hired to lead its conversion from a SQL Server database to an Oracle database. That conversion was identified as a business necessity because most potential clients did not view a SQL-Server-based product as a professional, serious solution. That is not quite as common today, but it still applies in general: a software product has a better chance of market penetration if it can accommodate multiple database options as preferred by the target customers (especially enterprise-class customers).
Likewise, the installation CD was also viewed as an essential element. However, that situation and many more have revealed to me that most "real" DBAs will not accept an import-based database installation. As a DBA and architect, I know that I definitely will not for the same reasons.
Simply put, an import-based database installation gives the customer almost no control over the resulting database. It is opaque to the customer, leaving them questioning what it did. It forces the customer to expend massive efforts to attempt to exercise what little control they can. It is notoriously fragile and error-prone (Oracle imports are well known for ownership and permission problems, constraint problems, etc.). Weighing all those impacts, an import-based database installation is unprofessional--it does not put the customers' needs first.
Scripting the database installation provides the right kind of transparency, configurability, selective repeatability, and overall customer control that professionalism demands. It also encourages you to properly understand the impacts of your database design decisions in a way that an import does not.
Best wishes.
Personally I favour SQL scripts to database creation and data loads where possible. I tend to use PL/SQL Developer. It has some good options to generate scripts from an existing database. Once you have these you can run the scripts using sqlplus or any application code that can execute arbitrary SQL (eg JDBC with Java). Toad is the more common (and more expensive) tool for Oracle development.
The only limitation of a SQL export is it can't export CLOB/BLOB fields. If you have those, you either need to do them separately (as a PL/SQL export) or do the whole thing as a PL/SQL export. Theres no dramas with this except the file is effectively a binary export (extension .pde) and is more limited in how you can execute it.
The other big advantage of SQL source files is they can be version controlled easily. It's really handy to be able to create a database environment by running one or two scripts.
The import and export tools for Oracle I think are more applicable for backup and restore operations.
Now, as for delivering that to a customer, from your comments it seems that you'll be giving this to DBAs. Pretty much any Oracle installation will have DBAs involved. They will be fine with SQL scripts to create the schema and do the data load. They will be doing a lot of site-specific configuration (eg tuning the SGA, temp tablespaces, # of concurrent connections, etc based on expected load).
You, as the vendor, can give guidance on any relevant configuration and you may get involved in support and possibly installation but ultimately it's up to them to figure out what works for them. Oracle runs on a large number of operating systems and hardware variants with infinite variations in network topology and firewall configuraiton. You can't factor in all of these to an installer or even a set of instructions (other than the guidelines mentioned previously).
The last time I was involved in the creation of a (oracle) db (for a reasonably large company with in-house DBAs) the DBAs wanted to know things like:
what we wanted to call the db,
what tablespaces we would need, and an estimate of how much data would be in each one
how many users would be connecting.
(From memory) they set up the db and tablespaces, then we provided a combination of simple scripts that they could run (or clear instructions if a task wasn't easy to automate)
As I say this was for an in-house app, so your mileage may vary, but in my case they wanted all instructions clearly spelt out so that (a) there was no possibily of a misunderstanding leading to the wrong thing being done, and (b) no culpability on their part if something didn't work ("we were just following the instructions")

Resources