Does DigibeeCTL return a sql statement in the flow spec? - ipaas

I need to make a major analysis in the queries my pipelines are currently making against my data store.
I am looking for extract all the statements from all of the pipelines, 40+ with about 100 statements.
Is the anyway to retrieve this information in bulk? Any other possibility?
Thanks!

Related

Impala query with LIMIT 0

Being production support team member, I investigate issues with various Impala queries and while researching on an issue , I see a team submits an Impala query with LIMIT 0 which obviously do not return any rows and then again without LIMIT 0 which gives them result. I guess they submit these queries from IBM Datastage. Before I question them why they do so.. wanted to check what could be a reason for someone to run with LIMIT 0. Is it just to check syntax or connection with Impala? I see a similar question discussed here in context of SQL but thought to ask anyway in Impala perspective. Thanks Neel
I think you are partially correct.
Pls note, limit will process all the data and then apply limit clause.
LIMIT 0 is mostly used to -
to check if syntax of SQL is correct. But impala do fetch all the records before applying limit. so SQL is completely validated. Some system may use this to check out the sql they generated automatically before actually applying it in server.
limit fetching lots of rows from a huge table or a data set every time you run a SQL.
sometime you want to create an empty table using structure of some other tables but do not want to copy store format, configurations etc.
dont want to burden the hue/any interface that is interacting with impala. All data will be processed but will not be returned.
performance test - this will somewhat give you an idea of run time of SQL. i used the word somewhat because its not actual time to complete but estimated time to complete a SQL.

How much of Talend functionality is translated in SQL-Query and how much in Java?

I am facing an internship and they asked me to learn how to use talend ETL.
I did it, not so difficult.
One of the extra-tasks that have been assigned to me is to verify how much of the operations I set on the design workspace is executed in java and what is done through the use of queries.
I've set up a simple Join using the TMap component and I monitored the SQLdatabase through the use of SQL Profiler. the result is that only the essential create/drop and the select/insert of the table is done via sql while every other thing like the actual join is made "Java" side.
As long as it is an simple operation like join, wouldn't it be convenient to execute it through a query without having to bother java to perform it?
For those who also know SAP, in terms of performance is there so much difference between Talend and SAP?
Only operations in tDB components (create,select,insert, etc) are actually done through SQL. All operations done in other talend components (tMap, tFilter, aggregate, etc) are done through java.
Indeed you'll have better performances doing operations SQL-side. You then have to find the right balance between an "all-in-sql" type of job and an "all-java" one. (it could be harder for a talend developer to debug operations if all the sql part is done through a unique query inside a single component...).
You could definitely have your joins inside a tDBInput component, and output the result in a single output flow.
You can also check ELT* components : they let you use SQL-engine instead of java-engine to perform all operations (join,aggregate,filter) while using a talend interface.

Which Nifi processor to use for RDBMS Extract

i will explain my use case to understand which DB extract utility to use.
I need to extract data from SQL Server tables with varying frequency each day. Each extract query is a complex SQL statement, involving 5-10 tables in joins etc with multiple causes. Have around 20-30 such statements overall.
All these extract queries might be required to run multiple times a day with varying frequencies each day. It depends on how many times we receive data from source system or other cases.
We are planning to use Kafka to publish a message to let Nifi workflow know whenever a RDBMS table is updated and flow needs to be triggered (i can't just trigger Nifi flow based on "incremental" column value, there might only be all row update scenarios and we might not create new rows in tables).
How should i go about designing my Nifi. There are ExecuteSQL/GenerateTableFetch/ExecuteSQLRecord/QueryDatabaseTable all sorts of components available. Which one is going to fit my requirement best?
Thanks!
I am suggesting that you use ExecuteSQL. You can set query from attribute or compose it using attribute. Easiest way is to create json and then parse that json and create attributes. Check this example, here is sql created from file you can adjust it to create it from kafka link

GraphQL and nested resources would make unnecessary calls?

I read GraphQL specs and could not find a way to avoid 1 + N * number_of_nested calls, am I missing something?
i.e. a query has a type client which has nested orders and addresses, if there are 10 clients it will do 1 call for the 10 clients + 10 calls for each client.orders + 10 calls for each client.addresses.
Is there a way to avoid this? Not that it is not the same as caching an UUID of something, those are all different values and if you GraphQL points to a database which can make joins, it would be pretty bad on it because you could do 3 queries for any number of clients.
I ask this because I wanted to integrate GraphQL with an API that can fetch nested resources in an efficient way and if there was a way to solve the whole graph before resolving it would be nice to try to put some nested stuff in just one call.
Or I got it wrong and GraphQL is meant to be used only with microservices?
This is one of the difficulties of GraphQL's "resolver architecture". You must avoid incurring a ton of network latency by doing a lot of I/O in each resolver. Apps using a SQL DBMS will often grapple with the N + 1 problem at first. You need to use some batching and/or caching techniques to get around this.
If you are using Node.js on the server, I have two tools to recommend:
DataLoader - A database-agnostic tool for batching resolvers for each field and caching individual records.
Join Monster - A SQL-tailored tool that reads each query and your schema and compiles a SQL query for you. It leverages JOINs and DataLoader-style batching to fetch the data from your tables in a few (or a single) SQL queries.
I consider, that you're talking about using GraphQL with SQL database backend. The standard itself is database agnostic, and it doesn't care, how are you going to work out the problems of possible N+1 SELECT issues in your code. That being said, the specific server-side implementations of GraphQL server introduce many different ways of mitigating that problem:
AFAIK, Ruby implementation is able to to make use of Active Record and gems such as bullet to apply horizontal batching of executed database calls.
JavaScript implementation may make use of DataLoader library, which have similar techinque of batching series of executed promises together. You can see it in action here.
Elixir and Python implementations have concept of runtime info about executed subqueries, that can be used to determine which data will be further needed in order to execute GraphQL query, and potentially prefetch it.
F# implementation works similar to Elixir, but plugin itself can perform live analysis of execution tree to better describe, which fields can be potentially used in code, allowing for easier split of GraphQL domain model from database model.
Many implementations (i.e. PostGraph) tie underlying database model directly into GraphQL schema. In this case GQL query is often translated directly into database query language.

how to reduce the database's pressure

I have a database(sql server 2005),now there are about 100000 records in the table called users, when I do query use linq to sql, it is very slower and slower.how can I do some operate to improve the speed?
Analyse your query and add some indexes to your table may help.
To get a more specific answer post more specific information (table stucture, indexes you have, the sql code L2S generates, ...)
You could (in order of preference)
Save your query as a stored procedure
Add indexes to your users
table, for what you are querying for/sorting for
Analyze your query
(if it is complicated), see if there's a less-resource-intensive way
of doing it. There are graphical query analyzers to help you.
As a last resort, not use LINQ, but instead ADO.NET Entity Framework, it's significantly faster. But you'll only see performance improvements for crazy stuff, and only if you've already done all of the above.
Use stored procedures and then use linq to sql to get the desired rows, this will give performance.
The best tools at your disposal for analyzing your database access and seeing what needs to be optimized are:
SQL Server Profiler
Graphical Execution Plans
The first one will allow you to see the exact queries being sent to your database from your application, which is especially useful if it turns out that your application is chattier than you think. The second one will allow you to take those queries and see exactly what the SQL server is doing with them.
In the graphical execution plan, look for steps which use a lot of CPU and paths which transfer a lot of records. Those are what you'll want to optimize. It's possible that you're doing a table scan somewhere, which is slow, or maybe joining on many more records than you need somewhere, which is slow, etc.

Resources