Optimizing multiple calls to Datastore (Joining multiple entities?) - go

Currently our system is designed in a very ad-hoc manner. There are cases where we have datastore entities designed as
NameSpace: ProjectName
Kind: <SpecificUseCaseLikeSQLTables>
Then there are cases where we have defined our entites such as
Namespace: <SomeKeyWhichUniquelyDefineAnObject>
Kind: SpecificUseCaseLikeSQLTables
Now, we are in a situation where a single call from user is taking around 10 seconds to response. I am looking into that function and it looks like we end up fetching multiple entities for one specific use case. Right now i am trying to see how many of those calls that can be fetched only once (i.e., if there is no change in those entities, those entities should get passed down to nested functions rather being fetched again). But besides that, one thing that i am thinking is that is there a way where I can issue only one query to datastore to fetch data from multiple namespaces/kinds (as described above).
In layman terms, I am asking, is there a concept of joins in Datastore? Or an alternative to it?

Joins are not supported in GAE. You could check the following documentation(http://code.google.com/appengine/docs/java/datastore/jdo/relationships.html).
If you are looking for RDBMS style databases you can try using Cloud SQL (https://developers.google.com/cloud-sql/docs/introduction).

Related

Apollo GraphQL DataLoader DynamoDb

I'm new to GraphQL and am reading about N+1 issue and the dataloader pattern to increase performance. I'm looking at starting a new GraphQL project with DynamoDB for the database. I've done some initial research and found a couple of small NPM packages for dataloader and DynamoDb but they do no seem to be actively supported. So, it seems to me, from my initial research, that DynamoDB may not be the best choice supporting an Apollo GraphQL app.
Is it possible to implement dataloader pattern against DynamoDb database?
Dataloader doesn't care what kind of database you have. All that really matters is that there's some way to batch up your operations.
For example, for fetching a single entity by its ID, with SQL you'd have some query that's a bit like this:
select * from product where id = SOME_ID_1
The batch equivalent of this might be an in query as follows:
select * from product where id in [SOME_ID_1, SOME_ID_2, SOME_ID_3]
The actual mechanism for single vs batch querying is going to vary depending on what database you're using, it may not always be possible but it usually is. A quick search shows that DynamoDB has BatchGetItem which might be what you need.
Batching up queries that take additional parameters (such as pagination, or complex filtering) can be more challenging and may or may not be worth investing the effort. But batching anything that looks like "get X by ID" is always worth it.
In terms of finding libraries that support Dataloader and DynamoDB in particular, I wouldn't worry about it. You don't need this level of tooling. As long as there's some way of constructing the database query, and you can put it inside a function that takes an array of IDs and returns a result in the right shape, you can do it -- and this usually isn't complicated enough to justify adding another library.

Apache Nifi - Federated Search

My team’s been thrown into the deep end and have been asked to build a federated search of customers over a variety of large datasets which hold varying degrees of differing data about each individuals (and no matching identifiers) and I was wondering how to go about implementing it.
I was thinking Apache Nifi would be a good fit to query our various databases, merge the result, deduplicate the entries via an external tool and then push this result into a database which is then queried for use in an Elasticsearch instance for the applications use.
So roughly speaking something like this:-
For examples sake the following data then exists in the result database from the first flow :-

Then running https://github.com/dedupeio/dedupe over this database table which will add cluster ids to aid the record linkage, e.g.:-

Second flow would then query the result database and feed this result into Elasticsearch instance for use by the applications API for querying which would use the cluster id to link the duplicates.
Couple questions:-
How would I trigger dedupe to run on the merged content was pushed to the database?
The corollary question - how would the second flow know when to fetch results for pushing into Elasticsearch? Periodic polling?
I also haven’t considered any CDC process here as the databases will be getting constantly updated which I'd need to handle, so really interested if anybody had solved a similar problem or used different approach (happy to consider other technologies too).
Thanks!
For de-duplicating...
You will probably need to write a custom processor, or use ExecuteScript. Since it looks like a Python library, I'm guessing writing a script for ExecuteScript, unless there is a Java library.
For triggering the second flow...
Do you need that intermediate DB table for something else?
If you do need it, then you can send the success relationship of PutDatabaseRecord as the input to the follow-on ExecuteSQL.
If you don't need it, then you can just go MergeContent -> Dedupe -> ElasticSearch.

GraphQL and nested resources would make unnecessary calls?

I read GraphQL specs and could not find a way to avoid 1 + N * number_of_nested calls, am I missing something?
i.e. a query has a type client which has nested orders and addresses, if there are 10 clients it will do 1 call for the 10 clients + 10 calls for each client.orders + 10 calls for each client.addresses.
Is there a way to avoid this? Not that it is not the same as caching an UUID of something, those are all different values and if you GraphQL points to a database which can make joins, it would be pretty bad on it because you could do 3 queries for any number of clients.
I ask this because I wanted to integrate GraphQL with an API that can fetch nested resources in an efficient way and if there was a way to solve the whole graph before resolving it would be nice to try to put some nested stuff in just one call.
Or I got it wrong and GraphQL is meant to be used only with microservices?
This is one of the difficulties of GraphQL's "resolver architecture". You must avoid incurring a ton of network latency by doing a lot of I/O in each resolver. Apps using a SQL DBMS will often grapple with the N + 1 problem at first. You need to use some batching and/or caching techniques to get around this.
If you are using Node.js on the server, I have two tools to recommend:
DataLoader - A database-agnostic tool for batching resolvers for each field and caching individual records.
Join Monster - A SQL-tailored tool that reads each query and your schema and compiles a SQL query for you. It leverages JOINs and DataLoader-style batching to fetch the data from your tables in a few (or a single) SQL queries.
I consider, that you're talking about using GraphQL with SQL database backend. The standard itself is database agnostic, and it doesn't care, how are you going to work out the problems of possible N+1 SELECT issues in your code. That being said, the specific server-side implementations of GraphQL server introduce many different ways of mitigating that problem:
AFAIK, Ruby implementation is able to to make use of Active Record and gems such as bullet to apply horizontal batching of executed database calls.
JavaScript implementation may make use of DataLoader library, which have similar techinque of batching series of executed promises together. You can see it in action here.
Elixir and Python implementations have concept of runtime info about executed subqueries, that can be used to determine which data will be further needed in order to execute GraphQL query, and potentially prefetch it.
F# implementation works similar to Elixir, but plugin itself can perform live analysis of execution tree to better describe, which fields can be potentially used in code, allowing for easier split of GraphQL domain model from database model.
Many implementations (i.e. PostGraph) tie underlying database model directly into GraphQL schema. In this case GQL query is often translated directly into database query language.

ActiveRecord (CDbCriteria) vs QueryBuilder?

I have to make some filters, such as get persons who are in a given department, and I was wondering about the best way to do it.
Some of them are going to require the join of multiple tables.
Does anyone know about the main differences between CDbCriteria and Query Builder? I would particularly like to know about the compatibility with databases.
I found this in the Yii documentation about Query Builder:
It offers certain degree of DB abstraction, which simplifies migration to different DB platforms.
Is it the same for the CDbCriteria objects? Is it better?
The concept of CDbCriteria is used when working with Yii's active record (AR) abstraction (which is usually all of the time). AR requires that you have created models for the various tables in your database.
Query builder a very different way to access the database; in effect it is a structured wrapper that allows you to programmatically construct an SQL query instead of just writing it out as a string (as an added bonus it also offers a degree of database abstraction as you mention).
In a typical application there would be little to no need to use query builder because AR already provides a great deal of functionality and it also offers the same degree of database abstraction.
In some cases you might want to run a very specific type of query that is not convenient or performant to issue through AR. You then have two options:
If the query is fixed or almost fixed then you can simply issue it through DAO; in fact the query builder documentation mentions that "if your queries are simple, it is easier and faster to directly write SQL statements".
If the query needs to be dynamically constructed then query builder becomes a good fit for the job.
So as you can see, query builder is not all that useful most of the time. Only if you want to write very customized and at the same time dynamically constructed queries does it make sense to use it.
The example feature that you mention can and should be implemented using AR.

ColdFusion improving performance of queries within loops

I've got a database setup that is a bit on the complicated side, with several many-many tables.
I'm trying to generate an XML document from this data. There's a bit of checking, like if a name is not defined in one language try to get the name from another language (instead of showing null)
The problem I have that there are a lot of queries within loops.
Are there any guidelines for this, like what stuff to stay away from and what to use, to improve the performance?
cfoutput cfloop cfquery ?
If the looping logic is basically doing data processing, eg: based on the values from the first query, deciding what to go back to the database with for the next query, the best thing you can do for performance is to take all that logic out of your CF code, and put it into the DB. Use the DB for data processing, use CF for handling the data once it's been processed, and converting it into output.
The only time CF should be doing data manipulation is if you need to process data from differing sources: eg the database, some remote service, the file system, a different database, etc. Basically only if the database can't do the data processing itself should you be involving ColdFusion.
Regarding, " like if a name is not defined in one language try to get the name from another language (instead of showing null)".
You should be able to do this in your query. Pretty much every db out there has a coalesce function. They all support case constructs as well. You just have to pick the most appropriate method for your situation.

Resources