I am currently working on a project involving graphQL and I was wondering if the action of retrieving every elements of a given type in a query was taking significantly more time than only retrieving some or if this time was negligible.
Here is an exemple:
fragment GlobalProtocolStatsFields on GlobalProtocolStats {
totalProfiles
totalBurntProfiles
totalPosts
totalMirrors
totalComments
totalCollects
totalFollows
totalRevenue {
...Erc20AmountFields
}}
vs
fragment GlobalProtocolStatsFields on GlobalProtocolStats {
totalProfiles
totalBurntProfiles
totalPosts
totalMirrors
}
Thanks in advance!
The answer highly depends on the implementation on the backend side. Let's look at what three stages the data goes through and how this can impact response time.
1. Data fetching from the source
First, the GraphQL server has to fetch the data from the database or a different data source. Some data sources allow you to specify which fields you want to receive. If the GraphQL service is optimised to only fetch the data needed, some time can be saved here. In my experience, it is often not worth it to do this and it is much easier to just fetch all fields that could be needed for an object type. Some GraphQL implementations do this automatically, e.g. Hasure, Postgraphile, Pothos with the Prisma Plugin. What can be more expensive is resolving relationships between entities. Often, the GraphQL implementation has to do another roundtrip to the server.
2. Data transformation and business logic
Sometimes, the data has to be transformed before it is returned from the resolver. The resolver model allows this business logic to be called conditionally. Leaving out a field will skip its resolver. In my experience, most business logic is incredibly fast and does not really impact response time.
3. Data serialisation and network
Once all the data is ready on the server side, it has to be serialised to JSON and sent to the client. Serializing large amounts of data can be expensive, especially because GraphQL is hard to serialise in a stream. Sending data to the client can also take a while, if the connection is slow or the data has a large size. This was one of the motivations for GraphQL: Allow the client to select the required fields and reduce unused data transfer.
Summary
As you can see, the response time is mostly related to the amount of data returned from the API and the network connection. Depending on the implementation, real savings are only made on the network, but more advanced implementations can drastically reduce the work done on the server as well.
Related
So I have been following Ben Awad's 14 hr GraphQL and React series and I came across field resolvers and I have been a little confused about their use-case. I understand that field resolvers can be used to modify a certain data in the server. In this particular case, he uses it to slice the text of huge user posts, so in this context I am guessing that fetching large chunks of data is time consuming as the data grows, so field resolvers can be used to modify appropriate data to optimise the server. Am i right? Because this makes sense to me and this is how i understand it. Is there more to what they offer or is this it?
Background information
We sell an API to users, that analyzes and presents corporate financial-portfolio data derived from public records.
We have an "analytical data warehouse" that contains all the raw data used to calculate the financial portfolios. This data warehouse is fed by an ETL pipeline, and so isn't "owned" by our API server per se. (E.g. the API server only has read-only permissions to the analytical data warehouse; the schema migrations for the data in the data warehouse live alongside the ETL pipeline rather than alongside the API server; etc.)
We also have a small document store (actually a Redis instance with persistence configured) that is owned by the API layer. The API layer runs various jobs to write into this store, and then queries data back as needed. You can think of this store as a shared persistent cache of various bits of the API layer's in-memory state. The API layer stores things like API-key blacklists in here.
Problem statement
All our input data is denominated in USD, and our calculations occur in USD. However, we give our customers the query-time option to convert the response just-in-time to another currency. We do this by having the API layer run a background job to scrape exchange-rate data, and then cache it in the document store. Individual API-layer nodes then do (in-memory-cached-with-TTL) fetches from this exchange-rates key in the store, whenever a query result needs to be translated into a specific currency.
At first, we thought that this unit conversion wasn't really "about" our data, just about the API's UX, and so we thought this was entirely an API-layer concern, where it made sense to store the exchange-rates data into our document store.
(Also, we noticed that, by not pre-converting our DB results into a specific currency on the DB side, the calculated results of a query for a particular portfolio became more cache-friendly; the way we're doing things, we can cache and reuse the portfolio query results between queries, even if the queries want the results in different currencies.)
But recently we've been expanding into also allowing partner clients to also execute complex data-science/Business Intelligence queries directly against our analytical data warehouse. And it turns out that they will also, often, need to do final exchange-rate conversions in their BI queries as well—despite there being no API layer involved here.
It seems like, to serve the needs of BI querying, the exchange-rate data "should" actually live in the analytical data warehouse alongside the financial data; and the ETL pipeline "should" be responsible for doing the API scraping required to fetch and feed in the exchange-rate data.
But this feels wrong: the exchange-rate data has a different lifecycle and integrity constraints than our financial data. The exchange rates are dirty and ephemeral point-in-time samples attained by scraping, whereas the financial data is a reliable historical event stream. The exchange rates get constantly updated/overwritten, while the financial data is append-only. Etc.
What is the best practice for serving the needs of analytical queries that need to access backend "application state" for "query result presentation" needs like this? Or am I wrong in thinking of this exchange-rate data as "application state" in the first place?
What I find interesting about your scenario is about when the exchange rate data is applicable.
In the case of the API, it's all about the realtime value in the other currency and it makes sense to have the most recent value in your API app scope (Redis).
However, I assume your analytical data warehouse has tables with purchases that were made at a certain time. In those cases, the current exchange rate is not really relevant to the value of the transaction.
This might mean that you want to store the exchange rate history in your warehouse or expand the "purchases" table to store the values in all the currencies at that moment.
Can someone explain the benefits of using Graphql in your Magento/Magento 2 site?
Is it really faster than a normal query or using collections. Because from what i see is that you still have to set/fetch all data in the resolver that was declared on the schema.graphql so that it will be available on every request.
Is it faster because each set of data is cached by graphql or is their a logic behind it that make it faster?
Like when you just need a name, description of a product you would just have to call getCollection()->addAttributeToSelect(['name', 'description'])->addAttributeToFilter('entity_id', $id)->getFirstItem() inside your block wherein a graphql request the data will be fetch via the resolver which all the data of the product is being fetch.
Regarding about performance , GraphQL is normally faster than the equivalent REST API if the client needs to get a graph of data and assuming GraphQL API is implemented correctly . Otherwise ,it may easily lead to N+1 loading problem which will make the API slow.
It is faster mainly because client only need to send one request to get the whole graph of the data while in the REST API , client need to send many HTTP requests separately to get the whole graph of data. The number of network round trip is reduced to one in GraphQL case and hence it is faster (Of course, it assumes that there is no single equivalent REST API to get this graph of data. 😉)
In other words , if you only get a single record , you will not find there is much performance differences in the REST API and GraphQL API.
But besides performance , what GraphQL API offers is to allow user to get the exact fields and exact graph of data that they want which is difficult to be achieved in REST API.
We use one endpoint that returns massive size of data and sometime the page would take 5-10s to load. We don't have control over the backend api.
Is there a way to reduce the size that's going to be downloaded from the API?
We have already enabled compression.
I heard GraphQL could make a data schema before query it. Would GraphQL help in this case?
GraphQL could help, assuming:
Your existing API request is doing a lot of overfetching, and you don't actually need a good chunk of the data being returned
You have the resources to set up an additional GraphQL server to serve as a proxy to the REST endpoint
The REST endpoint response can be modeled as a GraphQL schema (this might be difficult or outright impossible if the object keys in the returned JSON are subject to change)
The response from the REST endpoint can be cached
The extra latency introduced by adding the GraphQL server as an intermediary is sufficiently offset by the reduction in response size
Your GraphQL server would have to expose a query that would make a request to the REST endpoint and then cache it server-side. The cached response would be used upon subsequent queries to the server, until it expires or is invalidated. Caching the response is key, otherwise simply proxying the request through the GraphQL server will simply make all your queries slower since getting the data from the REST endpoint to the server will itself take approximately the same amount of time as your request does currently.
GraphQL can then be used to cut down the size of your response in two ways:
By not requesting certain fields that aren't needed by your client (or omitting these fields from your schema altogether)
By introducing pagination. If the reason for the bloated size of your response is the sheer number of records returned, you can add pagination to your schema and return smaller chunks of the total list of records one at a time.
Note: the latter can be a significant optimization, but can also be tricky if your cache is frequently invalidated
In an application we have to send sensory data stream from multiple clients to a central server over internet. One obvious solution is to use MOMs (Message Oriented Middlewares) such as Kafka, but I recently learned that we can do this with data base synchronization tools such as oracle Materialized View.
The later approach works in some application (sending data from a central server to multiple clients, inverse directin of our application), but what is the pros and cons of it in our application? Which one is better for sending sensory data stream from multiple (~100) clients to server in terms of speed, security, etc.?
Thanks.
P.S.
For more detail consider an application in which many (about 100) clients have to send streaming data (1MB data per minute) to a central server over internet. The data are needed in server for the sake of online monitoring, analysis and some computation such as machine learning and data mining tasks.
My question is about the difference between db-to-db connection and streaming solutions such as kafka for trasfering data from clients to server.
Prologue
I'm going to try and break your question down into in order to get a clearer understanding of your current requirements and then build it back up again. This has taken a long time to write so I'd really appreciate it if you do two things off the back of it:
Be sceptical - there's absolutely no substitute for testing things yourself. The internet is very useful as a guide but there's no guarantee that the help you receive (if this answer is even helpful!) is the best thing for your specific situation. It's impossible to completely describe your current situation in the space allotted and so any answer is, of necessity, going to be lacking somewhere.
Look again at how you explained yourself - this is a valid question that's been partially stopped by a lack of clarity in your description of the system and what you're trying to achieve. Getting someone unfamiliar with your system to look over your question before posting a complex question may help.
Problem definition
sensory data stream from multiple clients to a central server
You're sending data from multiple locations to a single persistence store
online monitoring
You're going to be triggering further actions based off the raw data and potentially some aggregated data
analysis and some computation such as machine learning and data mining tasks
You're going to be performing some aggregations on the clients' data, i.e. you require aggregations of all of the clients' data to be persisted (however temporarily) somewhere
Further assumptions
Because you're talking about materialized views we can assume that all the clients persist data in a database, probably Oracle.
The data coming in from your clients is about the same topic.
You've got ~100 clients, at that amount we can assume that:
the number of clients might change
you want to be able to add clients without increasing the number of methods of accessing data
You don't work for one of Google, Amazon, Facebook, Quantcast, Apple etc.
Architecture diagram
Here, I'm not making any comment on how it's actually going to work - it's the start of a discussion based on my lack of knowledge of your systems. The "raw data persistence" can be files, Kafka, a database etc. This is description of the components that are going to be required and a rough guess as to how they will have to connect.
Applying assumed architecture to materialized views
Materialized views are a persisted query. Therefore you have two choices:
Create a query that unions all 100 clients data together. If you add or remove a client you must change the query. If a network issue occurs at any one of your clients then everything fails
Write and maintain 100 materialized views. The Oracle database at your central location has 100 incoming connections.
As you can probably guess from the tradeoffs you'll have to make I do not like materialized views as the sole solution. We should be trying to reduce the amount of repeated code and single points of failure.
You can still use materialized views though. If we take our diagram and remove all the duplicated arrows in your central location it implies two things.
There is a single service that accepts incoming data
There is a single service that puts all the incoming data into a single place
You could then use a single materialized view for your aggregation layer (if your raw data persistence isn't in Oracle you'll first have to put the data into Oracle).
Consequences of changes
Now we've decided that you have a single data pipeline your decisions actually become harder. We've decoupled your clients from the central location and the aggregation layer from our raw data persistence. This means that the choices are now yours but they're also considerably easier to change.
Reimagining architecture
Here we need to work out what technologies aren't going to change.
Oracle databases are expensive and you're pushing 140GB/day into yours (that's 50TB/year by the way, quite a bit). I don't know if you're actually storing all the raw data but at those volumes it's less likely that you are - you're only storing the aggregations
I'm assuming you've got some preferred technologies where your machine learning and data mining happen. If you don't then consider getting some to prevent madness supporting everything
Putting all of this together we end up with the following. There's actually only one question that matters:
How many times do you want to read your raw data off your database.
If the answer to that is once then we've just described middleware of some description. If the answer is more than once then I would reconsider unless you've got some very good disks. Whether you use Kafka for this middle layer is completely up to you. Use whatever you're most familiar with and whatever you're most willing to invest the time into learning and supporting. The amount of data you're dealing with is non-trivial and there's going to be some trial and error getting this right.
One final point about this; we've defined a data pipeline. A single method of data flowing through your system. In doing so, we've increased the flexibility of the system. Want to add more clients, no need to do anything. Want to change the technology behind part of the system, as long as the interface remains the same there's no issue. Want to send data elsewhere, no problem, it's all in the raw data persistence layer.