We use New Relic to gather performance information from our production environment and we have added some custom instrumentation. In the Web Transactions screens, we can see which transactions use most time and we can even drill down into the specific traces of the slowest transactions. This is all working fine. However, the slowest transactions are not always representative for the operation as a whole. They are often edge cases (cache expired, warming requests after an update, etc...).
I would be interested to see the very same data that we can see in the Trace Details in a more aggregate way. Preferably also in the hierarchical way that is used in Trace Details (although this will not always be possible, as multiple instances may have different traces). Is the Breakdown Table on the overview page for one Web Transaction type actually what I am looking for? I am not sure. What does that show exactly?
The Breakdown Table in New Relic's Web Transactions tab is designed to give you an aggregate of performance data along with historical comparisons. This may not provide the specific level of detail you're looking for.
New Relic has a new feature available for the Python and Java agents called X-Ray Sessions. After you start an x-ray session, New Relic will collect up to 100 transaction traces and a thread profile for your transaction. Collection automatically stops at 100 traces or 24 hours, whichever comes first. The results are displayed in a hierarchical waterfall chart like transaction traces, but the data is aggregated. Here is an overview:
https://newrelic.com/docs/transactions-dashboards/xray-sessions
While I can't say if or when this feature will be rolled out to the other language agents, I suggest keeping an eye the following for updates:
https://newrelic.com/docs/features/new-noteworthy
Related
Our scenario: we receive bulks of messages from Kafka and write them to the DB after certain processing. Currently we achieve DB-write rates (in our company network) of up to 300..310 thousand records/min. But my colleagues want more (500K-600K/min.)
The affected Java application has a functional layer (a "business facade" so to speak), underneath we have classes of persistence layer, which write records grouped to individual tables into the DB as bulk inserts/updates. Whereas a bulk insert/update has been implemented as a #Transactional(REQUIRED) - i.e. default setting. Therefore, a received group of Kafka messages often means more than 1 database transaction.
I know that a DB-commit is expensive in terms of performance. I used the following settings when configuring our Spring-based data sources:
useConfigs=maxPerformance
rewriteBatchedStatements=true
prepStmtCacheSize=256
prepStmtCacheSqlLimit=2048
This did improve performance, but not to the desired benchmark of 500K-600K DB-writes/min.
Question to you, colleagues: is it OK from the standpoint of software architecture and for performance increase to annotate our "business facade" class as #Transactional(REQUIRED) and the DB layer classes as #Transactional(SUPPORTS). Thus, I want to have only one transaction per group/bulk of Kafka messages and thereby increase the DB-write rates by avoiding "excessive" commits.
Personally, I'm a bit hesitant about this change. On the one hand, I'm breaking here the boundaries of the areas of responsibility of the individual classes/layers: business logic "high-level" classes should know nothing about transaction management and the persistence layer classes should treat DB transactions as their core task. On the other hand, unwanted "cross-dependencies" arise: i.e. if an update for a table XYZ fails, then a rollback is also made for another table ABC, although everything ran smoothly there (remember all tables are getting now updates and inserts within one transaction!).
What do you think about this potential change in the transaction management? How can you fine-tune a spring-boot application to achieve higher write rates (configuration or maybe implementation changes)?
The company that I work at uses a microservices architecture with the 'database per service' pattern. This pattern makes it harder to query based on data from multiple services, since each service has its own database. Imagine a service for managing your products and one for managing stock. You would have to somehow combine the data from both services to query for products based on stock.
I know that event sourcing and API composition are potential solutions to the problem, but I was wondering if it is possible to continuously replicate specific tables from the product and stock databases based on database transaction logs. Wouldn't this be much simpler than say implementing an event based solution like event sourcing? One service that I am working with contains a lot of domain events, which would make implementing and maintaining event-based solution rather complex.
Another reason for why I am considering to look at the problem from a different angle is that there is a lot of data. In-memory joins with say API composition will most likely be slow.
To sum it all up, I would like to know if it is possible to continuously replicate specific tables from different databases into one database.
The technologies that my company uses are primarily Spring Framework and PostgreSQL.
I would step back and ask why you have microservices (including why you have multiple databases). This is because it's quite easy to make choices that are superficially easy but which achieve that ease by negating the reason you had the microservices to begin with, and in such a situation, it may in fact be easier to just not do microservices.
For example, you might be doing microservices because you want to be able to have the team maintaining your product service be able to make changes without coordinating with the stock service or vice versa. By setting up a direct replication of a table from service A's database into service B's database, you essentially require many changes service A might want to make to that table to be coordinated with service B. It's perhaps less operationally coupled than unifying the services into a monolith, but in terms of developer velocity, you're giving up a fair amount.
Alternatively, if the rationale is to allow one service to be down (failures, maintenance, releases: doesn't matter) without taking the others down, a replication which guarantees strong consistency implies that taking service B's database down prevents service A from updating its database (because if you allowed service A to update its database in that situation, you couldn't have strong consistency).
Rather than direct replication, it might make sense to use change data capture (e.g. with Debezium) to publish a stream of changes from the transaction logs (e.g. to Kafka). The critical difference from logical replication is that the consumer can, for instance, choose to ignore updates to columns it doesn't care about: the stock service might include details like where things are stocked in a warehouse, for instance, which is data you don't need for answering a query like "show me the products in this category which are in stock". This can be a nice middle ground between going full event-sourcing and other approaches.
I'm endeavoring to develop an application that uses Oracle as the database back-end. The application will calculate several statistics from the various tables in the database. The front-end will most likely be a web application and this front-end will display various charts and calculated statistics. Now, I imagine that it would be more efficient to perform the calculations in the database rather than in the service layer because said calculations would need to be performed for every web request. That being the case, I'm not sure which mechanism to use. (e.g. stored procedure, function, view) To illustrate what I'm going for, suppose I want to keep statistics of student grades for many students. I would like to have a web interface that lets me view those statistics on student-by-student basis and also an all-inclusive basis. Some of the stats are dependent on aggregates (e.g. average, min, max) of all of the student grades and some stats are dependent only on an individual student. In this situation, every time a record is added or updated, the aggregates would have to be recalculated. So I am speculating that if I had a special table that held all of the calculated values I need and a trigger(s) to recalculate everything when a record is added/updated then all I would need to do from a web request point-of-view is have the service layer pull the desired values from this special table. I'm just not sure if this is the best way to go or not so I am asking the community for any input/advice. Note: Although I'm using Oracle, I'm open to using PostgreSQL or mySQL.
Thanks in advance
The scenario you are describing would be ideal for using materialized views. They can be designed to refresh automatically (and incrementally) every time the source data is updated by your application. The calculations would be built in to the view definition. No triggers required, and likely no stored procedures unless your calculations involve multiple steps. Check here: https://oracle-base.com/articles/misc/materialized-views and here: https://medium.com/oracledevs/lightning-fast-sql-with-real-time-materialized-views-12-things-developers-will-love-about-oracle-54bcc9eac358 for more info.
In an application we have to send sensory data stream from multiple clients to a central server over internet. One obvious solution is to use MOMs (Message Oriented Middlewares) such as Kafka, but I recently learned that we can do this with data base synchronization tools such as oracle Materialized View.
The later approach works in some application (sending data from a central server to multiple clients, inverse directin of our application), but what is the pros and cons of it in our application? Which one is better for sending sensory data stream from multiple (~100) clients to server in terms of speed, security, etc.?
Thanks.
P.S.
For more detail consider an application in which many (about 100) clients have to send streaming data (1MB data per minute) to a central server over internet. The data are needed in server for the sake of online monitoring, analysis and some computation such as machine learning and data mining tasks.
My question is about the difference between db-to-db connection and streaming solutions such as kafka for trasfering data from clients to server.
Prologue
I'm going to try and break your question down into in order to get a clearer understanding of your current requirements and then build it back up again. This has taken a long time to write so I'd really appreciate it if you do two things off the back of it:
Be sceptical - there's absolutely no substitute for testing things yourself. The internet is very useful as a guide but there's no guarantee that the help you receive (if this answer is even helpful!) is the best thing for your specific situation. It's impossible to completely describe your current situation in the space allotted and so any answer is, of necessity, going to be lacking somewhere.
Look again at how you explained yourself - this is a valid question that's been partially stopped by a lack of clarity in your description of the system and what you're trying to achieve. Getting someone unfamiliar with your system to look over your question before posting a complex question may help.
Problem definition
sensory data stream from multiple clients to a central server
You're sending data from multiple locations to a single persistence store
online monitoring
You're going to be triggering further actions based off the raw data and potentially some aggregated data
analysis and some computation such as machine learning and data mining tasks
You're going to be performing some aggregations on the clients' data, i.e. you require aggregations of all of the clients' data to be persisted (however temporarily) somewhere
Further assumptions
Because you're talking about materialized views we can assume that all the clients persist data in a database, probably Oracle.
The data coming in from your clients is about the same topic.
You've got ~100 clients, at that amount we can assume that:
the number of clients might change
you want to be able to add clients without increasing the number of methods of accessing data
You don't work for one of Google, Amazon, Facebook, Quantcast, Apple etc.
Architecture diagram
Here, I'm not making any comment on how it's actually going to work - it's the start of a discussion based on my lack of knowledge of your systems. The "raw data persistence" can be files, Kafka, a database etc. This is description of the components that are going to be required and a rough guess as to how they will have to connect.
Applying assumed architecture to materialized views
Materialized views are a persisted query. Therefore you have two choices:
Create a query that unions all 100 clients data together. If you add or remove a client you must change the query. If a network issue occurs at any one of your clients then everything fails
Write and maintain 100 materialized views. The Oracle database at your central location has 100 incoming connections.
As you can probably guess from the tradeoffs you'll have to make I do not like materialized views as the sole solution. We should be trying to reduce the amount of repeated code and single points of failure.
You can still use materialized views though. If we take our diagram and remove all the duplicated arrows in your central location it implies two things.
There is a single service that accepts incoming data
There is a single service that puts all the incoming data into a single place
You could then use a single materialized view for your aggregation layer (if your raw data persistence isn't in Oracle you'll first have to put the data into Oracle).
Consequences of changes
Now we've decided that you have a single data pipeline your decisions actually become harder. We've decoupled your clients from the central location and the aggregation layer from our raw data persistence. This means that the choices are now yours but they're also considerably easier to change.
Reimagining architecture
Here we need to work out what technologies aren't going to change.
Oracle databases are expensive and you're pushing 140GB/day into yours (that's 50TB/year by the way, quite a bit). I don't know if you're actually storing all the raw data but at those volumes it's less likely that you are - you're only storing the aggregations
I'm assuming you've got some preferred technologies where your machine learning and data mining happen. If you don't then consider getting some to prevent madness supporting everything
Putting all of this together we end up with the following. There's actually only one question that matters:
How many times do you want to read your raw data off your database.
If the answer to that is once then we've just described middleware of some description. If the answer is more than once then I would reconsider unless you've got some very good disks. Whether you use Kafka for this middle layer is completely up to you. Use whatever you're most familiar with and whatever you're most willing to invest the time into learning and supporting. The amount of data you're dealing with is non-trivial and there's going to be some trial and error getting this right.
One final point about this; we've defined a data pipeline. A single method of data flowing through your system. In doing so, we've increased the flexibility of the system. Want to add more clients, no need to do anything. Want to change the technology behind part of the system, as long as the interface remains the same there's no issue. Want to send data elsewhere, no problem, it's all in the raw data persistence layer.
I am building an application with RethinkDB and I'm about to switch to using changefeeds. But I'm facing an architectural choice and I'd like to get some advice.
My application currently loads all user data from several tables on user login (sending all of it to the frontend), and then processes requests from the frontend, altering the database, and preparing and sending changed items to users. I'd like to switch that over to changefeeds. The way I see it, I have two choices:
Set up a single changefeed for each table. Filter by users logged in to a particular server, and distribute the changes to users manually. These changefeeds are never closed, e.g. they have the lifetime of my servers.
When a user logs in, set up an individual changefeed for that user, for that user's data only (using a getAll with a secondary index). Maintain as many changefeeds as there are currently logged in users. Close them when users log out.
Solution #1 has a big disadvantage: RethinkDB changefeeds do not have a concept of time (or version number), like for example Kafka does. This means that there is no way to a) load initial data, and b) get changes that happened since the initial load. There is a time window where changes can be lost: between initial data load (a) and the moment the changefeed is set up (b). I find this worrying.
Solution #2 seems better, because includeInitial can be used to get initial data, and then get subsequent changes without interruption. I'd have to deal with initial load performance (it's faster to load a single dump of all data than process thousands of updates), but it seems more "correct". But what about scaling? I'm planning to handle up to 1k users per server — is RethinkDB prepared to handle thousands of changefeeds, each being essentially a getAll query? The actual activity in these changefeeds will be very low, it's just the number that I'm worried about.
The RethinkDB manual is a bit terse about changefeed scaling, saying that:
Changefeeds perform well as they scale, although they create extra intracluster messages in proportion to the number of servers with open feed connections on each write.
Solution #2 creates many more feeds, but the number of servers with open feed connections is actually the same for both solutions. And "changefeeds perform well as they scale" isn't quite enough to go on :-)
I'd also be interested to know what are recommended practices for handling server restarts/upgrades and disconnections. The way I see it, if anything happens to RethinkDB, clients have to perform a full data load (using includeInitial) after reconnecting, because there is no way to know what changes have been lost during downtime. Is that what people do?
RethinkDB should be able to handle thousands of changefeeds just fine if it's on reasonable hardware. One thing some people to do lower network load in that case is they put a proxy node on the same machine as their app server, and connect to that, since the proxy node knows enough to deduplicate the changefeed messages coming in over the network, and because it takes a lot of CPU/memory load off of their main cluster.
Currently the only way to recover from a crash is to restart the changefeed using includeInitial. There are plans to add write timestamps in the future, but handling deletes is complicated in that case.