If you had to write code that takes messages from a message queue and updates a table in a database, how would you go about structuring it in a good oo way. How would you structure it? The messages is XML data, one node per row in the table. The rows in the table could be updated, deleted or inserted.

I don't believe you've provided enough information for a good answer. What do the messages look like? Do they vary in contents/type, or are they all just "messages"? Do they interact with each other, or is this just a data format conversion? One of the keys to OO development is to realize that the "find the nouns-n-verbs" game (which is as much as you've described) rarely leads to the best solution. It certainly won't be the worst, but you'll end up with data aggregation and a bunch of procedural code.
Procedural code isn't bad, though. Why does it need to be OO? Does the problem itself require polymorphism and data hiding? Is there any complex behavior that you are trying to model? There's no shame in using a non-OO solution, when the problem is simple.

Normally with OO implementations of message queues you make the classes that represent the individual types of messages yourself. To the extent that the different message types that you expect to get are derivates of each other, this provides your class hierarchy for the messages.
With configuration based persistence frameworks you can just set up presistence for these classes directly.
Then there's one or more classes that listen to the message queue and just persist the messages, probably just one. It doesn't have to be more elaborate than that.

The best way of building OO code when doing messaging or dealing with any kind of middleware is to hide the middleware APIs from your code and just deal with business logic.
Then you just need to define what your Data Transfer Objects look like; how you want to encode things on the wire in XML / JSON / whatever.
The great thing about this approach is your code is now totally middleware agnostic - you could swap out your message queue and use a database or JavaSpace or in-memory SEDA or files or any other communication protocol or middleware API.


Shall I use a DTO or not?

I'm building a web application with Spring, and I'm at the point where I have an Entity, a Repository, a RestController, and I can access endpoints in my browser.
I'm now trying to return JSON data to the browser, and I'm seeing all of this stuff about DTOs in various guides.
Do I really need a DTO? Can't I just put the serialization logic on the entity itself?
I think, this is a little bit debatable question, where the short answer would be:
It depends.
Little longer answer
There are plenty of people, who, in plenty of cases, would prefer one approach (using DTOs) over another (using bare entities), and vice versa; however, there is no the single source of truth on which is better to use.
It very much depends on the requirements, architectural approach you decide to stick with, (even on) personal preference and other (project-related) specific details.
Some even claim that DTO is an anti-pattern; some love using them; some think, that data refinement/adjustment should happen on the consumer/client side (for various reasons, out of which, one can be No Policy for API changes).
That being said, YES, you can simply return the #Entity instance (or list of entities) right from your controller and there is no problem with this approach. I would even say, that this does not necessarily violate something from SOLID or Clean Code principles.again, it depends on what do you use a response for, what representation of data do you need, what should be the capacity and purpose of the object in question, and etc..
DTO is generally a good practice in the following scenarios:
When you want to aggregate the data for your object from different resources, i.e. you want to put some object transformation logic between the Persistence Layer and the Business(or Web) Layer:
Imagine you fetch from your database a List<Employee>; however, from another 3rd party web-service, you also receive some complementary-to-employee data for each Employee object, which you have to aggregate in the Employee objects (aggregate, or do some calculation, or etc. point is that you want to combine the data from different resources). This is a good case when you might want to use DTO pattern. It is reusable, it conforms to Single-Responsibility Principle, and it is well segregated from other layers;
When you don't necessarily combine data received from different sources, but you want to modify the entity which you will be returning:
Imagine you have a very big Entity (with a lot of fields), and the client, which calls the corresponding endpoint (Front-End application, Mobile, or any client), has no need of receiving this huge entity (or list of entities). If you, despite the client's requirement, will still be sending the original/unchanged entity, you will end up consuming network bandwidth/load inefficiently (more than enough), performance will be weaker, and generally, you will be just wasting computing resources for no good reason. In this case, you might want to transform your original Entity to the DTO object, which the client needs (only with required fields). Here, you might even want to implement different DTO classes, for one entity, for different consumers/clients.
However, if you are sure, that your table/relation representations (instances of #Entity classes) are exactly what the client needs, I see no necessity of introducing DTOs.
Supporting further the idea, that #Entity can be returned to the presentation layer without DTO
Java Persistence with Hibernate, Second Edition, in §3.3.2, even motivates it explicitly, that:
You can reuse persistent classes outside the context of persistence, in unit tests or in the presentation layer, for example. You can create instances in any runtime environment with the regular Java new operator, preserving testability and reusability;
Hibernate entities do not need to be explicitly Serializable;
In general, it’s up to you to decide. If your application is relatively simple and you don’t expose any sensitive information, an response is y ambiguous for the client, there is nothing criminal in returning back the whole entity. If your client expect a small slice of entity, eg only 2-3 fields from 30 fields entity, then it make sense to do the translation or consider different protocol such as GraphQL.
It is ideal design where you should not expose the entity.
It is a good design to convert your entity to DTO before you pass the same to web layer.
These days RestJpacontrollers are also available.
But again it all varies from application to application which one to use.
If your application does a need only read only operation then make sense to use RestJpacontrollers and can use entity at web layer.
In other case where application modifies data frequently then in that case better option to opt DTO and use it at the UI layer.
Another case is of multiple requests are required to bring data for a particular task. In the same case data to be brought can be combined in a DTO so that only one request can bring all the required data.
We can use data of multiple entities data into one DTO.
This DTO can be used for the front end or in the rest API.
Do I really need a DTO? Can't I just put the serialization logic on the entity itself?
I'd say you don't, but it is better to use them, according to SOLID principles, namely single responsibility one. Entities are ORM should be used to interact with database, not being serialized and passed to the other layers.

CQRS commands and GraphQL mutations

I've just learned about CQRS, and I would like to combine it in a project with a GraphQL based API. However, in order to do that, a question has come to my mind: according to CQRS, commands have to not return anything after its execution. However, according to GraphQL conventions, mutations have to return the updated state of the entity.
How should I deal with that? Are CQRS and GraphQL incompatible? The only solution that comes to my mind is, in order to resolve the mutation, first execute a command and later a query, in order to get the response object. Is there anything better than that? It doesn't look very efficent to me...
How should I deal with that?
Real answer? Ignore the "have to not return anything" constraint; the underlying assumptions behind that constraint don't hold, so you shouldn't be leaning to hard on it.
How exactly to do that is going to depend on your design.
For example, if you are updating the domain model in the same process that handles the HTTP Request, then it is a perfectly reasonable thing to (a) save the domain model, (b) run your view projection on the copy of the model that you just saved, (c) and then return the view.
In other words, the information goes through exactly the same transformations it would "normally", except that we perform those transformations synchronously, rather than asynchronously.
If the model is updated in a different process, then things get trickier, since more message passing is required, and you may need to deal with timeouts. For instance, you can imagine a solution where you send the command, and then poll the "read side" until that model is updated to reflect your changes.
It's all trade offs, and those trade-offs are an inevitable consequence of choosing a distributed architecture. We don't choose CQRS because it makes everything better, we choose CQRS because it makes some things better, other things worse, and we are in a context where the things it makes better are more important than the things it makes worse.
I am considering similar, i.e. using GraphQL predominantly for interfacing with the read-side of a system based on CQRS.
On the write-side, however, I am considering using a Web or REST API that has one end-point that accepts commands.
Remember, in CQRS you don't directly mutate entities but submit a command signalling your intent / desire to do something.
Alternatively, thinking out loud here, it may be possible to use mutations in GraphQL to create commands and track their status using subscriptions.

Is Event sourcing using Database CDC considered good architecture?

When we talk about sourcing events, we have a simple dual write architecture where we can write to database and then write the events to a queue like Kafka. Other downstream systems can read those events and act on/use them accordingly.
But the problem occurs when trying to make both DB and Events in sync as the ordering of these events are required to make sense out of it.
To solve this problem people encourage to use database commit logs as a source of events, and there are tools build around it like Airbnb's Spinal Tap, Redhat's Debezium, Oracle's Golden gate, etc... It solves the problem of consistency, ordering guaranty and all these.
But the problem with using the Database commit log as event source is we are tightly coupling with DB schema. DB schema for a micro-service is exposed, and any breaking changes in DB schema like datatype change or column name change can actually break the downstream systems.
So is using the DB CDC as an event source a good idea?
Extending Constantin's answer:
Transaction log tailing/mining should be hidden from others.
It is not strictly an event-stream, as you should not access it directly from other services. It is generally used when transitioning a legacy system gradually to a microservices based. The flow could look like this:
Service A commits a transaction to the DB
A framework or service polls the commit log and maps new commits to Kafka as events
Service B is subscribed to a Kafka stream and consumes events from there, not from the DB
Longer story:
Service B doesn't see that your event is originated from the DB nor it accesses the DB directly. The commit data should be projected into an event. If you change the DB, you should only modify your projection rule to map commits in the new schema to the "old" event format, so consumers must not be changed. (I am not familiar with Debezium, or if it can do this projection).
Your events should be idempotent as publishing an event and committing a transaction
atomically is a problem in a distributed scenario, and tools will guarantee at-least-once-delivery with exactly-once-processing semantics at best, and the exactly-once part is rarer. This is due to an event origin (the transaction log) is not the same as the stream that will be accessed by other services, i.e. it is distributed. And this is still the producer part, the same problem exists with Kafka->consumer channel, but for a different reason. Also, Kafka will not behave like an event store, so what you achieved is a message queue.
I recommend using a dedicated event-store instead if possible, like Greg Young's: This solves the problem by integrating an event-store and message-broker into a single solution. By storing an event (in JSON) to a stream, you also "publish" it, as consumers are subscribed to this stream. If you want to further decouple the services, you can write projections that map events from one stream to another stream. Your event consuming should be idempotent with this too, but you get an event store that is partitioned by aggregates and is pretty fast to read.
If you want to store the data in the SQL DB too, then listen to these events and insert/update the tables based on them, just do not use your SQL DB as your event store cuz it will be hard to implement it right (failure-proof).
For the ordering part: reading events from one stream will be ordered. Projections that aggregates multiple event streams can only guarantee ordering between events originating from the same stream. It is usually more than enough. (btw you could reorder the messages based on some field on the consumer side if necessary.)
If you are using Event sourcing:
Then the coupling should not exist. The Event store is generic, it doesn't care about the internal state of your Aggregates. You are in the worst case coupled with the internal structure of the Event store itself but this is not specific to a particular Microservice.
If you are not using Event sourcing:
In this case there is a coupling between the internal structure of the Aggregates and the CDC component (that captures the data change and publish the event to an Message queue or similar). In order to limit the effects of this coupling to the Microservice itself, the CDC component should be part of it. In this way when the internal structure of the Aggregates in the Microservice changes then the CDC component is also changed and the outside world doesn't notice. Both changes are deployed at the same time.
So is using the DB CDC as an event source a good idea?
"Is it a good idea?" is a question that is going to depend on your context, the costs and benefits of the different trade offs that you need to make.
That said, it's not an idea that is consistent with the heritage of event sourcing as I learned it.
Event sourcing - the idea that our book of record is a ledger of state changes - has been around a long long time. After all, when we talk about "ledger", we are in fact alluding to those documents written centuries ago that kept track of commerce.
But a lot of the discussion of event sourcing in software is heavily influenced by domain driven design; DDD advocates (among other things) aligning your code concepts with the concepts in the domain you are modeling.
So here's the problem: unless you are in some extreme edge case, your database is probably some general purpose application that you are customizing/configuring to meet your needs. Change data capture is going to be limited by the fact that it is implemented using general purpose mechanisms. So the events that are produced are going to look like general purpose patch documents (here's the diff between before and after).
But if we trying to align our events with our domain concepts (ie, what does this change to our persisted state mean), then patch documents are a step in the wrong direction.
For example, our domain might have multiple "events" that make changes to the same, or very similar, sets of fields in our model. Trying to rediscover the motivation for a change by reverse engineering the diff is kind of a dumb problem to have; especially when we have already fought with the same sort of problem learning user interface design.
In some domains, a general purpose change is good enough. In some contexts, a general purpose change is good enough for now. Horses for courses.
But it's not really the sort of implementation that the "event sourcing" community is talking about.
Besides Constantin Galbenu mentioned CDC component side, you can also do it in event storage side like Kafka stream API.
What is Kafka stream API? Input is read from one or more topics in order to generate output to one or more topics, effectively transforming the input streams to output streams.
After transfer detailed data to abstract data, your DB schema is only bind with the transformation now and can release the tightly relation between DB and subscribers.
If your data schema need to change a lot, maybe you should add a new topic for it.

"ObjectMessage usage is generally discouraged", what to use instead?

The ActiveMQ docs state:
Although ObjectMessage usage is generally discouraged, as it
introduces coupling of class paths between producers and consumers,
ActiveMQ supports them as part of the JMS specification
Having not had much experience with message busses, I have been approaching them as conceptually similar to SOAP web services, where you specify the service interface contract for consumers, who then construct equivalent class proxies.
What I am trying to achieve is:
Publishers in some way indicate the schema of the message
Subscribers in some way know the schema of the message
ObjectMessage solves this problem, although not in the nicest way given the noted classpath coupling. As far as I can see the other message types provide minimal guidance to the consumer as to the expected message format (e.g. consumers would have to assume that a MapMessage contained certain keys with certain value types).
Is there another reasonable way to accomplish this, or is this not even something I should be pursuing?
Since the idea is for publishers/subscribers to know about the schema. The first step is to definitely have a structure to the payload using JSON/ protobuf. (Not a big fan of XML personally). And then we pass the data as either TextMessage / BytesMessage.
While the idea is for publishers/subscribers to communicate the schema. Couple of ways to achieve this:
Subscriber knows about the schema via publishér's javadoc or sample invocations . (Sounds fine for simple use-cases)
Have a centralized config to publish both the publisher and for the subscriber to pick up from. This config could lie in a database/ application that serves out configurations. An effective implementation would ensure neither publisher/subscriber will break if there are modifications.
Advantages of this approach over the Object message approach:
No tight coupling of payload (i.e jar upgrades/attribute changes etc)
Significant performance improvement - Here's an example where a Java class with string and int takes 3.7x times more than directly storing int and string as bytes.

ETL , Esper or Drools?

The question environment relates to JavaEE, Spring
I am developing a system which can start and stop arbitrary TCP (or other) listeners for incoming messages. There could be a need to authenticate these messages. These messages need to be parsed and stored in some other entities. These entities model which fields they store.
So for example if I have property1 that can have two text fields FillLevel1 and FillLevel2, I could receive messages on TCP which have both fill levels specified in text as F1=100;F2=90
Later I could add another filed say FillLevel3 when I start receiving messages F1=xx;F2=xx;F3=xx. But this is a conscious decision on the part of system modeler.
My question is what do you think is better to use for parsing and storing the message. ETL (using Pantaho, which is used in other system) where you store the raw message and use task executor to consume them one by one and store the transformed messages as per your rules.
One could use Espr or Drools to do the same thing , storing rules and executing them with timer, but I am not sure how dynamic you could get with making rules (they have to be made by end user in a running system and preferably in most user friendly way, ie no scripts or code, only GUI)
The end user should be capable of changing the parse rules. It is also possible that end user might want to change the archived data as well (for example in the above example if a new value of FillLevel is added, one would like to put a FillLevel=-99 in the previous values to make the data consistent).
Please ask for explanations, I have the feeling that I need to revise this question a bit.
Well Esper is a great CEP engine, but drools has it's own implementation Drools Fusion which integrates really well with jBpm. That would be a good choice.
