My team’s been thrown into the deep end and have been asked to build a federated search of customers over a variety of large datasets which hold varying degrees of differing data about each individuals (and no matching identifiers) and I was wondering how to go about implementing it.
I was thinking Apache Nifi would be a good fit to query our various databases, merge the result, deduplicate the entries via an external tool and then push this result into a database which is then queried for use in an Elasticsearch instance for the applications use.
So roughly speaking something like this:-
For examples sake the following data then exists in the result database from the first flow :-

Then running https://github.com/dedupeio/dedupe over this database table which will add cluster ids to aid the record linkage, e.g.:-

Second flow would then query the result database and feed this result into Elasticsearch instance for use by the applications API for querying which would use the cluster id to link the duplicates.
Couple questions:-
How would I trigger dedupe to run on the merged content was pushed to the database?
The corollary question - how would the second flow know when to fetch results for pushing into Elasticsearch? Periodic polling?
I also haven’t considered any CDC process here as the databases will be getting constantly updated which I'd need to handle, so really interested if anybody had solved a similar problem or used different approach (happy to consider other technologies too).

For de-duplicating...
You will probably need to write a custom processor, or use ExecuteScript. Since it looks like a Python library, I'm guessing writing a script for ExecuteScript, unless there is a Java library.
For triggering the second flow...
Do you need that intermediate DB table for something else?
If you do need it, then you can send the success relationship of PutDatabaseRecord as the input to the follow-on ExecuteSQL.
If you don't need it, then you can just go MergeContent -> Dedupe -> ElasticSearch.


i will explain my use case to understand which DB extract utility to use.
I need to extract data from SQL Server tables with varying frequency each day. Each extract query is a complex SQL statement, involving 5-10 tables in joins etc with multiple causes. Have around 20-30 such statements overall.
All these extract queries might be required to run multiple times a day with varying frequencies each day. It depends on how many times we receive data from source system or other cases.
We are planning to use Kafka to publish a message to let Nifi workflow know whenever a RDBMS table is updated and flow needs to be triggered (i can't just trigger Nifi flow based on "incremental" column value, there might only be all row update scenarios and we might not create new rows in tables).
How should i go about designing my Nifi. There are ExecuteSQL/GenerateTableFetch/ExecuteSQLRecord/QueryDatabaseTable all sorts of components available. Which one is going to fit my requirement best?
I am suggesting that you use ExecuteSQL. You can set query from attribute or compose it using attribute. Easiest way is to create json and then parse that json and create attributes. Check this example, here is sql created from file you can adjust it to create it from kafka link

I set up Elasticsearch Service and DynamoDb stream as described in this blog post.
Now I need to add pre-existing data from DynamoDB to Elasticsearch.
I saw "Indexing pre-existing content" part of article but I dont know what to do with that Python code, where to execute it.
What the best option in this case to add pre-existing data?
Populating existing items to elasticsearch is not straightforward since dynamodb stream works for item changes not for existing records,
Here are few approaches with pro and cons
Scan all the existing items from dynamodb and send to elasticsearch
We can scan all the existing items and run a python code hosted on a ec2 machine to send data to es.
a. Simple solution, nothing much required.
a. Can not be run on a lambda function since the job may timeout if number of records are too many.
b. This approach is more of a one time thing and can not be used for incremental changes, (let's say we want to keep updating es as dynamodb data changes.)
Use dynamodb streams
We can enable dynamodb streams and build the pipeline as explained here.
Now we can update some flag of existing items so that all the records flow through the pipeline and data goes to es.
a. The pipeline can be used for incremental dynamodb changes.
b. No code duplication or one time effort, Every time we need to update one item in es, we update the item and it gets indexed in es.
c. No redundant, untested, one time code. (Huge issue in software world to maintain code.)
a. Changing Prod data can be a dangerous thing and may not be allowed depending on use case.
This is slight modification of above approach
Instead of changing item in prod table we can create a Temporary table and enable stream on Temporary table. Utilize the pipeline mentioned in 2nd approach. And then copy items from prod table to Temporary table, The data will flow through the existing pipeline and get indexed in ES.
a. No Prod data change is required and this pipeline can be used for incremental changes as well.
b. same as approach 2.
a. Copying data from one table to another may take lots of time depending on data size.
b. Copying data from one table to another is a one time script, hence has maintainability issues.
Feel free to edit or suggest another approaches in comment.
Here is what we came up with. By using 3 value status column.
0 = Not indexed
1 = Updated
2 = Indexed
There will be 2 jobs...
Job 1 will select top X records where status = 0 and pop them into a queue like RabitMQ.
Then a consumer will bulk insert those records to ES and update the status of DB records to 1.
For updates, since we have control of our data... The SQL stored proc that updates that particular record will set it's status to 2. Job2 will select top x records where status = 2 and pop them on RabitMQ. Then a consumer will bulk insert those records to ES and update the status of DB records to 1.
Of course we may need an intermediate status for "queued" so none of the jobs pick up the same record again but the same job should not run if it hasn't completed. The chances of a queued record being updated are slim to none. Since updates only happen at end of day usually the next day.
So I know there's rivers (but being deprecated and probably not flexible like ETL)
I would like to bulk insert records from my SQL server to Elasticsearch.
Write a scheduled batch job of some sort either ETL or any other tool doesn't matter.
select from table where id > lastIdInsertedToElasticSearch this will allow to load the latest records into Elasticsearch at scheduled interval.
But what if a record is updated in the SQL server? What would be a good pattern to track updated records in the SQL server and then push the updated records in ES? I know ES has document versions when putting the same Id. But can't seem to be able to visualize a pattern.
So IMHO, batch inserts are good for building or re-building the index. So for the first time, you can run batch jobs that run SQL queries and perform bulk updates. Rivers, as you correctly pointed out, don't provide a lot of flexibility in terms of transformation.
If the entries in your SQL data store are created by you (i.e. some codebase in your control), it would be better that the same code base updates documents in Elasticsearch, may be not directly but by notifying some other service or with the help of queues to not waste time in responding to requests (if that's the kind of setup you have).
We have a pretty similar use case of Elasticsearch. We provide search inside our app, which performs search across different categories of data. Some of this data is actually created by the users of our app through our app - so we handle this easily. Our app writes that data to our SQL data store and pushes the same data in RabbitMQ for indexing/updating in Elasticsearch. On the other side of RabbitMQ, we have a consumer written in Python that basically replaces the entire document in Elasticsearch. So the corresponding rows in our SQL datastore and documents in Elasticsearch share the ID which enables us to update the document.
Another case is where there are a few types of data that we perform search on comes from some 3rd party service which exposes the data over their HTTP API. The data creation is in our control but we don't have an automated mechanism of updating the entries in Elasticsearch. In this case, we basically run a cron job that takes care of this. We have managed to tune the cron's schedule because we also have a limited number of API queries quota. But in this case, our data is not really updated so much per day. So this kind of system works for us.
Disclaimer: I co-developed this solution.
I needed something like the jdbc-river that could do more complex "roll-ups" of data. After careful consideration of what it would take to modify the jdbc-river to suit my needs, I ended up writing the river-net.
Here are a few of the features:
It gets fairly decent performance (comparable to the jdbc-river. We get upwards of 6k rows/sec)
It can join many tables to create complex nested arrays of documents without creating duplicate child documents
It follows a lot of the same conventions as the jdbc-river.
It also supports reading from files.
It's written in C#
It uses Quartz.Net and supports cron expressions for scheduling.
This project is open source, and we already have a second project (also to be open sourced) that does generic job scheduling with RabbitMQ. We have ported over a lot of this project, and plan to the RabbitMQ river for better performance and stability when indexing into Elasticsearch.
To combat large updates, we aren't hitting tables directly. Instead we use stored procedures that only grab deltas. We also have an option on the sp to reset the delta to reindex everything.
The project is fairly young with only a few commits, but we are open to collaboration and new ideas.

I am building an application that requires a lot of data constantly being extracted from a local MongoDB to be put into Neo4j. Seeing as I am also having many users access the Neo4j database, from both a Django webserver and other places, I decided on using the REST interface for Neo4j.
The problem I am having is that, even with batch insertion, the Neo4j server is active over 50% of the time with just trying the insert all the data from the mongoDB. As far as I can see there might be some waiting time because of the HTTP requests but I have been trying to tweak but have only gotten so far.
The question is, if I write a Java plugin (http://docs.neo4j.org/chunked/stable/server-plugins.html) that can handle inserting the mongoDB extractions directly, will I then go around the REST API? Or will the java plugin commands just convert to regular REST API requests? Furthermore, will there be a performance boost by using the plugin?
The last question is how do I optimize the speed of the REST API (So far I am performing around 1500 read/write operations which includes many "get_or_create_in_index" operations)? Is there a sweet spot where the number of queries appended to one HTTP requests will keep Neo4j busy until the next HTTP request arrives?
I am using Neo4j version 2.0
The data that I am extracting consists of bluetooth observations, where the phone that is running the app i created scans all nearby phones. This single observation is then saved as a document in MongoDB and consists of the users id, the time of the scan and a list of the phones/users that he has seen in that scan.
In Neo4j I model all the users as nodes and I also model an observation between two users as a node so that it will look like this:
Furthermore I index all user nodes.
When moving the observation from mongoDB to Neo4j, I do the following for each document:
Check in the index if the user doing the scan already has a node assigned, else create one
Then for each observed user in the scan: A) Check in index if the observed user has a node else create one B) Create an observation node and relationships between the users and the observation node, if this doesn't already exist C) Make a relationship between the observation node and a timeline node (the timeline just consists of a tree of nodes so that I can quickly find observations at a certain time)
As it can be seen I am doing quite a few lookups in the user index (3), some normal read (2-3) and potentially many writes for each observation.
Each bluetooth scan average around 5-30 observations and I batch 100 scans in a single HTTP request. This means that each request usually contains 5000-10000 updates.
What version are you using?
The unmanaged extension would use the underlying Java-API so it much faster, also you can decide on the format & protocol of the data that you push to it.
It is sensible to batch writes, so that you don't incurr tx overhead per each tiny write. E.g. aggregating 10-50k updates in one operation helps a lot.
What is the concrete shape of the updates you do? Can you edit your question to reflect that?
We have different set of data into different systems like Hadoop, Cassandra, MongoDB. But our analytic team want to get the stitched data from different systems. For example customer information with demographic will be in one system, their transactions will be in another system. Analytic should able to query to get data like from US users what was the volume of transaction. We need to develop an application to provide ease way to interact with different system. What is the best way to do?
Another requirement:
If we want to provide their custom workspace in a system like MongoDB, they can easily place with it. What is the best strategy to pull data from one system to another system on demand?
Any pointer or common architecture used to solve this kind of problem will be really helpful.
I see two questions here:
How can I consolidate data from different systems into one system?
How can I create some data in Mongo for people to experiment with?
Here we go ... =)
I would pick one system and target that for consolidation. In other words, between Hadoop, Cassandra and MongoDB, which one does your team have the most experience with? Which one do you find easiest to query with? Which one do you have set up to scale well?
Each one has pros and cons to scale, storage and queryability.
I would pick one and then pump all data to that system. At a recent job, that ended up being MongoDB. It was easy to move data to Mongo and it had by far the best query language. It also had a great community and setting up nodes was easier than Hadoop, etc.
Once you have solved (1), you can trim your data set and create a scaled down sandbox for people to run ad-hoc queries against. That would be my approach. You don't want to support the entire data set, because it would likely be too expensive and complicated.
If you were doing this in a relational database, I would say just run a
select top 1000 * from [table]
query on each table and use that data for people to play with.
