elastic search join/scripted query ,using output of subquery

elastic search join/scripted query ,using output of subquery - elasticsearch

I have a situation to write search query in elasticsearch having data as follows
{id:"p1",person:{name:"name",age:"12"},relatedTO:{id:"p2"}}
{id:"p2",person:{name:"name2",age:"15"},relatedTO:{id:"p3"}}
{id:"p3",person:{name:"name3",age:"17"},relatedTO:{id:"p1"}}
scenario:- user's want to search people related to p2,and using each related person find who they are related to
1.first find who are related to p2 answer= p1
2.Now find people related to p1 answer=p3. (the requirement as of now is to go only 1-level) so no need to find people related to p3.the final result should be p2,p1,p3.
Normal scenario's we will write a nested sql to get results.How do we achieve this using elastic query language in one-shot

With one shot you will need to use Parent-Child-Relationships, but I wouldn't recommand it to you in the first place, because it is not very performant. Btw: also Grandparents and Grandchildren are supported.
You could also use Application Side Joins - meaning you execute several queries, until you get what you want. (Be aware that the first result sets should be very tiny, otherwise this could get costly)
What I would really recommand to you is read this docu and rethink your use case.
In case you want to model relationships like in facebook or google+ I would tell you to look for a NoSQL Graph Database.
Note: Ideally in Elasticsearch the data is flat, which means denormalized.

Related

Is it OK to have multiple merge steps in an Excel Power query?

I have data from multiple sources - a combination of Excel (table and non table), csv and, sometimes, even a tsv.
I create queries for each data source and then I am bringing them together one step at a time or, actually, it's two steps: merge and then expand to bring in the fields I want for each data source.
This doesn't feel very efficient and I think that maybe I should be just joining everything together in the Data Model. The problem when I did that was that I couldn't then find a way to write a single query to access all the different fields spread across the different data sources.
If it were Access, I'd have no trouble creating a single query one I'd created all my relationships between my tables.
I feel as though I'm missing something: How can I build a single query out of the data model?
Hoping my question is clear. It feels like something that should be easy to do but I can't home in on it with a Google search.

It is never a good idea to push the heavy lifting downstream in Power Query. If you can, work with database views, not full tables, use a modular approach (several smaller queries that you then connect in the data model), filter early, remove unneeded columns etc.
The more work that has to be performed on data you don't really need, the slower the query will be. Please take a look at this article and this one, the latter one having a comprehensive list for Best Practices (you can also just do a search for that term, there are plenty).
In terms of creating a query from the data model, conceptually that makes little sense, as you could conceivably create circular references galore.

Apache Nifi - Federated Search

My team’s been thrown into the deep end and have been asked to build a federated search of customers over a variety of large datasets which hold varying degrees of differing data about each individuals (and no matching identifiers) and I was wondering how to go about implementing it.
I was thinking Apache Nifi would be a good fit to query our various databases, merge the result, deduplicate the entries via an external tool and then push this result into a database which is then queried for use in an Elasticsearch instance for the applications use.
So roughly speaking something like this:-
For examples sake the following data then exists in the result database from the first flow :-

Then running https://github.com/dedupeio/dedupe over this database table which will add cluster ids to aid the record linkage, e.g.:-

Second flow would then query the result database and feed this result into Elasticsearch instance for use by the applications API for querying which would use the cluster id to link the duplicates.
Couple questions:-
How would I trigger dedupe to run on the merged content was pushed to the database? The corollary question - how would the second flow know when to fetch results for pushing into Elasticsearch? Periodic polling? I also haven’t considered any CDC process here as the databases will be getting constantly updated which I'd need to handle, so really interested if anybody had solved a similar problem or used different approach (happy to consider other technologies too).
Thanks!

For de-duplicating...
You will probably need to write a custom processor, or use ExecuteScript. Since it looks like a Python library, I'm guessing writing a script for ExecuteScript, unless there is a Java library.
For triggering the second flow...
Do you need that intermediate DB table for something else?
If you do need it, then you can send the success relationship of PutDatabaseRecord as the input to the follow-on ExecuteSQL.
If you don't need it, then you can just go MergeContent -> Dedupe -> ElasticSearch.

Caching? Large Query performance for multiple, optional filters

I'm trying to figure out my options for a large query which is taking a somewhat long but sensible amount of time considering what it does. It has many joins and has to be searched against for up to a predefined number of parameters. Some of these parameter values are predefined (select box) while some are a free-form text box (unfortunately LIKE with prefixed and suffixed wildcards). The data sets returned are large and the filter options are very likely to be changed frequently. The order of the result sets are also controlled by the user. Additionally, user access must be restricted to only results the user is authorized to. This authorization is handled as part of a baseline WHERE clause which is applied regardless of the chosen filters.
I'm not really looking for query optimization advice as I've already reviewed the query and examined/optimized the query plan as much as I can given my requirements. I'm more interested in alternative solutions intended for after the query has been optimized. Outside of trying to break up the query into separate smaller bits (which unfortunately is not an acceptable solution), I can only think of two options. But, I don't think they are a good fit for this situation.
Caching first came to my mind, but I don't think it is viable based
on how likely the filters will vary and the large datasets returned.
From my research, options such as ElasticSearch and Solr would not be the
right fit either as the data sets can be manipulated my multiple programs and these data stores would quickly become outdated.
Are there other options to improve the perceived performance of a search feature with these requirements?

You don't provide enough information about your tables and queries for a concrete solution.
As mentioned in a comment by #jmarkmurphy, DB2 and IBM i does it's own "caching". I agree that it's unlikely you'd be able to improve upon it when dealing with large and varied results sets. But you need to make sure you're using what's provided by IBM. For example, if using SQL embedded in RPGLE, make sure you don't have set option CLOSQLCSR=*ENDMOD. Also check the settings in QAQQINI you're using.
You've mentioned using Visual Explain and building some of the requested indexes. That's a good start. But as the queries are run in production, keep an eye on the plan cache, index usage and the advised indexes.
Lastly, you mentioned that you're seeing full table scans do to the use of LIKE '%SOMETHING%'. Again, without details of the columns and data involved, it's a guess as to what may be useful. As suggested in my comment, Omnifind for IBM i may be an improvement.
However, Omnifind is NOT and improved LIKE. Omnifind is designed to handle linguistic searches. From the article i Can … Find a Needle in a Haystack using OmniFind Text Search Server for DB2 for i:
SELECT story_id FROM story_library.story_table
WHERE CONTAINS(story_doc, 'blind mouse') = 1;
This query result will include matches that we’d expect from a typical search engine. The search is case insensitive, and linguistic variations on the search words will be matched. In other words, the previous query will indicate a match for documents that contain “Blind Mice.” In a similar manner, a search for “bad wolves” would return documents that contained “the Big Bad Wolf.”

Adding Advanced Search in ASP.NET MVC 3 / .NET

In a website I am working on, there is an advanced search form with several fields, some of them dynamic that show up / hide depending on what is being selected on the search form.
Data expected to be big in the database and records are spread over several tables in a very normalized fashion.
Is there a recommendation on using a 3rd part search engine, sql server full text search, lucene.net, etc ... other than using SELECT / JOIN queries?
Thank you

Thinking a little outside the box here -
Check out CSLA.NET; Using this framework you can create business objects and "denormalise" your search algorithm.
Either way, be sure the database has proper indexes in place for better performance.

On the frontend youre going to need to use some javascript to map which top level fields show sub level fields. Its pretty straight forward.
For the actual search, I would recommend some flavor of Lucene.
You have your option of the .NET flavor of Lucene.NET which Stackoverflow uses, Solr which is arguably easier to setup and get running than Lucene is, or the newest kid on the block which is ElasticSearch which aims to be schema free and infinitely scalable simply by dropping more instances in the cluster.
I have only used Solr myself, and it has a nice .NET client (SolrNet).

first index your database field that is important and very usable
and for search better use full text search
i try it and result is very different from when i dont use full text
and better use select and join query in stored proc and call sp from your program

What is the easiest way to save a LINQ query for later use?

I have a request for a feature to be able to save a user's search for later.
Right now I'm building LINQ statements on the fly based on what the user has specified.
So I started wondering, is there an easy way for me to simply take the query that the user built, and persist it somewhere, preferably my database, so that I can retrieve it later?
Is there some way of persisting the query as XML or perhaps JSON, and then reconstituting the query later?

Never done this before, but I've had this idea:
Rather than having the query run against your database directly, if you were to have it run against an OData endpoint, you could conceivably extract the URL that is generated as the query string, and save that URL for later use. Since OData has a well-though-out spec already, you would be able to profit from other people's labor.

I'd go with a domain-specific object here even if such goodies did exist -- what happens when you save serialized queries in LINQ and your underlying model changes, invalidating everyone's saved queries. Using your own data format should shield you from this to some extent.

Take a look at the Expression class. This will allow you to pre-compile a query. Although persisting this for later use to the DB for better performance is questionable.

I'm writing this as I watch this presentation at PDC10. Just after the 1-hour mark, he shows how he's built a JSON serializer for expression trees. You might find that interesting.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio