What is the difference between file and allFile in the GraphQL query in Gatsby? - graphql

In my GraphiQL explorer, it appears that I can query file or allFile, but I don't really understand what the difference between the two is. In fact, every query appears to be "duplicated" in this manner. Can someone explain, or point me to some documentation that explains the difference and when I should use one over the other?

allFile (notice the all prefix) stands for all GraphQL assets of file node type that Gatsby has inferred using the gatsby-source-filesystem, while file is an isolated node that would need to be filtered to get the specific object.
In other words, allFile is exposing all files in your project (at least the ones that Gatsby is aware of), which will give you an array of nodes while file is only pointing to a specific GraphQL node. This single and isolated node normally needs to be filtered to get them among the rest.
The same reasoning applies to allMdx and mdx, allSite and site, etc.
For example, using allMdx example (node created by gatsby-plugin-mdx): let's say you have a blog project where you store your posts in MDX files. In your gatsby-node.js you will query for allMdx node in order to get all posts, then, you will loop through the results and create dynamic pages for each post (using createPage method). In the post template (the template that each specific post will use), instead of querying for allMdx, you will need to use mdx because it's pointing specifically to a single mdx node.
Of course, in your template, you can still use allMdx, for it's not the optimal GraphQL node since you have a specific mdx that will be your single post, filtered by some unique value like the slug, the id, etc.

Related

How do I query nodes that are missing a child of a specific type?

I'm new to graphql, and trying to understand how I might fill this use case.
I have thousands of nodes of a specific type/schema.
Some of these nodes have children, some of them don't.
I'd like to query all the nodes, and return only the ones that don't have children.
This might get more specific in the future, where I'd like to query only nodes that don't have children of a specific type.
Is that even possible?
I've seen plenty of query examples that show how to select children nodes, or nested nodes + fields, or nodes with specific values. It's an easy thing with SQL, I'm just having trouble understanding how it's done with graphql.
Thoughts?
As Daniel Rearden said, there is no built in way in GraphQL to filter or sort the results of a query. We have a few filters in our Gentics Mesh GraphQL API, but it is currently not possible to create a filter involving another list of items (children in your case).
I've added your case to the issue in Github. https://github.com/gentics/mesh/issues/27

elasticsearch copy field when indexing

I would like to create a one to many relashanship for the purpose of aggregations.
The "join" will be according to a field called "common_id":
When I create the first document belonging to the same group I would like to use it's flakeId (it's _id) as the common_id.
When adding other document belonging to the same group I would like to explicitly set the common_id to have the same value as the first document I added. This can be done by my app since my application will know the common_id of the first element.
My problem is with the first document:
How can i tell elasticsearch to copy the _id into common_id in a single call to elastic (I know I can do it using update script, or using two calls one for index and one for update... but this requires two requests instead of one).
I would like a simple syntax for this.
thanks

elasticsearch: decide which query should run first

We have a simple web page, where the user can provide some input and query the database. We currently use mongodb but want to migrate to elasticsearch, since the queries are faster.
There are some required search fields, like start and end date, and some optional ones, like a search string to match an entry, or a parent search string, to match parent entries. Parent-child relations are just described through fields containing each entry's ancestors ids.
The question is the following: If both search and parent search string are provided, is there a way to know before executing the queries, which query should be executed first, in order to provide results faster and to be more performant?
For example, it could be that a specific parent search results in only 2 docs/parent entries, and then we can fetch all children matching the search string. In that case we should execute firstly the parent query and then the entry query.
One option would be to get the count of both queries and then execute first the one with the smallest count, but isn't this solution worse, since the queries are going to be executed twice? Once for the count and once for the actual query.
Are there any other options to solve this?
PS. We use elasticsearch v1.7
Example
Let's say the user wants to search for all entries matching the following fields.
searchString: type:BLOCK AND name:test
parentSearchString: name:parentTest AND NOT type:BLOCK
This means that we either have to
fetch all entries (parents) matching the parentSearchString and store their ids. Then, we have to fetch all entries that match the searchString and also have to contain any of the parent ids in the ancestors field.
OR
fetch all entries that match the searchString and store all ancestors ids. Then fetch all entries that match the parentSearchString and their id is one of the ancestors ids.
Just to clarify, both parent and children entries have the exact same structure and reside in the same index. We cannot have different indices since the pare-child relation can be 10 times nested, so an entry can be both a parent and a child. An entry looks more or less like:
{
id: "e32452365321",
name: "name",
type: "type",
ancestors: "id1 id2 id3" // stored in node as an array of ids
}
First of all, I would advise you, to upgrade your Elasticsearch version, if possible. There happened a lot since 1.7 and to be honest, I can't tell if all of what's written in the following article is valid for such an old version (probably it isn't).
But to your actual question: Hopefully I am understanding you correctly, but you try to estimate how costly a query for Elasticsearch is? Well, you don't have to. If you provide all 'queries' in one nested query, Elasticsearch will do that for you: https://www.elastic.co/blog/elasticsearch-query-execution-order
Regarding speed, there is one other thing I can mention: calculating score does take time. So if sorting is not based on the elasticsearch _score, you want to use boolean filter queries. This would also apply, if you want to sort only by _score of parent matches, then you could put the query for children into a filter.
update
Thanks to your example, I now see the problem. Self referencial Parent-Child relations are unfortunately not supported by ElasticSearch, so your approach is probably right. You might want to check out the short chapter of the documentation about application-joins.
So yes, in general, you want to send the second query with the least possible amount of ids/terms. While getting counts for both queries is not as bad as you might think, because the results are most likely still cached, does it actually help? Because if you're going from child to parent, you would have to count the ancestors (field values), and not the actual document count.
I would argue, that the most expensive operation is very often fetching result source from disk. So whichever way you go, you probably should only fetch what you need in the first query. So your options are:
Fetch only the id of parent matches, and then use a terms filter on ancestors in the second query.
Or, fetch only the ancestors field of child matches, and use an id filter in your second query.
Unfortunately, I can't help you more than that, since I don't have enough experience in comparing speed of those approaches. My guess would be, that an id filter might be faster in general. But that's just a guess...

How to implement Tag search?

I've designed a news hub system which read Rss links and stores whole news in the database. Now I want to implement a search system using tags. Each news has it's own tags. There are lots of algorithms to implement this but I don't know what is the most common to have the best performance. Currently I'm using Elastic search database and I use multiple keyword search. Which one of these are the best?
1- to store tags in a list or a string with a separator and search among them?
2- work like a relational system and have a table of tags, and a table of news tags to have a record for each news tag. and 5 records for 5 tags of one news
3- another algorithm which I don't know
Seems like you want something like the inverted index
This is an index, that for each term (hashtag in your case) holds a list of document ids which contain this hashtag.
For example, if you have 3 documents: d1,d2,d3 with the hash tags:
d1: #tag1, #tag2
d2: #tag3
d3: tag3, #tag2
The inverted index will be:
#tag1: d1
#tag2: d1,d3
#tag3: d2,d3
It is fairly easy using the inverted index to find all documents that contain a certain term (hashtag in your case), by simply going over the list the is attached to this term.
This datastructure is also very efficient for union (or queries) and intersection (and queries).
This DS is very popular for information retrieval for full text search and also is often used in semi-structured search.
For more information, you can read about Information Retrieval in general. Mannings Introduction to Information Retrieval represents this Data structure in the book's first chapter.
ElasticSearch will handle that very well and you have multiple ways of implementing that behavior.
What you want is a parent child relationship between a news article (parent) and its tags (children).
Depending on whether you need to update the hashtags after indexing your news articles or not, you could go with storing them in the news article or as separate documents pointing to the news article document as their parent.
See more details here: http://www.elasticsearch.org/blog/managing-relations-inside-elasticsearch/
You mentioned a choice between storing the tags as a list or a comma separated string. Go with the list as that is more idiomatic and ElasticSearch can handle json objects (you would actually analyze the string and turn it into a list of token anyways).

Passing parameters to a couchbase view

I'm looking to search for a particular JSON document in a bucket and I don't know its document ID, all I know is the value of one of the sub-keys. I've looked through the API documentation but still confused when it comes to my particular use case:
In mongo I can do a dynamic query like:
bucket.get({ "name" : "some-arbritrary-name-here" })
With couchbase I'm under the impression that you need to create an index (for example on the name property) and use startKey / endKey but this feels wrong - could you still end up with multiple documents being returned? Would be nice to be able to pass a parameter to the view that an exact match could be performed on. Also how would we handle multi-dimensional searches? i.e. name and category.
I'd like to do as much of the filtering as possible on the couchbase instance and ideally narrow it down to one record rather than having to filter when it comes back to the App Tier. Something like passing a dynamic value to the mapping function and only emitting documents that match.
I know you can use LINQ with couchbase to filter but if I've read the docs correctly this filtering is still done client-side but at least if we could narrow down the returned dataset to a sensible subset, client-side filtering wouldn't be such a big deal.
Cheers
So you are correct on one point, you need to create a view (an index indeed) to be able to query on on the content of the JSON document.
So in you case you have to create a view with this kind of code:
function (doc, meta) {
if (doc.type == "youtype") { // just a good practice to type the doc
emit(doc.name);
}
}
So this will create a index - distributed on all the nodes of your cluster - that you can now use in your application. You can point to a specific value using the "key" parameter

Resources