GraphQL - limiting the number of subqueries to prevent Batching Attack - graphql

I want to understand if there is a mechanism to limit the number of subqueries within the GraphQL query to mitigate against the GraphQL batching attack. (It's possible to send more than one mutation request per HTTP request because of the GraphQL batching feature)
Eg.
{
first: changeTheNumber(newNumber: 1) {
theNumber
}
second: changeTheNumber(newNumber: 1) {
theNumber
}
third: changeTheNumber(newNumber: 1) {
theNumber
}
}
I'm using graphql-java-kickstart.

In graphql-java there are two instrumentations that can check the "depth" or "complexity" or your query:
MaxQueryDepthInstrumentation
MaxQueryComplexityInstrumentation
The first one checks the depth of a query (how many levels are requests) the second one counts the fields. You can configure the expected max depth/complexity and if a query is deeper/complexer than your configured number it is rejected.
You can customize the behaviour of the MaxQueryComplexityInstrumentation so that some fields count as "more complex" than others (for example you could say a plain string field is less complex than a field that requires it's own database request when processed).
Here is an example that uses a custom direcive (Complexity) in a schema description to determine the complexity of a field.
If you only want to avoid that a concrete field is requested more than once, you could write you own Instrumentation or use the DataFetchingEnvironment in your resolver function to count the number of that fields in the current query (getSelectionSet() gives access to all fields contained in the current query).

Related

How to retrieve all documents(size greater than 10000) in an elasticsearch index

I am trying to get all documents in an index, I tried the following-
1) getting the total number of records first and then setting /_search?size= parameter -doesn't work as size parameter is restricted to 10000
2)tried paginating by making multiple calls and used the parameters '?size=1000&from=9000'
-worked till 'from' was < 9000 but after it exceeds 9000 i again get this size restriction error-
"Result window is too large, from + size must be less than or equal to: [10000] but was [100000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting"
So how can I retrieve all documents in the index?I read some answers suggesting to use the scroll api and even the documentation states -
"While a search request returns a single “page” of results, the scroll API can be used to retrieve large numbers of results (or even all results) from a single search request, in much the same way as you would use a cursor on a traditional database."
But I couldn't find any sample query to get all records in a single request.
I have a total of 388794 documents in the index.
Also note, this is a one time call so I am not worried about performance concerns.
Figured out the solution-
Scroll api is the proper way to do it- here's how its working-
In the first call to fetch the documents, a size say 1000 can be provided and scroll parameter specifying the time in minutes after which search context times out.
POST /index/type/_search?scroll=1m
{
"size": 1000,
"query": {....
}
}
For all subsequent calls we can use the scroll_id returned in the response of the first call to get the nest chunk of records.
POST /_search/scroll
{
"scroll" : "1m",
"scroll_id" : "DnF1ZXJ5VGhIOLSJJKSVNNZZND344D123RRRBNMBBNNN==="
}

GraphQL + Relay Connection Optimization

Using Relay + GraphQL (graphql-relay-js) connections and trying to determine the best way to optimize queries to the data source etc.
Everything is working, though inefficient when connection results are sliced. In the below query example, the resolver on item will obtain 200+ records for sale 727506341339, when in reality we only need 1 to be returned.
I should note that in order to fulfill this request we actually make two db queries:
1. Obtain all items ids associated with a sale
2. Obtain item data for each item id.
In testing and reviewing of the graphql-relay-js src, it looks like the slice happens on the final connection resolver.
Is there a method provided, short of nesting connections or mutating the sliced results of connectionFromArray, that would allow us to slice the results provided to the connection (item ids) and then in the connection resolver fetch the item details against the already sliced id result set? This would optimize the second query so we would only need to query for 1 items details, not all items...
Obviously we can implement something custom or nest connections, though it seems this is something that would be avail, thus I feel like I am missing something here...
Example Query:
query ItemBySaleQuery {
viewer {
item (sale: 727506341339) {
items (first:1){
edges {
node {
dateDisplay,
title
}
}
}
}
}
}
Unfortunately the solution is not documented in the graphql-relay-js lib...
Connections can use resolveNode functions to work directly on an edge node. Example: https://github.com/graphql/graphql-relay-js/blob/997e06993ed04bfc38ef4809a645d12c27c321b8/src/connection/tests/connection.js#L64

Sorting by a non-key (arbitrary) field in CouchDB

I have a fairly large CouchDB database (approximately 3 million documents). I have various view functions returning slices of the data that can't be modified (or at least, should only be modified as a last resort).
I need the ability to sort on an arbitrary field for reporting purposes. For smaller DBs, I return the entire object, json_parse it in our PHP backend, then sort there. However, we're often getting Out Of Memory errors when doing this on our largest DBs.
After some research, I'm leaning towards accessing a sort key (via URL parameter) in a list function and doing the sort there. This is an idea I've stolen from here. Excerpt:
function(head, req) {
var row
var rows=[]
while(row = getRow()) {
rows.push(row)
}
rows.sort(function(a,b) {
return b.value-a.value
})
send(JSON.stringify({"rows" : rows}))
}
It seems to be working for smaller DBs, but it still needs a lot of work to be production ready.
Is this:
a) a good solution?
b) going to work with 3, 5, or 10 million rows?
You can't avoid loading everything into memory by using a list function. So with enough data, eventually, you'll get an out of memory error, just as you're getting with PHP.
If you can live within the memory constrains, it's a reasonable solution, with some advantages.
Otherwise, investigate using something like lucene, elasticsearch, or Cloudant Search (clouseau & dreyfus).
In our environment, we have more than 5 million records. The couch is design such that each and every Document has some specific fields which distinguish it from the other category of documents.
For example, there are number documents with field DocumentType "USer" or DocumentType "XXX"
These DocumentType field allow us to sort various document based on different categories.
So if you have 3 Million doc, and you have around 10 categories so each category will have about 300k Docs.
Now you can design system such that you always pass the DocId you need to be passed to Couch. In that way it will be faster.
so query can be like
function(doc)
{
if(doc.DocumentType=== 'XXX' && doc._id) {emit(doc.FieldYouWant, doc._id)}
}
This is how our backhand is designed in production.

Give advantage to search by phrase in sort SOLR

Search query which I send to SOLR is:
?q=iphone 4s&sort=sold desc
By default the search works great, but the problem appears when I want to
sort results by some field for eg. sold - No. of sold products.
SOLR finds all the results which have: (iphone 4s) or (iphone) or (4s)
So, when I apply sort by field 'sold' first result is: "iPhone 3GS..." which is problem.
I need the results by phrase ("iphone 4s") first and then the rest of the results - all sorted by sold.
So, the questions are:
Is it possible to have query like this, and how?
q=iphone 4s&sort={some algoritam for phrase results first} desc, sold desc
Or, can I perform this by setting up query analyzer and how?
At the moment this is solved by sending 2 requests to SOLR,
first with phrase "iphone 4s" and, if this returns 0 results,
I perform second request without the phrase - only: iphone 4s.
If sorting by score, id, field is not sufficient, Lucene lets you implement custom sorting mechanism by providing your own subclass of FieldComparatorSource abstract base class.
With in that custom-sort-logic, you can implement the way that realizes your requirements.
Example Java code:
If(modelNum1.equals(modelNum2)){
//return based on number of units sold.
}else{
//ALWAYS return a value such that the preferred model beats others.
}
DISCLAIMER: This may lead to maintenance problems as you will have to change the logic when a new phone model arrives.
Steps:
1) Sort object accepts FieldComparatorSource type instance during instantiation.
2) Extend the FieldComparatorSource
3) You've to load the required field information that participates in 'SORTING' using FieldCache within the FieldComparatorSource in setNextReader()
4) Override the FieldComparatorSource.newComparator() to return your custom FieldComparator.
5) In the method FieldComparator.compare(slot1DocId, slot2DocId), you may include your custom logic by accessing the corresponding field information, via loaded FieldCache, using the docIds passed in.
Incorporating Lucene code into Solr as a plug-in should not trouble you..
EDIT:
Can not use space in that function. Term is only without space.
As of Solr3.1, sorting can also be done on arbitrary function queries
(as in FunctionQuery) that produce a single value per document.
So, I will use function termfreq in sort
termfreq(field,term) returns the number of times the term appears in
the field for that document.
Search query will be
q=iphone 4s&sort=termfreq(product_name,"iphone 4s") desc, sold desc
Note: The function termfreq is active from Solr 4.0 version

Elastic Search limit results

In MySQL I can do something like:
SELECT id FROM table WHERE field = 'foo' LIMIT 5
If the table has 10,000 rows, then this query is way way faster than if I left out the LIMIT part.
In ElasticSearch, I've got the following:
{
"query":{
"fuzzy_like_this_field":{
"body":{
"like_text":"REALLY LONG (snip) TEXT HERE",
"max_query_terms":1,
"min_similarity":0.95,
"ignore_tf":true
}
}
}
}
When I run this search, it takes a few seconds, whereas mysql can return results for the same query in far, far less time.
If I pass in the size parameter (set to 1), it successfully only returns 1 result, but the query itself isn't any faster than if I had set the size to unlimited and returned all the results. I suspect the query is being run in its entirety and only 1 result is being returned after the query is done processing. This means the "size" attribute is useless for my purposes.
Is there any way to have my search stop searching as soon as it finds a single record that matches the fuzzy search, rather than processing every record in the index before returning a response? Am I misunderstanding something more fundamental about this?
Thanks in advance.
You are correct the query is being ran entirely. Queries by default return data sorted by score, so your query is going to score each document. The docs state that the fuzzy query isn't going to scale well, so might want to consider other queries.
A limit filter might give you similar behavior to what your looking for.
A limit filter limits the number of documents (per shard) to execute
on
To replicate mysql field='foo' try using a term filter. You should use filters when you don't care about scoring, they are faster and cache-able.

Resources