How can I add heterogeneous data to Elasticsearch? - elasticsearch

I am trying to add heterogenous data (i.e. of different "types") to Elasticsearch. Each (top-level) object contains a user's settings for an application. A simplified example is:
{
'name':'test',
'settings': [
{
'key':'color',
'value':'blue'
},
{
'key':'isTestingMode',
'value':true
},
{
'visibleColumns',
'value': [
'column1',
'column3',
'column4',
]
},
...
...
}
When I try to add this, the POST fails with an MapperParsingException. Searching around, it seems like this is because the 'value' field has different types.
Is there any way to just store arbitrary data like this?

This is not possible.
Mapping is per field and mapping is not array aware.
This means that you can keep settings.value as string or array but not both.
An easy tweak would be to define all value as array -
{
'name':'test',
'settings': [
{
'key':'color',
'value': [ 'blue' ]
},
{
'key':'isTestingMode',
'value': [ true ]
},
{
'visibleColumns',
'value': [
'column1',
'column3',
'column4',
]
},
...
...
}
If that is not acceptable , then another idea would be to apply source transform which will do this normalization to the settings.value field before it is indexed. This way , the source is kept as it is AND you will get what you want.

Related

Query unknown data structure in GraphQL

I just started to work with GraphQL and I am setting up a server with webonyx/graphql-php at the moment. Since a GraphQL query already has to contain the resulting data structure, I am not quite sure how to get dynamic data. Assumed that I query the content which consists different element types and my final structure should look like this:
{
"data": {
"dataset": {
"uuid": "abc...",
"insertDate": "2018-05-04T12:12:12Z",
// other metadata
"content": [
{
"type": "headline",
"text": "I am a headline"
},
{
"type": "image",
"src": "http://...",
"alt": "I am an image"
},
{
"type": "review",
"rating": 3,
"comment": "I am a review"
},
{
"type": "headline",
"text": "I am another headline"
}
// other content elements
]
}
}
}
How could I write a query for this example?
{
dataset {
uuid
insertDate
content {
????
}
}
}
And how would a type definition for the content section look like? There is a defined set of element types (headline, image, review, many more) but their order and number of elements is unknown and they have only one field, type, in common. While writing the query in my frontend, I don't know anything about the content structure. And what would the graphql-php type definition for the content section look like? I couldn't find any similar example online, so I am not sure if it is even possible to use GraphQL for this use case. As an extra information, I always want to query the whole content section, not a single element or field, always everything.
When you're returning an array of Object types, but each individual item could be one of any number of different Object types, you can use either an Interface or a Union. We can use an Interface here since all the implementing types share a field (type).
use GraphQL\Type\Definition\InterfaceType;
use GraphQL\Type\Definition\Type;
$content = new InterfaceType([
'name' => 'Content',
'description' => 'Available content',
'fields' => [
'type' => [
'type' => Type::nonNull(Type::string()),
'description' => 'The type of content',
]
],
'resolveType' => function ($value) {
if ($value->type === 'headline') {
return MyTypes::headline();
} elseif ($value->type === 'image') {
return MyTypes::image();
} # and so on
}
]);
Types that implement the Interface need to do so explicitly in their definition:
$headline = new ObjectType([
# other properties
'interfaces' => [
$content
]
]);
Now if you change the type of the content field to a List of content, you can query only fields specific to each implementing type by using inline fragments:
query GetDataset {
dataset {
uuid
insertDate
content {
type # this field is shared, so it doesn't need an inline fragment
... on Headline {
text
}
... on Image {
src
alt
}
# and so on
}
}
}
Please see the docs for more details.

Global secondary index: Number of projected attributes in all indexes exceeds limit of 20

I'm trying to create a GSI on a table with 30 columns (using ruby SDK). I use the projection_type: 'ALL' - but I still get the following exception:
Aws::DynamoDB::Errors::ValidationException: One or more parameter values were invalid: Number of projected attributes in all indexes exceeds limit of 20, number of projected attributes:30
As far as I read, this should only happen when using the INCLUDE projection_type:
This limit does not apply for secondary indexes with a ProjectionType of KEYS_ONLY or ALL.
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Limits.html#limits-secondary-indexes
The create statement looks something like:
connection.update_table({
table_name: "my-table", # required
attribute_definitions: [
{
attribute_name: "indexDate",
attribute_type: "S",
},
{
attribute_name: "createdAt",
attribute_type: "S",
},
],
global_secondary_index_updates: [
{
create: {
index_name: "my-new-index", # required
key_schema: [
{
attribute_name: "indexDate",
key_type: "HASH",
},
{
attribute_name: "createdAt",
key_type: "RANGE",
},
],
projection: { # required
projection_type: "ALL"
},
provisioned_throughput: { # required
read_capacity_units: 10, # required
write_capacity_units: 300, # required
}
}
}
]
})
Turned out that the attribute limit restriction goes across all GSI's on the table. I had another one that caused this one to fail. Deleted that one, and then it worked.

How to use two conditon in one array?

I have a list of task stored in Mongo, like below
{
"name": "task1",
"requiredOS": [
{
"name": "linux",
"version": [
"6.0"
]
},
{
"name": "windows",
"version": [
"2008",
"2008R2"
]
}
],
"requiredSW": [
{
"name": "MySQL",
"version": [
"1.0"
]
}
]
}
My purpose is to filter the task by OS and Software, for example the user give me below filter condition
{
"keyword": [
{
"OS": [
{
"name": "linux",
"version": [
"6.0"
]
},
{
"name": "windows",
"version": [
"2008"
]
}
]
},
{
"SW": [ ]
}
]
}
I need filter out all the task can both running on the windows2008 and Linux 6.0 by searching the "requiredOS" and "requiredSW" filed. As you seen, the search condition is an array (the "OS" part). I have a trouble when use an array as search condition. I expect the query to return me a list of Task which satisfy the condition.
A challenging thing is that I need to integrate the query in to spring-data using #Query. so the query must be parameterized
can anyone give me a hand ?
I have tried a query but return nothing. my purpose is to use $all to combine two condition together then use $elemMatch to search the "requiredOS" field
{"requiredOS":{"$elemMatch":{"$all":[{"name":"linux","version":"5.0"},{"name":"windows","version":"2008"}]}}}
If I understood correctly what you are trying, you need to use $elemMatch operator:
http://docs.mongodb.org/manual/reference/operator/query/elemMatch/#op._S_elemMatch
Taking your example, the query should be like:
#Query("{'requiredOS':{$elemMatch:{name:'linux', version:'7.0'},$elemMatch:{name:'windows', version:'2008'}}}")
It match the document you provided.
You basically seem to need to translate your "parameters" into a query form that produces results, rather than passing them straight though. Here is an example "translation" where the "empty" array is considered to match "anything".
Also the other conditions do not "literally" go straight through. The reason for this is that in that form MongoDB considers it to mean an "exact match". So what you want is a combination of the $elemMatch operator for multiple array conditions, and the $and operator which combines the conditions on the same property element.
This is a bit longer than $all but essentially because that operator is a "shortened" form of $and as $in is to $or:
db.collection.find({
"$and": [
{
"requiredOS": {
"$elemMatch": {
"name": "linux",
"version": "6.0"
}
}
},
{
"requiredOS": {
"$elemMatch": {
"name": "windows",
"version": "2008"
}
}
}
]
})
So it just a matter of transforming the properties in the request to actually match the required query form.
Building this into a query object can be done in a number of ways, such as using the Query builder:
DBObject query = new Query(
new Criteria().andOperator(
Criteria.where("requiredOS").elemMatch(
Criteria.where("name").is("linux").and("version").is("6.0")
),
Criteria.where("requiredOS").elemMatch(
Criteria.where("name").is("windows").and("version").is("2008")
)
)
).getQueryObject();
Which you can then pass in to a mongoOperations method as a query object or any other method that accepts the query object.

Query Mongo Embedded Documents with a size

I have a ruby on rails app using Mongoid and MongoDB v2.4.6.
I have the following MongoDB structure, a record which embeds_many fragments:
{
"_id" : "76561198045636214",
"fragments" : [
{
"id" : 76561198045636215,
"source_id" : "source1"
},
{
"id" : 76561198045636216,
"source_id" : "source2"
},
{
"id" : 76561198045636217,
"source_id" : "source2"
}
]
}
I am trying to find all records in the database that contain fragments with duplicate source_ids.
I'm pretty sure I need to use $elemMatch as I need to query embedded documents.
I have tried
Record.elem_match(fragments: {source_id: 'source2'})
which works but doesn't restrict to duplicates.
I then tried
Record.elem_match(fragments: {source_id: 'source2', :source_id.with_size => 2})
which returns no results (but is a valid query). The query Mongoid produces is:
selector: {"fragments"=>{"$elemMatch"=>{:source_id=>"source2", "source_id"=>{"$size"=>2}}}}
Once that works I need to update it to $size is >1.
Is this possible? It feels like I'm very close. This is a one-off cleanup operation so query performance isn't too much of an issue (however we do have millions of records to update!)
Any help is much appreciated!
I have been able to achieve desired outcome but in testing it's far too slow (will take many weeks to run across our production system). The problem is double query per record (we have ~30 million records in production).
Record.where('fragments.source_id' => 'source2').each do |record|
query = record.fragments.where(source_id: 'source2')
if query.count > 1
# contains duplicates, delete all but latest
query.desc(:updated_at).skip(1).delete_all
end
# needed to trigger after_save filters
record.save!
end
The problem with the current approach in here is that the standard MongoDB query forms do not actually "filter" the nested array documents in any way. This is essentially what you need in order to "find the duplicates" within your documents here.
For this, MongoDB provides the aggregation framework as probably the best approach to finding this. There is no direct "mongoid" style approach to the queries as those are geared towards the existing "rails" style of dealing with relational documents.
You can access the "moped" form though through the .collection accessor on your class model:
Record.collection.aggregate([
# Find arrays two elements or more as possibles
{ "$match" => {
"$and" => [
{ "fragments" => { "$not" => { "$size" => 0 } } },
{ "fragments" => { "$not" => { "$size" => 1 } } }
]
}},
# Unwind the arrays to "de-normalize" as documents
{ "$unwind" => "$fragments" },
# Group back and get counts of the "key" values
{ "$group" => {
"_id" => { "_id" => "$_id", "source_id" => "$fragments.source_id" },
"fragments" => { "$push" => "$fragments.id" },
"count" => { "$sum" => 1 }
}},
# Match the keys found more than once
{ "$match" => { "count" => { "$gte" => 2 } } }
])
That would return you results like this:
{
"_id" : { "_id": "76561198045636214", "source_id": "source2" },
"fragments": ["76561198045636216","76561198045636217"],
"count": 2
}
That at least gives you something to work with on how to deal with the "duplicates" here

How to select from array in Rethinkdb?

I have a field bidder with arrays and objects like this(it can be also empty):
[
[
{
"date":"08/17/1999"
},
{
"time":"07:15:23"
},
{
"increase":31.5
}
],
[
{
"date":"04/01/1998"
},
{
"time":"01:06:18"
},
{
"increase":10.5
}
]
]
How can I select first-array's increase value that means output should be 31.5.
In JavaScript
r.table('test')('bidder').nth(0)('increase').run(conn, callback)
In Python and Ruby
r.table('test')['bidder'][0]['increase'].run(conn)
Edit: Queries for all documents
If you need to do more complex things that just returning a value, you can use the general "form" with map
r.table('test').map(function(doc) {
return doc('bidder').nth(0)('increase')
}).run(conn, callback)

Resources