OrientDB import edges only using ETL tool - etl

I already used the OETL to insert all my Vertex to the graph.
Now I have a file that outlines the edges at the following way:
node_1,rel_type,node_2
11000001,relation_A,10208879
11000001,relation_A,10198662
11000001,relation_B,10159927
11000001,relation_C,10165779
How can I import it using the OrientDB OETL tool?
I tried the following:
"transformers": [
{ "csv": {} },
{ "command" : {
"command" : "create edge ${rel_type} from (select flatten(#rid) from V where node_id= ${node_1}) to (select flatten(#rid) from V where node_id = ${node_2})",
"output" : "edge"
}
}
],
But this failed to work since it can't parse the values from the csv.

You must use the $input variable.
"transformers": [{
"csv": {
"separator": ","
}
},
{
"command" : {
"command" : "create edge ${input.rel_type} from (select from V where node_id= ${input.node_1}) to (select from V where node_id = ${input.node_2})",
"output" : "edge"
}
}
],
It works for me.
Hope it helps.

Related

GraphQL on clause with enum type

I have a question regarding GraphQL because I do not know if it is possible or not.
I have a simple scheme like this:
enum Range{
D,
D_1,
D_7
}
type Data {
id: Int!
levels(range: [Range!]):[LevelEntry]
}
type LevelEntry{
range: Range!
levelData: LevelData
}
type LevelData {
range: Range!
users: Int
name: String
stairs: Int
money: Float
}
Basically I want to do a query so I can retrieve different attributes for the different entries on the levelData property of levels array which can be filtered by some levels range.
For instance:
data {
"id": 1,
"levels": [
{
"range": D,
"levelData": {
"range": D,
"users": 1
}
},
{
"range": D_1,
"levelData": {
"range": D_1,
"users": 1,
"name": "somename"
}
}
]
This means i want for D "range, users" properties and for D_1 "range,users,name" properties
I have done an example of query but I do not know if this is possible:
query data(range: [D,D_1]){
id,
levels {
range
... on D {
range,
users
}
... on D_1 {
range,
users,
name
}
}
}
Is it possible? If it is how can i do it?

GraphQL fallback query if no results

I have the following query:
{
entity(id: "theId") {
source1: media(source: 1){
images{
src, alt
}
}
source2: media(source: 2){
images{
src, alt
}
}
}
}
That give me a result like:
{
"entity": [
{
"source1": {
"images": [{"src": "", "alt": ""}]
},
"source2": {
"images": [{"src": "", "alt": ""}]
}
}
]
}
Is there a way to have a single result of source1 and source2, executing source1 and if it has no result it use source2 as fallback?
You are querying two fields (source1, source2) so something has to come back for both of them (null being a possible option). If you want to check them in a sequence you should probably break the query in two and run them one at the time from the client.
Could you perhaps change so you only query a single source field and have the resolver (on the server) return what makes sense based on what is available, so to speak? Like this:
{
entity(id: "theId") {
source: media(sourcesList: [1, 2]){
images{
src, alt
}
}
}
}
where sourceList is the sources to try, in order. So the resolver (server) can then check if source 1 is available and if not return source 2.
You could also add a field to let the client know which source was actually returned from the proposed list (sourceNumberReturned below would return 1 if source 1 was returned, otherwise 2).
{
entity(id: "theId") {
source: media(sourcesList: [1, 2]){
images{
src, alt
}
sourceNumberReturned
}
}
}

Elasticsearch filter where poi nearby

I want to perform the following pseudo query on Elastic:
SELECT SomeTypeOfObjects WHERE distance to
PointOfInterests < 20km AND PointOfInterests.type = 'bar';
I have the following data in Elastic:
SomeTypeOfObjects (around 10.000.000 rows)
----------
id = x
geo_location = x,x
PointOfInterests( around 400.000 rows )
---------
id = x
geo_location = x,x
type=bar, hospital, etc
Would this be possible without 2 queries or feeding the query all possible geo locations?
If you saved your lat,lon as geo-point, you should be able to execute a geo_distance query like this:
GET /my_locations/location/_search
{
"query": {
"bool" : {
"must" : {
"match_all" : {}
},
"filter" : {
"geo_distance" : {
"distance" : "12km",
"pin.location" : {
"lat" : 40,
"lon" : -70
}
}
}
}
}
}
Which apparently doesn't answer your question. I guess the best option would be to
1. Get coordinates of all bars
2. Use geo_distance with multiple points:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-geo-distance-query.html

How can I do this in painless script Elasticsearch 5.3

We're trying to replicate this ES plugin https://github.com/MLnick/elasticsearch-vector-scoring. The reason is AWS ES doesn't allow any custom plugin to be installed. The plugin is just doing dot product and cosine similarity so I'm guessing it should be really simple to replicate that in painless script. It looks like groovy scripting is deprecated in 5.0.
Here's the source code of the plugin.
/**
* #param params index that a scored are placed in this parameter. Initialize them here.
*/
#SuppressWarnings("unchecked")
private PayloadVectorScoreScript(Map<String, Object> params) {
params.entrySet();
// get field to score
field = (String) params.get("field");
// get query vector
vector = (List<Double>) params.get("vector");
// cosine flag
Object cosineParam = params.get("cosine");
if (cosineParam != null) {
cosine = (boolean) cosineParam;
}
if (field == null || vector == null) {
throw new IllegalArgumentException("cannot initialize " + SCRIPT_NAME + ": field or vector parameter missing!");
}
// init index
index = new ArrayList<>(vector.size());
for (int i = 0; i < vector.size(); i++) {
index.add(String.valueOf(i));
}
if (vector.size() != index.size()) {
throw new IllegalArgumentException("cannot initialize " + SCRIPT_NAME + ": index and vector array must have same length!");
}
if (cosine) {
// compute query vector norm once
for (double v: vector) {
queryVectorNorm += Math.pow(v, 2.0);
}
}
}
#Override
public Object run() {
float score = 0;
// first, get the ShardTerms object for the field.
IndexField indexField = this.indexLookup().get(field);
double docVectorNorm = 0.0f;
for (int i = 0; i < index.size(); i++) {
// get the vector value stored in the term payload
IndexFieldTerm indexTermField = indexField.get(index.get(i), IndexLookup.FLAG_PAYLOADS);
float payload = 0f;
if (indexTermField != null) {
Iterator<TermPosition> iter = indexTermField.iterator();
if (iter.hasNext()) {
payload = iter.next().payloadAsFloat(0f);
if (cosine) {
// doc vector norm
docVectorNorm += Math.pow(payload, 2.0);
}
}
}
// dot product
score += payload * vector.get(i);
}
if (cosine) {
// cosine similarity score
if (docVectorNorm == 0 || queryVectorNorm == 0) return 0f;
return score / (Math.sqrt(docVectorNorm) * Math.sqrt(queryVectorNorm));
} else {
// dot product score
return score;
}
}
I'm trying to start with just getting a field from index. But I'm getting error.
Here's the shape of my index.
I've enabled delimited_payload_filter
"settings" : {
"analysis": {
"analyzer": {
"payload_analyzer": {
"type": "custom",
"tokenizer":"whitespace",
"filter":"delimited_payload_filter"
}
}
}
}
And I have a field called #model_factor to store a vector.
{
"movies" : {
"properties" : {
"#model_factor": {
"type": "text",
"term_vector": "with_positions_offsets_payloads",
"analyzer" : "payload_analyzer"
}
}
}
}
And this is the shape of the document
{
"#model_factor":"0|1.2 1|0.1 2|0.4 3|-0.2 4|0.3",
"name": "Test 1"
}
Here's how I use the script
{
"query": {
"function_score": {
"query" : {
"query_string": {
"query": "*"
}
},
"script_score": {
"script": {
"inline": "def termInfo = doc['_index']['#model_factor'].get('1', 4);",
"lang": "painless",
"params": {
"field": "#model_factor",
"vector": [0.1,2.3,-1.6,0.7,-1.3],
"cosine" : true
}
}
},
"boost_mode": "replace"
}
}
}
And this is the error I got.
"failures": [
{
"shard": 2,
"index": "test",
"node": "ShL2G7B_Q_CMII5OvuFJNQ",
"reason": {
"type": "script_exception",
"reason": "runtime error",
"caused_by": {
"type": "wrong_method_type_exception",
"reason": "wrong_method_type_exception: cannot convert MethodHandle(List,int)int to (Object,String)String"
},
"script_stack": [
"termInfo = doc['_index']['#model_factor'].get('1',4);",
" ^---- HERE"
],
"script": "def termInfo = doc['_index']['#model_factor'].get('1',4);",
"lang": "painless"
}
}
]
The question is how do I access the index field to get #model_factor in painless scripting?
Option 1
Due to the fact that #model_factor is a text field, in painless scripting, it would be possible to access it, setting fielddata=true in the mapping. So the mapping should be:
{
"movies" : {
"properties" : {
"#model_factor": {
"type": "text",
"term_vector": "with_positions_offsets_payloads",
"analyzer" : "payload_analyzer",
"fielddata" : true
}
}
}
}
And then it can be scored accessing doc-values:
{
"query": {
"function_score": {
"query" : {
"query_string": {
"query": "*"
}
},
"script_score": {
"script": {
"inline": "return Double.parseDouble(doc['#model_factor'].get(1)) * params.vector[1];",
"lang": "painless",
"params": {
"vector": [0.1,2.3,-1.6,0.7,-1.3]
}
}
},
"boost_mode": "replace"
}
}
}
Problems with Option 1
So it is possible to access the field data value setting fielddata=true, but in this case, the value is the vector index as a term, not the value of the vector which is stored in the payload. Unfortunately, it looks like there is no way to access the Token Payload (where the real vector index value is stored) using painless scripting and doc-values. See the source code for elasticsearch and another similar question re: accessing term info.
So the answer is that using painless scripting is NOT possible to access the payload.
I tried also to store the vector values with a simple pattern tokenizer but when accessing the term vector values the order is not preserved, and this is probably the reason for which the author of the plugin decided to use the term as a string and then retrieve the position 0 of the vector as the term "0" and then find the real vector value in the payload.
Option 2
A very simple alternative is to use n fields in the documents, each of them represents a position in the vector, so in your example, we have a 5 dim vector with values stored in v0...v4 directly as double:
{
"#model_factor":"0|1.2 1|0.1 2|0.4 3|-0.2 4|0.3",
"name": "Test 1",
"v0" : 1.2,
"v1" : 0.1,
"v2" : 0.4,
"v3" : -0.2,
"v4" : 0.3
}
and then the painless scripting should be:
{
"query": {
"function_score": {
"query" : {
"query_string": {
"query": "*"
}
},
"script_score": {
"script": {
"inline": "return doc['v0'].getValue() * params.vector[0];",
"lang": "painless",
"params": {
"vector": [0.1,2.3,-1.6,0.7,-1.3]
}
}
},
"boost_mode": "replace"
}
}
}
It should be easily possible to iterate on the input vector length and get the fields dynamically to calculate the dot product modifying doc['v0'].getValue() * params.vector[0] that I wrote for simplicity.
Problems with Option2
Option 2 is viable as long as the vector dimension remains not big. I think that default Elasticsearch max number of fields per document is 1000, but it can be changed also in AWS environment:
curl -X PUT \
'https://.../indexName/_settings' \
-H 'cache-control: no-cache' \
-H 'content-type: application/json'
-d '{
"index.mapping.total_fields.limit": 2000
}'
Moreover, it should be tested also the script speed on a large number of documents.
Maybe in re-scoring / re-ranking scenarios, it is a viable solution.
Option 3
The third option is really an experiment and the most fascinating in my opinion.
It tries to exploit the internal Elasticsearch representation of the Vector Space Model and does not use any scripting to score but reuse the default similarity score based on tf/idf.
Lucene, that seats at Elasticsearch core, is already using internally a modification of the cosine similarity to calculate the similarity score between documents in his Vector Space Model representation of terms as the formula below, taken from the TFIDFSImilarity javadoc, shows:
In particular, the weights of the vector representing the field are the tf/idf values of the terms of that field.
So we could index a document with termvectors, using as term the index of the vector. If we repeat it N times, we represent the value of the vector, exploiting the tf part of the scoring formula.
This means that the domain of the vector should be transformed and rescaled in {1.. Infinite} Positive Integer numbers domain. We start from 1 so that we are sure that all the documents contain all the terms, it will make it easier to exploit the formula.
For example, the vector: [21, 54, 45] can be indexed as a field in a document using a simple whitespace analyzer and the following value:
{
"#model_factor" : "0<repeated 21 times> 1<repeated 54 times> 2<repeated 45 times>",
"name": "Test 1"
}
then to query, i.e. calculate the dot product, we boost the single terms that represent the index position of the vector.
So using the same example above the input vector: [45, 1, 1] will be transformed in the query:
"should": [
{
"term": {
"#model_factor": {
"value": "0",
"boost": 45
}
}
},
{
"term": {
"#model_factor": "1" // boost:1 by default
}
},
{
"term": {
"#model_factor": "2" // boost:1 by default
}
}
]
norm(t,d) should be disabled in the mapping so that it is not used in the formula above. The idf part is constant for all the documents because all of them contains all the terms (having all the vectors the same dimension).
queryNorm(q) is the same for all the documents in the formula above so it is not a problem.
coord(q,d) is a constant because all the documents contain all the terms.
Problems with Option 3
Need to be tested.
It works only for positive numbers vectors, see this question in math stackoverflow for making it works also for negative numbers.
It is not the exact same of a dot product but very close to find similar documents based on raw vectors.
Scalability on large vector dimension can be an issue at querying time because this means we need to do a N dim terms query with different boosting.
I will try it in a test index and edit this question with the results.

Elasticsearch - group by element in child collection

I have documents of following format in an elastic search index:
{
"item":"Firefox",
"tags":["a","b","c"]
},
{
"item":"Chrome",
"tags":["b","c","d"]
}
I want to group by each element in the tags property, so that I get results like:
"a" = 1, "b" = 2, "c" = 2, "d" = 1
Any help or pointers would be appreciated.
If you index (write) your document to,
index= x, type =y then ,
POST x/y/_search
{
"size":0,
"aggs":{
"t":{
"terms" :{
"field" :"tags"
}
}
}
}
To know its working, just learn elasticsearch.

Resources