Titan cannot find suitable index while index exists - elasticsearch

I try to create a simple Titan graph system on Berkeley, and Titan does not use the index I've created for its queries.
String INDEX_NAME = "search"
File dir=new File("C:/TEMP/titanTest1");
dir.mkdirs();
TitanFactory.Builder config = TitanFactory.build();
config.set("storage.backend", "berkeleyje");
config.set("storage.directory", dir.getAbsolutePath());
config.set("index."+INDEX_NAME+".backend","elasticsearch");
config.set("index." + INDEX_NAME + ".directory", new File(dir,"es").getAbsolutePath());
config.set("index."+INDEX_NAME+".elasticsearch.local-mode",true);
config.set("index."+INDEX_NAME+".elasticsearch.client-only",false);
config.set("query.force-index",true);
TitanGraph tg= config.open();
try {
if (tg.getPropertyKey("name")==null){
TitanManagement mgmt = tg.getManagementSystem();
VertexLabel metaclassLabel=mgmt.makeVertexLabel("metaclass").make();
final PropertyKey name = mgmt.makePropertyKey("name").dataType(String.class).make();
mgmt.buildIndex("metaClassesByName",Vertex.class).addKey(name).indexOnly(metaclassLabel).buildMixedIndex(INDEX_NAME);
mgmt.commit();
}
System.out.println("indexed:"+tg.getIndexedKeys(Vertex.class));
Vertex v=tg.addVertexWithLabel("metaclass");
v.setProperty("name", "test");
for (Object o:tg.query().has("name").has(ImplicitKey.LABEL.getName(), "metaclass").vertices()){
Vertex v2=(Vertex)o;
System.out.println(v2);
}
tg.commit()
} finally {
tg.shutdown();
}
this code prints:
indexed:[name]
Exception in thread "main" com.thinkaurelius.titan.core.TitanException: Could not find a suitable index to answer graph query and graph scans are disabled: [(name <> null AND label = metaclass)]:VERTEX
at com.thinkaurelius.titan.graphdb.transaction.StandardTitanTx$8.execute(StandardTitanTx.java:1198)
I don't understand why Titan can't use the index I've defined. I want to list all objects that have the metaclass label. The only thing that works is to defined a composite index and search a vertex with an exact name value. Is it in anyway possible?
Thanks!

You can use a direct index query:
for (Result<TitanVertex> res : g.indexQuery("metaClassesByName","v.name:*").vertices()) {
Vertex v2 = res.getElement();
System.out.println(v2);
}

Related

Aggregation in Logstash-ElasticSearch

I am using logstash with input-elasticsearch and output-elasticsearch.Both Elastic Search have a different instance.
Before the data goes to the output block,I want to aggregate some documents,create a hash of the new document and insert the nested document in the elastic search.
So basically I want to do some processing before nested document is inserted in the elasticsearch.Is this possible?
input{
# something here to get a value of variable stored in a different file
elasticsearch{
hosts=>"abc.de.fg.hi:jklm"
query=>'{--some query---}'
}
}
output{
elasticsearch{
hosts=>"xxx.xx.xx.xx:yyyy"
}
I'm using the "aggregate" plug in.
In my case the input is From UDP and i filter it with "grok" but i believe you can achieve what you want to do by tweaking the code a bit.
Without a sample of you are trying to achieve exactly, the best this i can do is show you a sample of my code:
aggregate {
task_id => “%{action}_%{progress}”
code =>
“
map[‘avg’] || = 0;
map[‘avg’] += event.get(‘elapsed’);
map[‘my_count’] || = 0;
map[‘my_count’] += 1;
if (map[‘my_count’] == ${LogstashAggregationCount})#Environment variable
event.set(‘elapsedAvg’, (map[‘avg’] / map[‘my_count’]))
event.set(‘Aggregetion’, true)
map[‘avg’] = 0
map[‘my_count’] = 0
end
“
}
if (![Aggregetion]) {
drop {}
}
Of curse you need to adapt it to your specific case. For more in depth explanation of my code read here: How to Use Logstash Aggregations

Browse all documents and bulk update some of them

I am using the Jest client for Elastic to browse an index of document to update one field. My workflow is to run an empty query with paging and look if I can compute the extra field. If I can, I update the relevant documents in one bulk update.
Pseudo-code
private void process() {
int from = 0
int size = this.properties.batchSize
boolean moreResults = true
while (moreResults) {
moreResults = handleBatch(from, this.properties.batchSize)
from += size
}
}
private boolean handleBatch(int from, int size) {
log.info("Processing records $from to " + (from + size))
def result = search(from, size)
if (result.isSucceeded()) {
// Check each element and perform an upgrade
}
// return true if the query returned at least one item
}
private SearchResult search(int from, int size) {
String query =
'{ "from": ' + from + ', ' +
'"size": ' + size + '}'
Search search = new Search.Builder(query)
.addIndex("my-index")
.addType('my-document')
.build();
jestClient.execute(search)
}
I don't have any error but when I run the batch several times, it looks like is finding "new" documents to upgrade while the total number of documents hasn't changed. I got the suspicion that an updated document was processed several times which I could confirm by checking the processed IDs.
How can I run a query so that the original documents are the ones processed and any update wouldn't interfere with it?
Instead of running a normal search (i.e. using from+size), you need to run a scroll search query instead. The main difference is that the scroll will freeze a given snapshot of documents (at the time of the query) and query them. Whatever changes happen after the first scroll query, won't be considered.
Using Jest, you need to modify your code to look more like this:
// 1. Initiate the scroll request
Search search = new Search.Builder(searchSourceBuilder.toString())
.addIndex("my-index")
.addType("my-document")
.addSort(new Sort("_doc"))
.setParameter(Parameters.SIZE, size)
.setParameter(Parameters.SCROLL, "5m")
.build();
JestResult result = jestClient.execute(search);
// 2. Get the scroll_id to use in subsequent request
String scrollId = result.getJsonObject().get("_scroll_id").getAsString();
// 3. Issue scroll search requests until you have retrieved all results
boolean moreResults = true;
while (moreResults) {
SearchScroll scroll = new SearchScroll.Builder(scrollId, "5m")
.setParameter(Parameters.SIZE, size).build();
result = client.execute(scroll);
def hits = result.getJsonObject().getAsJsonObject("hits").getAsJsonArray("hits");
moreResults = hits.size() > 0;
}
You need to modify your process and handleBatch methods with the above code. It should be straightforward, let me know if not.

Sorting for Azure DocumentDB

I want to use DocumentDB to store roughly 200.000 documents of the same type. The documents each get an integer id field and I would like to retrieve them paged, in reverse order (highest id first).
So recently I found out there is no sorting for DocumentDB (see also DocumentDB - query result order). Perhaps it is better to go for a different database (such as RavenDB) however, time is pressing and I want to avoid the cost of switching to another database.
The question:
I have been looking at implementing my own sorted index of the documents on the client side (ASP Web API 2). I was thinking of creating a SortedList of key(id) and value(document.selflink). Then I could create a Getter with parameters for count, offset and a predicate to filter the documents. Below I added a quick example.
I just have the feeling this is a bad idea; either slow, costing too many resources or can be better done another way. So I am open for implementation suggestions...
public class SortableDocumentDbRepository
{
private SortedList _sorted = new SortedList();
private readonly string _sortedPropertyName;
private DocumentCollection ReadOrCreateCollection(string databaseLink) {
DocumentCollection col = base.ReadOrCreateCollection(databaseLink);
var docs = Client.CreateDocumentQuery(Collection.DocumentsLink)
.AsEnumerable();
lock (_sorted.SyncRoot) {
foreach (Document doc in docs) {
var propVal = doc.GetPropertyValue<string>(_sortedPropertyName);
if (propVal != null) {
_sorted.Add(propVal, doc.SelfLink);
}
}
}
return col;
}
public List<T> GetItems<T>(int count, int offset, Expression<Func<T, bool>> predicate) {
List<T> result = new List<T>();
lock (_sorted.SyncRoot) {
var values = _sorted.GetValueList();
for (int i = offset; i < _sorted.Count; i++) {
var queryable = predicate != null ?
Client.CreateDocumentQuery<T>(values[i].ToString()).Where(predicate) :
Client.CreateDocumentQuery<T>(values[i].ToString());
T item = queryable.AsEnumerable().FirstOrDefault();
if (item == null || item.Equals(default(T))) continue;
result.Add(item);
if (result.Count >= count) return result;
}
}
return result;
}
}
Microsoft has implemented Sorting:
https://learn.microsoft.com/en-us/azure/cosmos-db/sql-api-sql-query-reference#bk_orderby_clause
Example: SELECT * FROM c ORDER BY c._ts DESC
As you mentioned, order by unfortunately isn't implemented yet.
Your approach looks reasonable to me.
I see you are using a predicate to narrow the query result set (pulling 200,000 records for any DB will be costly).
Since it looks like you are looking to order by id - you can also look in to setting up a range index on id allowing you to perform range queries (e.g. < and >) on the id and further narrow the query result set. There is also a range index included by default on the _ts (timestamp) system property on documents that may also be helpful in this context.
See: http://azure.microsoft.com/en-us/documentation/articles/documentdb-indexing-policies/

Multidimensional data lookup

I have a collection of tuples of N values. A value may be a wildcard (matches any value), or a concrete value. What would be the best way to lookup all tuples in the collection matching a specific tuple without scanning the entire collection and testing items one by one?
E.g. 1.2.3 matches 1.*.3 and *.*.3, but not 1.2.4 or *.2.4.
What data structure am I looking for here?
I'd use a trie to implement this. Here's how I would construct the trie:
The data structure would look like:
Trie{
Integer value
Map<Integer, Trie> tries
}
To insert:
insert(tuple, trie){
curTrie = trie
foreach( number in tuple){
nextTrie = curTrie.getTrie(number)
//add the number to the trie if it isn't in there
if(nextTrie == null){
newTrie = new Trie(number)
curTrie.setTrie(number, newTrie)
}
curTrie = curTrie.getTrie(number)
}
}
To get all the tuples:
getTuples(tuple, trie){
if(head(tuple) == "*"){
allTuples = {}
forEach(subTrie in trie){
allTuples.union(getTuples(restOf(tuple), subTrie))
forEach(partialTuple in allTuples){
partialTuple = head(tuple)+partialTuple
}
}
return allTuples
}
if(tuple == null)
return {trie.value}
if(trie.getTrie(head(tuple)) == null)
raise error because tuple does not exist
allTuples = {}
allTuples.union(getTuples(restOf(tuple), trie.getTrie(head(tuple))
forEach(partialTuple in allTuples){
partialTuple = head(tuple)+partialTuple
}
return allTuples
}

How to filter child collections in Linq

I need to filter the child elements of an entity in linq using a single linq query. Is this possible?
Suppose I have two related tables. Verses and VerseTranslations. The entity created by LINQ to SQL is such that i have a Verse object that contains a child object that is a collection of VerseTranslation.
Now if i have the follow linq query
var res = from v in dc.Verses
where v.id = 1
select v;
I get a collection of Verses whose id is 1 and each verse object contains all the child objects from VerseTranslations.
What I also want to do is filter that child list of Verse Translations.
So far the only way i have been able to come up with is by using a new Type Anonymous or otherwise. As follows
var res= from v in dc.Verses
select new myType
{
VerseId = v.VerseId,
VText = v.Text,
VerseTranslations = (from trans in v.VerseTranslations
where languageId==trans.LanguageId
select trans
};
The above code works, but i had to declare a new class for it. Is there no way to do it in such a manner such that the filtering on the child table can be incorporated in the first linq query so that no new classes have to be declared.
Regards,
MAC
So i finally got it to work thanks to the pointers given by Shiraz.
DataLoadOptions options = new DataLoadOptions();
options.AssociateWith<Verse>(item => item.VerseTranslation.Where(t => languageId.Contains(t.LanguageId)));
dc.LoadOptions = options;
var res = from s in dc.Verse
select s;
This does not require projection or using new extension classes.
Thanks for all your input people.
Filtering on the enclosed collection of the object,
var res = dc.Verses
.Update(v => v.VerseTranslations
= v.VerseTranslations
.Where(n => n.LanguageId == languageId));
By using extension method "Update" from HookedOnLinq
public static class UpdateExtensions {
public delegate void Func<TArg0>(TArg0 element);
/// <summary>
/// Executes an Update statement block on all elements in an IEnumerable<T> sequence.
/// </summary>
/// <typeparam name="TSource">The source element type.</typeparam>
/// <param name="source">The source sequence.</param>
/// <param name="update">The update statement to execute for each element.</param>
/// <returns>The numer of records affected.</returns>
public static int Update<TSource>(this IEnumerable<TSource> source, Func<TSource> update) {
if (source == null) throw new ArgumentNullException("source");
if (update == null) throw new ArgumentNullException("update");
if (typeof(TSource).IsValueType)
throw new NotSupportedException("value type elements are not supported by update.");
int count = 0;
foreach(TSource element in source) {
update(element);
count++;
}
return count;
}
}
If this is coming from a database, you could run your first statement.
Then do a Load or Include of VerseTranslations with a Where clause.
http://msdn.microsoft.com/en-us/library/bb896249.aspx
Do you have a relationship in your model between Verse and VerseTranslations. In that case this might work:
var res= from v in
dc.Verses.Include("VerseTranslations").Where(o => languageId==o.LanguageId)
select v;
Is there no way to do it in such a
manner such that the filtering on the
child table can be incorporated in the
first linq query so that no new
classes have to be declared?
Technically, the answer is no. If you're trying to return more data than a single entity object (Verse, VerseTranslation) can hold, you'll need some sort of object to "project" into. However, you can get around explicitly declaring myType by using an anonymous type:
var res = from v in dc.Verses
select new
{
Verse = v,
Translations = (from trans in v.VerseTranslations
where languageId==trans.LanguageId
select trans).ToList()
};
var first = res.First();
Console.WriteLine("Verse {0} has {1} translation(s) in language {2}.",
first.Verse.VerseId, first.Translations.Count, languageId);
The compiler will generate a class with appropriately-typed Verse and Translations properties for you. You can use these objects for just about anything as long as you don't need to refer to the type by name (to return from a named method, for example). So while you're not technically "declaring" a type, you're still using a new type that will be generated per your specification.
As far as using a single LINQ query, it all depends how you want the data structured. To me it seems like your original query makes the most sense: pair each Verse with a filtered list of translations. If you expect only a single translation per language, you could use SingleOrDefault (or FirstOrDefault) to flatten your subquery, or just use a SelectMany like this:
var res= from v in dc.Verses
from t in v.VerseTranslations.DefaultIfEmpty()
where t == null || languageId == t.LanguageId
select new { Verse = v, Translation = t };
If a Verse has multiple translations, this will return a "row" for each Verse/Translation pair. I use DefaultIfEmpty() as a left join to make sure we will get all Verses even if they're missing a translation.

Resources