Can I get all of the terms and docId lists from ElasticSearch - elasticsearch

How can I get all of the terms and doc lists in ES.For example the inverted index data looks like the following:
word1: doc1,doc5,doc6...
word2: doc3,doc9,doc12...
word3: doc5,doc100...
I just wanna get all of the terms and correspond doc list. Any api I can do this. Thanks!

In order to retrieve this you should understand a little bit about how Lucene operates. In Lucene, the structure of the index is structured (as you seem to know) with Fields->Terms->PostingLists (represented as PostingsEnums).
To retrieve these values, you can use this as a template Lucene tool (assuming you have access to the base reader - AtomicReader:
// get every one of the fields in the reader
Fields fields = MultiFields.getFields(reader);
for (String field: fields) {
// get the Terms for the field
TermsEnum terms = fields.terms(field).iterator(null);
// a term is represented by a BytesRef in lucene
// and we will iterate across all of them using
// the TermsEnum syntax (read lucene docs for this)
BytesRef t;
while ((t = terms.next()) != null) {
// get the PostingsEnum (not that this is called
// DocsEnum in Lucene 4.X) which represents an enumeration
// of all of the documents for the Term t
PostingsEnum docs = terms.postings(null, null);
String line = String.format("%s: ",t);
while (docs.nextDoc() != NO_MORE_DOCS) {
line += String.valueOf(docs.docID());
line += ", "
}
System.out.println(line);
}
}
I haven't actually had a chance to run this code exactly as is (I have a similar tool I've written for my specific fork of Lucene to compare Indexes), but hopefully this gives you the general knowledge of the structure of Lucene so that you can write your own tool.
The tricky part will be getting the explicit AtomicReader from your index - but I'm sure there are other StackOverflow answers to help you with that! (As a little hint you might want to look at opening your index with DirectoryReader#open(File f)#leaves())

Related

Is it possible to get data contained in another document by id, when map function is running for some document in couchbase view?

I have two kinds of documents in my couchbase bucket with keys like -
product.id.1.main
product.id.2.main
product.id.3.main
and
product.id.1.extended
product.id.2.extended
product.id.3.extended
I want to write a view for documents of first kind, such that when some conditions are matched for a document, I can emit the attributes contained in the documents of first kind as well as the document of second kind.
Something like -
function(doc, meta){
if((meta.id).match("product.id.*.main") && doc.attribute1.match("value1"){
var extendedDocId = replaceMainWithExtended(meta.id)
emit(meta.id, doc.attribute1 + getExtendedDoc(extendedDocId).extendedAttribute1 );
}
}
I want to know how to implement this kind of function in couchbase views -
getExtendedDoc(extendedDocId).extendedAttribute1

How Should Complex ReQL Queries be Composed?

Are there any best practices or ReQL features that that help with composing complex ReQL queries?
In order to illustrate this, imagine a fruits table. Each document has the following structure.
{
"id": 123,
"name": "name",
"colour": "colour",
"weight": 5
}
If we wanted to retrieve all green fruits, we might use the following query.
r
.db('db')
.table('fruits')
.filter({colour: 'green'})
However, in more complex cases, we might wish to use a variety of complex command combinations. In such cases, bespoke queries could be written for each case, but this could be difficult to maintain and could violate the Don't Repeat Yourself (DRY) principle. Instead, we might wish to write bespoke queries which could chain custom commands, thus allowing complex queries to be composed in a modular fashion. This might take the following form.
r
.db('db')
.table('fruits')
.custom(component)
The component could be a function which accepts the last entity in the command chain as its argument and returns something, as follows.
function component(chain)
{
return chain
.filter({colour: 'green'});
};
This is not so much a feature proposal as an illustration of the problem of complex queries, although such a feature does seem intuitively useful.
Personally, my own efforts in resolving this problem have involved the creation of a compose utility function. It takes an array of functions as its main argument. Each function is called, passed a part of the query chain, and is expected to return an amended version of the query chain. Once the iteration is complete, a composition of the query components is returned. This can be viewed below.
function compose(queries, parameters)
{
if (queries.length > 1)
{
let composition = queries[0](parameters);
for (let index = 1; index < queries.length; index++)
{
let query = queries[index];
composition = query(composition, parameters);
};
return composition;
}
else
{
throw 'Must be two or more queries.';
};
};
function startQuery()
{
return RethinkDB;
};
function filterQuery1(query)
{
return query.filter({name: 'Grape'});
};
function filterQuery2(query)
{
return query.filter({colour: 'Green'});
};
function filterQuery3(query)
{
return query.orderBy(RethinkDB.desc('created'));
};
let composition = compose([startQuery, filterQuery1, filterQuery2, filterQuery3]);
composition.run(connection);
It would be great to know whether something like this exists, whether there are best practises to handle such cases, or whether this is an area where ReQL could benefit from improvements.
In RethinkDB doc, they state it clearly: All ReQL queries are chainable
Queries are constructed by making function calls in the programming
language you already know. You don’t have to concatenate strings or
construct specialized JSON objects to query the database. All ReQL
queries are chainable. You begin with a table and incrementally chain
transformers to the end of the query using the . operator
You do not have to compose another thing which just implicit your code, which gets it more difficult to read and be unnecessary eventually.
The simple way is assign the rethinkdb query and filter into the variables, anytime you need to add more complex logic, add directly to these variables, then run() it when your query is completed
Supposing I have to search a list of products with different filter inputs and getting pagination. The following code is exposed in javascript (This is simple code for illustration only)
let sorterDirection = 'asc';
let sorterColumnName = 'created_date';
var buildFilter = r.row('app_id').eq(appId).and(r.row('status').eq('public'))
// if there is no condition to start up, you could use r.expr(true)
// append every filter into the buildFilter var if they are positive
if (escapedKeyword != "") {
buildFilter = buildFilter.and(r.row('name').default('').downcase().match(escapedKeyword))
}
// you may have different filter to add, do the same to append them into buildFilter.
// start to make query
let query = r.table('yourTableName').filter(buildFilter);
query.orderBy(r[sorterDirection](sorterColumnName))
.slice(pageIndex * pageSize, (pageIndex * pageSize) + pageSize).run();

Searching a MongoDB collection from the end (c#)

I am looking for the most efficient way to get the last elements of a fairly large (> 1 million docs) MongoDB collection.
Specifically, it is the oplog collection and I am looking for all entries after a given timestamp. It makes no sense to search the first million or so entries for a timestamp larger than the current one, since they are all definitely older because the collection is stored in its natural order.
Is there a way to tell MongoDB to search from the end of a collection?
I tried a linq query with Skip(N) but it's very slow. It seems it parses through all documents from the beginning and just doesn't return the first N.
The most efficient way is probably using aggregation. If your collection is sorted, you can get the last Timestamp using this aggregation:
var group = new BsonDocument
{
{
"$group", new BsonDocument
{
{"_id", 0},
{"newestTimeStamp", new BsonDocument { {"$last","$timeStamp"} } }
}
}
};
var pipeline = new[] {group};
var result = _dtCollection.Aggregate(pipeline);
}
Then you can deserialize the result into a Timestamp class. If you want to get several elements, you could create a similar expression using $match.
Also make sure to add an index to the collection on the TimeStamp field. This will probably make your LINQ-query faster if you decide to use that instead.

Getting the objects with similar secondary index in Riak?

Is there a way to get all the objects in key/value format which are under one similar secondary index value. I know we can get the list of keys for one secondary index (bucket/{{bucketName}}/index/{{index_name}}/{{index_val}}). But somehow my requirements are such that if I can get all the objects too. I don't want to perform a separate query for each key to get the object details separately if there is way around it.
I am completely new to Riak and I am totally a front-end guy, so please bear with me if something I ask is of novice level.
In Riak, it's sometimes the case that the better way is to do separate lookups for each key. Coming from other databases this seems strange, and likely inefficient, however you may find your query will be faster over an index and a bunch of single object gets, than a map/reduce for all the objects in a single go.
Try both these approaches, and see which turns out fastest for your dataset - variables that affect this are: size of data being queried; size of each document; power of your cluster; load the cluster is under etc.
Python code demonstrating the index and separate gets (if the data you're getting is large, this method can be made memory-efficient on the client, as you don't need to store all the objects in memory):
query = riak_client.index("bucket_name", 'myindex', 1)
query.map("""
function(v, kd, args) {
return [v.key];
}"""
)
results = query.run()
bucket = riak_client.bucket("bucket_name")
for key in results:
obj = bucket.get(key)
# .. do something with the object
Python code demonstrating a map/reduce for all objects (returns a list of {key:document} objects):
query = riak_client.index("bucket_name", 'myindex', 1)
query.map("""
function(v, kd, args) {
var obj = Riak.mapValuesJson(v)[0];
return [ {
'key': v.key,
'data': obj,
} ];
}"""
)
results = query.run()

Sitecore - Load-balance Lucene queries

Sitecore.NET 6.6.0 (rev. 130404)
Our production website is very search-heavy and our Lucene indexes are queried heavily throughout the day. This amounts to considerable amount of CPU power being spent on Lucene query processing. Are there industry practices to offload Lucene indexes and queries to a different machine? or are there any hardware mechanisms that can be used to boost Lucene query performance?
(Our most used Lucene index contains less than 10,000 entries)
Update (more info):
Although our index contains less than 10,000, can the CPU usage be caused by high number of Lucene queries that get executed parallely? We have a very complex faceted search. Initially, when users try out various search criteria, we were displaying result-count-breakdowns alongside all the search options (resulting in 50-60 count queries with each search request). This caused the CPU usage reach 90-95% during high traffic. When we removed the counts, the CPU stabilized around 20-30%.
Here are the two methods we use for querying:
public static Document[] GetLuceneDocuments(ACIndex acIndex, Query query, Sort sort = null, int maxResults = 999, bool trackScores = false, bool fillFields = true)
{
Index index = SearchManager.GetIndex(GetIndexName(acIndex));
if (sort == null)
{
sort = new Sort(new SortField(null, SortField.SCORE));
}
using (IndexSearchContext searchContext = index.CreateSearchContext())
{
Lucene.Net.Search.IndexSearcher searcher = searchContext.Searcher;
TopFieldCollector collector = TopFieldCollector.create(sort, maxResults, fillFields, trackScores, false, false);
searcher.Search(query, collector);
TopDocs topdocs = collector.TopDocs();
Document[] documents = new Document[topdocs.ScoreDocs.Length];
for (int i = 0; i < topdocs.ScoreDocs.Length; i++)
{
documents[i] = searcher.Doc(topdocs.ScoreDocs[i].doc);
}
return documents;
}
}
public static int GetSearchResultCount(ACIndex acIndex, Query query)
{
Index index = SearchManager.GetIndex(GetIndexName(acIndex));
using (IndexSearchContext searchContext = index.CreateSearchContext())
{
Lucene.Net.Search.IndexSearcher searcher = searchContext.Searcher;
TopScoreDocCollector collector = TopScoreDocCollector.create(1, false);
searcher.Search(query, collector);
return collector.GetTotalHits();
}
}
You should look into implementing Solr for your searches. While not an expert on the subject, Solr is Lucene based (making the transition easier) and runs off a central server or servers, dealing with all your search requirements.
Solr isn't natively officially supported in versions prior to Sitecore 7 - but I have worked on a number of Sitecore 6 solutions that did use Solr.
This article should give you a lead start: How to implement Solr into Sitecore
As far as industry processes go, with Sitecore, Solr is the solution to this particular problem. Depending on your solution implementation however, it could take some doing to get up and going.
You might look at www.alpha-solutions.dk/sitecore-search-solution for a Solr on Sitecore 6 approach.
Note: I am affiliated with Alpha Solutions
Your index is small, I know there are recommendations that you rearchitect the whole solution, however, I recommend something I have done in the past that has worked well for me and will not require that you provision another server or install another indexing tool like Elastic or SOLR.
First, store the fields in the index that you facet on, like below (either in configuration or using a custom crawler):
_group
_path
_creator
Manufacturer
Size
Year
... [other fields]
Create a class that represents a result
public class MyThing
{
public string Manufacturer { get; set; }
public string Size { get; set; }
public int Year { get; set; }
public MyThing(Document doc)
{
Manufacturer = doc.GetField("Manufacturer").Value;
Size = doc.GetField("Size").Value;
Year = int.Parse(doc.GetField("Year").Value);
}
}
Then take your main search result hits, instantiate your lightweight POCO's, and do counts off of that. Voila, 1 query!
int countForSomething = results.Count(result=>result.Size == "XL");
NOTE: I kind of wrote this code off the top of my head, but you get the idea. I have used this process on indexes in Lucene up to 700K+ results in Sitecore without much issue. Good luck sir!
Ah! Just tackled the issue of faceted search and CPU usage myself. This is some border-line black-magic coding and some really creative caching.
We found a way to implement Solr's faceted querying into Lucene, and boy oh boy are the results stunningly fast.
Short version:
Build a static class that holds onto a dictionary. Key: unique representation of an individual filter, Value: the BitArray produced by a Lucene QueryFilter object.
var queryFilter = new QueryFilter(filterBooleanQuery);
var bits = queryFilter.Bits(indexReader);
result[filter.ID.ToString()] = bits
Build this dictionary periodically asynchronously in the background. My index of about 80k documents only took about 15 seconds to build, but that's enough to make a lot of users angry so doing it in a non-blocking manner is crucial.
Query this dictionary using bitwise logic to find the resulting BitArray representing the hits you're looking for.
var combo =
facetDictionary[thisFilter.ID.ToString()]
.And(facetDictionary[selectedFilter.ID.ToString()]);
Long Version:
http://www.devatwork.nl/articles/lucenenet/faceted-search-and-drill-down-lucenenet/
Now, our implementation was only to get the cardinality of these result sets, but theoretically you could use these bit arrays to get actual documents out of the index as well.
Good luck!
Upgrading to sitecore 7 would give you the facets out of the box. Abstracted in a nice LINQ API that lets you switch from Lucene and SOLR (others, like ElasticSearch are coming)...

Resources