Lucene scoring: get cosine similarity as scores - elasticsearch

I'm trying to solve nearest neighbor search problem.
Here is my code:
// Indexing
val analyzer = new StandardAnalyzer()
val directory = new RAMDirectory()
val config = new IndexWriterConfig(analyzer)
val iwriter = new IndexWriter(directory, config)
val queryField = "fieldname"
stringData.foreach { str =>
val doc = new Document()
doc.add(new TextField(queryField, str, Field.Store.YES))
iwriter.addDocument(doc)
}
iwriter.close()
// Searching
val ireader = DirectoryReader.open(directory)
val isearcher = new IndexSearcher(ireader)
val parser = new QueryParser(queryField, analyzer)
val query = parser.parse("Some text for testing")
val hits = isearcher.search(query, 10).scoreDocs
When I look on the value hits I see scores more then 1.
As far as I know, lucene scoring formula is:
score(q,d) = coord-factor(q,d) · query-boost(q) · cosSim(q,d) · doc-len-norm(d) · doc-boost(d)
But I want to get only cosine similarity in range[0,1] between query and document instead of coord-factor, doc-len-norm and so on.
What is a possible way to achieve it?

If you have gone through this official documentation, you would realize that the rest of the terms in the score expression is important and makes the scoring process more logical and coherent.
But still if you want to achieve a scoring process using only Cosine Similaity, then you can write your custom similarity class. I have used different types of similarity method for document retrieval in my class assignment. So, in short you can write your own similarity method and assign it to the Lucene's index searcher. I am giving an example here which you modify to accomplish what you want.
Write your custom class (you just need to override one method in your class).
import org.apache.lucene.search.similarities.BasicStats;
import org.apache.lucene.search.similarities.SimilarityBase;
public class MySimilarity extends SimilarityBase {
#Override
protected float score(BasicStats stats, float termFreq, float docLength) {
double tf = 1 + (Math.log(termFreq) / Math.log(2));
double idf = Math.log((stats.getNumberOfDocuments() + 1) / stats.getDocFreq()) / Math.log(2);
float dotProduct = (float) (tf * idf);
return dotProduct;
}
}
Then assign your implemented method to index searcher for relevance calculation as below.
IndexReader reader = DirectoryReader.open(FSDirectory.open(new File(indexPath)));
IndexSearcher indexSearcher = new IndexSearcher(reader);
indexSearcher.setSimilarity(new MySimilarity());
Here, i am using tf-idf dot product to compute similarity between query and documents. Formula is,
Two things need to be mentioned here are:
stats.getNumberOfDocuments() returns total number documents in the index.
stats.getDocFreq() returns document frequency for a term appeared in both query and document.
Lucene will now call the score() method that you have implemented to compute relevance score for each of the matched terms; terms that appeare both in query and documents.
This is not an straight forward answer to your question i know but you can use the approach i mentioned above in anyway you want. I implemented 6 different scoring technique in my homework assignment. I hope it will help you too.

Related

Can I get all of the terms and docId lists from ElasticSearch

How can I get all of the terms and doc lists in ES.For example the inverted index data looks like the following:
word1: doc1,doc5,doc6...
word2: doc3,doc9,doc12...
word3: doc5,doc100...
I just wanna get all of the terms and correspond doc list. Any api I can do this. Thanks!
In order to retrieve this you should understand a little bit about how Lucene operates. In Lucene, the structure of the index is structured (as you seem to know) with Fields->Terms->PostingLists (represented as PostingsEnums).
To retrieve these values, you can use this as a template Lucene tool (assuming you have access to the base reader - AtomicReader:
// get every one of the fields in the reader
Fields fields = MultiFields.getFields(reader);
for (String field: fields) {
// get the Terms for the field
TermsEnum terms = fields.terms(field).iterator(null);
// a term is represented by a BytesRef in lucene
// and we will iterate across all of them using
// the TermsEnum syntax (read lucene docs for this)
BytesRef t;
while ((t = terms.next()) != null) {
// get the PostingsEnum (not that this is called
// DocsEnum in Lucene 4.X) which represents an enumeration
// of all of the documents for the Term t
PostingsEnum docs = terms.postings(null, null);
String line = String.format("%s: ",t);
while (docs.nextDoc() != NO_MORE_DOCS) {
line += String.valueOf(docs.docID());
line += ", "
}
System.out.println(line);
}
}
I haven't actually had a chance to run this code exactly as is (I have a similar tool I've written for my specific fork of Lucene to compare Indexes), but hopefully this gives you the general knowledge of the structure of Lucene so that you can write your own tool.
The tricky part will be getting the explicit AtomicReader from your index - but I'm sure there are other StackOverflow answers to help you with that! (As a little hint you might want to look at opening your index with DirectoryReader#open(File f)#leaves())

Ehcache query paged

There is any way to paginate one Ehcache query from X item to Y item in the index?
Query query = getCache().createQuery();
Attribute<String> did = new Attribute("did");
Attribute<Date> activity = new Attribute("activity");
Attribute<Double> latitude = new Attribute("latitude");
Attribute<Double> longitude = new Attribute("longitude");
query
.addOrderBy(activity, Direction.DESCENDING)
.includeAttribute(did)
.includeAttribute(activity)
.includeAttribute(latitude)
.includeAttribute(longitude)
.includeValues()
.end();
Results results = query.execute();
// To do in query???
List<Result> page = results.range(range * 20, (range + 1) *20);
After call the execute() method, I know the Results.range(int, int) method does it, but I want only get the focus items.
Thank you in advance.
The way you describe is the way to do it.
What you are querying for contains an orderBy clause, so it cannot even be correct without looking at all the results anyway.
Remember also that this is a cache you are dealing with, two queries at different time may return different results as a consequence of expiry or eviction taking place. In this context, trying to get specific range of results across query executions may show duplicate or missed results.
Is this manner a good aproximation to get more performance?
query
.addOrderBy(activity, Direction.DESCENDING)
.includeAttribute(activity)
.includeValues()
.maxResults((range + 1) * 20)
.end();
List<Result> page = query.execute().range(range * 20, (range + 1) * 20);
I use one activity Date property, set when the item is push on cache to avoid the problem exposed by Louis Jacomet previusly.

Neo4j formulating Cypherquery, performance issue, multiple startpoint

I have to execute the following Query:
#Query("START whps=node:__types__(className = 'de.adma.domain.WHProcessStep'),
csd=node:__types__(className = 'de.adma.domain.CSDocument'),
whm=node:__types__(className = 'de.adma.domain.WHMachine')
MATCH whps<-[r1:RELATES_TO]-csd<-[r2:OCCURS_IN]-whm
WHERE (whps.id IN {0}) AND (csd.id IN {1})
RETURN DISTINCT whm ")
Each of these classes (CSDocument, WHMachine, ..) have the same scaffold:
#NodeEntity
public class CSDocument {
#GraphId
Long nodeId;
#Indexed(unique = true)
String id;
#Indexed(indexType = IndexType.FULLTEXT, indexName = "accessUri")
String accessUri;
// .. definition of some RelatedToVia-Relationships and getter/setters
}
Is the query as formulated the correct way to query neo4j?
Currently this works fine for small amounts up to ~100k Nodes/Relationships (query needs <5 seconds).
I need this for ~10mio Nodes/Relationship, but the query runs several minutes.
My test environment is a VM, Xeon 2,18Ghz (hexacore), 32GB Ram, SSD.
JVM config:
-Xmx14000m
-XX:MaxPermSize=4048m
-Xss3068m
-XX:+UseConcMarkSweepGC
I am using Neo4j embedded 1.8.1 inside an Java-Spring application.
Any ideas how i could improve the performance?
Is there an other way for the multiple startpoints when using the IN-statement? It seems as these multiple starting points slows down the queries.
Do I have to define an index?
Thanks!
Use index lookup on your id-index:
#Query("START whps=node:WHProcessStep(id = {0}),
csd=node:CSDocument(id = {1})
MATCH whps<-[:RELATES_TO]-csd<-[:OCCURS_IN]-whm
RETURN DISTINCT whm ")
If you want to pass multiple id's to the index you unfortunately have to pass the whole index query as parameters to your method:
#Query("START whps=node:WHProcessStep({0}),
csd=node:CSDocument({1})
MATCH whps<-[:RELATES_TO]-csd<-[:OCCURS_IN]-whm
RETURN DISTINCT whm ")
Collection<WHMachine> find(String whps, String csd);
where the two strings are: String whps = "id:(id1 id2 id3)";
Is it better if you simply drop these parts of your START clause?
csd=node:__types__(className = 'de.adma.domain.CSDocument'),
whm=node:__types__(className = 'de.adma.domain.WHMachine')
You're making a cartesian product of all of your start variables and then reducing it in the match. It will be quicker to do type checks after the match.

Sitecore - Load-balance Lucene queries

Sitecore.NET 6.6.0 (rev. 130404)
Our production website is very search-heavy and our Lucene indexes are queried heavily throughout the day. This amounts to considerable amount of CPU power being spent on Lucene query processing. Are there industry practices to offload Lucene indexes and queries to a different machine? or are there any hardware mechanisms that can be used to boost Lucene query performance?
(Our most used Lucene index contains less than 10,000 entries)
Update (more info):
Although our index contains less than 10,000, can the CPU usage be caused by high number of Lucene queries that get executed parallely? We have a very complex faceted search. Initially, when users try out various search criteria, we were displaying result-count-breakdowns alongside all the search options (resulting in 50-60 count queries with each search request). This caused the CPU usage reach 90-95% during high traffic. When we removed the counts, the CPU stabilized around 20-30%.
Here are the two methods we use for querying:
public static Document[] GetLuceneDocuments(ACIndex acIndex, Query query, Sort sort = null, int maxResults = 999, bool trackScores = false, bool fillFields = true)
{
Index index = SearchManager.GetIndex(GetIndexName(acIndex));
if (sort == null)
{
sort = new Sort(new SortField(null, SortField.SCORE));
}
using (IndexSearchContext searchContext = index.CreateSearchContext())
{
Lucene.Net.Search.IndexSearcher searcher = searchContext.Searcher;
TopFieldCollector collector = TopFieldCollector.create(sort, maxResults, fillFields, trackScores, false, false);
searcher.Search(query, collector);
TopDocs topdocs = collector.TopDocs();
Document[] documents = new Document[topdocs.ScoreDocs.Length];
for (int i = 0; i < topdocs.ScoreDocs.Length; i++)
{
documents[i] = searcher.Doc(topdocs.ScoreDocs[i].doc);
}
return documents;
}
}
public static int GetSearchResultCount(ACIndex acIndex, Query query)
{
Index index = SearchManager.GetIndex(GetIndexName(acIndex));
using (IndexSearchContext searchContext = index.CreateSearchContext())
{
Lucene.Net.Search.IndexSearcher searcher = searchContext.Searcher;
TopScoreDocCollector collector = TopScoreDocCollector.create(1, false);
searcher.Search(query, collector);
return collector.GetTotalHits();
}
}
You should look into implementing Solr for your searches. While not an expert on the subject, Solr is Lucene based (making the transition easier) and runs off a central server or servers, dealing with all your search requirements.
Solr isn't natively officially supported in versions prior to Sitecore 7 - but I have worked on a number of Sitecore 6 solutions that did use Solr.
This article should give you a lead start: How to implement Solr into Sitecore
As far as industry processes go, with Sitecore, Solr is the solution to this particular problem. Depending on your solution implementation however, it could take some doing to get up and going.
You might look at www.alpha-solutions.dk/sitecore-search-solution for a Solr on Sitecore 6 approach.
Note: I am affiliated with Alpha Solutions
Your index is small, I know there are recommendations that you rearchitect the whole solution, however, I recommend something I have done in the past that has worked well for me and will not require that you provision another server or install another indexing tool like Elastic or SOLR.
First, store the fields in the index that you facet on, like below (either in configuration or using a custom crawler):
_group
_path
_creator
Manufacturer
Size
Year
... [other fields]
Create a class that represents a result
public class MyThing
{
public string Manufacturer { get; set; }
public string Size { get; set; }
public int Year { get; set; }
public MyThing(Document doc)
{
Manufacturer = doc.GetField("Manufacturer").Value;
Size = doc.GetField("Size").Value;
Year = int.Parse(doc.GetField("Year").Value);
}
}
Then take your main search result hits, instantiate your lightweight POCO's, and do counts off of that. Voila, 1 query!
int countForSomething = results.Count(result=>result.Size == "XL");
NOTE: I kind of wrote this code off the top of my head, but you get the idea. I have used this process on indexes in Lucene up to 700K+ results in Sitecore without much issue. Good luck sir!
Ah! Just tackled the issue of faceted search and CPU usage myself. This is some border-line black-magic coding and some really creative caching.
We found a way to implement Solr's faceted querying into Lucene, and boy oh boy are the results stunningly fast.
Short version:
Build a static class that holds onto a dictionary. Key: unique representation of an individual filter, Value: the BitArray produced by a Lucene QueryFilter object.
var queryFilter = new QueryFilter(filterBooleanQuery);
var bits = queryFilter.Bits(indexReader);
result[filter.ID.ToString()] = bits
Build this dictionary periodically asynchronously in the background. My index of about 80k documents only took about 15 seconds to build, but that's enough to make a lot of users angry so doing it in a non-blocking manner is crucial.
Query this dictionary using bitwise logic to find the resulting BitArray representing the hits you're looking for.
var combo =
facetDictionary[thisFilter.ID.ToString()]
.And(facetDictionary[selectedFilter.ID.ToString()]);
Long Version:
http://www.devatwork.nl/articles/lucenenet/faceted-search-and-drill-down-lucenenet/
Now, our implementation was only to get the cardinality of these result sets, but theoretically you could use these bit arrays to get actual documents out of the index as well.
Good luck!
Upgrading to sitecore 7 would give you the facets out of the box. Abstracted in a nice LINQ API that lets you switch from Lucene and SOLR (others, like ElasticSearch are coming)...

Getting all zip codes within an n mile radius

What's the best way to get a function like the following to work:
def getNearest(zipCode, miles):
That is, given a zipcode (07024) and a radius, return all zipcodes which are within that radius?
There is a project on SourceForge that could assist with this:
http://sourceforge.net/projects/zips/
It gives you a database with zip codes and their latitude / longitude, as well as coding examples of how to calculate the distance between two sets of coordinates. There is probably a better way to do it, but you could have your function retrieve the zipcode and its coordinates, and then step through each zipcode in the list and add the zipcode to a list if it falls within the number of miles specified.
If you want this to be accurate, you must start with polygon data that includes the location and shape of every zipcode. I have a database like this (used to be published by the US census, but they no longer do that) and have built similar things atop it, but not that exact request.
If you don't care about being exact (which I'm guessing you don't), you can get a table of center points of zipcodes and query points ordered by great circle distance. PostGIS provides great tools for doing this, although you may construct a query against other databases that will perform similar tasks.
An alternate approach I've used is to construct a box that encompasses the circle you want, querying with a between clause on lon/lat and then doing the great-circle in app code.
Maybe this can help. The project is configured in kilometers though. You can modify these in CityDAO.java
public List<City> findCityInRange(GeoPoint geoPoint, double distance) {
List<City> cities = new ArrayList<City>();
QueryBuilder queryBuilder = geoDistanceQuery("geoPoint")
.point(geoPoint.getLat(), geoPoint.getLon())
//.distance(distance, DistanceUnit.KILOMETERS) original
.distance(distance, DistanceUnit.MILES)
.optimizeBbox("memory")
.geoDistance(GeoDistance.ARC);
SearchRequestBuilder builder = esClient.getClient()
.prepareSearch(INDEX)
.setTypes("city")
.setSearchType(SearchType.QUERY_THEN_FETCH)
.setScroll(new TimeValue(60000))
.setSize(100).setExplain(true)
.setPostFilter(queryBuilder)
.addSort(SortBuilders.geoDistanceSort("geoPoint")
.order(SortOrder.ASC)
.point(geoPoint.getLat(), geoPoint.getLon())
//.unit(DistanceUnit.KILOMETERS)); Original
.unit(DistanceUnit.MILES));
SearchResponse response = builder
.execute()
.actionGet();
SearchHit[] hits = response.getHits().getHits();
scroll:
while (true) {
for (SearchHit hit : hits) {
Map<String, Object> result = hit.getSource();
cities.add(mapper.convertValue(result, City.class));
}
response = esClient.getClient().prepareSearchScroll(response.getScrollId()).setScroll(new TimeValue(60000)).execute().actionGet();
if (response.getHits().getHits().length == 0) {
break scroll;
}
}
return cities;
}
The "LocationFinder\src\main\resources\json\cities.json" file contains all cities from Belgium. You can delete or create entries if you want too. As long as you don't change the names and/or structure, no code changes are required.
Make sure to read the README https://github.com/GlennVanSchil/LocationFinder

Resources