How to filter out "Hits" result returned by indexsearcher.search() function? - performance

How can I reduce the size of a "Hits" object that is returned by indexsearcher.search() function?
Currently I do something like:
Hits hits = indexSearch.search(query,filter,...);
Iterator hitsIt = hits.iterator();
int newSize=0;
while (hitsIt.hasNext()){
Hit currHit = (Hit)hitsIt.next();
if (hasPermission(currHit)){
newSize++;
}
}
However, this is creating a huge performance problem when the number of hits is large (like 500 or more).
I have heard of something called "HitsCollector" or maybe "Collector" which is supposed to help improve performance, but I don't have any idea how to use it.
Would appreciate it if someone could point me in the right direction.
We are using Apache Lucene for indexing within the Atlassian Confluence web application.

A Collector is just a simple callback mechanism that gets invoked for each document hit, you'd use a collector like this :-
public class MyCollector extends HitCollector {
// this is called back for every document that
// matches, with the docid and the score
public void collect(int doc, float score){
// do whatever you have to in here
}
}
..
HitCollector collector = new MyCollector();
indexSearch(query,filter,collector);

For good performance you will have to index you security information along with each document. This of course depends on your security model. For example, if you can assign each document to security roles that have permissions for it, then use that. Also
check out this question. Yours is pretty much a duplicate of that.

Related

Is it safe to pass a Lucene Query String directly from a user into a QueryParser?

tldr: Can I securely pass a raw query string (retrieved as a URL parameter) into a Lucene QueryParser without any added input sanitization?
I'm not a security expert, but I need some advice. As the title states, is it safe to use this controller method:
#CrossOrigin(origins = "${allowed-origin}")
#GetMapping(value = "/search/{query_string}", produces = MediaType.APPLICATION_JSON_VALUE)
public List doSearch(#PathVariable("query_string") String queryString) {
return searchQueryHandlerService.doSearch(queryString);
}
In tandem with this service method (the error handling is for testing only):
public List doSearch(String queryString) {
LOGGER.debug("Parsing query string: " + queryString);
try {
Query q = new QueryParser(null, standardAnalyzer).parse(queryString);
FullTextEntityManager manager = Search.getFullTextEntityManager(entityManager);
FullTextQuery fullTextQuery = manager.createFullTextQuery(q, Poem.class, Book.class, Section.class);
return fullTextQuery.getResultList();
} catch (ParseException e) {
LOGGER.error(e);
return Collections.emptyList();
}
}
With only basic input sanitization? If this isn't safe are there measures I can take to make it safe?
Any help is greatly appreciated.
I've been looking into this on and off for the last few weeks and I cannot find any reason why it wouldn't be safe, but It's such an obscure question (in an area I'm unfamiliar with) that I may be missing some obvious, fundamental problem anyone working in the area would see immediately.
A FullTextQuery is always read only, so you don't have to be concerned with people dropping tables or similar issues that you might have to consider when dealing with SQL injection.
But you might want to be careful if you have security restrictions on what data can be seen by your users.
The API also restricts the operation to a certain set of indexes - in your case those containing the Poem entities - so it's also not possible to break out of the chosen indexes.
But you need to consider:
is it ok if the user is able to somehow find a different Poem than what you expected them to look for
if you share the same index with other entities, there might be some ways to infer data about these other entities
So to be security conscious you might want to:
each entity type gets indexed into its own index (which is the default).
enable some FullTextFilter to restrict the user query based on your custom rules.
actually check the content of each result before rendering it, so to remove content that your other filters didn't catch.
If you are extremely paranoid, consider that any full-text index can actually reveal a bit about how frequent certain terms are in the whole index. People are normally not too concerned about this as it's extremely hard to take advantage of, and only minimal clues about the data distribution are revealed.
So back at your example, if this index just contains poems and you're ok with allowing any user to see any poem you have stored, giving away clues about which poems you are making available is normally not a security concern but is rather the whole point of your service.

retrieving data from arbitrary memory addresses using VSIX

I am working on developing a debugger plugin for visual studio using VSIX. My problem is I have an array of addresses but I cannot set the IDebugMemoryBytes2 to a particular address. I use DEBUG_PROPERTY_INFO and get the array of addresses, and I also am able to set the context to the particular addresses in the array using the Add function in IDebugMemoryContext2. However, I need to use the ReadAt function to retrieve n bytes from a specified address (from IDebugMemoryBytes2).
Does anyone have any idea how to retrieve data from arbitrary addresses from memory?
I am adding more information on the same:
I am using the Microsoft Visual Studio Extensibility package to build my debugger plugin. In the application I am trying to debug using this plugin, there is a double pointer and I need to read those values to process them further in my plugin. For this, there is no way to display all the pointer variables in the watch window and hence, I am not able to get the DEBUG_PROPERTY_INFO for all the block of arrays which the pointer variable is pointing to. This is my problem which I am trying to address. There is no way for me to read the memory pointed to by this double pointer.
Now as for the events in the debuggee process, since the plugin is for debugging variables, I put a breakpoint at a place where I know this pointer is populated and then come back to the plugin for further evaluation.
As a start, I was somehow able to get the starting addresses of each of the array. But still, I am not able to read x bytes of memory from each of these starting addresses.
ie., for example, if I have int **ptr = // pointing to something
I have the addresses present in ptr[0], ptr[1], ptr[2], etc. But I need to go to each of these addresses and fetch the memory block they are pointing to.
For this, after much search, I found this link: https://macropolygon.wordpress.com/2012/12/16/evaluating-debugged-process-memory-in-a-visual-studio-extension/ which seems to address exactly my issue.
So to use expression evaluator functions, I need an IDebugStackFrame2 object to get the ExpressionContext. To get this object, I need to register to events in the debuggee process which is for breakpoint. As said in the post, I did:
public int Event(IDebugEngine2 engine, IDebugProcess2 process,
IDebugProgram2 program, IDebugThread2 thread, IDebugEvent2
debugEvent, ref Guid riidEvent, uint attributes)
{
if (debugEvent is IDebugBreakpointEvent2)
{
this.thread = thread;
}
return VSConstants.S_OK;
}
And my registration is like:
private void GetCurrentThread()
{
uint cookie;
DBGMODE[] modeArray = new DBGMODE[1];
// Get the Debugger service.
debugService = Package.GetGlobalService(typeof(SVsShellDebugger)) as
IVsDebugger;
if (debugService != null)
{
// Register for debug events.
// Assumes the current class implements IDebugEventCallback2.
debugService.AdviseDebuggerEvents(this, out cookie);
debugService.AdviseDebugEventCallback(this);
debugService.GetMode(modeArray);
modeArray[0] = modeArray[0] & ~DBGMODE.DBGMODE_EncMask;
if (modeArray[0] == DBGMODE.DBGMODE_Break)
{
GetCurrentStackFrame();
}
}
}
But this doesn't seem to invoke the Event function at all and hence, I am not sure how to get the IDebugThread2 object.
I also tried the other way suggested in the same post:
namespace Microsoft.VisualStudio.Debugger.Interop.Internal
{
[InterfaceType(ComInterfaceType.InterfaceIsIUnknown), Guid("1DA40549-8CCC-48CF-B99B-FC22FE3AFEDF")]
public interface IDebuggerInternal11 {
[DispId(0x6001001f)]
IDebugThread2 CurrentThread { [return:
MarshalAs(UnmanagedType.Interface)]
[MethodImpl(MethodImplOptions.InternalCall, MethodCodeType =
MethodCodeType.Runtime)]
get; [param: In, MarshalAs(UnmanagedType.Interface)]
[MethodImpl(MethodImplOptions.InternalCall, MethodCodeType =
MethodCodeType.Runtime)] set; }
}
}
private void GetCurrentThread()
{
debugService = Package.GetGlobalService(typeof(SVsShellDebugger)) as IVsDebugger;
if (debugService != null)
{
IDebuggerInternal11 debuggerServiceInternal =
(IDebuggerInternal11)debugService;
thread = debuggerServiceInternal.CurrentThread;
GetCurrentStackFrame();
}
}
But in this method, I think I am missing something but I am not sure what, because after the execution of the line
IDebuggerInternal11 debuggerServiceInternal =
(IDebuggerInternal11)debugService;
when I check the values of the debuggerServiceInternal variable, I see there is a System.Security.SecurityException for CurrentThread, CurrentStackFrame (and so obviously the next line causes a crash). For this, I googled the error and found I was missing the ComImport attribute to the class. So I added that and now, I get a System.AccessViolationException : Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
I am new to C# programming as well and hence, it is a bit difficult to grasp many things in short duration. I am lost as to how to proceed further now.
Any help in the same or suggestions to try another way to achieve my objective will be greatly appreciated.
Thanks a lot,
Esash
After much search, since I am short of time, I need a quick solution and hence, for now, it seems like the quickest way to solve this problem is to hack the .natvis files by making it display all the elements of the pointer and then using the same old way by using IDebug* interface methods to access and retrieve the memory context for each of the pointer elements. But, after posting the same question in msdn forums, I think the proper answer to this problem is as mentioned by Greggs:
"For reading memory, if you want a fast way to do this, you just want the raw memory, and the debug engine of the target is the normal Visual Studio native engine (in other words, you aren't creating your own debug engine), I would recommend referencing Microsoft.VisualStudio.Debugger.Engine. You can then use DkmStackFrame.ExtractFromDTEObject to get the DkmStackFrame object. This will give you the DkmProcess object and you can call DkmProcess.ReadMemory to read memory from the target."
Now, after trying a lot to understand how to implement this, I found that you could just accomplish this using :
DkmProcess.GetProcesses() and doing a ReadMemory on the process returned.
There is a question now, what if more than one processes are returned. Well, I tried attaching many processes to the current debugging process and tried attaching many processes to the debuggee process as well, but found that the DkmProcess.GetProcesses() gets only the one from which I regained the control from, and not the other processes I am attached to. I am not sure if this will work in all cases but for me, it worked this way and for anyone who has similar requirements, this might work as well.
Using the .natvis files to accomplish this means, using IndexListItems for VS2013 and prior versions, and using CustomListItems for VS2015 and greater versions, and to make it look prettier, use the "no-derived" attribute. There is no way to make the Synthetic tag display only the base address of each variable and hence, the above attribute is the best way to go about, but this is not available in VS2013 and prior versions (The base address might get displayed but for people who want to go beyond just displaying contents and also access the memory context of the pointer element, Synthetic tag is not the right thing).
I hope this helps some developer who struggled like me using IDebug* interfaces. For reference, I am also giving the link to the msdn forum where my question was answered.
https://social.msdn.microsoft.com/Forums/en-US/030cef1c-ee79-46e9-8e40-bfc59f14cc34/how-can-i-send-a-custom-debug-event-to-my-idebugeventcallback2-handler?forum=vsdebug
Thanks.

Improve Script performance by caching Spreadsheet values

I am trying to develop a webapp using Google Apps Script to be embedded into a Google Site which simply displays the contents of a Google Sheet and filters it using some simple parameters. For the time being, at least. I may add more features later.
I got a functional app, but found that filtering could often take a while as the client sometimes had to wait up to 5 seconds for a response from the server. I decided that this was most likely due to the fact that I was loading the spreadsheet by ID using the SpreadsheetApp class every time it was called.
I decided to cache the spreadsheet values in my doGet function using the CacheService and retrieve the data from the cache each time instead.
However, for some reason this has meant that what was a 2-dimensional array is now treated as a 1-dimensional array. And, so, when displaying the data in an HTML table, I end up with a single column, with each cell being occupied by a single character.
This is how I have implemented the caching; as far as I can tell from the API reference I am not doing anything wrong:
function doGet() {
CacheService.getScriptCache().put('data', SpreadsheetApp
.openById('####')
.getActiveSheet()
.getDataRange()
.getValues());
return HtmlService
.createTemplateFromFile('index')
.evaluate()
.setSandboxMode(HtmlService.SandboxMode.IFRAME);
}
function getData() {
return CacheService.getScriptCache().get('data');
}
This is my first time developing a proper application using GAS (I have used it in Sheets before). Is there something very obvious I am missing? I didn't see any type restrictions on the CacheService reference page...
CacheService stores Strings, so objects such as your two-dimensional array will be coerced to Strings, which may not meet your needs.
Use the JSON utility to take control of the results.
myCache.put( 'tag', JSON.stringify( myObj ) );
...
var cachedObj = JSON.parse( myCache.get( 'tag' ) );
Cache expires. The put method, without an expirationInSeconds parameter expires in 10 minutes. If you need your data to stay alive for more than 10 minutes, you need to specify an expirationInSeconds, and the maximum is 6 hours. So, if you specifically do NOT need the data to expire, Cache might not be the best use.
You can use Cache for something like controlling how long a user can be logged in.
You could also try using a global variable, which some people would tell you to never use. To declare a global variable, define the variable outside of any function.

Implementing thread-safe, parallel processing

I am trying to convert an existing process in a way that it supports multi-threading and concurrency to make the solution more robust and reliable.
Take the example of an emergency alert system. When a worker clocks-in, a new Recipient object is created with their information and added to the Recipients collection. Conversely, when they clock-out, the object is removed. And in the background, when an alert occurs, the alert engine will iterate through the same list of Recipients (foreach), calling SendAlert(...) on each object.
Here are some of my requirements:
Adding a recipient should not block if an alert is in progress.
Removing a recipient should not block if an alert is in progress.
Adding or removing a recipient should not affect the list of
recipients used by an in-progress alert.
I've been looking at the Task and Parallel classes as well as the BlockingCollection and ConcurrentQueue classes but am not clear what the best approach is.
Is it as simple as using a BlockingCollection? After reading a ton of documentation, I'm still not sure what happens if Add is called while I am enumerating the collection.
UPDATE
A collegue referred me to the following article which describes the ConcurrentBag class and how each operation behaves:
http://www.codethinked.com/net-40-and-system_collections_concurrent_concurrentbag
Based on the author's explanation, it appears that this collection will (almost) serve my purposes. I can do the following:
Create a new collection
var recipients = new ConcurrentBag();
When a worker clocks-in, create a new Recipient and add it to the collection:
recipients.Add(new Recipient());
When an alert occurs, the alert engine can iterate through the collection at that time because GetEnumerator uses a snapshot of the collection items.
foreach (var recipient in recipients)
recipient.SendAlert(...);
When a worker clocks-out, remove the recipient from the collection:
???
The ConcurrentBag does not provide a way to remove a specific item. None of the concurrent classes do as far as I can tell. Am I missing something? Aside from this, ConcurrentBag does everything I need.
ConcurrentBag<T> should definitely be the best performing class out of the bunch for you to use for such a case. Enumeration works exactly as your friend describes and so it should serve well for the scenario you have laid out. However, knowing you have to remove specific items from this set, the only type that's going to work for you is ConcurrentDictionary<K, V>. All the other types only offer a TryTake method which, in the case of ConcurrentBag<T>, is indeterminate or, in the case of ConcurrentQueue<T> or ConcurrentStack<T> ordered only.
For broadcasting you would just do:
ConcurrentDictionary<string, Recipient> myConcurrentDictionary = ...;
...
foreach(Recipient recipient in myConcurrentDictionary.Values)
{
...
}
The enumerator is once again a snapshot of the dictionary in that instant.
I came into work this morning to an e-mail from a friend that gives me the following two answers:
1 - With regards to how the collections in the Concurrent namespace work, most of them are designed to allow additions and subtractions from the collection without blocking and are thread-safe even when in the process of enumerating the collection items.
With a "regular" collection, getting an enumerator (via GetEnumerator) sets a "version" value that is changed by any operation that affects the collection items (such as Add, Remove or Clear). The IEnumerator implementation will compare the version set when it was created against the current version of the collection. If different, an exception is thrown and enumeration ceases.
The Concurrent collections are designed using segments that make it very easy to support multi-threading. But, in the case of enumerating, they actually create a snapshot copy of the collection at the time GetEnumerator is called and the enumerator works against this copy. That allows changes to be made to the collection without adverse affects on the enumerator. Of course this means that the enumeration will know nothing of these changes but it sounds like your use-case allows this.
2 - As far as the specific scenario you are describing, I don't believe that a Concurrent collection is needed. You can wrap a standard collection using a ReaderWriterLock and apply the same logic as the Concurrent collections when you need to enumerate.
Here's what I suggest:
public class RecipientCollection
{
private Collection<Recipient> _recipients = new Collection<Recipient>();
private ReaderWriterLock _lock = new ReaderWriterLock();
public void Add(Recipient r)
{
_lock.AcquireWriterLock(Timeout.Infinite);
try
{
_recipients.Add(r);
}
finally
{
_lock.ReleaseWriterLock();
}
}
public void Remove(Recipient r)
{
_lock.AcquireWriterLock(Timeout.Infinite);
try
{
_recipients.Remove(r);
}
finally
{
_lock.ReleaseWriterLock();
}
}
public IEnumerable<Recipient> ToEnumerable()
{
_lock.AcquireReaderLock(Timeout.Infinite);
try
{
var list = _recipients.ToArray();
return list;
}
finally
{
_lock.ReleaseReaderLock();
}
}
}
The ReaderWriterLock ensures that operations are only blocked if another operation that changes the collection's contents is in progress. As soon as that operation completes, the lock is released and the next operation can proceed.
Your alert engine would use the ToEnumerable() method to obtain a snapshot copy of the collection at that time and enumerate the copy.
Depending on how often an alert is sent and changes are made to the collection, this could be an issue but you might be able to still implement some type of version property that is changed when an item is added or removed and the alert engine can check this property to see if it needs to call ToEnumerable() again to get the latest version. Or encapsulate this by caching the array inside the RecipientCollection class and invalidating the cache when an item is added or removed.
HTH
There is much more to an implementation like this than just the parallel processing aspects, durability probably being paramount among them. Have you considered building this using an existing PubSub technology like say... Azure Topics or NServiceBus?
Your requirements strike me as an good fit for the way standard .NET events are triggered in C#. I don't know offhand if the VB syntax gets compiled to similar code or not. The standard pattern looks something like:
public event EventHandler Triggered;
protected void OnTriggered()
{
//capture the list so that you don't see changes while the
//event is being dispatched.
EventHandler h = Triggered;
if (h != null)
h(this, EventArgs.Empty);
}
Alternatively, you could use an immutable list class to store the recipients. Then when the alert is sent, it will first take the current list and use it as a "snapshot" that cannot be modified by adding and removing while you are sending the alert. For example:
class Alerter
{
private ImmutableList<Recipient> recipients;
public void Add(Recipient recipient)
{
recipients = recipients.Add(recipient);
}
public void Remove(Recipient recipient)
{
recipients = recipients.Remove(recipient);
}
public void SendAlert()
{
//make a local reference to the current list so
//you are not affected by any calls to Add/Remove
var current = recipients;
foreach (var r in current)
{
//send alert to r
}
}
}
You will have to find an implementation of an ImmutableList, but you should be able to find several without too much work. In the SendAlert method as I wrote it, I probably didn't need to make an explicit local to avoid problems as the foreach loop would have done that itself, but I think the copy makes the intention clearer.

Reliable and efficient way to handle Azure Table Batch updates

I have an IEnumerable that I'd like to add to Azure Table in the most efficient way possible. Since every batch write has to be directed to the same PartitionKey, with a limit of 100 rows per write...
Does anyone want to take a crack at implementing this the "right" way as referenced in the TODO section? I'm not sure why MSFT didn't finish the task here...
Also I'm not sure if error handling will complicate this, or the correct way to implement it. Here is the code from the Microsoft Patterns and Practices team for Windows Azure "Tailspin Toys" demo
public void Add(IEnumerable<T> objs)
{
// todo: Optimize: The Add method that takes an IEnumerable parameter should check the number of items in the batch and the size of the payload before calling the SaveChanges method with the SaveChangesOptions.Batch option. For more information about batches and Windows Azure table storage, see the section, "Transactions in aExpense," in Chapter 5, "Phase 2: Automating Deployment and Using Windows Azure Storage," of the book, Windows Azure Architecture Guide, Part 1: Moving Applications to the Cloud, available at http://msdn.microsoft.com/en-us/library/ff728592.aspx.
TableServiceContext context = this.CreateContext();
foreach (var obj in objs)
{
context.AddObject(this.tableName, obj);
}
var saveChangesOptions = SaveChangesOptions.None;
if (objs.Distinct(new PartitionKeyComparer()).Count() == 1)
{
saveChangesOptions = SaveChangesOptions.Batch;
}
context.SaveChanges(saveChangesOptions);
}
private class PartitionKeyComparer : IEqualityComparer<TableServiceEntity>
{
public bool Equals(TableServiceEntity x, TableServiceEntity y)
{
return string.Compare(x.PartitionKey, y.PartitionKey, true, System.Globalization.CultureInfo.InvariantCulture) == 0;
}
public int GetHashCode(TableServiceEntity obj)
{
return obj.PartitionKey.GetHashCode();
}
}
Well, we (the patterns & practices team) just optimized for showing other things we considered useful. The code above is not really a "general purpose library", but rather a specific method for the sample that uses it.
At that moment we thought that adding that extra error handling would not add much, and we diceided to keep it simple, but....we might have been wrong.
Anyway, if you follow the link in the //TODO:, you will find another section of a previous guide we wrote that talks a little bit more on error handling in "complex" storage transactions (not in the "ACID" form though as transactions "ala DTC" are not supported in Windows Azure Storage).
Link is this: http://msdn.microsoft.com/en-us/library/ff803365.aspx
The limitations are listed in more detail there:
Only one instance of the entity should be present in the batch
Max 100 entities or 4 MB payload
Same PartitionKey (which is being handled in the code: notice that "batch" is only specified if there's a single Partition key)
etc.
Adding some extra error handling should not overcomplicate things too much, but depends on the type of app you are building on top of this and your preference to handle this higher or lower in your app stack. In our example, the app would never expect > 100 entities anyway, so it would simply bubble the exception up if that situation happens (because it should be truly exceptional). Same with the total size. The use cases implemented in the app make it impossible to have the same entity in the same collection, so again, that should never happen (and if it happens, it wouls simply throw)
All "entity group transactions" limitations are documented here: http://msdn.microsoft.com/en-us/library/dd894038.aspx
Let us know how it goes! I'm also interested to know if other pieces of the guide were useful for you.

Resources