hello I want to use stanford parser wuth threads but I dont know how to do that with thread pool. I want that all threads will do this:
LexicalizedParser.apply(Object in)
but I dont want to create all the time new object of LexicalizedParser because it will load
lp = new LexicalizedParser("englishPCFG.ser.gz");
and it will take 2 sec for each obj.
what can I do?
thanks!
Guess it's too late but a thread safe version is there: http://nlp.stanford.edu/software/lex-parser.shtml
You can use ThreadLocal.
It allows you to keep one instance of parser per thread. Thus any created instance of parser will never be used from more than one thread.
Usually it shouldn't create more instances than CPUs*cores you have.
For me it is ~4-5 instances (if I disable Hyper Threading on my quadcore).
P.S. Not related to StanfordNLP. Sometimes poor class implementations contain static fields and modify them in non-thread safe way. General safe parallelization approach for such implementations would be:
move computation part into separate process;
launch (CPUs*cores) number of processes with computations.
use IPC technic for communicating between main/background processes.
Related
I'm familiarizing myself with NSPersistentContainer. I wonder if it's better to spawn an instance of the private context with newBackgroundContext every time I need to insert/fetch some entities in the background or create one private context, keep it and use for all background tasks through the lifetime of the app.
The documentation also offers convenience method performBackgroundTask. Just trying to figure out the best practice here.
I generally recommend one of two approaches. (There are other setups that work, but these are two that I have used, and tested and would recommend.)
The Simple Way
You read from the viewContext and you write to the viewContext and only use the main thread. This is the simplest approach and avoid a lot of the multithread issues that are common with core-data. The problem is that the disk access is happening on the main thread and if you are doing a lot of it it could slow down your app.
This approach is suitable for small lightweight application. Any app that has less than a few thousand total entities and no bulk changes at once would be a good candidate for this. A simple todo list, would be a good example.
The Complex Way
The complex way is to only read from the viewContext on the main thread and do all your writing using performBackgroundTask inside a serial queue. Every block inside the performBackgroundTask refetches any managedObjects that it needs (using objectIds) and all managedObjects that it creates are discarded at the end of the block. Each performBackgroundTask is transactional and saveContext is called at end of the block. A fuller description can be found here: NSPersistentContainer concurrency for saving to core data
This is a robust and functional core-data setup that can manage data at any reasonable scale.
The problem is that you much always make sure that the managedObjects are from the context you expect and are accessed on the correct thread. You also need a serial queue to make sure you don't get write conflicts. And you often need to use fetchedResultsController to make sure entities are not deleted while you are holding pointers to them.
We have a piece of code with a static field val format = DateTimeFormatter.forPattern("yyyy-MM-dd")
Now this instance of formatter will be used by concurrent threads to parse and print date format.parseDateTime("2013-09-24") and format.print(instant).
I learnt that in Scala you can write your code without caring for concurrency, provided that you only use immutable fields, but what about the performance ? Can it become a bottleneck if several threads use the same instance ?
Thanks,
Your question is more related to Java. If the implementation of the forPattern method is thread safe you can share it between many threads without any bottleneck.
Check the javadoc to see if the implementation is thread safe. In your specific case, I will assume that you are using the JodaTime library :
extract from DateTime Javadoc :
DateTimeFormat is thread-safe and immutable, and the formatters it returns are as well.
Has a counter example see SimpleDateFormat javadoc :
Date formats are not synchronized. It is recommended to create separate format instances for each thread. If multiple threads access a format concurrently, it must be synchronized externally.
Using a val just mean that the variable reference will not change after his declaration. see What is the difference between a var and val definition in Scala?
Well looks like you have mis-interpreted it: Concurrency in Scala is very much like Java (or any other language) and nothing much special. Just that it provides alternative libraries and syntactic sugar to get them done with much lesser boiler plate and more importantly do get them done safely (ex: akka).
But the other principles like: dependency on number of cores, thread-pool sizes, context switches etc etc will all have to be handled and taken care of.
Now for the question if a immutable val accessed by multiple threads degrades performance: I dont think there should be any over-head nor do I have data to support. But I think the performance might be good as the processor can cache it and the same object can retrieved faster in another core.
I need to do some network bound calls (e.g., fetch a website) and I don't want it to block the UI. Should I be using NSThread's or python's threading module if I am working in pyobjc? I can't find any information on how to choose one over the other. Note, I don't really care about Python's GIL since my tasks are not CPU bound at all.
It will make no difference, you will gain the same behavior with slightly different interfaces. Use whichever fits best into your system.
Learn to love the run loop. Use Cocoa's URL-loading system (or, if you need plain sockets, NSFileHandle) and let it call you when the response (or failure) comes back. Then you don't have to deal with threads at all (the URL-loading system will use a thread for you).
Pretty much the only time to create your own threads in Cocoa is when you have a large task (>0.1 sec) that you can't break up.
(Someone might say NSOperation, but NSOperationQueue is broken and RAOperationQueue doesn't support concurrent operations. Fine if you already have a bunch of NSOperationQueue code or really want to prepare for working NSOperationQueue, but if you need concurrency now, run loop or threads.)
I'm more fond of the native python threading solution since I could join and reference threads around. AFAIK, NSThreads don't support thread joining and cancelling, and you could get a variety of things done with python threads.
Also, it's a bummer that NSThreads can't have multiple arguments, and though there are workarounds for this (like using NSDictionarys and NSArrays), it's still not as elegant and as simple as invoking a thread with arguments laid out in order / corresponding parameters.
But yeah, if the situation demands you to use NSThreads, there shouldn't be any problem at all. Otherwise, it's cool to stick with native python threads.
I have a different suggestion, mainly because python threading is just plain awful because of the GIL (Global Interpreter Lock), especially when you have more than one cpu core. There is a video presentation that goes into this in excruciating detail, but I cannot find the video right now - it was done by a Google employee.
Anyway, you may want to think about using the subprocess module instead of threading (have a helper program that you can execute, or use another binary on the system. Or use NSThread, it should give you more performance than what you can get with CPython threads.
I normally work on single threaded applications and have generally never really bothered with dealing with threads. My understanding of how things work - which certainly, may be wrong - is that as long as we're always dealing with single threaded code (i.e. no forks or anything like that) it will always be executed in the same thread.
Is this assumption correct? I have a fuzzy idea that UI libraries/frameworks may spawn off threads of their own to handle GUI stuff (which accounts for the fact that the Windows task manager tells me that my 'single threaded' application is actually running on 10 threads) but I'm guessing that this shouldn't affect me?
How does this apply to COM? For instance, if I were to create an instance of a COM component in my code; and that COM component writes some information to a thread-based location (using System.Threading.Thread.GetData for instance) will my application be able to get hold of that information?
So in summary:
In single threaded code, can I be sure that whatever I store in a thread-based location can be retrievable from anywhere else in the code?
If that single threaded code were to create an instance of a COM component which stores some information in a thread-based location, can that be similarly retrievable from anywhere else?
UI usually has the opposite constraint (sadly): it's single threaded and everything must happen on that thread.
The easiest way to check if you are always in the same thread (for, say, a function) is to have an integer variable set at -1, and have a check function like (say you are in C#):
void AssertSingleThread()
{
if (m_ThreadId < 0) m_ThreadId = Thread.CurrentThread.ManagedThreadId;
Debug.Assert(m_ThreadId == Thread.CurrentThread.ManagedThreadId);
}
That said:
I don't understand the question #1, really. Why store in a thread-based location if your purpose is to have a global scope ?
About the second question, most COM code runs on a single thread and, most often, on the thread where your UI message processing lives - this is because most COM code is designed to be compatible with VB6 which is single-thread.
The reason your program has about 10 threads is because both Windows (if you use some of its features like completion ports, or some kind of timers) and the CLR (for example for the GC or, again, some types of timers) may create threads in your process space (technically any program with enough priviledges, can too).
Think about having the model of having a single dataStore class running in your mainThread that all threads can read and write their instance variables to. This will avoid a lot of problems that might arise accessing threads all over the shop.
Simple idea, until you reach the fun part of threading. Concurrency and synchronization; simply, if you have two threads that want to read and write to the same variable inside dataStore at the same time, you have a problem.
Java handles this by allowing you to declare a variable or method synchronized, allowing only one thread access at a time.
I believe some .NET objects have Lock and Synchronized methods defined on them, but I know no more than this.
I am working on a cocoa software and in order to keep the GUI responsive during a massive data import (Core Data) I need to run the import outside the main thread.
Is it safe to access those objects even if I created them in the main thread without using locks if I don't explicitly access those objects while the thread is running.
With Core Data, you should have a separate managed object context to use for your import thread, connected to the same coordinator and persistent store. You cannot simply throw objects created in a context used by the main thread into another thread and expect them to work. Furthermore, you cannot do your own locking for this; you must at minimum lock the managed object context the objects are in, as appropriate. But if those objects are bound to by your views a controls, there are no "hooks" that you can add that locking of the context to.
There's no free lunch.
Ben Trumbull explains some of the reasons why you need to use a separate context, and why "just reading" isn't as simple or as safe as you might think, in this great post from late 2004 on the webobjects-dev list. (The whole thread is great.) He's discussing the Enterprise Objects Framework and WebObjects, but his advice is fully applicable to Core Data as well. Just replace "EC" with "NSManagedObjectContext" and "EOF" with "Core Data" in the meat of his message.
The solution to the problem of sharing data between threads in Core Data, like the Enterprise Objects Framework before it, is "don't." If you've thought about it further and you really, honestly do have to share data between threads, then the solution is to keep independent object graphs in thread-isolated contexts, and use the information in the save notification from one context to tell the other context what to re-fetch. -[NSManagedObjectContext refreshObject:mergeChanges:] is specifically designed to support this use.
I believe that this is not safe to do with NSManagedObjects (or subclasses) that are managed by a CoreData NSManagedObjectContext. In general, CoreData may do many tricky things with the sate of managed objects, including firing faults related to those objects in separate threads. In particular, [NSManagedObject initWithEntity:insertIntoManagedObjectContext:] (the designated initializer for NSManagedObjects as of OS X 10.5), does not guarantee that the returned object is safe to pass to an other thread.
Using CoreData with multiple threads is well documented on Apple's dev site.
The whole point of using locks is to ensure that two threads don't try to access the same resource. If you can guarantee that through some other mechanism, go for it.
Even if it's safe, but it's not the best practice to use shared data between threads without synchronizing the access to those fields. It doesn't matter which thread created the object, but if more than one line of execution (thread/process) is accessing the object at the same time, since it can lead to data inconsistency.
If you're absolutely sure that only one thread will ever access this object, than it'd be safe to not synchronize the access. Even then, I'd rather put synchronization in my code now than wait till later when a change in the application puts a second thread sharing the same data without concern about synchronizing access.
Yes, it's safe. A pretty common pattern is to create an object, then add it to a queue or some other collection. A second "consumer" thread takes items from the queue and does something with them. Here, you'd need to synchronize the queue but not the objects that are added to the queue.
It's NOT a good idea to just synchronize everything and hope for the best. You will need to think very carefully about your design and exactly which threads can act upon your objects.
Two things to consider are:
You must be able to guarantee that the object is fully created and initialised before it is made available to other threads.
There must be some mechanism by which the main (GUI) thread detects that the data has been loaded and all is well. To be thread safe this will inevitably involve locking of some kind.
Yes you can do it, it will be safe
...
until the second programmer comes around and does not understand the same assumptions you have made. That second (or 3rd, 4th, 5th, ...) programmer is likely to start using the object in a non safe way (in the creator thread). The problems caused could be very subtle and difficult to track down. For that reason alone, and because its so tempting to use this object in multiple threads, I would make the object thread safe.
To clarify, (thanks to those who left comments):
By "thread safe" I mean programatically devising a scheme to avoid threading issues. I don't necessarily mean devise a locking scheme around your object. You could find a way in your language to make it illegal (or very hard) to use the object in the creator thread. For example, limiting the scope, in the creator thread, to the block of code that creates the object. Once created, pass the object over to the user thread, making sure that the creator thread no longer has a reference to it.
For example, in C++
void CreateObject()
{
Object* sharedObj = new Object();
PassObjectToUsingThread( sharedObj); // this function would be system dependent
}
Then in your creating thread, you no longer have access to the object after its creation, responsibility is passed to the using thread.