Whenever I'm CPU bound (running on SSD), I see Mongo only using one cpu on my machine. I have 8. Is it possible for Mongo to utilize that? Preferably in ruby, if not, I can convert over easily.
With the current javascript engine in MongoDB 2.2 (which is Mozilla's SpiderMonkey), only one thread in the mongod process executes Javascript at a time, so JS operations including map/reduce and aggregations will be locked into a single thread. You can perform concurrent map/reduce by plugging in the hadoop adapter. I/O operations which do not use javascript can run concurrently while obeying certain locking rules introduced in v2.2, so parallelism can be achieved in a limited fashion. If you are running mongos (sharding your data) you can achieve somewhat better concurrency but in general in a single mongod process you will be limited to a single thread.
Related
I am already having a stream pipeline written in Apache beam. Earlier I was running the same in Google Dataflow and it used to run like a charm. Now due to changing business needs I need to run it using flink runner.
Currently in use beam version is 2.38.0, Flink version is 1.14.5. I validated this and found this is supported and valid combination of versions.
The pipeline code which is written using Apache Beam sdk and it uses multiple ParDo and PTransforms. The pipeline is somewhat complicated in nature as it involves lot of interim operations (catered via these ParDo & PTransforms) between source and sink.The source in my case is Azure service bus topic which I am reading using JmsTopicIO reads. Until here all works fine i.e. stream of data enters in to the pipeline and getting processed normally. The problem occurs when load testing is performed. I see many operator going in to back pressure and eventually not able to read & process msgs from topic. Though the CPU and memory usage remains well under control of Job & Task manager.
Actual issue/problem/question: While troubleshooting this performance issues I observed that Flink is chaining and grouping these ParDo's and PTranforms (by itself) in to operators. With my implementation I see that many heavy processing tasks are getting combined in to same operator. This is causing slow processing of all such operators. Also the parallelism I have set (20 for now) is at pipeline level which mean each operator is running with 20 parallelism.
flinkOptions.setParallelism(20);
Question 1. Using apache beam sdk or any flink configuration is there any way I can control or manage the chaining and grouping of these ParDo/PTransforms in to operators (through code or config)?. So I should be able to uniformly distribute the load myself.
Question 2. With implementation using Apache Beam, how I can mention or set the parallelism to each operator (not to complete pipeline) based on the load on them?. This way I will be able to better allocate the resources to heavy computing operators (set of tasks).
Please suggest answers to above questions. Also if any other pointer can be given to me to work up on for flink performance improvements in my deployment. Just for reference please note my pipeline .
Currently utilising the Google Dataflow with Python for batch processing. This works fine, however, I'm interested in getting a bit more speed out of my Dataflow Jobs without having to deal with Java.
Using the Go SDK, I've implemented a simple pipeline that reads a series 100-500mb files from Google Storage (using textio.Read), does some aggregation and updates CloudSQL with the results. The number of files being read can range from dozens to hundreds.
When I run the pipeline, I can see from the logs that files are being read serially, instead of in parallel, as a result the job takes much longer. The same process executed with the Python SDK triggers autoscaling and runs multiple reads within minutes.
I've tried specifying the number of workers using --num_workers=, however, Dataflow scales the job down to one instance after a few minutes and from the logs no parallel reads take place in the time the instance was running.
Something similar happens if I remove the textio.Read and implement a custom DoFn for reading from GCS. The read process is still run serially.
I'm aware the current Go SDK is experimental and lacks many features, however, I haven't found a direct reference to limitations with Parallel processing, here. Does the current incarnation of the Go SDK support parallel processing on Dataflow?
Thanks in advance
Managed to find an answer for this after actually creating my own IO package for the Go SDK.
SplitableDoFns are not yet available in the Go SDK. This key bit of functionality is what allows the Python and Java SDKs to perform IO operations in parallel and thus, much faster than the Go SDK at scale.
Now (GO 1.16) it's built-in :
https://pkg.go.dev/google.golang.org/api/dataflow/v1b3
I'm working on a Ruby script that will be making hundreds of network requests (via open-uri) to various APIs and I'd like to do this in parallel since each request is slow, and blocking.
I have been looking at using Thread or Process to achieve this but I'm not sure which method to use.
With regard to network request, when should i use a Thread over Process, or does it not matter?
Before going into detail, there is already a library solving your problem. Typhoeus is optimized to run a large number of HTTP requests in parallel and is based on the libcurl library.
Like a modern code version of the mythical beast with 100 serpent
heads, Typhoeus runs HTTP requests in parallel while cleanly
encapsulating handling logic.
Threads will be run in the same process as your application. Since Ruby 1.9 native threads are used as the underlying implementation. Resources can be easily shared across threads, as they all can access the mutual state of the application. The problem, however, is that you cannot utilize the multiple cores of your CPU with most Ruby implementations.
Ruby uses the Global Interpreter Lock (GIL). GIL is a locking mechanism to ensure that the mutual state is not corrupted due to parallel modifications from different threads. Other Ruby implementations like JRuby, Rubinius or MacRuby offer an approach without GIL.
Processes run separately from each other. Processes do not share resources, which means every process has its own state. This can be a problem, if you want to share data across your requests. A process also allocates its own stack of memory. You could still share data by using a messaging bus like RabitMQ.
I cannot recommend to use either only threads or only processes. If you want to implement that yourself, you should use both. Fork for every n requests a new processes which then again spawns a number of threads to issue the HTTP requests. Why?
If you fork for every HTTP request another process, this will result in too many processes. Although your operating system might be able to handle this, the overhead is still tremendous. Some HTTP requests might finish very fast, so why bother with an extra process, just run them in another thread.
As I understand, Ruby 1.9 uses OS threads but only one thread will still actually be running concurrently (though one thread may be doing blocking IO while another thread is doing processing). The threading examples I've seen just use Thread.new to launch a new thread. Coming from a Java background, I typically use thread pools as to not launch to many new threads since they are "heavyweight."
Is there a thread pool construct built into ruby? I didn't see one in the default language libraries. Or are there is a standard gem that is typically used? Since OS level threading is a newer feature of ruby, I don't know how mature the libraries are for it.
You are correct in that the default C Ruby interpreter only executes one thread at a time (other C based dynamic languages such as Python have similar restrictions). Because of this restriction, threading is not really that common in Ruby and as a result there is no default threadpool library. If there are tasks to be done in parallel, people typically uses processes since processes can scale over multiple servers.
If you do need to use threads, I would recommend you use https://github.com/meh/ruby-threadpool on the JRuby platform, which is a Ruby interpreter running on the JVM. That should be right up your alley, and because it is running on the virtual machine it will have true threading.
The accepted answer is correct, But, there are many tasks in which threads are fine. after all there are some reasons why it is there. even though it can only run a thread at a time. it is still can be considered parallel in many real life situations.
for example when we have 100 long running process in which each takes approximate 10 minutes to complete. by using threads in ruby, even with all those restrictions, if we define a threadpool of 10 tasks at time, it will run much faster than 100*10 minutes when running without threads. examples include, live capturing of file changes, sending large number of web requests (such as status check)
You can understand how pooling works by reading https://blog.codeship.com/understanding-fundamental-ruby-abstraction-concurrency/ . in production code use https://github.com/meh/ruby-thread#pool
How would I determine the current server load? Do I need to use JMX here to get the cpu time, or is there another way to determine that or something similar?
I basically want to have background jobs run only when the server is idle. I will use Quartz to fire the job every 30 minutes, check the server load then proceed if it is low or halt if it is busy.
Once I can determine how to measure the load (cpu time, memory usage), I can measure these at various points to determine how I want to configure the server.
Walter
Tricky to do in a portable way, it would likely depend considerably on your platform.
An alternative is to configure your Quartz jobs to run in low-priority threads. Quartz allows you to configure the thread factory, and if the server is busy, then the thread should be shuffled to the back of the pack until it can be run without getting in the way.
Also, if the load spikes in the middle of the job, then the VM will automatically throttle your batch job until the load drops again. It should be self-regulating, which you wouldn't get by manual introspection of the current load.
I think you've answered your own question. If you want a pure Java solution, then the best that you can do is the information returned by the ThreadMXBean.
You can find out how many threads there are, how many processors the host machine has and how much time has been used by each thread, and calculate CPU load from that.