We are using tika 1.9 library to extract content. The system processes incoming data and extracts its content.
To improve the performance we, have 100 threads which actually extract data from tika. Though if we bump up the threads beyond 100, there is no further performance improvement.
We use the same instance of AutoDetectParser across the threads, could that cause a bottle neck?
Also, is there anything in tika which can be fine tuned to improve the performance of content extraction.
We also tried the same exercise with tika 1.15 but again there wasn't any gain in performance.
Regards,
Gaurav
Related
I have some related to long running and large size batch processing questions and interested in real experience and numbers.
First. Am I right, that fault tolerance for long runnig tasks is considered to be handled mostly manually via checkpoints? Let long running tasks here be ones running 1 day or more. Hence, for long running tasks their re-execution may be unappropriate.
Second. Are there any numbers, benchmarks or real experience of processing large data sets which don't fit in memory with Ignite? For example, if available memory is 3, 10 or 100 times smaller than data set size.
Finally. If pure Ignite doesn't fit well for such scenarios, are there any numbers or experience using Ignite as accelerator for Hadoop\Spark?
Thanks
If it's possible, that a node may go down during work, then you should enable native persistence, and all data, that is written to cache, will be written to disk periodically. Here is documentation on Ignite persistence: https://apacheignite.readme.io/docs/distributed-persistent-store
But you'll have to figure out, how to restore your task by data, that is written to cache.
I couldn't find any data about benchmark results of Ignite. Only for a product, built on top of it, i.e. GridGain: https://www.gridgain.com/resources/benchmarks/gridgain-benchmarks-results
You can configure persistence for Ignite and run benchmarks yourself. A lot of benchmarks are available in Ignite repository. You can find them in yardstick module on GitHub: https://github.com/apache/ignite/tree/master/modules/yardstick/src/main/java/org/apache/ignite/yardstick/cache
Here is documentation on benchmarking: https://apacheignite.readme.io/docs/perfomance-benchmarking
I need to load petabytes of text data into a storage (RAM/SSD) within a second.
Below are some of the question to solve the above problem.
1) Is it practically/theoretically possible to load petabytes of data in a second?
2) What will be the best design approach in order to achive fast loading of petabyte scale data in sub seconds.
3) Any benchmark approach available?.
I am okay to implement with any kind of technologies like Hadoop, spark, HPCC etc...
"petabytes .... within a second". seriously? Please check wikipedia Petabyte: it is 1.000.000 GB!
Also check wikipedia Memory bandwidth. Even the fastest RAM cannot handle more than a few 10 GB / s (in practice this is far lower).
Just curious: what is your use-case?
No, it is not technically possible at this time. Not even RAM memory is fast enough (not to mention the obvious capacity constraints). The fastest SSD (M.2 drives) you can get write speed around 1.2GB/s and with raid 0, you might achieve speeds just around 3GB/s at most. There are also economical constraints, as those drives by themselves are quite expensive. So to answer your question, those speeds are technically impossible at current time.
From HPCC perspective...
Thor is designed to load data and support multiple servers. However the biggest cluster I heard about is about 4000 servers. Thor is designed to load a lot of data over long time (even a week).
In the other hand Roxie is designed to serve data quickly but is not what you are asking for...nor it could serve Petabytes under a second.
I have been working on aggregation of streaming data, I found 2 tools to achieve the same. They are druid and pipelinedb. I have understood the implementation and architecture of the both. But couldn't figure out a way to benchmark these two. Is there any existing benchmark test that has been done? Or if I want to do a benchmarking of my own apart from the speed and scalability what are all the factors that I need to consider. Any ideas, links and help would be really appreciable. Also do share your own experience with pipelinedb and druid
Thanks
UPD:
After reading PipelineDB pages, I only wonder why do you need to compare such different things?
Druid is quite complex to install and maintain, it requires several external dependencies (such as zookeeper and hdfs/amazon, which must be maintained too).
And for that price you buy the key features of druid: column-oriented and distributed storage and processing. That also implies horizontal scalabitily out-of-the box, and it is completely automatic, you don't have even to think about it.
So if you don't need its distributed nature, I'd say you don't need druid at all.
FIRST VERSION:
I have no experience with pipelinedb (what is it? google shows nothing, pls share some link), but I have much experience with druid. So I would consider (apart from [query] speed and scalability):
ingesting performance (how many rows per sec/min/hour/... can be
inserted?)
RAM consumption of ingesting (how much RAM it needs to ingest with target speed?)
compression level (how many disk space needs one
hour/day/month/... of data?)
fault-tolerance (what happens when some
of the components fail? it is critical for my business?)
Caching (just keep in mind)
The question is generic and can be extended to other frameworks or contexts beyond Spark & Machine Learning algorithms.
Regardless of the details, from a high-level point-of-view, the code is applied on a large dataset of labeled text documents. It passes by 9 iterations of cross-validation to tune some parameters of a Logistic Regression multi-class classifier.
It is expected that this kind of Machine Learning processing will be expensive in term of time and resources.
I am running now the code and everything seems to be OK, except that I have no idea if my application is running efficiently or not.
I couldn't find guidelines saying that for a certain type and amount of data, and for certain type of processing and computing resources the processing time should be in the approximate order of...
Is there any method that help in judging if my application is running slow or fast, or it is purely a matter of experience?
I had the same question and I didn't find a real answer/tool/way to test how good my performances were just looking "only inside" my application.
I mean, as far as I know, there's no tool like a speedtest or something like for the internet connection :-)
The only way I found is to re-write my app (if possible) with another stack in order to see if the difference (in terms of time) is THAT big.
Otherwise, I found very interesting 2 main resources, even if quite old:
1) A sort of 4 point guide to remember when coding:
Understanding the Performance of Spark Applications, SPark Summit 2013
2) A 2-episode article from Cloudera blog to tune at best your jobs:
episode1
episode2
Hoping it could help
FF
Your question is pretty generic, so I would also highlight few generic areas where you can look out for performance optimizations: -
Scheduling Delays - Are there significant scheduling delays in scheduling the tasks? if yes then you can analyze the reasons (may be your cluster needs more resources etc).
Utilization of Cluster - are your jobs utilizing the available cluster resources (like CPU, mem)? In case not then again look out for the reasons. May be creating more partitions helps in faster execution. May be there is significant time taken in serialization, so can you switch to Kyro Serialization.
JVM Tuning - Consider analyzing GC logs and tune if you find anomalies.
Executor Configurations - Analyze the memory/ cores provided to your executors. It should be sufficient to hold the data processed by the task/job.
your DAG and
Driver Configuration - Same as executors, Driver should also have enough memory to hold the results of certain functions like collect().
Shuffling - See how much time is spend in Shuffling and kind of Data Locality used by your task.
All the above are needed for the preliminary investigations and in some cases it can also increase the performance of your jobs to an extent but there could be complex issues for which the solution will depend upon case to case basis.
Please also see Spark Tuning Guide
I have a ruby 1.9 rails 3.0.7 application that is using lucid/solr to index large amounts of text data (3GB or so). The data is stored in a MongoDB database and consists mainly of emails.
One issue I'm having is that I'm trying to index the entire data initially when I establish the application so I can search it. This is a process that will actually be repeated quite often, so I have to figure out how to index the entire MongoDB database quickly and efficiently into solr. According to the solr docs, one of the main ways to expedite the indexing process is to use multiple cores. I ran the index on a single core VM and it took about 1 hour to index the data I have. When I moved it to a 4 core VM and ran it it took about 1 hour as well. I didn't notice any discernible difference between the 2.
This leads me to suspect that maybe ruby 1.9 is NOT capable of using multiple cores properly? I'm using a Linux Ubuntu 10.10 VM.
I've read some posts that mention ruby 1.9 is a different multi-core functionality than 1.8 but I admit this is not an area I'm very knowledgeable about.
Does anyone know if ruby 1.9 is indeed capable of taking advantage of multiple cores for indexing large amounts of data in solr?
According to this question and this, it can run on all the cores, as long as the thread frees something called Giant VM Lock.
Since this probably depends on the gems (and thus C-extensions) you're using, I would suggest you to do some testing to check that it's actually using all the cores, and in the case that it's not doing it, maybe move to JRuby, which should use all the cores OOB.
I know that this is not a definitive answer, but I hope it helps you to find out a solution.