I have a cluster in databricks. Before importing the data, I want to choose among python vs scala, which one is better in terms of read/write large data from the source?
For the dataframe api, it should be the same performance. For the rdd api, scala is going to be faster.
I would choose scala , my two cents on this subject:
Scala:
supports multiple concurrency primitives
uses JVM during runtime which gives is some speed over Python
Python:
does not support concurrency or multithreading (support heavyweight process forking so only one thread is active at a time)
is interpreted and dynamically typed and this reduces the speed
Also I recommend this article: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
Related
Which technology is efficient to analyze the data hadoop or python? and which technology is speed between the above two?
So Hadoop mostly uses spark. If the underlying framework you are using to analyse or crunch your data contains spark, you are good to go with either Scala, PySpark or maybe R. Using alone python won't give you benefits of Spark which makes data analysis faster and also various transformations on Big Data. So whichever you use, its about using spark.
Scala or PySpark : both contains almost all of these features.
Whenever analyzing data and considering speed as a criteria, two key components determine the speed: The amount of data you have and where the data is located.
If you have Big Data, consider using Hadoop or Spark to analyze it. This will make it much faster and you will not be dependent of load time. If you have a few gigabytes of data it maybe best to use python but it still may slow down your machine.
Now to address where the data is, if you have your data on premise then python is the best approach. If your data is located in cloud server, then Azure, GCP, or even AWS have big data tools available to make this data exploration easier. All three cloud systems have big data tools available for use.
So in terms of speed, it really depends on the two constraints. If you have Big Data and your data is located in a cloud system. Consider using Hadoop to analyze your data. If you have only a few gigabytes of data and on-premise, use python to analyze your data.
What could be the best way to use Neo4j and Hadoop?
I have to show the output in an admin panel.
My constraints are - large amount of data and query operations.
What I am currently thinking is,
Bring data into hadoop, perform ETL operations and write it back to the system. Convert this into a job and set it for repititive execution. Use Neo4j on this exported data. Is this the right way?
When searched about it, I found an article
In the past there were some approaches that used Hadoop to quickly generate Neo4j datastores directly. While this approach is performant, it is also tightly coupled to the store-format of a certain Neo4j version as it has to duplicate the functionality of writing to split-up store-files. With the parallel neo4j-import tool and APIs introduced in Neo4j 2.2, such a solution is no longer needed. The import facilities scale across a large number of CPUs to maximize import performance.
Does this mean in terms of large datasets Neo4j no longer needs Hadoop for data processing?
Is Hadoop a proper solution for jobs that are CPU intensive and need to process a small file of around 500 MB? I have read that Hadoop is aimed to process the so called Big Data, and I wonder how it performs with a small amount of data (but a CPU intensive workload).
I would mainly like to know if a better approach for this scenario exists or instead I should stick to Hadoop.
Hadoop is a distributed computing framework proposing a MapReduce engine. If you can express your parallelizable cpu intensive application with this paradigm (or any other supported by Hadoop modules), you may take advantage of Hadoop.
A classical example of Hadoop computations is the calculation of Pi, which doesn't need any input data. As you'll see here, yahoo managed to determine the two quadrillonth digit of pi thanks to Hadoop.
However, Hadoop is indeed specialized for Big Data in the sense that it was developped for this aim. For instance, you dispose of a file system designed to contain huge files. These huge files are chunked into a lot of blocks accross a large number of nodes. In order to ensure your data integrity, each block has to be replicated to other nodes.
To conclude, I'd say that if you already dispose of an Hadoop cluster, you may want to take advantage of it.
If that's not the case, and while I can't recommand anything since I have no idea what exactly is your need, I think you can find more light weights frameworks than Hadoop.
Well a lot of companies are moving to Spark, and I personally believe it's the future of parallel processing.
It sounds like what you want to do is use many CPUs possibly on many nodes. For this you should use a Scalable Language especially designed for this problem - in other words Scala. Using Scala with Spark is much much easier and much much faster than hadoop.
If you don't have access to a cluster, it can be an idea to use Spark anyway so that you can use it in future more easily. Or just use .par in Scala and that will paralellalize your code and use all the CPUs on your local machine.
Finally Hadoop is indeed intended for Big Data, whereas Spark is really just a very general MPP framework.
You have exactly the type of computing issue that we do for Data Normalization. This is a need for parallel processing on cheap hardware and software with ease of use instead of going through all the special programming for traditional parallel processing. Hadoop was born of hugely distributed data replication with relatively simple computations. Indeed, the test application still being distributed, WordCount, is numbingly simplistic. This is because the genesis of Hadoop was do handle the tremendous amount of data and concurrent processing for search, with the "Big Data" analytics movement added on afterwards to try to find a more general purpose business use case. Thus, Hadoop as described in its common form is not targeted to the use case you and we have. But, Hadoop does offer the key capabilities of cheap, easy, fast parallel processing of "Small Data" with custom and complicated programming logic.
In fact, we have tuned Hadoop to do just this. We have a special built hardware environment, PSIKLOPS, that is powerful for small cluster (1-10) nodes with enough power at low cost for run 4-20 parallel jobs. We will be showcasing this in a series of web casts by Inside Analysis titled Tech Lab in conjunction with Cloudera for the first series, coming in early Aug 2014. We see this capability as being a key enabler for people like you. PSIKLOPS is not required to use Hadoop in the manner we will showcase, but it is being configured to maximize ease of use to launch multiple concurrent containers of custom Java.
(Even more basic than Difference between Pig and Hive? Why have both?)
I have a data processing pipeline written in several Java map-reduce tasks over Hadoop (my own custom code, derived from Hadoop's Mapper and Reducer). It's a series of basic operations such as join, inverse, sort and group by. My code is involved and not very generic.
What are the pros and cons of continuing this admittedly development-intensive approach vs. migrating everything to Pig/Hive with several UDFs? which jobs won't I be able to execute? will I suffer a performance degradation (working with 100s of TB)? will I lose ability to tweak and debug my code when maintaining? will I be able to pipeline part of the jobs as Java map-reduce and use their input-output with my Pig/Hive jobs?
Reference Twitter : Typically a Pig script is 5% of the code of native map/reduce written in about 5% of the time. However, queries typically take between 110-150% the time to execute that a native map/reduce job would have taken. But of course, if there is a routine that is highly performance sensitive they still have the option to hand-code the native map/reduce functions directly.
The above reference also talks about pros and cons of Pig over developing applications in MapReduce.
As with any higher level language or abstraction, flexibility and performance is lost with Pig/Hive at the expense of developer productivity.
In this paper as of 2009 it is stated that Pig runs 1.5 times slower than plain MapReduce. It is expected that higher level tools built on top of Hadoop perform slower than plain MapReduce, however it is true that in order to have MapReduce perform optimally an advanced user that writes a lot of boilerplate code is needed (e.g. binary comparators).
I find it relevant to mention a new API called Pangool (which I'm a developer of) that aims to replace the plain Hadoop MapReduce API by making a lot of things easier to code and understand (secondary sort, reduce-side joins). Pangool doesn't impose a performance overhead (barely 5% as of its first benchmark) and retains all the flexibilty of the original MapRed API.
After reading this and this paper, I decided I want to implement a distributed volume rendering setup for large datasets on MapReduce as my undergraduate thesis work. Is Hadoop a reasonable choice? Wouldn't it being Java kill some performance gains or make difficult the integration with CUDA? Would Phoenix++ be a better tool for the job?
Hadoop also has a C++ API called Hadoop Pipes. Pipes allows you to write Map and Reduce code in C++, and thus interface with any C/C++ libraries you have available. It makes sense that this could enable you to interface with CUDA.
To my understanding, it is only a rewriting of MapReduce, thus all of the network communication and the distributed filesystem would still be handled by Java. Hadoop is intended to make parallelization of tasks simple and general, and as such it is unable to be the most efficient MapReduce implementation. Your requirements for efficiency versus available programmer time will probably be the deciding factor in using Hadoop or a more efficient, low-level framework.
Word Count in Pipes example. There is a real lack of documentation, unfortunately, but having the source available makes things a lot easier.