SequenceFile alternative/extension that allows in-place updates - hadoop

I like the convenience of a database where you can update a row in-place. But Hadoop relies on sequence files that are capable of being consumed in parallel.
I liked the idea of HBase where I can rewrite just one row; as well as being input to a map-reduce job. But HBase is not something a newb must mess with, right? What is a good tool/method for this?

I don't think it's very difficult to learn and use HBase.
Coming to your original question. The reason why we use HBase is same as the reason behind using any other DB, i.e random, real-time read/write access, which HDFS lacks like any other FS. And this is true for any filesystem, not just for HDFS. You could take ext4 & MySQL paradigm as an example.
And when you say re-write in HBase it is actually not update. You either put a new version of a cell or delete a cell and put new data at the same location.
And you can't say that Hadoop relies on sequence files to provide you the parallelism. Parallelism is something which is provided by Hadoop by virtue of its nature, i'e a distributed platform. You can handle almost any kind of file using Hadoop with almost euqal parallelism. The only advantage with sequence files is that they are more suitable for MapReduce processing as they are already in key/vale pairs.
You have to take it with a pinch of salt, but frankly speaking Hadoop doesn't understand update. If you could elaborate your use case a bit more, maybe I could suggest something better.

Related

Calling distcp from Spark

can anyone tell me what's the most robust way to copy files from HDFS to S3 in Pyspark ?
I am looking at 2 options:
I. Call distcp directly as in the following:
distcp_arglist =['/usr/lib/hadoop/bin/hadoop','distcp',
...,
'-overwrite',
src_path, dest_path]
II. Using s3-distcp - which seems a bit more involved.
https://gist.github.com/okomestudio/699edbb8e095f07bafcc
Any suggestions are welcome. Thanks.
I'm going to point you at a little bit of my code, cloudcp
This is a basic proof of concept of implementing distCp in spark
individual files are scheduled via the spark scheduler; not ideal for 0-byte files, but stops the job being held up by a large file off one node
does do locality via a special RDD which works out the location of every row (i.e file) differently (which has to be in the org.apache.spark package for scoped access)
shows how to do FS operations within a spark map
shuffles the input for a bit of randomness
collects results within an RDD
Doesn't do:
* incremental writes (you can't compare checksums between HDFS and S3 anyway, but it could do a check for fs.exists(path) before the copy.
* permissions. S3 doesn't have them
* throttling
* scheduling of the big files first. You ought to.
* recovery of job failure (no incremental, see)
Like I said, PoC to say "we be more agile by using spark for the heavy lifting"
Anyway, take it and play, you can rework it to operate within an existing spark context with ease, as long as you don't mind a bit of scala coding.
Distcp would probably be the way to go as it is well-proven solution for transfering data between the clusters. I guess any possible alternatives would do something similar - create mapreduce jobs for transfering the data. Important point here is how to tune this process for your particular data as it could really depend on many factors like networking or map-reduce settings. I recommend you to read HortonWorks article about how you can tune this process

converting unstructured data to structured data using Hadoop

I want to convert an unstructured data to structured data for easy data analysis, so i want to know if PIG or HIVE is the best. If not which other Hadoop tool can be used and how?
In my experience the most concise, yet statically typed and very flexible, is Scalding. It's robust, concise and functional.
Scalding is an open source Twitter project that sits on top of Cascading. Cascading sits on top of Hadoop. What cascading does is take user defined stages, and magically 'cascade' them down into as few MapReduce phases as possible.
This page pretty much proves Scalding is the best Hadoop API:
https://github.com/twitter/scalding/wiki/Rosetta-Code
Spark (not technically a Hadoop technology, it's actually much much better) now has a magic JsonRDD - you give it a JSON file(s) and it will magically work out the scheme.

Can I do cluster analyses with Hadoop and can I hook Hadoop to MongoDB

I am a newbie to Hadoop, I went through a few blogs and skimmed through a couple of books on a subject. To guide my further studying I need answer to these two questions:
How much I can really do with Map-Reduce? From examples I see I can do min(), max(), sum(), count(). You can probably as easy to do average() and even standard_deviation(), but is that it? What if I want to run a query such that “customers who bought X also bought Y” (sort of join table to itself in SQL terminology). What if I want to do graph analyses or cluster analyses is Haddop’s map-reduce of any help or I am still pretty much on my own?
If I have existing database, let's say it is big (1 petabyte) and distributed, let’s say it is MongoDB with clusters, shards and all that. Can I hook Haddop to my existing MondoDB shards, or do I need to copy my data (and respectively keep it synchronized as it changes). The latter, if that is what I really need to do, sounds like expensive process, is there anything in Hadoop to help me do it.
Detailed elaborative answer or a link to such will be much appreciated.
MapReduce is pretty general structure for computation and while it is not appropriate for every situation, it is flexible in being used for a wide variety of problems. For examples of what you have described, you can refer to Mahout: http://mahout.apache.org/users/clustering/clusteringyourdata.html
Using the mongodb connector you can directly access a mongo database as a mapreduce inputformat without having to sync the data to HDFS: http://docs.mongodb.org/ecosystem/tools/hadoop
Alternatively mongo itself allows you to write queries in mapreduce to be directly executed by the database. I'd recommend aggregating the data as much as possible in this way before interfacing with hadoop/hdfs to reduce the potential amount of data transferred between the two systems.

Pig vs Hive vs Native Map Reduce

I've basic understanding on what Pig, Hive abstractions are. But I don't have a clear idea on the scenarios that require Hive, Pig or native map reduce.
I went through few articles which basically points out that Hive is for structured processing and Pig is for unstructured processing. When do we need native map reduce? Can you point out few scenarios that can't be solved using Pig or Hive but in native map reduce?
Complex branching logic which has a lot of nested if .. else .. structures is easier and quicker to implement in Standard MapReduce, for processing structured data you could use Pangool, it also simplifies things like JOIN. Also Standard MapReduce gives you full control to minimize the number of MapReduce jobs that your data processing flow requires, which translates into performance. But it requires more time to code and introduce changes.
Apache Pig is good for structured data too, but its advantage is the ability to work with BAGs of data (all rows that are grouped on a key), it is simpler to implement things like:
Get top N elements for each group;
Calculate total per each group and than put that total against each row in the group;
Use Bloom filters for JOIN optimisations;
Multiquery support (it is when PIG tries to minimise the number on MapReduce Jobs by doing more stuff in a single Job)
Hive is better suited for ad-hoc queries, but its main advantage is that it has engine that stores and partitions data. But its tables can be read from Pig or Standard MapReduce.
One more thing, Hive and Pig are not well suited to work with hierarchical data.
Short answer - We need MapReduce when we need very deep level and fine grained control on the way we want to process our data. Sometimes, it is not very convenient to express what we need exactly in terms of Pig and Hive queries.
It should not be totally impossible to do, what you can using MapReduce, through Pig or Hive. With the level of flexibility provided by Pig and Hive you can somehow manage to achieve your goal, but it might be not that smooth. You could write UDFs or do something and achieve that.
There is no clear distinction as such among the usage of these tools. It totally depends on your particular use-case. Based on your data and the kind of processing you need to decide which tool fits into your requirements better.
Edit :
Sometime ago I had a use case wherein I had to collect seismic data and run some analytics on it. The format of the files holding this data was somewhat weird. Some part of the data was EBCDIC encoded, while rest of the data was in binary format. It was basically a flat binary file with no delimiters like\n or something. I had a tough time finding some way to process these files using Pig or Hive. As a result I had to settle down with MR. Initially it took time, but gradually it became smoother as MR is really swift once you have the basic template ready with you.
So, like I said earlier it basically depends on your use case. For example, iterating over each record of your dataset is really easy in Pig(just a foreach), but what if you need foreach n?? So, when you need "that" level of control over the way you need to process your data, MR is more suitable.
Another situation might be when you data is hierarchical rather than row-based or if your data is highly unstructured.
Metapatterns problem involving job chaining and job merging are easier to solve using MR directly rather than using Pig/Hive.
And sometimes it is very very convenient to accomplish a particular task using some xyz tool as compared to do it using Pig/hive. IMHO, MR turns out to be better in such situations as well. For example if you need to do some statistical analyses on your BigData, R used with Hadoop streaming is probably the best option to go with.
HTH
Mapreduce:
Strengths:
works both on structured and unstructured data.
good for writing complex business logic.
Weakness:
long development type
hard to achieve join functionality
Hive :
Strengths:
less development time.
suitable for adhoc analysis.
easy for joins
Weakness :
not easy for complex business logic.
deals only structured data.
Pig
Strengths :
Structured and unstructured data.
joins are easily written.
Weakness:
new language to learn.
converted into mapreduce.
Hive
Pros:
Sql like
Data-base guys love that.
Good support for structured data.
Currently support database schema and views like structure
Support concurrent multi users, multi session scenarios.
Bigger community support. Hive , Hiver server , Hiver Server2, Impala ,Centry already
Cons:
Performance degrades as data grows bigger not much to do, memory over flow issues. cant do much with it.
Hierarchical data is a challenge.
Un-structured data requires udf like component
Combination of multiple techniques could be a nightmare dynamic portions with UTDF in case of big data etc
Pig:
Pros:
Great script based data flow language.
Cons:
Un-structured data requires udf like component
Not a big community support
MapReudce:
Pros:
Dont agree with "hard to achieve join functionality", if you understand what kind of join you want to implement you can implement with few lines of code.
Most of the times MR yields better performance.
MR support for hierarchical data is great especially implement tree like structures.
Better control at partitioning / indexing the data.
Job chaining.
Cons:
Need to know api very well to get a better performance etc
Code / debug / maintain
Scenarios where Hadoop Map Reduce is preferred to Hive or PIG
When you need definite driver program control
Whenever the job requires implementing a custom Partitioner
If there already exists pre-defined library of Java Mappers or Reducers for a job
If you require good amount of testability when combining lots of large data sets
If the application demands legacy code requirements that command physical structure
If the job requires optimization at a particular stage of processing by making the best use of tricks like in-mapper combining
If the job has some tricky usage of distributed cache (replicated join), cross products, groupings or joins
Pros of Pig/Hive :
Hadoop MapReduce requires more development effort than Pig and Hive.
Pig and Hive coding approaches are slower than a fully tuned Hadoop MapReduce program.
When using Pig and Hive for executing jobs, Hadoop developers need not worry about any version mismatch.
There is very limited possibility for the developer to write java level bugs when coding in Pig or Hive.
Have a look at this post for Pig Vs Hive comparison.
All the things which we can do using PIG and HIVE can be achieved using MR (sometimes it will be time consuming though). PIG and HIVE uses MR/SPARK/TEZ underneath. So all the things which MR can do may or may not be possible in Hive and PIG.
Here is the great comparison.
It specifies all the use case scenarios.

Synchronize data to HBase/HDFS and use it as input to MapReduce job

I would like to synchronize data to a Hadoop filesystem. This data is intended to be used as input for a scheduled MapReduce job.
This example might explain more:
Lets say I have an input stream of documents which contain a bunch of words, these words are needed as input for a MapReduce WordCount job. So, for each document, all words should be parsed out and uploaded to the filesystem. However, if the same document arrives from the input stream again, I only want the changes to be uploaded (or deleted) from the filesystem.
How should the data be stored; should I use HDFS or HBase? The amount of data is not very large, maybe a couple of GB.
Is it possible to start scheduled MapReduce jobs with input from HDFS and/or HBase?
I would first pick the best tool for the job, or do some research to make a reasonable choice. You're asking the question, which is the most important step. Given the amount of data you're planning to process, Hadoop is probably just one option. If this is the first step towards bigger and better things, then that would narrow the field.
I would then start off with the simplest approach that I expect to work, which typically means using the tools I already know. Write code flexibly to make it easier to replace original choices with better ones as you learn more or run into roadblocks. Given what you've stated in your question, I'd start off by using HDFS, using Hadoop command-lines tools to push the data to an HDFS folder (hadoop fs -put ...). Then, I'd write an MR job or jobs to do the processing, running them manually. When it was working I'd probably use cron to handle scheduling of the jobs.
That's a place to start. As you build the process, if you reach a point where HBase seems like a natural fit for what you want to store, then switch over to that. Solve one problem at a time, and that will give you clarity on which tools are the right choice each step of the way. For example, you might get to the scheduling step and know by that time that cron won't do what you need - perhaps your organization has requirements for job scheduling that cron won't fulfil. So, you pick a different tool.

Resources