HDFS Metadata is taking too much space - hadoop

I am trying to migrate data from SQL Database to HBase with Hadoop. But the problem is my database is of 70 GB in SQL and it takes around 400 GB when I have transferred it to Hadoop. Why it is so ?. Is there any way to reduce this space used.
Also how much disk space is required if I have a data of SQL database of 800 GB.

After a large amount of serach, I come across some results that I am storing my data in default format of Hadoop i.e. text format. So, It will consume large amount of space for storing data compare to other Storage. Also Manjunath is correct as we reduce the replication factor, It might reduce the storage space but it will cause some problems as well. For further information on this topic, kindly refer to the below mentioned link :
http://datametica.com/rcorc-file-format/

Related

hbase skip region server to read rows directly from hfile

Am attempting to dump over 10 billion records into hbase which will
grow on average at 10 million per day and then attempt a full table
scan over the records. I understand that a full scan over hdfs will
be faster than hbase.
Hbase is being used to order the disparate data
on hdfs. The application is being built using spark.
The data is bulk-loaded onto hbase. Because of the various 2G limits, region size was reduced to 1.2G from an initial test of 3G (Still requires a bit more detail investigation).
scan cache is 1000 and cache blocks is off
Total hbase size is in the 6TB range, yielding several thousand regions across 5 region servers (nodes). (recommendation is low hundreds).
The spark job essentially runs across each row and then computes something based on columns within a range.
Using spark-on-hbase which internally uses the TableInputFormat the job ran in about 7.5 hrs.
In order to bypass the region servers, created a snapshot and used the TableSnapshotInputFormat instead. The job completed in abt 5.5 hrs.
Questions
When reading from hbase into spark, the regions seem to dictate the
spark-partition and thus the 2G limit. Hence problems with
caching Does this imply that region size needs to be small ?
The TableSnapshotInputFormat which bypasses the region severs and
reads directly from the snapshots, also creates it splits by Region
so would still fall into the region size problem above. It is
possible to read key-values from hfiles directly in which case the
split size is determined by the hdfs block size. Is there an
implementation of a scanner or other util which can read a row
directly from a hfile (to be specific from a snapshot referenced hfile) ?
Are there any other pointers to say configurations that may help to boost performance ? for instance the hdfs block size etc ? The main use case is a full table scan for the most part.
As it turns out this was actually pretty fast. Performance analysis showed that the problem lay in one of the object representations for an ip address, namely InetAddress took a significant amount to resolve an ip address. We resolved to using the raw bytes to extract whatever we needed. This itself made the job finish in about 2.5 hours.
A modelling of the problem as a Map Reduce problem and a run on MR2 with the same above change showed that it could finish in about 1 hr 20 minutes.
The iterative nature and smaller memory footprint helped the MR2 acheive more parallelism and hence was way faster.

To hadoop or not to hadoop

We have data (not allot at this point) that we want to transform/aggregate/pivot up to wazoo.
I had a look on the www and all the answers i am asking is pointing to hadoop for scalable,cheap to run(no SQL server machine and license),fast(if you have allot of data), programmable(not little boxes that you drag around).
There is just one problem that i keep coming up against
namely 'Use hadoop if you have more than 10gb of data'
Now we don't even have 1gb of data(at this stage) is it still viable.
My other option is SSIS. Now we do use SSIS for some of our current ETL but we don't have resources for it and putting a SQL in the cloud is just going to cost to much and don't even get me started on scalability cost and config.
thanks
Your current data volume seems to be too low for making an entry into hadoop. Enter into hadoop ecosystem only if you are dealing with huge volume of data(TB/year) and if you suspect the data volume to increase exponentially down the line.
Let me explain why I suggest against hadoop for such low volume of data.
By default hadoop stores your files into 128MB chunks of data and while processing also, it takes 128MB Chunks at a time to process(parallely). If your business requirement involves heavy CPU intensive processing, then you can decrease the input chunk size from 128MB to less. But then again by decreasing the amount of data to be processed parallely, you'll end up increasing the number of IO seaks(low level block storage). At the end you might be spending more resource on managing the tasks rather than what the actual task is taking. Hence, try avoiding distributed computing as a solution for your(low) data volume.
As #Makubex has suggested, don't use hadoop.
And SISS is a good option as it handles the data in-memory so it would perform data aggregations, data type conversions, merging, etc at a much faster rate than writing to the disk using temporary tables in stored procedures.
Hadoop is meant for large amounts of data I would suggest it only for data in terabytes. It would be way slower that SISS(which runs in-memory) for small data-sets.
Refer: When to use T-SQL or SSIS for ETL

What makes Spark fast if data size exceeds available memory?

Everywhere I try to understand spark it says it is fast because it keeps data in memory as opposed to map reduce. Lets take this examples -
I have a 5 node spark cluster, with 100 GB RAM each. Lets say I have 500 TB of data to run a spark job against. Now total data that spark can keep is 100*5=500 GB. If It can keep max of 500 GB of data only in memory at any point of time, what makes it lightning fast ??
Spark isn't magical and can't change fundamental principles of computing. Spark uses memory as a progressive enhancement and will fall back to disk I/O for huge datasets that can not be kept in memory. In a scenario where tables must be scanned from disks, spark performance should be comparable to other parallel solutions involving table scanning from disk.
Suppose only 0.1% of the 500 TB is "interesting". For instance, in a marketing funnel there are a lot of ad impressions, fewer clicks, even fewer sales, and less repeat sales. A program can filter through a huge dataset and tell Spark to cache in memory a smaller, filtered and corrected dataset needed for further processing. Spark caching of a smaller filtered data set is obviously much faster than repeated disk table scans and repeated processing of the larger raw data.

How to compare two large data sets using hadoop mapreduce?

I am new to hadoop and mapreduce. We have a normal java application where we read a file ( 8 GB in size ) from hadoop file system and we apply some rules on that data. After applying rules we get java hashmap (which is huge in size) and we keep that data in cache or in buffer. At the same time we get the data from hive by applying a query on it and prepare a java hashmap which is again huge in size. Now we compare both the hashmaps data to prepare final report to check the data accuracy.
In the above process since we are using normal java program to do the stuff we are facing below problems.
To process this huge data it takes ages to complete the job. Because input file contains tens of millions of records in it and we need to apply rules on each row to extract the data. It takes days to complete the job. At the same time hive also contains the same amount of data, query is taking too much time to return the data from hive.
Since we are keeping the data in buffer we are facing memory issues.
Now we are trying to implement the same in hadoop mapreduce.
What is the best way to achieve the above scenario?
What are the best ways to implement the above scenario in mapreduce?
How can I increase the application performance by using mapreduce?
8 GB is a tiny data set. I can fit 4 of these 'data sets' into my Laptop RAM! Just dump it in a any relational engine and massage it as fit until the cows come home. This is not 'big data'.
For the record, the way to do processing of two truly large datasets (say +1 TB each) in Hive is a sort-merge-bucket join (aka. SMB join). Read LanguageManual JoinOptimization, watch Join Strategies in Hive.

Is Hadoop Suited to Serve 100 byte Records Out of 50GB Dataset?

We have a question on whether Hadoop is suitable for simple tasks that require no application running, but require very fast reads and writes of small amount of data.
The requirement is to be able to write roughly a 100-200 bytes long messages with couple of indexes at rate 30 per second, at the same time to be able to read (search by those two indexes) at rate roughly 10 per seconds. The read queries must be very fast - 100-200 milliseconds max per query and return few matching records.
The total data volume is expected to reach 50-100 gb and is to be maintained at this rate by removing older records (something like daily task to delete records that are older than 14 days)
As you can see the total data volume is not really that big, but we are concerned that the search speed of Hadoop may be slower than our need anyway.
Is Hadoop a solution for this?
Thanks
Nik
Hadoop, alone, is very bad at serving out many small segments of data. However, HBase is an indexed table database-like system meant to be run on top of Hadoop. It is excellent at serving out small indexed files. I would research that as a solution.
Another problem to keep an eye on is that importing data into HDFS or HBase is not trivial. It can slow your cluster down quite a bit, so if Hadoop is your choice, you have to also solve how to get those 75GB into HDFS so Hadoop can touch them.
As Sam noted HBase is the Hadoop stack solution that can handle your requirements. However I wouldn't go with Hadoop if these are your only requirements from the data.
You can go with other NoSQL solutions like MongoDB or CouchDB or even MySQL or Postgres

Resources