Replacement to HBase lookup during map for mapreduce - hadoop

During mapreduce processing, i need to lookup on hbase multiple times in one map execution. This is becoming a bottleneck as hbase is turning to be very slow.
Lookups are multiple times during one map process, example each line contains multiple employee ids, and employee information is stored on hbase.
What could be alternatives to this ? Is hbase is supposed to be slow for such processings ? Is it better to put Hbase as hdfs text and then do join instead of lookups.

It's a bit hard to give a perfect answer without knowing exactly what your MR job is doing, but I'd look at using TableInputFormatBase (with MultipleInputs to read the HBase table into your mapper alongside your other data), and then join on employee ID. This may mean that you now need two MR jobs, but it could be quicker than multiple lookups, and should certainly scale better.

Related

How do we optimise the spark job if the base table has 130 billions of records

We are joining multiple tables and doing complex transformations and enrichments.
In that the base table will have around 130 billions of records, how can we optimise the spark job when the spark filters all the records keep in memory and do the enrichments with other left outer join tables. Currently spark job is running for more than 7 hours, can you suggest some techniques
Here is what you can try
Partition your base tables on which you want to run your query, create partition on specific column like Department, or Date etc which you use during joining. If the under lying table is hive you can also try bucketing.
Try optimised joins which suits your requirement such sorted merge join, hash join.
File format, use parquet file format as it much faster compared to ORC for analytical queries, and it also stores data in columnar format.
If your query has multiple steps and some steps are reused try to use caching, as spark supports memory and disk caching.
Tune your spark jobs by specifying the number of partitions, executor, cores, driver memory as per the resources available. Check spark history UI to understand how data is distributed. Try various configurations see what works best for you.
Spark might perform poorly if there large skewness in data. if that is the case you might need further optimisation to handle it.
Apart from the above mentioned techniques, you can try below option as well to optimize your job.
1.You can partition your data by inspecting your data fields. Most common columns that are used for partitioning are like date columns, region ID, country code etc.Once data is partitioned your can explain your dataframe like df.explain() and see if is using PartitioningAwareFileIndex.
2.Try tuning the spark settings and cluster configuration to scale with the input data volume.
Try changing the spark.sql.files.maxPartitionBytes to 256 MB or 512
MB , we have see significant performance gain by changing this
parameter.
Use appropriate number of executor , cores & executor memory based on
compute need
Try analyzing the spark history to identify the stage jobs which are
consuming significant time. This would be good point to start
debugging your job.

Try to confirm my understanding of HBase and MapReduce behavior

I'm trying to do some process on my HBase dataset. But I'm pretty new to the HBase and Hadoop ecosystem.
I would like to get some feedback from this community, to see if my understanding of HBase and the MapReduce operation on it is correct.
Some backgrounds here:
We have a HBase table that is about 1TB, and exceeds 100 million records.2. It has 3 region servers and each region server contains about 80 regions, making the total region 240.3. The records in the table should be pretty uniform distributed to each region, from what I know.
And what I'm trying to achieve is that I could filter out rows based on some column values, and export those rows to HDFS filesystem or something like that.
For example, we have a column named "type" and it might contain value 1 or 2 or 3. I would like to have 3 distinct HDFS files (or directories, as data on HDFS is partitioned) that have records of type 1, 2, 3 respectively.
From what I can tell, MapReduce seems like a good approach to attack these kinds of problems.
I've done some research and experiment, and could get the result I want. But I'm not sure if I understand the behavior of HBase TableMapper and Scan, yet it's crucial for our code's performance, as our dataset is really large.
To simplify the issue, I would take the official RowCounter implementation as an example, and I would like to confirm my knowledge is correct.
So my questions about HBase with MapReduce is that:
In the simplest form of RowCounter (without any optional argument), it is actually a full table scan. HBase iterates over all records in the table, and emits the row to the map method in RowCounterMapper. Is this correct?
The TableMapper will divide the task based on how many regions we have in a table. For example, if we have only 1 region in our HBase table, it will only have 1 map task, and it effectively equals to a single thread, and does not utilize any parallel processing of our hadoop cluster?
If the above is correct, is it possible that we could configure HBase to spawn multiple tasks for a region? For example, when we do a RowCounter on a table that only has 1 region, it still has 10 or 20 tasks, and counting the row in parallel manner?
Since TableMapper also depends on Scan operation, I would also like to confirm my understanding about the Scan operation and performance.
If I use setStartRow / setEndRow to limit the scope of my dataset, as rowkey is indexed, it does not impact our performance, because it does not emit full table scan.
In our case, we might need to filter our data based on their modified time. In this case, we might use scan.setTimeRange() to limit the scope of our dataset. My question is that since HBase does not index the timestamp, will this scan become a full table scan, and does not have any advantage compared to we just filter it by our MapReduce job itself?
Finally, actually we have some discussion on how we should do this export. And we have the following two approaches, yet not sure which one is better.
Using the MapReduce approach described above. But we are not sure if the parallelism will be bound by how many regions a table has. ie, the concurrency never exceeds the region counts, and we could not improve our performance unless we increase the region.
We maintain a rowkey list in a separate place (might be on HDFS), and we use spark to read the file, then just get the record using a simple Get operation. All the concurrency occurs on the spark / hadoop side.
I would like to have some suggestions about which solution is better from this community, it will be really helpful. Thanks.
Seems like you have a very small cluster. Scalability is dependent on number of region servers(RS) also. So, just by merely increasing number of regions in table without increasing number of region servers wont really help you speed up the job. I think 80 Regions/RS for that table itself is decent enough.
I am assuming you are going to use TableInputFormat, it works by running 1 mapper/region and performs server side filter on basis of scan object. I agree that scanning using TableInputFormat is optimal approach to export large amount of data from hbase but scalability and performance not just proportional to number of regions. There are many many other factors like # of RS, RAM and Disk on each RS, uniform distribution of data are some of them.
In general, I would go with #1 since you just need to prepare a scan object and then hbase will take care of rest.
#2 is more cumbersome since you need to maintain the rowkey state outside hbase.

Map Reduce with HIVE

I have 4 different Data Sets in the form of 4 CSV files and the common field among those is ID. I have to implement using Join . For implementing this concept which would be better Map Reduce or HIVE and is it possible combine both Map Reduce and HIVE
Many Thanks .
Hive translates Hive queries into a series of MapReduce jobs to emulate query's behaviour. While Hive is very useful, it is not always efficient to represent your business logic as a Hive query.
If you are fine with delay in performance & large data sets to join, you can go for HIVE.
If your data sets are small, you can still use Map Reduce Joins Or Distributed Cache.
Have a look at Map Reduce Joins article.
Most of the times Map Reduce will give better performance and control compared to Hive for any of the usecase. The code has to be written with better understanding of the usecase.
Yes, it is possible to combine both Map Reduce and Hive.

MapReduce for same task/different data

We have a system that is made up of multiple PostgreSQL databases. Each database has the same tables, i.e., schema, but only carries a share of the data (and not the full data!).The reason for distributing the data is that our customers run queries that are rather complex and perform up to 100 calculations per row.
By distributing the data to multiple databases, we want to lower the amount of work processed by each database, and ultimately speed up search. At the end, we combine the results of each database to create the final results.
A friend of mine has recommended looking at MapReduce (Hadoop). In my opinion, map-reduce only makes sense if the single workers share the same data but perform different type of work on it (corresponds to multiple instruction, single data).
In our case, however, the workers should perform the same task, but perform that task on various data (corresponds to single instruction, multiple data).
Does MapReduce (Hadoop) make sense for the paradigm same task executed on different data?
Does MapReduce (Hadoop) make sense for the paradigm same task executed on different data?
Yes.
I think you have a misconception about Hadoop and MapReduce. A MapReduce job does indeed work on the same type of data (i.e., "same tables"), but different segments of that data. The parallel Map and Reduce tasks are the same tasks over different portions of the data. MapReduce is most definitely "single instruction, multiple data" from your definition.
Hadoop is by no means a drop-in replacement for a SQL database. They do different things in different ways. Here are some other things to note:
Note that MapReduce is only really going to do batch analytics for you. Things like rollups and counts and aggregates. You won't be able to retrieve or search with MapReduce effectively. Also, updating data in Hadoop is not a typical way you want to do things-- you treat things as more "append only". For any of that, you'll probably want to look at HBase.
Hadoop's file system segments the data for you. From a file system perspective, it'll look like files in folders that contain CSV (or some other file format). Files get split up into blocks, which can then be operated on separately with map tasks. You won't have to manually shard the data like you are now.
Take a look at Hive. It's a abstraction layer on top of MapReduce that interprets a light version of SQL into MapReduce under the covers. It should allow you to convert some of your logic a bit easier.

How to design Hadoop job to match fields from one file to another

I have two different files which each contain different data. I would like to do some processing with these files then merge the data together based on matching keys. What is the best way to implement this in Hadoop? I was thinking of somehow creating two mappers that would each process one file then a reducer to combine the data? I'm not sure if this is even possible. Does anyone have any suggestion as to how I can combine data from two files in Hadoop?
There are many ways to write map/reduce job (Hive, Pig, Cascading, Java etc.) but essentially a join is a multi-input job where the mappers emit record in the key_to_join_by and rest_of_data format and the reducer does the actual join (unless one of the files is small enough to hold in memory where you can do the join in the mapper)
You can see an example of how to do this in Pig here
Can you give examples of your file? It is not clear what you are asking. Are you talking about doing joins in Hadoop? If so you will need to have two mapper classes. Or you can use Hive which makes performing joins easier. Please look at this for examples of both the possible solutions: Joins in Hadoop

Resources