reading a hadoop.hive.ql.io.HiveSequenceFileOutputFormat hive table in spark - hadoop

I have a hive table in hadoop, which has an output format of
hadoop.hive.ql.io.HiveSequenceFileOutputFormat
I am reading this table using the spark sql
spark.sql('select * from testtable where y = 2021 and month = 12 and day =12')
The spark job runs super slow, i have tried adjusting the number of executors and memory per executor, but nothing seems to improve the performance. I read on a blog that SequenceFile are not the best when it comes to hive table.
Is there a better way of reading this table ?
Thanks in advance for any help.

You should consider partitioning your table by date if you will continue to access it regularly by date. (Lookups on the partition will be very fast, at the cost of queries that don't use partions).
You should also look into the "small files problem" with hadoop. You can get some nice speed out of making files larger.
You should look at using Parquet or Orc. They're compress nicely and often boost performance.
You should also look at running table stats on the hive table, this also helps to increase performance.

Related

1 Billion records join(Filters) in Spark with Parquet file format vs HadoopText Input format

When reading a 1 Billion records of a table in Spark from Hive and this table have date and country columns as partitions. It is running for very long time since we are doing many transformations on it. If I change the Hive table file format to Parquet then will it be there any performance? Any suggestions on improvement of performance .
Change the Orc to Parquet maybe will not improve the performance.
But it depends of the type of data you have. If you are working with nested objects you need to use Parquet, Orc is not good for that.
But to create some improvement, I suggest you to do some steps that can help with your data in Hive.
Check the number of files in Hive.
One common thing that can create big problems in Hive Query is the number of files in each partition, and the size of these files are. If you are using Spark to store the data, I suggest you to check the size of the files and if they are stored with the size of your Hadoop block. If not, try to use the command CONCATENATE to solve that problem. As you can see here.
Predicate PushDown
This is what Hive, and Orc files can give you with the best performance in query the data. I suggest you to run one ANALYSE command to force the creation of the Statistics of your table, this will improve the performance and if the data is not efficient this will help. Check here and with this will update the Hive Metastore and will give you some relevant data information.
Ordered Data
If it is possible, try to store your data ordered by some column, and filter and do other stuffs in that column. Your join can be improved with this.

Hive or Hbase when we need to pull more number of columns?

I have a data structure in Hadoop with 100 columns and few hundred rows. Most of the times I need to query 65% of columns. In this case which is better to use HBASE or HIVE? Please advice.
Just number of columns you are accessing is NOT the criteria for deciding hbase or hive.
HIVE (SQL) :
Use Hive when you have warehousing needs and you are good at SQL and don't want to write MapReduce jobs. One important point though, Hive queries get converted into a corresponding MapReduce job under the hood which runs on your cluster and gives you the result. Hive does the trick for you. But each and every problem cannot be solved using HiveQL. Sometimes, if you need really fine grained and complex processing you might have to take MapReduce's shelter.
Hbase (NoSQL database):
You can use Hbase to serve that purpose. If you have some data which you want to access real time, you could store it in Hbase.
hbase get 'rowkey' is powerful when you know your access pattern
Hbase follows CP of CAP Theorm
Consistency:
Every node in the system contains the same data (e.g. replicas are never out of data)
Availability:
Every request to a non-failing node in the system returns a response
Partition Tolerance:
System properties (consistency and/or availability) hold even when the system is partitioned (communicate lost) and data is lost (node lost)
also have a look at this
Its very difficult to answer the question in one line.
HBASE is NoSQL database: your data need to store denormalized data because HBASE is very bad for joi
ning tables.
Hive: You can store data in similar format (normalized) in Hive, but would only see benefits when doing batch processing.

Performance Issue in Hadoop,HBase & Hive

I am working on Migrating a Data from SQL Database to Hadoop, in which I have used HBase & Hadoop as well. I have successfully imported my data from SQL db to Hadoop, HBase and Hive. But the problem is the Performance of the System. I was getting the results of millions of entries within 5-10 minutes in SQL Db, but it takes around 1 hr to fetch 10 million of data from HBase & Hive. Can anyone help me on this to improve the Performance of my Hadoop System.
Data in HBase is only 'indexed' by rowkey. If you're querying in Hive on anything other than rowkey prefixes, you will generally be performing a full table scan.
There are some optimizations that can be made with HBase filters e.g., when using a FamilyFilter, you may be able to skip entire regions, but I doubt Hive is doing that.
How to improve performance depends on how your data is shaped and what analysis you need to perform on it. When performing frequent ad-hoc analysis, you may be better served by exporting data from HBase into something like Parquet files on HDFS and running your analysis against those with Hive (or Drill or Spark, Imapala, etc).

Does anybody know how to choose the data model when using impala?

There several kind of file format like impala internal table or external table format like csv, parquet, hbase. Now we need to guarantee the average insert rate is 50K row/s and each row is about 1K. And, some of the data also can be updated occasionally. We also need to do some aggregation operation on those data.
I think Hbase is not a good choose for large aggregation compute when using impala with external table. Does anybody have suggestion about it?
Thanks, Chen.
I've never worked with Impala, but I can tell you a few things based on my experience with Hive.
HBase will be faster if you have a good key design and a proper schema, because just like with Hive, Impala will translate your WHERE into scan filters, it'll depend a lot on the type of queries you run. There are multiple techniques to reduce the amount of data read by a job: from simple ones like providing start and stop rowkeys, timeranges, reading only some families/columns, the already mentioned filters... to more complex like solutions like performing realtime aggregations on your data (*) and keeping them as counters.
Regarding your insert rate, it can perfectly handle it with the proper infrastructure (better to use the HBase native JAVA API), also, you can buffer your writes to get even better performance.
*Not sure if Impala supports HBase counters.

Hive performance

I work on hive and i am new to it. I am facing some issues regarding the performance in hive query.
Number of mappers allocated to my job is very low even though there
are hundreds of mappers available. I have tried setting
mapred.map.tasks=200. But it takes only 20 to 30 mappers. I
understand, number of mappers depend upon the inputsplit. Is there
any other option to increase the mappers? if no then why is the
parameter(mapred.map.tasks) introduced ?
Is there any resource where i can understand to correlate hive
queries to map-reduce jobs, i.e where the different part of the
query is executed?
For more information about setting map tasks, check this link: http://wiki.apache.org/hadoop/HowManyMapsAndReduces. Basically, mapred.map.tasks is just a hint; it doesn't really control anything usually.
To see how Hive queries are executed, simply preface your query with explain. For example: explain select foo from bar;. If you need even more information, there's also explain extended.
I see this question has been asked long time ago, I'll try to answer it even though some of the suggestions here would not be available at the time when question has been asked.
To optimize Hive performance:
Tuning the number of mappers and reducers used by your Hive request; this could be done by tuning the input size for each mapper mapreduce.input.fileinputformat.split.maxsize, and the input size for each reducer: hive.exec.reducers.bytes.per.reducer
bare in mind that "the more the better" is not always true. So you need to tune those numbers to your needs.
Optimize the joins, convert Joins to map-joins, if one of the table is small table (if possible)...
(https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization)
Partition your table on columns that are often used in conditions (WHERE).
For example if you are requesting frequently SELECT * from myTable WHERE someColumn = 'someValue' it is recommended to partition your table on the column 'someColumn'
This will let your query search just the partition files someColumn=SomePartition instead of searching the whole table files.
Compressing the intermediate results may enhance the performance in some cases (depending on your hardware configuration, network and CPU / memory). This could be done by setting the property: hive.intermediate.compression.codec
Choosing the right compression codec, for example using Snappy (as in here):
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.type=BLOCK;
Not been available at the time of question:
Using optimized file format to store your table , instead of using Text File, or Sequence File, you could use ORC (hive 0.11 +) for example (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC )
Using another engine to execute your queries on, instead of MapReduce, you could use Tez or even Spark.To use tez for example:
<property>
<name>hive.execution.engine</name>
<value>tez</value>
</property>
For further optimization you could refer here
You can decrease 'mapreduce.input.fileinputformat.split.maxsize' to increase the number of mappers (more splits).

Resources