Consistent Hive and Impala Hash? - hadoop

I am looking for a consistent way to hash something in both the Hive Query Language and the Impala Query Language where the hashing function produce the same value regardless of if it is done in Hive or in Impala. To clarify, I want something like some_hive_hash_thing(A) = some_other_impala_hash_thing(A).
For Hive, I know there is hash() which uses MD5 (or any of the commands here).
For Impala, I know there is fnv_hash() which uses the FNV algorithm. I know that Hive and Impala have their own hashing functions, but they are completely different from one another.
Ideally, I am looking for a way to do fnv_hash in Hive, or a way to do MD5 in Impala. Does anyone have any suggestions?

It's so late as an answer, but let's keep it here for someone else who may find it helpful.
"A way to do MD5 in Impala" yes there is and you can use UDFs built-in function of Hive in Impala in the recent releases (I'm using CDH 5.12 and it's working well with impala 2.9, and hive 1.1)
you can find here the list of the built-in functions
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
or you can simply run
SHOW FUNCTIONS;
in your hive console
beeline -u jdbc:hive2://localhost:10000
so let's do a simulation of adding the MD5 function from hive to Impala.
DESCRIBE FUNCTION md5;
To make sure the function exists and to know the input and output variables type, so here we know that md5(string) takes a string as variable and a string as a return type .
Next we need to find hive-exec jar that contains our MD5 class using the Jar command :
/opt/jdk**/bin/jar tf hive-exec-*.*.*-cdh**.jar | grep Md5
Jar command is usually in the /bin under your java repository if it's not already configured in your environment variables .
you can find hive-exec-X-X.jar file in ../lib/hive/lib/ , if you can't find it just use locate command
so the output is something like :
/opt/jdk**/bin/jar tf hive-exec-*.*.*-cdh**.jar | grep Md5
org/apache/hadoop/hive/ql/udf/UDFMd5.class
save that path for later but we'll replace the '/' by '.' and remove the '.class'
like this :
org.apache.hadoop.hive.ql.udf.UDFMd5
copy the jar file in a directory accessible by HDFS and you may rename it for a simple use 'Im gonna name it hive-exec.jar'.
cp /lib/hive/lib/hive-exec.jar /opt/examples/
chown -R hdfs /opt/examples/
then create a place to put your jars in hdfs
sudo -u hdfs hadoop fs -mkdir /user/hive/warehouse/hive_jars
Copy your jar file to HDFS using :
sudo -u hdfs hadoop fs -copyFromLocal /opt/examples/hive-exec.jar /user/hive/warehouse/hive_jars/
so now you just have to go to Impala-shell and connect to a database then create your function using your HDFS path to the jar and the .class path we agreed earlier to convert in symbol.
Impala-shell>use udfs;
create function to_md5(string) returns string location '/user/hive/warehouse/hive_jars/hive-exec.jar' symbol='org.apache.hadoop.hive.ql.udf.UDFMd5';
here you go you can use it now like any Impala function :
select to_md5('test');
| udfs.to_md5('test') |
+----------------------------------+
| 098f6bcd4621d373cade4e832627b4f6 |
show functions ;
Query: show functions
+-------------+----------------------+-------------+---------------+
| return type | signature | binary type | is persistent |
+-------------+----------------------+-------------+---------------+
| STRING | to_md5(STRING) | JAVA | false |

Related

Search a table in all databases in hive

In Hive, how do we search a table by name in all databases?
I am a Teradata user. Is there any counterpart of systems tables (present in Teradata) like dbc.tables, dbc.columns which are present in HIVE?
You can use SQL like to search a table.
Example:
I want to search a table with the name starting from "Benchmark" I don't know the rest of it.
Input in HIVE CLI:
show tables like 'ben*'
Output:
+-----------------------+--+
| tab_name |
+-----------------------+--+
| benchmark_core_month |
| benchmark_core_qtr |
| benchmark_core_year |
+-----------------------+--+
3 rows selected (0.224 seconds)
Or you can try below command if you are using Beeline
!tables
Note: It will work with Beeline only (JDBC client based)
More about beeline: http://blog.cloudera.com/blog/2014/02/migrating-from-hive-cli-to-beeline-a-primer/
you can also use hdfs to find a table in all databases:
the path of hive databases is:
/apps/hive/warehouse/
so, by using hdfs :
hdfs dfs -find /apps/hive/warehouse/ -name t*
You should query the metastore.
You can find the connection properties within hive-site.xml
bash
<$HIVE_HOME/conf/hive-site.xml grep -A1 jdo
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://127.0.0.1/metastore?createDatabaseIfNotExist=true</value>
--
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
--
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
--
<name>javax.jdo.option.ConnectionPassword</name>
<value>cloudera</value>
Within the metastore you can use a query similar to the following
mysql
select *
from metastore.DBS as d
join metastore.TBLS as t
on t.DB_ID =
d.DB_ID
where t.TBL_NAME like '% ... put somthing here ... %'
order by d.NAME
,t.TBL_NAME
;
Searching for tables with name containing infob across all Hive databases
for i in `hive -e "show schemas"`; do echo "Hive DB: $i"; hive -e "use $i; show tables"|grep "infob"; done
Hive stores all its metadata information in Metastore. Metastore schema can be found at: link: https://issues.apache.org/jira/secure/attachment/12471108/HiveMetaStore.pdf
It has tables like DBS for database, TBLS for tables and Columns. You may use appropriate join, to find out table name or column names.
#hisi's answer is elegant. However it induce an error with lacking memory for GC on our cluster. So, there is another less elegant approach that works for me.
Let foo is the table name to search. So
hadoop fs -ls -R -C /apps/hive/warehouse/ 2>/dev/null | grep '/apps/hive/warehouse/[^/]\{1,\}/foo$'
If one does not remember exact name of table but only substring bar in table name, then command is
hadoop fs -ls -R -C /apps/hive/warehouse/ 2>/dev/null | grep '/apps/hive/warehouse/[^/]\{1,\}/[^/]\{1,\}$' | grep bar
That's an extention of Mantej Singh's answer: you can use pyspark to find tables across all Hive databases (not just one):
from functools import reduce
from pyspark import SparkContext, HiveContext
from pyspark.sql import DataFrame
sc = SparkContext()
sqlContext = HiveContext(sc)
dbnames = [row.databaseName for row in sqlContext.sql('SHOW DATABASES').collect()]
tnames = []
for dbname in dbnames:
tnames.append(sqlContext.sql('SHOW TABLES IN {} LIKE "%your_pattern%"'.format(dbname)))
tables = reduce(DataFrame.union, tnames)
tables.show()
The way to do this is to iterate through the databases searching for table with a specified name.

Checksum verification in Hadoop

Do we need to verify checksum after we move files to Hadoop (HDFS) from a Linux server through a Webhdfs ?
I would like to make sure the files on the HDFS have no corruption after they are copied. But is checking checksum necessary?
I read client does checksum before data is written to HDFS
Can somebody help me to understand how can I make sure that source file on Linux system is same as ingested file on Hdfs using webhdfs.
If your goal is to compare two files residing on HDFS, I would not use "hdfs dfs -checksum URI" as in my case it generates different checksums for files with identical content.
In the below example I am comparing two files with the same content in different locations:
Old-school md5sum method returns the same checksum:
$ hdfs dfs -cat /project1/file.txt | md5sum
b9fdea463b1ce46fabc2958fc5f7644a -
$ hdfs dfs -cat /project2/file.txt | md5sum
b9fdea463b1ce46fabc2958fc5f7644a -
However, checksum generated on the HDFS is different for files with the same content:
$ hdfs dfs -checksum /project1/file.txt
0000020000000000000000003e50be59553b2ddaf401c575f8df6914
$ hdfs dfs -checksum /project2/file.txt
0000020000000000000000001952d653ccba138f0c4cd4209fbf8e2e
A bit puzzling as I would expect identical checksum to be generated against the identical content.
Checksum for a file can be calculated using hadoop fs command.
Usage: hadoop fs -checksum URI
Returns the checksum information of a file.
Example:
hadoop fs -checksum hdfs://nn1.example.com/file1
hadoop fs -checksum file:///path/in/linux/file1
Refer : Hadoop documentation for more details
So if you want to comapre file1 in both linux and hdfs you can use above utility.
I wrote a library with which you can calculate the checksum of local file, just the way hadoop does it on hdfs files.
So, you can compare the checksum to cross check.
https://github.com/srch07/HDFSChecksumForLocalfile
If you are doing this check via API
import org.apache.hadoop.fs._
import org.apache.hadoop.io._
Option 1: for the value b9fdea463b1ce46fabc2958fc5f7644a
val md5:String = MD5Hash.digest(FileSystem.get(hadoopConfiguration).open(new Path("/project1/file.txt"))).toString
Option 2: for the value 3e50be59553b2ddaf401c575f8df6914
val md5:String = FileSystem.get(hadoopConfiguration).getFileChecksum(new Path("/project1/file.txt"))).toString.split(":")(0)
It does crc check. For each and everyfile it create .crc to make sure there is no corruption.

Collecting Parquet data from HDFS to local file system

Given a Parquet dataset distributed on HDFS (metadata file + may .parquet parts), how to correctly merge parts and collect the data onto local file system? dfs -getmerge ... doesn't work - it merges metadata with actual parquet files..
There is a way involving Apache Spark APIs - which provides a solution, but more efficient method without third-party tools may exist.
spark> val parquetData = sqlContext.parquetFile("pathToMultipartParquetHDFS")
spark> parquet.repartition(1).saveAsParquetFile("pathToSinglePartParquetHDFS")
bash> ../bin/hadoop dfs -get pathToSinglePartParquetHDFS localPath
Since Spark 1.4 it's better to use DataFrame::coalesce(1) instead of DataFrame::repartition(1)
you may use pig
A = LOAD '/path/to parquet/files' USING parquet.pig.ParquetLoader as (x,y,z) ;
STORE A INTO 'xyz path' USING PigStorage('|');
You may create Impala table on to it, & then use
impala-shell -e "query" -o <output>
same way you may use Mapreduce as well
You may use parquet tools
java -jar parquet-tools.jar merge source/ target/

Verifying checksum for files in HDFS

I'm using webhdfs to ingest data from Local file system to HDFS. Now I want to ensure integrity of files ingested into HDFS.
How can I make sure transferred files are not corrrupted/altered etc?
I used below webhdfs command to get the checksum of file
curl -i -L --negotiate -u: -X GET "http://$hostname:$port/webhdfs/v1/user/path?op=GETFILECHECKSUM"
How should I use above checksum to ensure the integrity of Ingested files? please suggest
Below is the steps I'm following
>md5sum locale_file
740c461879b484f4f5960aa4f67a145b
>hadoop fs -checksum locale_file
locale_file MD5-of-0MD5-of-512CRC32C 000002000000000000000000f4ec0c298cd6196ffdd8148ae536c9fe
Checksum of file on local system is different than same file on HDFS I need to compare checksum how can I do that?
One way to do that will be to calculate the checksum locally and than match it against the hadoop checksum after you ingest it.
I wrote a library to calculate check sum locally for it, in case any body is interested.
https://github.com/srch07/HDFSChecksumForLocalfile
Try this
curl -i "http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=GETFILECHECKSUM"
Refer follow link for full information
https://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-hdfs/WebHDFS.html#Get_File_Checksum
It can be done from the console like below
$ md5sum locale_file
740c461879b484f4f5960aa4f67a145b
$ hadoop fs -cat locale_file |md5sum -
740c461879b484f4f5960aa4f67a145b -
You can also verify local file via code
import java.io._
import org.apache.commons.codec.digest.DigestUtils;
val md5sum = DigestUtils.md5Hex("locale_file")
and for the Hadoop
import org.apache.hadoop.fs._
import org.apache.hadoop.io._
val md5sum = MD5Hash.digest(FileSystem.get(hadoopConfiguration).open(new Path("locale_file"))).toString

Hive - Possible to get total size of file parts in a directory?

Background:
I have some gzip files in a HDFS directory. These files are named in the format yyyy-mm-dd-000001.gz, yyyy-mm-dd-000002.gz and so on.
Aim:
I want to build a hive script which produces a table with the columns: Column 1 - date (yyyy-mm-dd), Column 2 - total file size.
To be specific, I would like to sum up the sizes of all of the gzip files for a particular date. The sum will be the value in Column 2 and the date in Column 1.
Is this possible? Are there any in-built functions or UDFs that could help me with my use case?
Thanks in advance!
A MapReduce job for this doesn't seem efficient since you don't actually have to load any data. Plus, doing this seems kind of awkward in Hive.
Can you write a bash script or python script or something like that to parse the output of hadoop fs -ls? I'd imagine something like this:
$ hadoop fs -ls mydir/*gz | python datecount.py | hadoop fs -put - counts.txt

Resources