How to extract 5k files out of HIVE tables - hadoop

I have a use case where i have all my 4 tb of data in HBase tables that i have interrogated with HIVE tables .
Now i want to extract 5 k files out of this 30 tables that i have created in HIVE.
This 5K files will be created by predefined 5K queries.
Can somebody suggest me what approach i should follow for this?
Required time for this is 15 hrs .
Should i write java code to generate all this files .
File generation is fast .Out of 5k text files there are 50 files that takes around 35 minutes rest of all creates very fast .
I have to generate zipped file and have to send it to client using ftp.

If I understand your question right, you can accomplish your task by first exporting the query results via one of methods from here : How to export a Hive table into a CSV file?, compressing the files in a zip archive and then FTP'ing them. You can write a shell script to automate the process.

Related

How to efficiently update Impala tables whose files are modified very frequently

We have a Hadoop-based solution (CDH 5.15) where we are getting new files in HDFS in some directories. On top os those directories we have 4-5 Impala (2.1) tables. The process writing those files in HDFS is Spark Structured Streaming (2.3.1)
Right now, we are running some DDL queries as soon as we get the files written to HDFS:
ALTER TABLE table1 RECOVER PARTITONS to detect new partitions (and their HDFS directories and files) added to the table.
REFRESH table1 PARTITIONS (partition1=X, partition2=Y), using all the keys for each partition.
Right now, this DDL is taking a bit too long and they are getting queued in our system, damaging the data availability of the system.
So, my question is: Is there a way to do this data incorporation more efficiently?
We have considered:
Using the ALTER TABLE .. RECOVER PARTITONS but as per the documentation, it only refreshes new partitions.
Tried to use REFRESH .. PARTITON ... with multiple partitions at once, but the statement syntaxis does not allow to do that.
Tried batching the queries but the Hive JDBC drives does not support batching queries.
Shall we try to do those updates in parallel given that the system is already busy?
Any other way you are aware of?
Thanks!
Victor
Note: The way in which we know what partitions need refreshed is by using HDFS events as with Spark Structured Streaming we don´t know exactly when the files are written.
Note #2: Also, the files written in HDFS are sometimes small, so it would be great if it could be possible to merge those files at the same time.
Since nobody seems to have the answer for my problem, I would like to share the approach we took to make this processing more efficient, comments are very welcome.
We discovered (doc. is not very clear on this) that some of the information stored in the Spark "checkpoints" in HDFS is a number of metadata files describing when each Parquet file was written and how big was it:
$hdfs dfs -ls -h hdfs://...../my_spark_job/_spark_metadata
w-r--r-- 3 hdfs 68K 2020-02-26 20:49 hdfs://...../my_spark_job/_spark_metadata/3248
rw-r--r-- 3 hdfs 33.3M 2020-02-26 20:53 hdfs://...../my_spark_job/_spark_metadata/3249.compact
w-r--r-- 3 hdfs 68K 2020-02-26 20:54 hdfs://...../my_spark_job/_spark_metadata/3250
...
$hdfs dfs -cat hdfs://...../my_spark_job/_spark_metadata/3250
v1
{"path":"hdfs://.../my_spark_job/../part-00004.c000.snappy.parquet","size":9866555,"isDir":false,"modificationTime":1582750862638,"blockReplication":3,"blockSize":134217728,"action":"add"}
{"path":"hdfs://.../my_spark_job/../part-00004.c001.snappy.parquet","size":526513,"isDir":false,"modificationTime":1582750862834,"blockReplication":3,"blockSize":134217728,"action":"add"}
...
So, what we did was:
Build a Spark Streaming Job polling that _spark_metadata folder.
We use a fileStream since it allow us to define the file filter to use.
Each entry in that stream is one of those JSON lines, which is parsed to extract the file path and size.
Group the files by the parent folder (which maps to each Impala partition) they belong to.
For each folder:
Read a dataframe loading only the targeted Parquet files (to avoid race conditions with the other job writing the files)
Calculate how many blocks to write (using the size field in the JSON and a target block size)
Coalesce the dataframe to the desired number of partitions and write it back to HDFS
Execute the DDL REFRESH TABLE myTable PARTITION ([partition keys derived from the new folder]
Finally, delete the source files
What we achieved is:
Limit the DDLs, by doing one refresh per partition and batch.
By having batch time and block size configurable, we are able to adapt our product to different deployment scenarios with bigger or smaller datasets.
The solution is quite flexible, since we can assign more or less resources to the Spark Streaming job (executors, cores, memory, etc.) and also we can start/stop it (using its own checkpointing system).
We are also studying the possibily of applying some data repartitioning, while doing this process, to have partitions as close as possible to the most optimum size.

Small files in hadoop

I am trying to combine small files on hdfs. This is simply for historical purposes, if needed the large file(s) would be disassembled and ran through the process to create the data for the hadoop table. Is there a way to achieve this simply? For example, day one receive 100 small files, combine into a file, then day two add/append more files into the previously created file, etc...
If the files are all the same "schema", let's say, like CSV or JSON. Then, you're welcome to write a very basic Pig / Spark job to read a whole folder of tiny files, then write it back out somewhere else, which will very likely merge all the files into larger sizes based on the HDFS block size.
You've also mentioned Hive, so use an external table for the small files, and use a CTAS query to create a separate table, thereby creating a MapReduce job, much the same as Pig would do.
IMO, if possible, the optimal solution is to setup a system "upstream" of Hadoop, which will batch your smaller files into larger files, and then dump them out to HDFS. Apache NiFi is a useful tool for this purpose.

i have a csv file with locations i need to move s3 files to new locations

I am interested in loading my data into AWS ATHENA DB
my data is compartmentalized by source_video, and in each we have 11 csv files that represent 11 tables referencing this data
ATHENA wants to load by table and not by source_video
for this i have to move these files to folders based on table name and not source_video.
I am fluent in python and bash
i know how to use the aws cli
i wish to know if there is maybe an easier way than running 4Million+ mv commands and executing them in different processes in parallel on several machines
I have a csv file that has locations of files located as children of the source_video they were created for:
I have 400,000+ source_video locations
I have 11 files in each source_video location
i.e.
+source_video1
- 11 files by type
+source_video2
- 11 files by type
+source_video3
- 11 files by type
.
.
+source_video400,000+
- 11 files by type
I wish to move them to 11 folders with 400,000+ files in each folder type
fields: videoName, CClocation, identityLocation, TAGTAskslocation, M2Location
and other locations ....
Below is an example of 2 rows of data:
pj1/09/11/09/S1/S1_IBM2MP_0353_00070280_DVR1.avi,
S1_IBM2MP_0353_00070280_DVR1.avi,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0353_00070280_DVR1_1021866/S1_IBM2MP_0353_00070280_DVR1.avi_CCsidentities.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0353_00070280_DVR1_1021866/S1_IBM2MP_0353_00070280_DVR1.avi_CCsTAGtasks.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0353_00070280_DVR1_1021866/S1_IBM2MP_0353_00070280_DVR1.avi_AL1CCs2data.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0353_00070280_DVR1_1021866/S1_IBM2MP_0353_00070280_DVR1.avi_AL1CCs2extendeddata.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0353_00070280_DVR1_1021866/S1_IBM2MP_0353_00070280_DVR1.avi_AL1CCsdata.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0353_00070280_DVR1_1021866/S1_IBM2MP_0353_00070280_DVR1.avi_CCsALGsruns.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0353_00070280_DVR1_1021866/S1_IBM2MP_0353_00070280_DVR1.avi_Cg1CCsdata.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0353_00070280_DVR1_1021866/S1_IBM2MP_0353_00070280_DVR1.avi_Cg1CCfeat.csv, s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0353_00070280_DVR1_1021866/S1_IBM2MP_0353_00070280_DVR1.avi_CCsidentitiestaggers.csv
pj1/09/11/09/S1/S1_IBM2MP_0443_00070380_DVR1.avi,
S1_IBM2MP_0443_00070380_DVR1.avi,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0443_00070380_DVR1_307638/S1_IBM2MP_0443_00070380_DVR1.avi_CCsidentities.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0443_00070380_DVR1_307638/S1_IBM2MP_0443_00070380_DVR1.avi_CCsTAGtasks.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0443_00070380_DVR1_307638/S1_IBM2MP_0443_00070380_DVR1.avi_AL1CCs2data.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0443_00070380_DVR1_307638/S1_IBM2MP_0443_00070380_DVR1.avi_AL1CCs2extendeddata.csv, s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0443_00070380_DVR1_307638/S1_IBM2MP_0443_00070380_DVR1.avi_AL1CCsdata.csv, s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0443_00070380_DVR1_307638/S1_IBM2MP_0443_00070380_DVR1.avi_CCsALGsruns.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0443_00070380_DVR1_307638/S1_IBM2MP_0443_00070380_DVR1.avi_Cg1CCsdata.csv, s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0443_00070380_DVR1_307638/S1_IBM2MP_0443_00070380_DVR1.avi_Cg1CCfeat.csv,
s3:/bucket1/DB2/pj1/09/11/09/S1_IBM2MP_0443_00070380_DVR1_307638/S1_IBM2MP_0443_00070380_DVR1.avi_CCsidentitiestaggers.csv
You are correct. Athena expects all files related to one table to be located in one directory, or in subdirectories of one directory.
Given that you are going to touch so many files, you could choose to process the files rather than simply moving them. For example, putting the contents of several files into a smaller number of files. You could also consider Zipping the files because this would cost you less to scan (Athena is charged based upon data read from disk -- zip files read less data and therefore cost less).
See: Analyzing Data in S3 using Amazon Athena
This type of processing could be done efficiently on an Amazon EMR cluster that runs Hadoop, but some specialist knowledge is required to run Hadoop so it might be easier to use the coding with which you are familiar (eg Python).

What is the best place to store multiple small files in hadoop

I will be having multiple small text files around size of 10KB, got confused where to store those files in HBase or in HDFS. what will be the optimized storage?
Because to store in HBase I need to parse it first then save it against some row key.
In HDFS I can directly create a path and save that file at that location.
But till now whatever I read, it says you should not have multiple small files instead create less big files.
But I can not merge those files, so I can't create big file out of small files.
Kindly suggest.
A large number of small files don´t fit very well with hadoop since each file is a hdfs block and each block require a one Mapper to be processed by default.
There are several options/strategies to minimize the impact of small files, all options require to process at least one time small files and "package" them in a better format. If you are planning to read these files several times, pre-process small files could make sense, but if you will use those files just one time then it doesn´t matter.
To process small files my sugesstion is to use CombineTextInputFormat (here an example): https://github.com/lalosam/HadoopInExamples/blob/master/src/main/java/rojosam/hadoop/CombinedInputWordCount/DriverCIPWC.java
CombineTextInputFormat use one Mapper to process several files but could require to transfer the files to a different DataNode to put files together in the DAtaNode where the map is running and could have a bad performance with speculative tasks but you can disable them if your cluster is enough stable.
Alternative to repackage small files are:
Create sequence files where each record contains one of the small files. With this option you will keep the original files.
Use IdentityMapper and IdentityReducer where the number of reducers are less than the number of files. This is the most easy approach but require that each line in the files be equals and independents (Not headers or metadata at the beginning of the files required to understand the rest of the file).
Create a external table in hive and then insert all the records for this table into a new table (INSERT INTO . . . SELECT FROM . . .). This approach have the same limitations than the option two and require to use Hive, the adventage is that you don´t require to write a MapReduce.
If you can not merge files like in option 2 or 3, my suggestion is to go with option 1
You could try using HAR archives: https://hadoop.apache.org/docs/r2.7.2/hadoop-archives/HadoopArchives.html
It's no problem with having many small different files. If for example you have a table in Hive with many very small files in hdfs, it's not optimal, better to merge these files into less big ones because when reading this table a lot of mappers will be created. If your files are completely different like 'apples' and 'employees' and can not be merged than just store them as is.

How to index all csv files in a directory with Solr?

Given a directory with hundreds of tab-delimited csv files, each of which contains no header in the first row. That means we will specify the column names by other means.These files can be located on a local disk, or HDFS.
What is the most efficient way to index these files?
if you have a lot of files , i think there are several methods to improve indexing speed :
First , if your data on a local disk , you can build index use multithreading , but need to pay attention to, each thread has its own index of an output directory. Finally merged them into an index so that improve search speed .
Second , if your data on HDFS , i think use Hadoop MapReduce to build index is very powerful .
in addition , some UDF plugins of Pig or Hive also can build index easily , but
you need convert your data into hive table or make pig schemal , these is simple !
Third , in order to better understand above methods , maybe you can read
How to make indexing faster

Resources