My sqoop import job creating more files than the number of mappers provided in the query - sqoop

I am providing 50 mappers in the sqoop command but the result is having 54 files .My expectation was to get 50 files.
The other issue is one of the files is corrupted and that corrupted files has records equal to total number of records in the remaining files.But I am not able to view the data of the corrupted parquet file

Related

Vertica performance degradation while loading parquet files over delimited files from s3 to vertica

I have parquet files for 2 Billion records with GZIP compression and the same data with SNAPPY compression. Also, I have Delimited files for the same 2 Billion records. We have 72 Vertica nodes in AWS prod, we are seeing a huge performance spike for parquet files while moving data from s3 to Vertica with COPY command than Delimited files. Parquet takes 7x more time than Delimited files eventhough delimited file size is 50X more than parquet.
Below are the stats for the test we conducted.
Total file sizes are
Parquet GZIP - 6 GB
Parquet Snappy - 9.2 GB
Delimited - 450GB
Below are the copy command used for both Parquet and Delimited. We did see some 2 mins improvement when we removed "No commit" in the copy query.
Delimited files
COPY schema.table1 ( col1,col2,col3,col4,col5 ) FROM 's3://path_to_delimited_s3/*' DELIMITER E'\001' NULL AS '\N' NO ESCAPE ABORT ON ERROR DIRECT NO COMMIT;
Parquet files
COPY schema.table2 (col1,col2,col3,col4,col5 ) FROM 's3://path_to_parquet_s3/*' PARQUET ABORT ON ERROR DIRECT NO COMMIT;
We are surprised to see this spike w.r.t parquet files, Is this expected for parquet copy? Any pointers, thoughts will be really helpful.
Thanks
Without knowing anything more it's difficult to answer. You should, again, monitor LOAD_STREAMS for finding out what's going on.
One reason could be that the various files in s3://path_to_parquet_s3/* for the CSV version are optimally distributed between the nodes of your load process, therefore enhancing the parallelism considerably.
To count the number of parsing threads - while your load is running, find your running LOAD_STREAM (WHERE is_executing ...), and then join that with LOAD_SOURCES - USING(transaction_id,statement_id).

Nifi Job to execute a spark submit command not giving correct results

I have a spark code that appends data from a hive table to parquet files partitioned on dates. The code runs absolutely correct when executed from the spark shell and the parquet files show the exact same number of rows as present in the hive table for the corresponding date.
However, when the same code is executed by putting the code in a jar file, which is called upon by a spark submit command, and the spark submit command is scheduled to execute daily at 9 AM via Nifi, the number of rows in the parquet partition files are coming out to be less. We are on the P_NO_SLA queue, and below are some of the facts and observations we have:
•Data on the source hive table gets updated by 4 AM approx
•Initially our Nifi job was scheduled to start running at 4:45 AM but the number of records did not match. On doing a manual update from the spark shell post 6 AM, the data was an exact match.
•Hence, we scheduled the job to run at 7 AM. On doing this, when the number of records were too less (approx. 20000 on weekends) as compared to weekdays (in the range of 150000 to >200000 records), the data got updated correctly via the Nifi Job. Again a manual run was done to backfill the missing data.
•Again, we postponed the job to 9 AM. Post doing this, there were 2 days when the number of records matched (between 160000 to 200000), however, since Jul-31, the data hasn't matched at all, irrespective of the number of records on any of the days, and we are having to do a manual backfill everyday.
We are unable to figure out any specific reason that maybe causing the code to run correctly from the spark shell at any time, but giving incorrect results from Nifi when Nifi is just schedculed to execute the spark submit command to run the jar file containing the same spark code.
Please help me with understanding why this would be happening and how I can fix this.
P.S.: I have checked the Nifi log files, and could not find any of the scheduled jobs giving an error.

How to extract 5k files out of HIVE tables

I have a use case where i have all my 4 tb of data in HBase tables that i have interrogated with HIVE tables .
Now i want to extract 5 k files out of this 30 tables that i have created in HIVE.
This 5K files will be created by predefined 5K queries.
Can somebody suggest me what approach i should follow for this?
Required time for this is 15 hrs .
Should i write java code to generate all this files .
File generation is fast .Out of 5k text files there are 50 files that takes around 35 minutes rest of all creates very fast .
I have to generate zipped file and have to send it to client using ftp.
If I understand your question right, you can accomplish your task by first exporting the query results via one of methods from here : How to export a Hive table into a CSV file?, compressing the files in a zip archive and then FTP'ing them. You can write a shell script to automate the process.

Hadoop Avro file size concern

I have a cronjob that that downloads zip files (200 bytes to 1MB) from a server on the internet every 5 minutes. If I import the zip files into HDFS as is, I encounter the infamous Hadoop small file size issue. In order to avoid the build up of small files in HDFS, process of the the text data in the zip files and convert them into avro files and wait every 6 hours to add my avro file into HDFS. Using this method, I have managed to get avro files imported into HDFS with a file size larger than 64MB. The files sizes range from 50MB to 400MB. What I'm concerned about is that what happens if I start building file sizes that start getting into the 500KB avro file size range or larger. Will this cause issues with Hadoop? How does everyone else handle this situation?
Assuming that you have some Hadoop post-aggregation step and that you're using some splittable compression type (sequence, snappy, none at all), you shouldn't face any issues from Hadoop's end.
If you would like your avro file sizes to be smaller, the easiest way to do this would be to make your aggregation window configurable and lower it when needed (6 hours => 3 hours?). Another way you might be able to ensure more uniformity in file sizes would be to keep a running count of lines seen from downloaded files and then combine upload after a certain line threshold has been reached.

Concatenate thousands of files using EMR

I currently have a process which reads files from AWS S3 and concatenates them using EMR.
The input files have the following format: 1 header row and 1 data row.
Fields are comma-separated and wrapped in double-quotes.
Example:
"header-field1","header-field2","header-field3",...
"data-field1","data-field2","data-field3",...
The files vary in size between 90 and 200 bytes.
The output file has the following format:
"header-field1","header-field2","header-field3",...
"file1-data-field1","file1-data-field2","file1-data-field3",...
"file2-data-field1","file2-data-field2","file2-data-field3",...
"file3-data-field1","file3-data-field2","file3-data-field3",...
....
My current approach uses a default mapper and a single reducer to concatenate all the data rows and prepend 1 header row at the top of the final output file.
Because I want to have a single header row in the output final, I was forced to use only 1 single reducer in my EMR job. This I feel, drastically increases run-time.
Early tests ran great with tens of files.
However, I am trying to scale this application to run for thousands of files with the final goal of concatenating 1 million.
My current process for 1000 files is still running after 30+ minutes, which is too long.
Do you have any suggestions on where I can improve my application to dramatically improve overall performance?
thank you.

Resources