Specific Column Dump from Parquet File using Parquet-tools.jar - hadoop

I want to dump only a specific column on some text file using parquet-tools-1.8.1.jar.But not able to do so. I am trying below command. Please note my column name has forward slash.
parquet-tools-1.8.1.jar dump --column 'dir1/log1/job12121' '/hdfs-path/to/parquet file with space.parquet' > /home/local/parquet/output.text

Run
hadoop jar parquet-tools-1.8.1.jar parquet.tools.Main dump --column 'dir1/log1/job12121' '/hdfs-path/to/parquet file with space.parquet' > /home/local/parquet/output.text

Please use the following:
hadoop jar parquet-tools-1.8.1.jar dump -c dir1 log1 job12121 -m /hdfs-path/to/parquet file with space.parquet >> /home/local/parquet/output.text
Note:No single quotes for input arguments.

Related

How to add sysdate from bcp

I have a .csv file with the following sample data format:
REFID|PARENTID|QTY|DESCRIPTION|DATE
AA01|1234|1|1st item|null
AA02|12345|2|2nd item|null
AA03|12345|3|3rd item|null
AA04|12345|4|4th item|null
To load the above file into a table I am using below BCP command:
/bcp $TABLE_NAME in $FILE_NAME -S $DB_SERVER -t "|" -F 1 -U $DB_USERNAME -d $DB_NAME
What i am trying to look here is like below (adding sysdate instead of null from bcp)
AA01|1234|1|1st item|3/16/2020
AA02|12345|2|2nd item|3/16/2020
AA03|12345|3|3rd item|3/16/2020
AA04|12345|4|4th item|3/16/2020
Update : I was able to exclude header with #Jamie answer by -F 1 option, but looking for some help on inserting date with bcp. Tried looking some old Q&A, but no luck so far..
To exclude a single header record, you can use the -F option. This will tell BCP which line in the file is the first line to begin loading from. For your sample, -F2 should work fine. However, your command has other issues. See comments.
There is no way to introduce new data using the BCP command as you stated. BCP cannot introduce a date value while copying data into your table. To accomplish this I suggest a default for your date column or to first load the raw data into a table without the date column then you can introduce the date value as you see fit in late processing.

Hadoop : Using Pig to add text at the end of every line of a hdfs file

We have files in HDFS with raw logs, each individual log is a line as these logs are line separated.
Our requirement is that to add a text (' 12345' for e.g. ) by the end of every log in these files ... using pig / hadoop command / or any other map reduce based tool.
Please advice
Thanks
AJ
Load the files where each log entry is loaded into one field i.e. line:chararray and use CONCAT to add the text to each line.Store it into new log file.If you want the individual files then you will have to parameterize the script to load each file and store into a new file instead of wildcard load.
Log = LOAD '/path/wildcard/*.log' USING TextLoader(line:chararray);
Log_Text = FOREACH Log GENERATE CONCAT(line,'Your Text') as newline;
STORE Log_Text INTO /path/NewLog.log';
If your files aren't extremely large, you can do that with a single shell command.
hdfs dfs -cat /user/hdfs/logfile.log | sed 's/$/12345/g' |\
hdfs dfs -put - /user/hdfs/newlogfile.txt

how to extract gzipped file in a different directory

I am currently using this command to do a restore in mysql of a gzipped file.
C:\...directory of gzip.exe >gunzip -c filename.gz | mysql -u.. -p.. -P.. -h dbname
I would like to extract files that are located in a directory that is different from the one in which gzip.exe is located.
How should i modify the instruction?
Change to the directory containing the .gz file and either 1) specify the path to gunzip.exe or 2) add the directory containing gunzip.exe to your PATH variable.
"C:\path\to\gunzip.exe" -c filename.gz | mysql -u.. -p.. -P.. -h dbname

How do I pipe a file into an encrypted, password protected zip file, then delete the original file, in Windows batch?

I am attempting to export some database data using the BCP Utility.
Here is my batch command so far:
BCP [table] out [file] -c -T -S [server] -t"¶" | 7z.exe a -si [archive name] -sdel
The BCP part works just fine:
BCP [table] out [file] -c -T -S [server] -t"¶"
However, for the 7-Zip part:
7z.exe a -si [archive name] -sdel
It works to a point. The original file is not removed, and I'd also like to encrypt the archive with 128 bit or 256 bit encryption with a password.
Any suggestions?
I found a work around solution with a small VB .NET script.
The script takes in a table name, runs BCP into a text file, runs 7 Zip with encryption options (https://sevenzip.osdn.jp/chm/cmdline/switches/method.htm#Zip), and a password, then deletes the original text file.
These commands are run using the Process() object functions.
That way I can loop through the tables I need placed in files easily.
It is not the Windows batch answer I was looking for, but it works.
Any other suggestions are still welcome.
Thanks!
BCP .... | 7z u -sidirData -pMyPassword -mhe outputFile.7z
^ ^ ^ ^ ^______________ The file that will be generated
| | | |___________________ Encrypt file names
| | |________________________________ Password used for encryption
| |___________________________________________ Name of stored file
|_____________________________________________ update/create container file
Note that there are no spaces between the switches and the values

Verifying checksum for files in HDFS

I'm using webhdfs to ingest data from Local file system to HDFS. Now I want to ensure integrity of files ingested into HDFS.
How can I make sure transferred files are not corrrupted/altered etc?
I used below webhdfs command to get the checksum of file
curl -i -L --negotiate -u: -X GET "http://$hostname:$port/webhdfs/v1/user/path?op=GETFILECHECKSUM"
How should I use above checksum to ensure the integrity of Ingested files? please suggest
Below is the steps I'm following
>md5sum locale_file
740c461879b484f4f5960aa4f67a145b
>hadoop fs -checksum locale_file
locale_file MD5-of-0MD5-of-512CRC32C 000002000000000000000000f4ec0c298cd6196ffdd8148ae536c9fe
Checksum of file on local system is different than same file on HDFS I need to compare checksum how can I do that?
One way to do that will be to calculate the checksum locally and than match it against the hadoop checksum after you ingest it.
I wrote a library to calculate check sum locally for it, in case any body is interested.
https://github.com/srch07/HDFSChecksumForLocalfile
Try this
curl -i "http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=GETFILECHECKSUM"
Refer follow link for full information
https://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-hdfs/WebHDFS.html#Get_File_Checksum
It can be done from the console like below
$ md5sum locale_file
740c461879b484f4f5960aa4f67a145b
$ hadoop fs -cat locale_file |md5sum -
740c461879b484f4f5960aa4f67a145b -
You can also verify local file via code
import java.io._
import org.apache.commons.codec.digest.DigestUtils;
val md5sum = DigestUtils.md5Hex("locale_file")
and for the Hadoop
import org.apache.hadoop.fs._
import org.apache.hadoop.io._
val md5sum = MD5Hash.digest(FileSystem.get(hadoopConfiguration).open(new Path("locale_file"))).toString

Resources