Copy large datasets from Hive to local directory - hadoop

Im trying to copy data from a hive table to my local dir.
The code that I am using is:
nohup hive -e "set hive.cli.print.header=true; set hive.resultset.use.unique.column.names=false; select * from sample_table;" | sed 's/[\t]/|/g' > /home/sample.txt &
The issue is the file will be around 400 GB and the process takes forever to complete.
Is there any better way to do it, like compressing the file as it is being generated.
I need to have the data as .txt file but im not able to get a quick work around for this problem.
Any smart ideas would be really helpful.

Have you tried doing it with the -getmerge option of the hadoop command? That'd typically what I use to merge Hive text tables and export to a local share drive.
hadoop fs -getmerge ${SOURCE_DIR}/table_name ${DEST_DIR}/table_name.txt
I think the sed command would also be slowing things down significantly. If you do the character replacement in Hive prior to extracting the data, that would be faster than a single-threaded sed command running on your edge node.

Related

How can I append multiple files in HDFS to a single file in HDFS without the help of local file system?

I am learning hadoop. I came across a problem now. I ran the mapreduce job and output was stored in multiple files but not as single file. I want to append all of them into a single file in hdfs. I know about appendToFile and getmerge commands. But they work only for either local file system to hdfsor hdfs to local system but not from HDFS to HDFS. Is there any way to append the output files in HDFS to a single file in HDFS without touching local file system?
The only way to do this would be to force your mapreduce code to use one reducer, for example, by sorting all the results by a single key.
However, this defeats the purpose of having a distributed filesystem and multiple processors. All Hadoop jobs should be able to read a directory of files, not isolated to process a single file
If you need a single file to download from HDFS, then you should use getmerge
There is no easy way to do this directly in HDFS. But the below trick works. Although not a feasible solution, but should work if output is not huge.
hadoop fs -cat source_folder_path/* | hadoop fs -put target_filename

Scheduled data load into Hadoop

Just wondering what is the best way to bulk load data from various sources into HDFS, mainly from FTP locations / file servers at scheduled times with regular frequency.
I know Sqoop / Oozie combination can be used for RDBMS data. However, wondering what is the best way to load unstructured data into HDFS with a scheduling mechanism.
you can do it with shell programming.i can guide with some code
hadoop fs -cp ftp://uname:password#ftp2.xxxxa.com/filename hdfs://IPofhdfs/user/root/Logs/
some points:
1 finding the new files in ftp folder source by comparing hdfs dest with filenames.
2 pass the new filename to hdfs copy command.
---list out all files in ftp,store list of file to allfiles.txt--
ftp -in ftp2.xxxx.com << SCRIPTEND
user Luname pass
lcd /home/Analytics/TempFiles
ls > AllFiles.txt
binary
quit
SCRIPTEND
let me know if you need any info

Strange issue running hiveql using -e option from .sh file

I have checked Stackoverflow but could not find any help and that is the reason i m posting a new question.
Issue is related executing hiveql using -e option from .sh file.
If i run hive as $ bin/hive everything works fine and properly all databases and tables are displayed.
If i run hive as $ ./hive OR $ hive (as set in path variable) OR $HIVE_HOME/bin/hive only default database is displayed that too without any table information.
I m learning hive and trying to execute hive command using $HIVE_HOME/bin/hive -e from .sh file but it always give database not found.
So i understand that it is something related to reading of metadata but i m not able to understand why this kind of behavior.
However hadoop commands work fine from anywhere.
Below is one command i m trying to execute from .sh file
$HIVE_HOME/bin/hive -e 'LOAD DATA INPATH hdfs://myhost:8040/user/hduser/sample_table INTO TABLE rajen.sample_table'
Information:
I m using hive-0.13.0, hadoop-1.2.1
Can anybody pl explain me how to solve this or how to overcome this issue?
can you correct the query first, hive expect load statement path should be followed by quotes.
try this first from shell- HIVE_HOME/bin/hive -e "LOAD DATA INPATH '/user/hduser/sample_table' INTO TABLE rajen.sample_table"
or put your command in test.hql file and test $hive -f test.hql
--test.hql
LOAD DATA INPATH '/user/hduser/sample_table' INTO TABLE rajen.sample_table
I finally was able to fix the issue.
Issue was that i have kept the default derby setup of hive metadatastore_db , so from where ever i used to trigger hive -e command, it used to create a new copy of metadata_db copy.
So i created metadata store in mysql which became global and so now from where ever i trigger hive -e command, same metadata store db was being used.

Can I set pig.temp.dir to /user/USERNAME/tmp/pig?

Hive can be configured with
hive.exec.scratchdir=/user/${user.name}/tmp/hive
Can I do something similar with Pig? I have tried modifying the pig.properties file, but nothing seems to work.
pig.temp.dir=/user/${user.name}/tmp/pig <- Doesn't work
pig.temp.dir=/user/`whoami`/tmp/pig <- Doesn't work
pig.temp.dir=/user/${user}/tmp/pig <- Doesn't work
pig.temp.dir=/user/${username}/tmp/pig <- Doesn't work
I could replace the pig command with an alias, but I am hoping to have the change enshrined in the configuration file.
pig -Dpig.temp.dir=/user/`whoami`/tmp/pig
Thanks!
UPDATE: We decided to use /tmp/ for the production system. The reason this was an issue at all is because we are running MapR which seems to try to put the temp directories into the user directory, and succeeds with Hive, but not with Pig.
You can also set the pig temp dir from within a Pig script as follows:
set pig.temp.dir /user/foo/tmp/pig;
For small outputs, I think using the /tmp directory is fine, but for large outputs, I'd recommend users write to their personal directories.
Not a configuration file solution, but you can bake this into the $PIG_HOME/bin/pig script:
PIG_OPTS="$PIG_OPTS -Dpig.temp.dir=/user/`whoami`/tmp/pig"

Hadoop Pig cannot store to an existing folder

I have created a folder to drop the result file from a Pig process using the Store command. It works the first time, but the second time it compains that the folder already exists. What is the best practice for this situiation? Documentation is sparse on this topic.
My next step will be to rename the folder to the original file name, to reduce the impact of this. Any thoughts?
You can execute fs commands from within Pig, and should be able to delete the directory by issuing a fs -rmr command before running the STORE command:
fs -rmr dir
STORE A into 'dir' using PigStorage();
The only subtly is the fs command doesn't expect quotes around the directory name, whereas the store command does expect quotes around the directory name.

Resources