I'm attempting to copy a table to a file using hadoop fs -copyToLocal. The command works swimmingly, minus the fact that all my fields are merged together. Is there a way to specify a delimiter?
I have seen exact same issues where coping Hive tables to local file system adds all the fields together in one giant line and '\n' character is not honored at the end of each row in table.
You best option is to use custom SerDe (Serializer and DeSerializer) to export hive to CVS as described here. You can get the source code from github as well.
Are you copying from the Hive table? And
Are you copying directly from the warehouse directory? Please provide the full command that you are using.
Related
My question is in the title. In addition, hive CLI is not possible for my situation, with only the hive editor in the hue platform.
The reason why xlsx is not used is because only 30,000 records can be exported by xlsx.
Refer the below question. Lot of options are suggested by users.
How to export a Hive table into a CSV file?
I have several files in my hadoop cluster, about 2000 fields in each file. I need a quick way of cutting specific fields out of each file and creating a new file for sftping to a client.
eg. I have 20 files with fields from 1 to 2000
From each file I need to pull fields 1,6,7,777,545,345,655,1004 etc, in that order.
I need to do this every day and have several processes selecting different fields to use.
Interested in hearing what other peoples suggestions would be for best technology to use to do that?
Use hive query to select all the required fields
Use mapreduce Use
spark to run hive
or mapreduce Something else completely different
Thanks,
Red
One approach is to use Apache Pig. The source files can be loaded into Pig and as you know the indexes of extracting fields, you can use those indexes to extract from Apache Pig relations (loaded files). Indexes start from 0 in Pig.
See following link for more details about loading and extracting fields in Apache Pig
https://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#LOAD
Alternatively, you can use awk commands to slice your files (specify appropriate delimiter) and pipe the command accordingly.
An ideal syntax may somewhat look like:
hdfs dfs -cat <filename> | awk -F"," {print insert_columns_here} > output_file
I'd use hive's "create external table as select".
I am very new to hadoop and i have requirement of scrubbing the file in which account no,name and address details and i need to change these name and address details with some other name and address which are existed in another file.
And am good with either Mapreduce or Hive.
Need help on this.
Thank you.
You can write simple Mapper only job (with reducer set to zero), update the information and store them on some other location. Verify the output of the your job, if it is as you expected, then remove the old files. Remember, HDFS does not support in-placing editing and over-write of files.
Hadoop - MapReduce Tutorial.
You can also use Hive to accomplish this task.
Write hive UDF based on your logic of scrubbing
Use above UDF for each column in hive table you want to scrub and store data in new Hive table.
3.You can remove old hive table.
I have a big table, that is generated in Hue with Pig Editor and contains some hundred thousand records.
Pig returns some part files and separately .pig_header and .pig_schema files.
I need to have all the part files and a header as one complete table in .txt format.
I can do it with getmerge command:
-- To delete schema from output folder
fs -rm /OUTPUT_folder/.pig_schema
--To merge all the part files and header from output folder and to save result in .txt file
fs -getmerge /OUTPUT_folder/* /Another_folder/Result.txt
I would like to ask if there is any way in Cloudera to get this complete table without using getmerge command?
Maybe there is a software in Cloudera or command that allows to combine part files at once.
And then i just need to open this table having, all the columns with headers in a ''nice- ordered way'', what is better to use for this goal in hue?
You could try to do a final GROUP BY ALL and a ORDER BY follow by a FOREACH FLATTEN() that way all the records will go into a single reducers and so will be in only one file.
how to work on specific part of cvs file uploaded into HDFS ?
I'm new in Hadoop and i have an a question that is if i export an a relational database into cvs file then uploaded it into HDFS . so how to work on specific part (table) in file using MapReduce .
thanks in advance .
I assume that the RDBMS tables are exported to individual csv files for each table and stored in HDFS. I presume that, you are referring to column(s) data within the table(s) when you mentioned 'specific part (table)'. If so, place the individual csv files into the separate file paths say /user/userName/dbName/tables/table1.csv
Now, you can configure the job for the input path and field occurrences. You may consider to use the default Input Format so that your mapper would get one line at time as input. Based on the configuration/properties, you can read the specific fields and process the data.
Cascading allows you to get started very quickly with MapReduce. It has framework that allows you to set up Taps to access sources (your CSV file) and process it inside a pipeline say to (for example) add column A to column B and place the sum into column C by selecting them as Fields
use BigTable means convert your database to one big table