I have several files in my hadoop cluster, about 2000 fields in each file. I need a quick way of cutting specific fields out of each file and creating a new file for sftping to a client.
eg. I have 20 files with fields from 1 to 2000
From each file I need to pull fields 1,6,7,777,545,345,655,1004 etc, in that order.
I need to do this every day and have several processes selecting different fields to use.
Interested in hearing what other peoples suggestions would be for best technology to use to do that?
Use hive query to select all the required fields
Use mapreduce Use
spark to run hive
or mapreduce Something else completely different
Thanks,
Red
One approach is to use Apache Pig. The source files can be loaded into Pig and as you know the indexes of extracting fields, you can use those indexes to extract from Apache Pig relations (loaded files). Indexes start from 0 in Pig.
See following link for more details about loading and extracting fields in Apache Pig
https://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#LOAD
Alternatively, you can use awk commands to slice your files (specify appropriate delimiter) and pipe the command accordingly.
An ideal syntax may somewhat look like:
hdfs dfs -cat <filename> | awk -F"," {print insert_columns_here} > output_file
I'd use hive's "create external table as select".
Related
I want to show hadoop files on HDFS under a specific folder which created on a specific day, is there a command/option to do this?
Thanks in advance,
Lin
As far as I know, hadoop command won't support this.
You can write a script to achieve this, which is not a good implementation.
My suggestions:
Organize your file in the way more convenient to be used. Say in your case, make a time partition would be better.
If you want to make data analysis easier, use some database based on hdfs like hive. hive support partition and sql like query and insert.
more about hive and hive partitions:
https://hive.apache.org/
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-PartitionedTables
I have a big table, that is generated in Hue with Pig Editor and contains some hundred thousand records.
Pig returns some part files and separately .pig_header and .pig_schema files.
I need to have all the part files and a header as one complete table in .txt format.
I can do it with getmerge command:
-- To delete schema from output folder
fs -rm /OUTPUT_folder/.pig_schema
--To merge all the part files and header from output folder and to save result in .txt file
fs -getmerge /OUTPUT_folder/* /Another_folder/Result.txt
I would like to ask if there is any way in Cloudera to get this complete table without using getmerge command?
Maybe there is a software in Cloudera or command that allows to combine part files at once.
And then i just need to open this table having, all the columns with headers in a ''nice- ordered way'', what is better to use for this goal in hue?
You could try to do a final GROUP BY ALL and a ORDER BY follow by a FOREACH FLATTEN() that way all the records will go into a single reducers and so will be in only one file.
I am new to PIG.
Actually I have a use case in which I have to store the data again and again in the same file after every regular interval. But as I gone through some tutorial and links, I didn't see the anything related to this.
How should I do store the data in same file?
It's impossible. Pig uses Hadoop and right now there is no "recommended" solution for appending files.
The other point is that pig would produce one file only if one mapper has been used or one reducer has been used and the end of the whole data flow.
You can:
Give more info about the problem you are trying to solve
Bad solution:
2.1. process data in your pig script
2.2. load data from exisitng file
2.3. union relations hwre first relation keeps new data, the second relation keeps data from exisitng file
2.4. store union result to new output
2.5. replace old file with new one.
Good solution:
Create folder /mydata
create partitions inside folder, they can be /yyyy/MM/dd/HH if you do process data each hour
Use globs to read data:
/mydata/*/*/*/*/*
All files from hour partitions would be read by PIG/HIVE/MR or whatever hadoop tool.
make a date folder like: /abc/hadoop/20130726/
within you generate output based on timestamp like: /abc/hadoop/20130726/201307265465.gz.
Then use getmerge command to merge all data into a single file
Usage: hadoop fs -getmerge <src> <localdst> [addnl]
Hope it will help you.
I'm attempting to copy a table to a file using hadoop fs -copyToLocal. The command works swimmingly, minus the fact that all my fields are merged together. Is there a way to specify a delimiter?
I have seen exact same issues where coping Hive tables to local file system adds all the fields together in one giant line and '\n' character is not honored at the end of each row in table.
You best option is to use custom SerDe (Serializer and DeSerializer) to export hive to CVS as described here. You can get the source code from github as well.
Are you copying from the Hive table? And
Are you copying directly from the warehouse directory? Please provide the full command that you are using.
I am a newbie on the MR and Hadoop front.
I wrote an MR for finding missing's in csv file and it is working fine.
now I have an usecase where i need to parse a csv file and code it with the regarding category.
ex: "11,abc,xyz,51,61,78","11,adc,ryz,41,71,38",.............
now this has to be replaced as "1,abc,xyz,5,6,7","1,adc,ryz,4,7,3",.............
here i am doing a mod of 10 but there will be different cases of mod's.
data size is in gb's.
I want to know how to replace the content in-place for the input. Is this achievable with MR?
Basically i have not seen any file handling or writing based hadoop examples any where.
At this point i do not want to go to HBase or other db tools.
You can not replace data in place, since HDFS files are append only, and can not be edited.
I think simplest way to achiece your goal is to register your data in the Hive as external table, and write your trnasformation in HQL.
Hive is a system sitting aside of hadoop and translating your queries to MR Jobs.
Its usage is not serious infrastructure decision as HBASE usage