naming convention of part files in HDFS

naming convention of part files in HDFS - hadoop

When we do an INSERT INTO command in Hive, the result of the execution creates multiple part files in HDFS.
e.g. part-*-***** or 000000_0,000001_0 etc or something else.
Is there a configuration/setting that controls the naming of these part files?
The cluster I work in creates 000000_0, 000001_0, 000000_1 etc. I would like to change this to part- or text- etc so that its easier for me to pick these files up and merge them if needed.
If there is a setting that can be set in Hive right before executing the HQL, that would be ideal.
Thanks in advance.

I think you should be able
set mapreduce.output.basename = part-;
This won't work. The only way I have found is with a custom file writer.

Related

Misinformation in DataStage XML Export

As the title suggests, I am just trying to do a simple export of a datastage job. The issue occurs when we export the XML and begin examination. For some reason, the wrong information is being pulled from the job and placed in the XML.
As an example the SQL in a transform of the job may be:
SELECT V1,V2,V3 FROM TABLE_1;
Whereas the XML for the same transform may produce:
SELECT V1,Y6,Y9 FROM TABLE_1,TABLE_2;
It makes no sense to me how the export of a job could be different then the actual architecture.
The parameters I am using to export are:
Exclude Read Only Items: No
Include Dependent Items: Yes
Include Source Code with Routines: Yes
Include Source Code with Job Executable: Yes
Include Source Content with Data Quality Specifications: No

What tool are you using to view the XML? Try using something less smart, such as Notepad or Wordpad. This will determine/eliminate whether the problem is with your XML viewer.
You might also try exporting in DSX format and examining that output, to see whether the same symptoms are visible there.

Thank you all for the feedback. I realized that the issue wasn't necessarily with the XML. It had to do with numerous factors within our data stage environment. As mentioned above, the data connections were old and unreliable. For some reason this does not impact our current production refresh, so it's a non issue.
The other issue was the way that the generated SQL and custom SQL options work when creating the XML. In my case, there were times when old code was kept in the system, but the option was switched from custom code to generate SQL based on columns. This lead to inconsistent output from my script. Thus the mini project was scrapped.

Load multiple files from multiple directory into Pig

Hello I have a directory with sub-directory similar to this a1,a2,..a8. and each of this directory has multiple files like
bat-a1-0-0
bat-a1-0-1
bat-a1-1-0
bat-a1-1-1
...
bat-a1-31-0
bat-a1-31-1
and for sub-directory a2 its similar
bat-a2-0-0
bat-a2-0-1
bat-a2-1-0
bat-a2-1-1
...
bat-a2-31-0
bat-a2-31-1
What I decide to do in order not to complicate things is to have multiple LOAD statement to load each directory and find a way to UNION to get all. But I do not know how to load the files in each of the directory using Apache Pig version 0.10.0-cdh4.2.1 since they seem not to follow a simple pattern. Need helps thanks.

In fact this may be simpler than you think. If you load in files in pig, you can simply point to a directory, and pig will recursively load all files. Even those which may be deeply nested.
So the solution is: Make sure all your data is under 1 (or a few) directories, and load them in.

Hadoop DistCp handle same file name by renaming

Is there any way to run DistCp, but with an option to rename on file name collisions? Maybe it's easiest to explain with an example.
Let's say I'm copying to hdfs:///foo to hdfs:///bar, and foo contains these files:
hdfs:///foo/a
hdfs:///foo/b
hdfs:///foo/c
and bar contains these:
hdfs:///bar/a
hdfs:///bar/b
Then after the copy, I'd like bar to contain something like:
hdfs:///bar/a
hdfs:///bar/a-copy1
hdfs:///bar/b
hdfs:///bar/b-copy1
hdfs:///bar/c
If there is no such option, what might be the most reliable/efficient way to do this? My own home-grown version of distcp could certainly get it done, but that seems like it could be a lot of work and pretty error-prone. Basically, I don't care at all about the file names, just their directory, and I want to periodically copy large amounts of data into a "consolidation" directory.

Distcp does not have that option. If you are using the Java API for it, it can be easily handled by checking if the destination path exist and changing the path in case it already exists. You can check that with a FileSystem object using the method exists(Path p).

Ruby organization with multiple source files

I am trying to figure out the best way to organize a bunch of Ruby scripts to make it easier on the next person. One key thing is that there are multiple constant variables that need to be used across all scripts. Where should these be stored? Do I keep a separate file for these constants? Should I use YAML? I've never had to create a project with multiple Ruby source files interacting with each other, so I'm not sure as to what the best method of approach is here.
Thanks for the help.

I like to use a config.yaml file for all my constants. This makes it easy to set and change variables that are going to be used across different files. Then all you need to do is read in the file and set the variables. You can keep this file anywhere really, so long as anyone using the file has read permissions. All you have to do then is set the file path.
Hope this helps.

I like to do a config.yml or settings.yml, but I also allow the variables defined in config.yml to be overloadable by ENV variables (might be overkill in your situation).
It's might also be a good idea to set some defaults in your config loading/setting code.
As far as common functions/methods go... common.rb is a pretty good name or maybe shared.rb.

Can we use control the source formats/layout in DMExpress using environment variables?

I am using DMExpress tasks to do taransformations on my business data. These business data come in multiple format/layout. I need to be able to use single task for transformation on multiple source layouts. Any DMExpress experts here??

One way that I found out for doing transformation on multiple source layouts with the help of single task was by using the Dmexpress SDK to write the script for the task rather than building the tasks using the GUI task editor. SDK gives lot more flexibility compared to the GUI editor.
But if you are bound to GUI then there is a way around for this specific purpose. You should define a common name for the source layout. Only the source layout name is binded to the task but not the actual layout definition. So you can alter the layout definition while keeping the layout name constant to get a generic task.

FYI- DMExpress is now called DMX (Syncsort changed the name about a year ago).
Do you have multiple different record types within a single file or is each type of record in a separate file? Your question is not clear on this.
If they are in separate files, this is very easy, but you will need to create a separate DMX task for each file. In each of these tasks, define one of the files as the source and create a record layout that matches the format of that file.
If they are in the SAME file, it is only a little more difficult. You can split them into separate files by creating multiple targets and defining a named condition for each target using the SourceName() function (this function returns the name of the file that the current record came from). Then you can process them as separate files (see above). This works UNLESS you have a parent-child relationship going on between the different types of records in that single file. If that is the case, please post some sample data and I can advise on how to handle it.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

naming convention of part files in HDFS - hadoop

I think you should be able set mapreduce.output.basename = part-; This won't work. The only way I have found is with a custom file writer.

Related

Misinformation in DataStage XML Export

Load multiple files from multiple directory into Pig

Hadoop DistCp handle same file name by renaming

Ruby organization with multiple source files

Can we use control the source formats/layout in DMExpress using environment variables?

Categories

Resources