How to avoid creation of .crc files when parquet files are created - parquet

I am using parquet framework to write parquet files.
I create the parquet writer with this constructor--
public class ParquetBaseWriter<T extends HashMap> extends ParquetWriter<T> {
public ParquetBaseWriter(Path file, HashMap<String, SchemaField> mySchema,
CompressionCodecName compressionCodecName, int blockSize,
int pageSize) throws IOException {
super(file, ParquetBaseWriter.<T>writeSupport(mySchema),
compressionCodecName, blockSize, pageSize, DEFAULT_IS_DICTIONARY_ENABLED, false);
}
Eachtime a parquet file is created, A .crc file corresponding to it also gets created on the disk.
How can I avoid the creation of that .crc file?
Is there a flag or something which I have to set?
Thanks

You could see this google groups discussion about the crc files:
https://groups.google.com/a/cloudera.org/forum/#!topic/cdk-dev/JR45MsLeyTE
TL;DR - crc files don't take up any overhead in the NN namespace. They're not HDFS data files, they are meta files in the data directories. You will see them in your local filesystem if you use the "file:///" URI.

Related

New Output file for each Item passed into FlatFileItemWriter

I have the following domain object. This is the object being passed from my processor to my writer.
public class DivisionIdPromoCompStartDtEndDtGrouping {
private int divisionId;
private Date rpmPromoCompDetailStartDate;
private Date rpmPromoCompDetailEndDate;
private List<MasterList> detailRecords = new ArrayList<MasterList>();
I would like a new file per DivisionIdPromoCompStartDtEndDtGrouping. each file would have a line for each of the detailRecords in the list. The output files would be of the same format just logically separated based on data (divisionId,rpmPromoCompDetailStartDate and rpmPromoCompDetailEndDate).
How can I create an FlatFileItemWriter to output a new file for each DivisionIdPromoCompStartDtEndDtGrouping with the content detailRecords?
I think the answer might be a compositeItemWriter. Is that right? Could someone help me with an example of this.
thanks in advance
You're close. Instead of just a CompositeItemWriter, use a ClassifierCompositeItemWriter. This coupled with a Classifier implementation that will choose a writer by grouping will allow you to have one file per group. You can read more about this ItemReader in the javadoc here: http://docs.spring.io/spring-batch/apidocs/org/springframework/batch/item/support/ClassifierCompositeItemWriter.html
No, the answer is not a composite writer. A composite writer simple forwards all items it receives to all defined childwriters.
The problem with FlatFileItemWriter is, that you you have to open and to close it, which is handled by the Framwork itself.
A simple approach would be to implement your own writer and use a FlatFileWriter in its write method.
public class MyWriter implements ItemWriter<..>{
public void write(List<..> items) {
for (.. item:items) {
FlatFileItemWriter fileWriter = new FlatFileItemWriter();
fileWriter.setResource(...); // unique FileName
fileWriter.setLineAggregator(...);
fileWriter.... ; // do other settings if necessary
fileWriter.afterPropertiesSet();
fileWriter.open(new ExecutionContext());
fileWriter.write(Collections.singleList(item));
fileWriter.close();
}
}
}
The lineAggregator has to create an appropriate String including all the linebreaks, so that everyDetail is written on its own line in the file.
Of course, you don't have to use a FlatFileWriter and just open an file, use the lineAggregator to create to line and save the line to the file.

Write data to local disk in each datanode

I want to store some value in map task into local disk in each data node. For example,
public void map (...) {
//Process
List<Object> cache = new ArrayList<Object>();
//Add value to cache
//Serialize cache to local file in this data node
}
How can I store this cache object to local disk in each data node, because if I store this cache in map function like above, then the performance will be terrible because I/O task?
I mean is there any way to wait for map task in this data node run completely and then we will store this cache into local disk? Or does Hadoop have a function to solve this issue?
Please see below example, the created file will be somewhere under the directories used by NodeManager for containers. This is configuration property yarn.nodemanager.local-dirs in yarn-site.xml, or the default inherited from yarn-default.xml, which is under /tmp
Please see #Chris Nauroth answer,
Which says that Its just for debugging purpose and It's not recommended as a permanent production configuration. It was clearly described why it was not recommended.
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
// do some hadoop stuff, like counting words
String path = "newFile.txt";
try {
File f = new File(path);
f.createNewFile();
} catch (IOException e) {
System.out.println("Message easy to look up in the logs.");
System.err.println("Error easy to look up in the logs.");
e.printStackTrace();
throw e;
}
}

Modifying data while importing from Oracle to HBase using Sqoop

I am trying to transfer my data which is in the oracle database to my HBase table using Sqoop. I am successfully able to do that using Java Sqoop client.
However in this case, I am doing just the transfer and always using hbase_row_key as "COL1, COL2".
Now I want to do is before I put in the data in the hbase table, I want to decide on the hbase_row_key which should be "COl1,COL2" if COL2 is present, if it is absent hbase_row_key should be ""COl1,COL3" ( assuming COL3 is always present).
I think using a custom mapper instead of default mapper should do it but I am not sure how to do it with Sqoop. How to make Sqoop use custom mapper before inserting data into HBase.
Any help in this regards would be highly appreciated.
Thanks again!..
Below is my Java sqoop client code:
import com.cloudera.sqoop.SqoopOptions;
import com.cloudera.sqoop.tool.ImportTool;
public class TestSqoopClient {
public static void main(String[] args) throws Exception {
SqoopOptions options = new SqoopOptions();
options.setConnectString("my_database_connection_tring");
options.setUsername("my_user");
options.setPassword("my_password");
options.setNumMappers(2); // Default value is 4
//options.setSqlQuery("SELECT * FROM user_logs WHERE $CONDITIONS limit 10");
options.setTableName("my_tablename");
options.setWhereClause("my_where_condition");
options.setSplitByCol("my_split_column");
// HBase options
options.setHBaseTable("my_hbase_table_name");
options.setHBaseColFamily("my_column_family");
options.setCreateHBaseTable(false); // Create HBase table, if it does not exist
options.setHBaseRowKeyColumn("COL1,COL2");
int ret = new ImportTool().run(options);
}
}
Have a look at extending HBase code as specified at http://sqoop.apache.org/docs/1.4.6/SqoopDevGuide.html#_hbase_serialization_extensions by writing a custom PutTransformer.

Reading and writing multiple files simultaneously using Spring batch

We are developing one application which will read multiple files & write multiple files i.e. one output file for one input file (name of output file must be same as input file).
MultiResourceItemReader can read multiple files but not simultaneously, which is a performance bottleneck for us. Spring batch provides multithreading support for this but again many threads will read the same file & try to write it. Since output file name must be same as Input file name, we can't use that option too.
Now I am looking for one more possibility, if I can create 'n' threads to read & write 'n' files. But I am not sure how to integrate this logic with Spring Batch framework.
Advance thanks for any help.
Since MultiResourceItemReader doesn't meet your performance needs you may take a closer look at parallel processing, which you already mentioned is a desirable option. I don't think many threads will read the same file and try to write it when running multi-threaded, if configured correctly.
Rather than taking the typical chunk-oriented approach you could create a tasklet-orient step that is partitioned (multi-threaded). The tasklet class would be the main driver, delegating calls to a reader and a writer.
The general flow would be something like this:
Retrieve the names of all the files that need to be read in/written out (via some service class) and save them to the execution context within an implementation of Partitioner.
public class filePartitioner implements Partitioner {
#Override
public Map<String, ExecutionContext> partition(int gridSize) {
Map<String, Path> filesToProcess = this.service.getFilesToProcess(directory); // this is just sudo-ish code but maybe you inject the directory you'll be targeting into this class
Map<String, ExecutionContext> execCtxs = new HashMap<>();
for(Entry<String, Path> entry : filesToProcess.entrySet()) {
ExecutionContext execCtx = new ExecutionContext();
execCtx.put("file", entry.getValue());
execCtxs.put(entry.getKey(), execCtx);
}
return execCtxs;
}
// injected
public void setServiceClass(ServiceClass service) {
this.service = service;
}
}
a. For the .getFilesToProcess() method you just need something that returns all of the files in the designated directory because you need to eventually know what is to be read and the name of the file that is to be written. Obviously there are several ways to go about this, such as...
public Map<String, Path> getFilesToProcess(String directory) {
Map<String, Path> filesToProcess = new HashMap<String, Path>();
File directoryFile = new File(directory); // where directory is where you intend to read from
this.generateFileList(filesToProcess, directoryFile, directory);
private void generateFileList(Map<String, Path> fileList, File node, String directory) {
// traverse directory and get files, adding to file list.
if(node.isFile()) {
String file = node.getAbsoluteFile().toString().substring(directory.length() + 1, node.toString().length());
fileList.put(file, directory);
}
if(node.isDirectory()) {
String[] files = node.list();
for(String filename : files) {
this.generateFileList(fileList, new File(node, filename), directory);
}
}
}
You'll need to create a tasklet, which will pull file names from the execution context and pass them to some injected class that will read in the file and write it out (custom ItemReaders and ItemWriters may be necessary).
The rest of the work would be in configuration, which should be fairly straight forward. It is in the configuration of the Partitioner where you can set your grid size, which could even be done dynamically using SpEL if you really intend to create n threads for n files. I would bet a fixed number of threads running across n files would show significant improvement in performance but you'll be able to determine that for yourself.
Hope this helps.

How to recursively list all the files in a directory in C# and copy all files to another directory

I am trying to write a c# console application that recursively reads through a certain folder.
In these folders are thousands of .jpg images
The folder structure is very deep in some levels and an example look like this:
Scc-LocalPhoto/testfiles/1997/JAN-JUN
1997/APRIL 1997/7.4.97 -
11.4.97/FRI11.4.97/
As you can see the folder structure is quite messy, however I do not have control over this.
My task is to read through all the folders. Retract the Meta data from the images and store in XML file. I then need to copy all the folders in the same layout and paste them in a new folder.
I think I will be able to read though all the directories and extract the meta data from the images and save it to an xml file.
What I do not know how to do is copy and past all the folders and images and paste them in a new directory maintaining the same folder structure.
Does anybody know of an efficient way to perform this task or is there any project, code available I could use as a starting point.
I am fairly new to C# and writing console apps.
Thanks for your time.
Parsing Directories Recursively
static void ParseDirectories(string root)
{
ProcessDirectory(new DirectoryInfo(root));
string[] subDirectories = Directory.GetDirectories(root);
// No more directories to explore
if (subDirectories.Length == 0)
return;
foreach (string subDirectory in subDirectories)
{
ParseDirectories(subDirectory);
}
}
Processing the Files in a Directory
static void ProcessDirectory(DirectoryInfo directory)
{
foreach (FileInfo file in directory.EnumerateFiles("*.jpg")
{
// record metadata and do other work on each file here
}
}
Copying a Directory tree
static void CopyDirectoryTree(DirectoryInfo source, DirectoryInfo dest)
{
if (!Directory.Exists(dest.FullName))
Directory.CreateDirectory(dest.FullName);
bool overwrite = true;
// Copy files
foreach (FileInfo file in source.EnumerateFiles())
{
file.CopyTo(Path.Combine(dest.ToString(), file.Name), overwrite);
}
// Copy Sub-directories
foreach (DirectoryInfo subDirectory in source.GetDirectories())
{
DirectoryInfo newDirectory = destination.CreateSubdirectory(subDirectory.Name);
CopyDirectoryTree(subDirectory, newDirectory);
}
}
Sample usage
static void Main(string[] args)
{
// Process each directory
string initialDirectory = #"C:\path_to_folder";
ParseDirectories(initialDirectory);
// Copy directory tree
string destinationDirectory = #"C:\path_to_new_root_directory";
CopyDirectoryTree(
new DirectoryInfo(initialDirectory),
new DirectoryInfo(destinationDirectory));
}
Hope this helps!
May I suggest the following, which is, in my opinion, a little bit more straightforward
public static void CopyFolderTree(string sourcePath, string targetPath)
{
var sourceDir = new DirectoryInfo(sourcePath);
var targetDir = new DirectoryInfo(targetPath);
targetDir.Create();
foreach(var file in sourceDir.GetFiles())
file.CopyTo(Path.Combine(targetPath, file.Name), true);
foreach(var subfolder in sourceDir.GetDirectories())
CopyFolderTree(subfolder.FullName, Path.Combine(targetPath, subfolder.Name));
}

Resources