What does the sync and syncFs of SequenceFile.Writer means? - hadoop

Environment: Hadoop 0.20.2-cdh3u5
I am trying to upload log data (10G) to HDFS with a customized tool which using SequenceFile.Writer.
SequenceFile.Writer w = SequenceFile.createWriter(
hdfs,
conf,
p,
LongWritable.class,
Text.class,
4096,
hdfs.getDefaultReplication(),
hdfs.getDefaultBlockSize(),
compressionType,
codec,
null,
new Metadata());
During the uploading process, if the tool crashed (without invoke the close() method explicitly), will the log that has been uploaded lost?
Should I invoke sync() or syncFs() timely, what do the two methods means?

Yes, probably.
sync() create a sync point. As stated in the book "Hadoop- The Definitive Guide" by Tom White (Cloudera)
a sync point is a point in the stream which can used by to
resynchronize with a record boundary if the reader is "lost" - for
example after seeking to an arbitrary position on the stream.
Now the implementation of syncFS() is pretty simple:
public void syncFs() throws IOException {
if (out != null) {
out.sync(); // flush contents to file system
}
}
where out is a FSDataOutputStream. Again, in the same book is stated:
HDFS provides a method for forcing all buffers to be synchronized to
the datanodes via the sync() method on FSDataOutputStream. After
a successful call return from sync() HDFS garantees that the data
written up to that point in the file is persisted and visible to all
readers. In the event of a crash (of the client or HDFS), the data
will not be lost.
But a footnote warns to look to bug HDFS-200, since the visibility mentioned above was not always not always honored.

Related

Kafka stream state store rocksdb file size not decreasing on manual deletion of messages

I am using processor api to delete messages from state store. Delete is working successfully, i confirmed by using interactive queries call on state store by kafka key, but it does not reduce the kafka streams file size on local disk under directory tmp/kafka-streams.
#Override
public void init(ProcessorContext processorContext) {
this.processorContext = processorContext;
processorContext.schedule(Duration.ofSeconds(10), PunctuationType.STREAM_TIME, new Punctuator() {
#Override
public void punctuate(long l) {
processorContext.commit();
}
}); //invoke punctuate every 12 seconds
this.statestore = (KeyValueStore<String, GenericRecord>) processorContext.getStateStore(StateStoreEnum.HEADER.getStateStore());
log.info("Processor initialized");
}
#Override
public void process(String key, GenericRecord value) {
statestore.all().forEachRemaining(keyValue -> {
statestore.delete(keyValue.key);
});
}
kafka streams directory size
2.3M /private/tmp/kafka-streams
3.3M /private/tmp/kafka-streams
Do I need any specific configuration so that it keeps the file size in control? If it doesn't work this way, is it okay to delete kafka-streams directory? I assume it should be safe, since such delete will delete the record from both state store and changelog topic.
RocksDB does file compaction in the background. Hence, if you need a more aggressive compaction you should pass in a custom RocksDBConfigSetter via Streams config parameter rocksdb.config.setter. For more details about RockDB, check out the RocksDB documentation.
https://docs.confluent.io/current/streams/developer-guide/config-streams.html#rocksdb-config-setter
However, I would not recommend to change RocksDB configs as long as there is no real issue -- you can do more harm than good. Seems you store size is quite small, thus, I don't see a real problem atm.
Btw: If you go to production, you should change the state.dir config to an appropriate directory where even after restarting of a machine the state will not be lost. If you put state into the default /tmp location, state is most likely gone after restarting of the machine and an expensive recovery from the changelog topics would be triggered.

How does the HDFS Client knows the block size while writing?

The HDFS Client is outside the HDFS Cluster. When the HDFS Client write the file to hadoop the HDFS clients split the files into blocks and then it will write the block to datanode.
The question here is how the HDFS Client knows the Blocksize ? Block size is configured in the Name node and the HDFS Client has no idea about the block size then how it will split the file into blocks ?
HDFS is designed in a way where the block size for a particular file is part of the MetaData.
Let's just check what does this mean?
The client can tell the NameNode that it will put data to HDFS with a particular block size.
The client has its own hdfs-site.xml that can contain this value, and can specify it on a per-request basis as well using the -Ddfs.blocksize parameter.
If the client configuration does not define this parameter, then it defaults to the org.apache.hadoop.hdfs.DFSConfigKeys.DFS_BLOCK_SIZE_DEFAULT value which is 128MB.
NameNode can throw an error for the client if it specifies a blocksize that is smaller then dfs.namenode.fs-limits.min-block-size (1MB by default).
There is nothing magical in this, NameNode does know nothing about the data and let the client to decide the optimal splitting, as well as to define the replication factor for blocks of a file.
In simple words, When you do client URI deploy, it will place server URI into Client or you download and manually replace in client. So whenever client request for info, it will go to the NameNode and fetch the required info or place new info on DataNodes.
P.S: Client = EdgeNode
Some more details below (from the Hadoop Definitive Guide 4th edition)
"The client creates the file by calling create() on DistributedFileSystem (step 1 in
Figure 3-4). DistributedFileSystem makes an RPC call to the namenode to create a new
file in the filesystem’s namespace, with no blocks associated with it (step 2). The
namenode performs various checks to make sure the file doesn’t already exist and that the
client has the right permissions to create the file. If these checks pass, the namenode
makes a record of the new file; otherwise, file creation fails and the client is thrown an
IOException. The DistributedFileSystem returns an FSDataOutputStream for the client
to start writing data to. Just as in the read case, FSDataOutputStream wraps a
DFSOutputStream, which handles communication with the datanodes and namenode.
As the client writes data (step 3), the DFSOutputStream splits it into packets, which it writes to an internal queue called the data queue."
Adding more info in response to comment on this post:
Here is a sample client program to copy a file to HDFS (Source-Hadoop Definitive Guide)
public class FileCopyWithProgress {
public static void main(String[] args) throws Exception {
String localSrc = args[0];
String dst = args[1];
InputStream in = new BufferedInputStream(new FileInputStream(localSrc));
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(dst), conf);
OutputStream out = fs.create(new Path(dst), new Progressable() {
public void progress() {
System.out.print(".");
}
});
IOUtils.copyBytes(in, out, 4096, true);
}
}
If you look at create() method implementation in FileSystem class, it has getDefaultBlockSize() as one of its arguments, which inturn fetches the values from configuration which is turn is provided by the namenode.
This is how client gets to know the block size configured on hadoop cluster.
Hope this helps

Apache Spark move/rename succefully processed files

I would like to use spark streaming (1.1.0-rc2 Java-API) to process some files, and move/rename them once the processing is done successfully in order to push them to other jobs.
I thought about using the file path included in the name of generated RDDs (newAPIHadoopFile), but how can we determine a successful end of processing of a file?
Also not sure this the right way to achieve it so any ideas are welcome.
EDIT:
Here is some pseudo code to be more clear :
logs.foreachRDD(new Function2<JavaRDD<String>, Time, Void>() {
#Override
public Void call(JavaRDD<String> log, Time time) throws Exception {
String fileName=log.name();
String newlog=Process(log);
SaveResultToFile(newlog, time);
//are we done with the file so we can move it ????
return null;
}
});
You aren't guaranteed that the input is backed by an HDFS file. But it doesn't seem like you need that given your question. You create a new file and write something to it. When the write completes, you're done. Move it with other HDFS APIs.

Testing connection to HDFS

In order to test connection to HDFS from a java program, is it sufficient enough to rely on FileSystem.get(configuration) or additional sanity checks should be done to do so?(fo ex: some file-based operations like list,copy,delete)
FileSystem.get(Configuration) creates a DistrubutedFileSystem object, which in turn relies on a DFSClient to talk to the NameNode. Buried deep down in the source (1.0.2 is the version i'm looking through), is a call to create an RPC for the NameNode, which in turn creates a Proxy for the ClientProtocol interface.
When this proxy is created, (org.apache.hadoop.ipc.RPC.getProxy(Class<? extends VersionedProtocol>, long, InetSocketAddress, UserGroupInformation, Configuration, SocketFactory, int)), a call is made to ensure the server and client both talk the same 'version', so this confirmation affirms that a NameNode is running at the configured address:
VersionedProtocol proxy =
(VersionedProtocol) Proxy.newProxyInstance(
protocol.getClassLoader(), new Class[] { protocol },
new Invoker(protocol, addr, ticket, conf, factory, rpcTimeout));
long serverVersion = proxy.getProtocolVersion(protocol.getName(),
clientVersion);
if (serverVersion == clientVersion) {
return proxy;
} else {
throw new VersionMismatch(protocol.getName(), clientVersion,
serverVersion);
}
Of course, whether the NameNode has sufficient datanodes running to perform some actions (such as create / open files) is not reported by this version match check.

multiple input into a Mapper in hadoop

I am trying to send two files to a hadoop reducer.
I tried DistributedCache, but anything I put using addCacheFile in main, doesn't seem to be given back to with getLocalCacheFiles in the mapper.
right now I am using FileSystem to read the file, but I am running locally so I am able to just send the name of the file. Wondering how to do this if I was running on a real hadoop system.
is there anyway to send values to the mapper except the file that it's reading?
I also had a lot of problems with distribution cache, and sending parameters. Options worked for me are below:
For distributed cache usage:
For me it was a nightmare to get the url/path to file on HDFS in Map or Reduce, but with symlink it worked
in run() method of the job
DistributedCache.addCacheFile(new URI(file+"#rules.dat"), conf);
DistributedCache.createSymlink(conf);
and then read in Map or Reduce
in header, before methods
public static FileSystem hdfs;
and then in setup() method of Map or Reduce
hdfs = FileSystem.get(new Configuration()).open(new Path ("rules.dat"));
For parameters:
Send some values to Map or Reduce (could be a filename to open from HDFS):
public int run(String[] args) throws Exception {
Configuration conf = new Configuration();
...
conf.set("level", otherArgs[2]); //sets variable level from command line, it could be a filename
...
}
then in Map or Reduce class just:
int level = Integer.parseInt(conf.get("level")); //this is int, but you can read also strings, etc.
If distributed cache suites your need - it is a way to go.
getLocalCacheFiles works differently in the local mode and in the distributed mode. (it actually do not work in local mode).
Look into this link: http://developer.yahoo.com/hadoop/tutorial/module5.html
look for the phrase: As a cautionary note:

Resources