Utility to push data into HDFS - hadoop

I need to build a common utility for unix/Windows based system to push data into hadoop system. User can run that utility from any platform and should be able to push data into HDFS.
WebHDFS can be one of the option but curious to know if anything else available.
Any suggestions?

I usually make a maven project and I add this dependency to my pom.xml file:
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.6.1</version>
</dependency>
Then push data into HDFS it's very easy with the hadoop java api, this is a simple example just to see how it works:
String namenodeLocation = "hdfs://[your-namenode-ip-address]:[hadoop:listening-port]/";
Configuration configuration = new Configuration();
FileSystem hdfs = FileSystem.get( new URI( namenodeLocation ), configuration );
Path file = new Path(namenodeLocation+"/myWonderful.data");
FSDataOutputStream outStream = hdfs.create(file);
byte[] coolDataToPushToHDFS = new byte[1500];
outStream.write(coolDataToPushToHDFS);
outStream.close();
hdfs.close();
It's a really simple program. I think the steps you have to do are:
Let users choose the input/data to push
Use hadoop java api to send file/data to your cluster
Give some feedback to the user.
You can also append information to a file, not only create new file.
Give a look to the documentation: https://hadoop.apache.org/docs/current/api/

Related

Accessing streaming data from ADLS in PyFlink

Attempting a Pyflink stream setup from ADLS and currently trying to read a json file using the StreamExecutionEnvironment.from_source() method.
Here's how the code looks like:
from flink.plan.Environment import get_environment
from pyflink.datastream.functions import SourceFunction
from pyflink.datastream import StreamExecutionEnvironment, RuntimeExecutionMode
from pyflink.datastream.connectors.file_system import (FileSource, StreamFormat, FileSink,
OutputFileConfig, RollingPolicy, BucketAssigner)
from pyflink.common import WatermarkStrategy, Encoder, Types
from azure.storage.filedatalake import FileSystemClient
file_system = FileSystemClient.from_connection_string(connection_str, file_system_name="my_fs")
# setting the stream environment object
env = StreamExecutionEnvironment.get_execution_environment()
env.set_runtime_mode(RuntimeExecutionMode.STREAMING)
env.add_jars("file:///opt/flink/plugins/azure/flink-azure-fs-hadoop-1.16.0.jar")
env.add_classpaths("file:///opt/flink/plugins/azure/flink-azure-fs-hadoop-1.16.0.jar")
file_client = file_system.get_file_client(my_file)
input_path = 'abfss://' + file_client.url[8:]
print('URL is ===== >>>>',file_client.url)
# Source
ds = env.from_source(
source=FileSource.for_record_stream_format(StreamFormat.text_line_format(),
input_path)
.process_static_file_set().build(),
watermark_strategy=WatermarkStrategy.for_monotonous_timestamps(),
source_name="file_source"
)
ds.sink_to(
sink=FileSink.for_row_format(
base_path=output_path,
encoder=Encoder.simple_string_encoder())
.with_bucket_assigner(BucketAssigner.base_path_bucket_assigner())
.build())
ds.print()
env.execute()
I am getting below error:
Caused by: org.apache.flink.core.fs.UnsupportedFileSystemSchemeException: Could not find a file system implementation for scheme 'abfs'. The scheme is directly supported by Flink through the following plugin(s): flink-fs-azure-hadoop. Please ensure that each plugin resides within its own subfolder within the plugins directory. See https://nightlies.apache.org/flink/flink-docs-stable/docs/deployment/filesystems/plugins/ for more information. If you want to use a Hadoop file system for that scheme, please add the scheme to the configuration fs.allowed-fallback-filesystems. For a full list of supported file systems, please see https://nightlies.apache.org/flink/flink-docs-stable/ops/filesystems/.
Jar file has already been added in to the plugins folder as given in the documentation:
https://nightlies.apache.org/flink/flink-docs-stable/docs/deployment/filesystems/plugins/
Also, storage account key is also added into the config.yaml file.
Alternatively tried adding the source as DataStream using :
ds = env.read_text_file(input_path)

Spring kafka not able to read truststore file from classpath

I am building an kafka consumer app which needs SASL_SSL config. Some how apache kafka is not recognizing truststore file located in classpath and looks like there is an open request to enhance it in kafka(KAFKA-7685).
In the mean time what would be the best way to solve this problem. Same app needs to deployed in PCF too so solution should work both during local windows based development and PCF (linux).
Any solution would be highly appreciated.
Here is the code which does file copy to java temp dir
String tempDirPath = System.getProperty("java.io.tmpdir");
System.out.println("Temp dir : " + tempDirPath);
File truststoreConf = ResourceUtils.getFile("classpath:Truststore.jks");
File truststoreFile = new File(tempDirPath + truststoreConf.getName());
FileUtils.copyFile(truststoreConf, truststoreFile);
System.setProperty("ssl.truststore.location", truststoreFile.getAbsolutePath());
You could use a ClassPathResource and FileCopyUtils to copy it from the jar to a file in a temporary directory in main() before creating the SpringApplication.
Root cause of this issue was resource filtering enabled. Maven during resource filtering corrupts the binary file. So if you have that enabled, disable it

How to handle service application failures developed by Hive JDBC?

I'm developing a service application based on the Hive jdbc-examples project, fail to run the application.
I normally use Huawei's big data products to perform Hive's jdbc sample development, which always fails to run the application.
To solve this problem, try either one of the following methods:
Method 1
Put the core-site.xml configuration file to classpath according to the Hive jdbc-examples sample project.
Method 2
In the code, explicitly load the core-site.xml file:
......conf = new Configuration();
String userdir = System.getProperty("user.dir") + File.separator + "conf" + File.separator;
conf.addResource(new Path(userdir + "core-site.xml"));......
Method 3
In the code, set hadoop.security.authentication to kerberos:
......CONF = new Configuration();
CONF.set("hadoop.security.authentication", "kerberos");......
See more.

Hadoop Configuration Object read XML

I have an XML file containing some name and values that I want to read from in my Spark application. How do I use the Hadoop Configuration to read in these values and use them in my code?
I tried uploading the XML file to HDFS , but I'm not sure what the key is supposed to be when I used conf.get()
Maybe you forgot to include these lines to your code:
val conf = new Configuration()
conf.addResource(new Path(<path-to-file>))

Adding some hadoop configuration to a spark application at runtime (through spark-submit)?

I want to send a key-value pair to my spark application something like the following:
mapreduce.input.fileinputformat.input.dir.recursive=true
I understand I can do this from the code in the following way:
sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive","true")
But I want to be able to send this property through spark-submit at runtime. Would this be possible?
Absolutely!
spark-submit (as well as spark-shell) support the --conf PROP=VALUE and --properties-file FILE options, which allow you specify such arbitrary configurations options. You can then get the values you pass by using the SparkConf .get function:
val conf = new SparkConf()
val mrRecursive =
conf.get("spark.mapreduce.input.fileinputformat.input.dir.recursive")
sc.hadoopConfiguration.set("spark.mapreduce.input.fileinputformat.input.dir.recursive", mrRecursive)
Spark-submit/spark-shell --help:
--conf PROP=VALUE Arbitrary Spark configuration property.
--properties-file FILE Path to a file from which to load extra properties. If not
specified, this will look for conf/spark-defaults.conf.
Spark docs regarding [dynamically] loading properties: https://spark.apache.org/docs/latest/configuration.html
Without code modification, such approach can be used.
Hadoop Configuration reads file "core-default.xml" during creation, description is here:
https://hadoop.apache.org/docs/r2.6.1/api/org/apache/hadoop/conf/Configuration.html
If put values in "core-default.xml", and include directory with file in classpath with spark-submit "driver-class-path" parameter, it can work.

Resources