How to strip host information from an hdfs path - hadoop

I have an hdfs path hdfs://host1:8899/path/to/file. I want to strip the host1 and port programmatically. As result, it should be hdfs:/path/tofile. Is there any helper method can do that?

"Is there any helper method can do that?"
Doesn't really take much to create your own. Just use the basic String class utility functions like split(), indexOf(), substring(), etc.
Something like this would do (with Java, though most languages have those methods):
public class TestPath {
public static void main(String[] args) throws Exception {
String path = "hdfs://localhost:9000/path/to/file";
System.out.println(getPathWithoutHostAndPort(path));
}
public static String getPathWithoutHostAndPort(String path) {
String[] array = path.split("(//)");
int indexOfFirstSlash = array[1].indexOf("/");
StringBuilder builder = new StringBuilder();
builder.append(array[0]).append(array[1].substring(indexOfFirstSlash));
return builder.toString();
}
}
Result: hdfs:/path/to/file

Related

How to parse a list of list of spring properties

I have this Spring boot application.properties
list1=valueA,valueB
list2=valueC
list3=valueD,valueE
topics=list1,list2,list3
What I'm trying to do is to use in the topics element of #KafkaListener annotation the values of the values of topics property
Using the expression
#KafkaListener(topics={"#{'${topics}'.split(',')}"})
I get list1,list2,list3 as separated string
How can I loop on this list in order to get valueA,valueB,valueC,valueD,valueE?
Edit: I must parse topics properties in order that #KafkaListener registers for consuming message from topics valueA,valueB,valueC, etc.
I read that is possible call a method in this way:
#KafkaListener(topics="#parse(${topics})")
So, I wrote this method:
public String[] parse(String s) {
ExpressionParser parser = new SpelExpressionParser();
return Arrays.stream(s.split(",").map(key -> (String)(parser.parse(key).getValue())).toArray(String[]::new);
}
But the parse method is not invoked
So, I tried directly to do this into annotations
in this way:
#KafkaListener(topics="#{Arrays.stream('${topics}'.split(',')).map(key->${key}).toArray(String[]::new)}")
But also this solution give me errors.
Edit 2:
Modifying in this way the method is invoked
#KafkaListener(topics="parse()")
#Bean
public String[] parse(String s) {
...
}
The problems is how to get "topics" props inside the method
You can't invoke arbitrary methods like that; you need to reference a bean #someBean.parse(...); using #parse requires registering a static method as a function.
However, this works for me and is much simpler:
list1=valueA,valueB
list2=valueC
list3=valueD,valueE
topics=${list1},${list2},${list3}
and
#KafkaListener(id = "so64390079", topics = "#{'${topics}'.split(',')}")
EDIT
If you can't use placeholders in topics, this works...
#SpringBootApplication
public class So64390079Application {
public static void main(String[] args) {
SpringApplication.run(So64390079Application.class, args);
}
#KafkaListener(id = "so64390079", topics = "#{#parser.parse('${topics}')}")
public void listen(String in) {
System.out.println(in);
}
}
#Component
class Parser implements EnvironmentAware {
private Environment environmment;
#Override
public void setEnvironment(Environment environment) {
this.environmment = environment;
}
public String[] parse(String[] topics) {
StringBuilder sb = new StringBuilder();
for (String topic : topics) {
sb.append(this.environmment.getProperty(topic));
sb.append(',');
}
return StringUtils.commaDelimitedListToStringArray(sb.toString().substring(0, sb.length() - 1));
}
}

Custom FileInputFormat always assign one filesplit to one slot

I have been writing protobuf records to our s3 buckets. And I want to use flink dataset api to read from it. So I implemented a custom FileInputFormat to achieve this. The code is as below.
public class ProtobufInputFormat extends FileInputFormat<StandardLog.Pageview> {
public ProtobufInputFormat() {
}
private transient boolean reachedEnd = false;
#Override
public boolean reachedEnd() throws IOException {
return reachedEnd;
}
#Override
public StandardLog.Pageview nextRecord(StandardLog.Pageview reuse) throws IOException {
StandardLog.Pageview pageview = StandardLog.Pageview.parseDelimitedFrom(stream);
if (pageview == null) {
reachedEnd = true;
}
return pageview;
}
#Override
public boolean supportsMultiPaths() {
return true;
}
}
public class BatchReadJob {
public static void main(String... args) throws Exception {
String readPath1 = args[0];
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
ProtobufInputFormat inputFormat = new ProtobufInputFormat();
inputFormat.setNestedFileEnumeration(true);
inputFormat.setFilePaths(readPath1);
DataSet<StandardLog.Pageview> dataSource = env.createInput(inputFormat);
dataSource.map(new MapFunction<StandardLog.Pageview, String>() {
#Override
public String map(StandardLog.Pageview value) throws Exception {
return value.getId();
}
}).writeAsText("s3://xxx", FileSystem.WriteMode.OVERWRITE);
env.execute();
}
}
The problem is that flink always assign one filesplit to one parallelism slot. In other word, it always process the same number of file split as the number of the parallelism.
I want to know what's the correct way of implementing custom FileInputFormat.
Thanks.
I believe the behavior you're seeing is because ExecutionJobVertex calls the FileInputFormat. createInputSplits() method with a minNumSplits parameter equal to the vertex (data source) parallelism. So if you want a different behavior, then you'd have to override the createInputSplits method.
Though you didn't say what behavior you actually wanted. If, for example, you just want one split per file, then you could override the testForUnsplittable() method in your subclass of FileInputFormat to always return true; it should also set the (protected) unsplittable boolean to true.

Add camel route at runtime using end points configured in property file

I own a spring application and want to add camel routes dynamically during my application startup.End points are configured in property file and are loaded at run time.
Using Java DSL, i am using for loop to create all routes,
for(int i=0;i<allEndPoints;i++)
{
DynamcRouteBuilder route = new
DynamcRouteBuilder(context,fromUri,toUri)
camelContext.addRoutes(route)
}
private class DynamcRouteBuilder extends RouteBuilder {
private final String from;
private final String to;
private MyDynamcRouteBuilder(CamelContext context, String from, String to) {
super(context);
this.from = from;
this.to = to;
}
#Override
public void configure() throws Exception {
from(from).to(to);
}
}
but getting below exception while creating first route itself
Failed to create route file_routedirect: at: >>> OnException[[class org.apache.camel.component.file.GenericFileOperationFailedException] -> [Log[Exception trapped ${exception.class}], process[Processor#0x0]]] <<< in route: Route(file_routedirect:)[[From[direct:... because of ref must be specified on: process[Processor#0x0]\n\ta
Not sure about it- what is the issue ? Can someone has any suggestion or fix for this. Thanks
Well, to create routes in an iteration it is nice to have some object that holds the different values for one route. Let's call this RouteConfiguration, a simple POJO with String fields for from, to and routeId.
We are using YAML files to configure such things because you have a real List format instead of using "flat lists" in property files (route[0].from, route[0].to).
If you use Spring you can directly transform such a "list of object configurations" into a Collection of objects using #ConfigurationProperties
When you are able to create such a Collection of value objects, you can simply iterate over it. Here is a strongly simplified example.
#Override
public void configure() {
createConfiguredRoutes();
}
void createConfiguredRoutes() {
configuration.getRoutes().forEach(this::addRouteToContext);
}
// Implement route that is added in an iteration
private void addRouteToContext(final RouteConfiguration routeConfiguration) throws Exception {
this.camelContext.addRoutes(new RouteBuilder() {
#Override
public void configure() throws Exception {
from(routeConfiguration.getFrom())
.routeId(routeConfiguration.getRouteId())
...
.to(routeConfiguration.getTo());
}
});
}

IronRuby - Wrong number of arguments

I just started using IronRuby. This is my test class:
class Program
{
static void Main(string[] args)
{
var path = #"C:\Users\frays\Desktop\test.rb";
var engine = Ruby.CreateEngine();
var scope = engine.Runtime.CreateScope();
scope.SetVariable("sendNext", new Action<string>(SendNext));
engine.ExecuteFile(path, scope);
Console.Read();
}
private static void SendNext(string text)
{
Console.WriteLine(text);
}
}
And this is my test script:
sendNext 'heyyy'
However, when trying to run the program it throws an exception saying wrong number of arguments (1 for 0), even though the method definitely takes a string as an argument.
This says that it is not possible Calling IronRuby from C# with a delegate but you can just call the invoke method.
sendNext.Invoke( 'heyyy')

How do you hash a string in a PCL project?

We've tried using this library: http://msdn.microsoft.com/en-us/library/system.security.cryptography.hashalgorithm(v=vs.110).aspx
And this code:
public static byte[] GetHash(string inputString)
{
HashAlgorithm algorithm = SHA1.Create(); // SHA1.Create()
return algorithm.ComputeHash(Encoding.UTF8.GetBytes(inputString));
}
public static string GetHashString(string inputString)
{
StringBuilder sb = new StringBuilder();
foreach (byte b in GetHash(inputString))
sb.Append(b.ToString("X2"));
return sb.ToString();
}
But the library doesn't seem to be available.
In case certain API is not available in PCL, you would normally create an interface of and inject it in the constructor.
In your example, it would be something like this
PCL library project
public interface IHashService
{
byte[] ComputeHash(byte[] data)
}
Platform specific project
public class Sha1HashService : IHashService
{
public ComputeHash(byte[] data)
{
using(var algorithm = SHA1.Create())
{
var result = algorithm.ComputeHash(data);
return result;
}
}
}
It is good practice not to use static methods and use a dependency injection whenever possible.
Also you probably want your interface to be more generic (takes bytes as an argument) rather than a string, for the very same reason (dependency on the Encoding.UTF8.GetBytes).

Resources