I want to:
Read a file,
Find a line that starts with certain word
Replace that line with a new one
Is there an efficient way of doing this using Java 8 Stream please?
Can you try this sample program. I read a file and look for a pattern, if I find a pattern I replace that line with a new one.
In this class:
- In method getChangedString, I read each line (Sourcefile is the path to the file you read)
- using map I check each line
- If I find the matching line, I replace it
- or else I leave the existing line as it is
- And finally return it as a List
import java.io.File;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.List;
import java.util.stream.Collectors;
public class FileWrite {
private static String sourceFile = "c://temp//data.js";
private static String replaceString = "newOrders: [{pv: 7400},{pv: 1398},{pv: 1800},{pv: 3908},{pv: 4800},{pv: 3490},{pv: 4300}";
public static void main(String[] args) throws IOException {
Files.write(Paths.get(sourceFile), getChangedString());
}
/*
* Goal of this method is to read each file in the js file
* if it finds a line which starts with newOrders
* then it will replace that line with our value
* else
* returns the same line
*/
private static List<String> getChangedString() throws IOException {
return Files.lines(Paths.get(sourceFile)) //Get each line from source file
//in this .map for each line check if it starts with new orders. if it does then replace that with our String
.map(line -> {if(line.startsWith("newOrders:" )){
return replaceString;
} else {
return line;
}
} )
//peek to print values in console. This can be removed after testing
.peek(System.out::println)
//finally put everything in a collection and send it back
.collect(Collectors.toList());
}
}
Related
I would like to extract specific information from 108 Xml files. The general source is also a XML-File with further URLs as resources.
XML-Source
The static method getURL() extracts the URLs in order to set them as URL-paths within a for loop in the main method. The programm works, but it takes approx. 5 minutes to get the data from all files. Any ideas how to increase the performance?
import java.io.File;
import java.io.IOException;
import java.net.URL;
import java.util.ArrayList;
import java.util.List;
import org.jdom2.Document;
import org.jdom2.Element;
import org.jdom2.Namespace;
import org.jdom2.filter.Filters;
import org.jdom2.input.SAXBuilder;
import org.jdom2.xpath.XPathExpression;
import org.jdom2.xpath.XPathFactory;
public class XmlReader2 {
public static void main(String[] args) throws IOException {
for (int i = 0; i < getURL().size(); i++) {
URL url = new URL(getURL().get(i));
try {
Document doc = new SAXBuilder().build(url);
final String getDeath = String
.format("//ns:teiHeader/ns:profileDesc/ns:particDesc/ns:listPerson/ns:person/ns:death");
XPathExpression<Element> xpath = XPathFactory.instance().compile(getDeath, Filters.element(), null,
Namespace.getNamespace("ns", "http://www.tei-c.org/ns/1.0"));
String test;
for (Element elem : xpath.evaluate(doc)) {
test = elem.getValue();
if (elem.getAttributes().size() != 0) {
test = elem.getAttributes().get(0).getValue();
}
System.out.println(elem.getName() + ": " + test);
}
} catch (org.jdom2.JDOMException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
public static List<String> getURL() throws IOException {
List<String> urlList = new ArrayList<>();
URL urlSource = new URL("http://www.steinheim-institut.de:80/cgi-bin/epidat?info=resources-mz1");
try {
Document doc = new SAXBuilder().build(urlSource);
final String getURL = String.format("/collection");
XPathExpression<Element> xpath = XPathFactory.instance().compile(getURL, Filters.element());
int i = 0;
for (Element elem : xpath.evaluate(doc)) {
while (i != elem.getChildren().size()) {
String url = elem.getChildren().get(i).getAttributes().get(1).getValue();
// System.out.println(url);
urlList.add(url);
i++;
}
}
} catch (org.jdom2.JDOMException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return urlList;
}
}
A delay of this order may be caused by fetching files from the web. Find a tool for monitoring HTTP requests issued from your machine to see what is going on. Look in particular for requests for common W3C files such as the XHTML DTD: because these files are requested so often, W3C deliberately injects a delay into the process to encourage people to use local copies of the files. If it turns out that this is the problem, there are various techniques you can use to access cached local copies.
Having said that, I'm puzzled by the logic of your code. The method getURL() appears to fetch and parse the document at http://www.steinheim-institut.de:80/cgi-bin/epidat?info=resources-mz1 every time it is called, and yet you are calling it within a loop, even using getURL().size() as your terminating condition.
The site suggest that i can use several flags
https://nlp.stanford.edu/software/openie.html
But how to use it, I tried doing it this way
import edu.stanford.nlp.ie.util.RelationTriple;
import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.naturalli.NaturalLogicAnnotations;
import edu.stanford.nlp.util.CoreMap;
import java.util.Collection;
import java.util.Properties;
/**
* A demo illustrating how to call the OpenIE system programmatically.
*/
public class OpenIEDemo {
public static void main(String[] args) throws Exception {
// Create the Stanford CoreNLP pipeline
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,depparse,natlog,openie");
props.setProperty("openieformat","ollie");
props.setProperty("openieresolve_coref","1");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// Annotate an example document.
Annotation doc = new Annotation("Obama was born in Hawaii. He is our president.");
pipeline.annotate(doc);
// Loop over sentences in the document
for (CoreMap sentence : doc.get(CoreAnnotations.SentencesAnnotation.class)) {
// Get the OpenIE triples for the sentence
Collection<RelationTriple> triples = sentence.get(NaturalLogicAnnotations.RelationTriplesAnnotation.class);
// Print the triples
for (RelationTriple triple : triples) {
System.out.println(triple.confidence + "\t" +
triple.subjectLemmaGloss() + "\t" +
triple.relationLemmaGloss() + "\t" +
triple.objectLemmaGloss());
}
}
}
}
I have added
props.setProperty("openieformat","ollie");
props.setProperty("openieresolve_coref","1");
But its not working
For StanfordCoreNLP, flags/properties for individual annotators are set with an annotator.flag name. And boolean flags have value "false" or "true". So, what you have is close to right, but needs to be:
props.setProperty("openie.format","ollie");
props.setProperty("openie.resolve_coref","true");
I'm trying to use the Stanford tokenizer with the following example from their website:
import java.io.FileReader;
import java.io.IOException;
import java.util.List;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.ling.HasWord;
import edu.stanford.nlp.process.CoreLabelTokenFactory;
import edu.stanford.nlp.process.DocumentPreprocessor;
import edu.stanford.nlp.process.PTBTokenizer;
public class TokenizerDemo {
public static void main(String[] args) throws IOException {
for (String arg : args) {
// option #1: By sentence.
DocumentPreprocessor dp = new DocumentPreprocessor(arg);
for (List sentence : dp) {
System.out.println(sentence);
}
// option #2: By token
PTBTokenizer ptbt = new PTBTokenizer(new FileReader(arg),
new CoreLabelTokenFactory(), "");
for (CoreLabel label; ptbt.hasNext(); ) {
label = ptbt.next();
System.out.println(label);
}
}
}
}
and I get the following error when I try to compile it:
TokenizerDemo.java:24: error: incompatible types: Object cannot be converted to CoreLabel
label = ptbt.next();
Does anyone know what the reason might be? In case you are interested, I'm using Java 1.8 and made sure that CLASSPATH contains the jar file.
Try parameterizing the PTBTokenizer class. For example:
PTBTokenizer<CoreLabel> ptbt = new PTBTokenizer<>(new FileReader(arg),
new CoreLabelTokenFactory(), "");
I am processing CSV files using FlatFileItemReader.
Sometimes I am getting blank lines within the input file.
When that happened the whole step stops. I want to skipped those lines and proceed normal.
I tried to add exception handler to the step in order to catch the execption instead of having the whole step stooped:
#Bean
public Step processSnidUploadedFileStep() {
return stepBuilderFactory.get("processSnidFileStep")
.<MyDTO, MyDTO>chunk(numOfProcessingChunksPerFile)
.reader(snidFileReader(OVERRIDDEN_BY_EXPRESSION))
.processor(manualUploadAsyncItemProcessor())
.writer(manualUploadAsyncItemWriter())
.listener(logProcessListener)
.throttleLimit(20)
.taskExecutor(infrastructureConfigurationConfig.taskJobExecutor())
.exceptionHandler((context, throwable) -> logger.error("Skipping record on file. cause="+ ((FlatFileParseException)throwable).getCause()))
.build();
}
Since I am processing with chunks when blank line arrives and exception is catched what's happens is that the whole chunk is skipped(the chunk might contains valid lines on CSV file and they are skipped aswell)
Any idea how to do this right when processing file in chunks?
Thanks,
ray.
After editing my code. still not skipping:
public Step processSnidUploadedFileStep() {
SimpleStepBuilder<MyDTO, MyDTO> builder = new SimpleStepBuilder<MyDTO, MyDTO>(stepBuilderFactory.get("processSnidFileStep"));
return builder
.<PushItemDTO, PushItemDTO>chunk(numOfProcessingChunksPerFile)
.faultTolerant().skip(FlatFileParseException.class)
.reader(snidFileReader(OVERRIDDEN_BY_EXPRESSION))
.processor(manualUploadAsyncItemProcessor())
.writer(manualUploadAsyncItemWriter())
.listener(logProcessListener)
.throttleLimit(20)
.taskExecutor(infrastructureConfigurationConfig.taskJobExecutor())
.build();
}
We created custom SimpleRecordSeparatorPolicy which is telling reader to skip blank lines. That way we read 100 records, i.e. 3 are blank lines and those are ignored without exception and it writes 97 records.
Here is code:
package com.my.package;
import org.springframework.batch.item.file.separator.SimpleRecordSeparatorPolicy;
public class BlankLineRecordSeparatorPolicy extends SimpleRecordSeparatorPolicy {
#Override
public boolean isEndOfRecord(final String line) {
return line.trim().length() != 0 && super.isEndOfRecord(line);
}
#Override
public String postProcess(final String record) {
if (record == null || record.trim().length() == 0) {
return null;
}
return super.postProcess(record);
}
}
And here is reader:
package com.my.package;
import org.springframework.batch.core.configuration.annotation.StepScope;
import org.springframework.batch.item.file.FlatFileItemReader;
import org.springframework.batch.item.file.mapping.DefaultLineMapper;
import org.springframework.batch.item.file.transform.DelimitedLineTokenizer;
import org.springframework.stereotype.Component;
#Component
#StepScope
public class CustomReader extends FlatFileItemReader<CustomClass> {
#Override
public void afterPropertiesSet() throws Exception {
setLineMapper(new DefaultLineMapper<CustomClass>() {
{
/// configuration of line mapper
}
});
setRecordSeparatorPolicy(new BlankLineRecordSeparatorPolicy());
super.afterPropertiesSet();
}
}
I am using hadoop-1.2.1 and trying to run a simple RowCount HBase job using ToolRunner. However, no matter what I seem to try, hadoop cannot find the map class. The jar file is being copied correctly into hdfs, but I can't seem to figure out where it is going wrong. Please help!
Here is the code:
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.filter.FirstKeyOnlyFilter;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;
import org.apache.hadoop.hbase.mapreduce.TableMapper;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.output.NullOutputFormat;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class HBaseRowCountToolRunnerTest extends Configured implements Tool
{
// What to copy.
public static final String JAR_NAME = "myJar.jar";
public static final String LOCAL_JAR = <path_to_jar> + JAR_NAME;
public static final String REMOTE_JAR = "/tmp/"+JAR_NAME;
public static void main(String[] args) throws Exception
{
Configuration config = HBaseConfiguration.create();
//All connection configs set here -- omitted to post the code
config.set("tmpjars", REMOTE_JAR);
FileSystem dfs = FileSystem.get(config);
System.out.println("pathString = " + (new Path(LOCAL_JAR)).toString() + " \n");
// Copy jar file to remote.
dfs.copyFromLocalFile(new Path(LOCAL_JAR), new Path(REMOTE_JAR));
// Get rid of jar file when we're done.
dfs.deleteOnExit(new Path(REMOTE_JAR));
// Run the job.
System.exit(ToolRunner.run(config, new HBaseRowCountToolRunnerTest(), args));
}
#Override
public int run(String[] args) throws Exception
{
Job job = new RowCountJob(getConf(), "testJob", "myLittleHBaseTable");
return job.waitForCompletion(true) ? 0 : 1;
}
public static class RowCountJob extends Job
{
RowCountJob(Configuration conf, String jobName, String tableName) throws IOException
{
super(conf, RowCountJob.class.getCanonicalName() + "_" + jobName);
setJarByClass(getClass());
Scan scan = new Scan();
scan.setCacheBlocks(false);
scan.setFilter(new FirstKeyOnlyFilter());
setOutputFormatClass(NullOutputFormat.class);
TableMapReduceUtil.initTableMapperJob(tableName, scan,
RowCounterMapper.class, ImmutableBytesWritable.class, Result.class, this);
setNumReduceTasks(0);
}
}//end public static class RowCountJob extends Job
//Mapper that runs the count
//TableMapper -- TableMapper<KEYOUT, VALUEOUT> (*OUT by type)
public static class RowCounterMapper extends TableMapper<ImmutableBytesWritable, Result>
{
//Counter enumeration to count the actual rows
public static enum Counters {ROWS}
/**
* Maps the data.
*
* #param row The current table row key.
* #param values The columns.
* #param context The current context.
* #throws IOException When something is broken with the data.
* #see org.apache.hadoop.mapreduce.Mapper#map(KEYIN, VALUEIN,
* org.apache.hadoop.mapreduce.Mapper.Context)
*/
#Override
public void map(ImmutableBytesWritable row, Result values, Context context) throws IOException
{
// Count every row containing data times 2, whether it's in qualifiers or values
context.getCounter(Counters.ROWS).increment(2);
}
}//end public static class RowCounterMapper extends TableMapper<ImmutableBytesWritable, Result>
}//end public static void main(String[] args) throws Exception
Ok- I found a workaround to the problem and thought that I would share for all others having similar issues...
As is turns out, I abandoned the tmpjars configuration option and just copied the jar file directed into the DistributedCache from the code itself. Here is what it looks like:
// Copy jar file to remote.
FileSystem dfs = FileSystem.get(conf);
dfs.copyFromLocalFile(new Path(LOCAL_JAR), new Path(REMOTE_JAR));
// Get rid of jar file when we're done.
dfs.deleteOnExit(new Path(REMOTE_JAR));
//Place it in the distributed cache
DistributedCache.addFileToClassPath(new Path(REMOTE_JAR), conf, dfs);
Perhaps it doesn't solve what is going on with tmpjars, but it does work.
I got the same problem today.Finally, I found it was because I forgot to insert the following sentence in the driver class...
job.setJarByClass(HBaseTestDriver.class);