PathPatternParser - Cannot combine patterns: *.csv and *.txt - spring

I have defined a SftpInboundFileSynchronizer class. I want to filter remote files which is of csv and txt files only. file extension array has *.csv and *.txt value
SftpInboundFileSynchronizer fileSynchronizer = new
SftpInboundFileSynchronizer(sftpSessionFactory());
PathPatternParser pp = new PathPatternParser();
PathPattern pattern = null;
for(String fileExtension: fileExtensionArray) {
if(pattern == null) {
pattern = pp.parse(fileExtension);
} else {
pattern.combine(pp.parse(fileExtension));
}
}
fileSynchronizer.setFilter(new
SftpSimplePatternFileListFilter(pattern.getPatternString()));
The above throws the below exception
java.lang.IllegalArgumentException: Cannot combine patterns: *.csv and *.txt
How to fix this error?
Thanks

The PathPattern.combine method will, as the name implies, combine the patterns. Basically making it an and. A file can never end in .txt and .csv. It must be an or. You cannot achieve this, afaik, with the PathPattern.
However with a regular expression that is fairly easy to do. Spring Integration ships with a component that can handle that for you. You can use the SftpRegexPatternFileListFilter instead of the SftpSimplePatternFileListFilter.
SftpRegexPatternFileListFilter filter = new SftpRegexPatternFileListFilter ("(?i)^.*\\.(csv|txt)$");
Something with the above expression should work (I'm not a pro on regular expressions, so it might need some work).

Related

How to read rules from a file

I am trying to match the sentence against rules.
I am able to compile multiple rules and match it against CoreLabel using the following method :
TokenSequencePattern pattern1 = TokenSequencePattern.compile("([{tag:/NN.*//*}])");
TokenSequencePattern pattern2 = TokenSequencePattern.compile("([{tag:/NN.*//*}])");
List<TokenSequencePattern> tokenSequencePatterns = new ArrayList<>();
tokenSequencePatterns.add(pattern1);
tokenSequencePatterns.add(pattern2);
MultiPatternMatcher multiMatcher = TokenSequencePattern.getMultiPatternMatcher(tokenSequencePatterns);
List<SequenceMatchResult<CoreMap>> matched=multiMatcher.findNonOverlapping(tokens);
I have many rules inside a file. Is there any way to load the rule file?
I have seen a method to load the rules from file using the following method:
CoreMapExpressionExtractor extractor = CoreMapExpressionExtractor.createExtractorFromFiles(TokenSequencePattern.getNewEnv(), "en.rules");
List<MatchedExpression> matched = extractor.extractExpressions((CoreMap)sentence);
But it accepts CoreMap as its argument. But I need to match it against CoreLabel
Please see this comprehensive write up on TokensRegex:
https://stanfordnlp.github.io/CoreNLP/tokensregex.html

Parsing several csv files using Spring Batch

I need to parse several csv files from a given folder. As each csv has different columns, there are separate tables in DB for each csv. I need to know
Does spring batch provide any mechanism which scans through the given folder and then I can pass those files one by one to the reader.
As I am trying to make the reader/writer generic, is it possible to just get the column header for each csv, based upon that I am trying to build tokenizer and also the insert query.
Code sample
public ItemReader<Gdp> reader1() {
FlatFileItemReader<Gdp> reader1 = new FlatFileItemReader<Gdp>();
reader1.setResource(new ClassPathResource("datagdp.csv"));
reader1.setLinesToSkip(1);
reader1.setLineMapper(new DefaultLineMapper<Gdp>() {
{
setLineTokenizer(new DelimitedLineTokenizer() {
{
setNames(new String[] { "region", "gdpExpend", "value" });
}
});
setFieldSetMapper(new BeanWrapperFieldSetMapper<Gdp>() {
{
setTargetType(Gdp.class);
}
});
}
});
return reader1;
}
Use a MultiResourceItemReader to scan all files.
I think you need a sort of classified ItemReader as MultiResourceItemReader.delegate but SB doesn't offer that so you have to write your own.
For ItemProcessor and ItemWriter SB offers a classifier-aware implementation (ClassifierCompositeItemProcessor and ClassifierCompositeItemWriter).
Obviously more different input file you have more XML config must be write,but it should be straightforward to do.
I suppose you are expecting this kind of implementation.
During the Partition Step Builder, read all the files names, file header, insert query for the writer and save the same in the Execution Context.
In the slave step, for every reader and writer, pass on the Execution context, get the file to read, file header to the tokenizer, insert query that needs to be inserted for that writer.
This resolves your question.
Answers for your questions:
I don't know about a specific mechanism on spring batch to scan files.
You can use opencsv as generic CSV reader, there are a lot of mechanisms reading files.
About OpenCSV:
If you are using maven project, try to import this dependency:
<dependency>
<groupId>net.sf.opencsv</groupId>
<artifactId>opencsv</artifactId>
<version>2.0</version>
</dependency>
You can read your files making an object for specific formats, or generic headers like this below:
private static List<DadosPeople> extrairDadosPeople() throws IOException {
CSVReader readerPeople = new CSVReader(new FileReader(people));
List<PeopleData> listPeople = new ArrayList<PeopleData>();
String[] nextLine;
while ((nextLine = readerPeople.readNext()) != null) {
PeopleData people = new PeopleData();
people.setIncludeData(nextLine[0]);
people.setPartnerCode(Long.valueOf(nextLine[1]));
listPeople.add(people);
}
readerPeople.close();
return listPeople;
}
There are a lot of other ways to read CSV files using opencsv:
If you want to use an Iterator style pattern, you might do something like this:
CSVReader reader = new CSVReader(new FileReader("yourfile.csv"));
String [] nextLine;
while ((nextLine = reader.readNext()) != null) {
// nextLine[] is an array of values from the line
System.out.println(nextLine[0] + nextLine[1] + "etc...");
}
Or, if you might just want to slurp the whole lot into a List, just call readAll()...
CSVReader reader = new CSVReader(new FileReader("yourfile.csv"));
List myEntries = reader.readAll();
which will give you a List of String[] that you can iterate over. If all else fails, check out the Javadocs here.
If you want to customize quote characters and separators, you'll find constructors that cater for supplying your own separator and quote characters. Say you're using a tab for your separator, you can do something like this:
CSVReader reader = new CSVReader(new FileReader("yourfile.csv"), '\t');
And if you single quoted your escaped characters rather than double quote them, you can use the three arg constructor:
CSVReader reader = new CSVReader(new FileReader("yourfile.csv"), '\t', '\'');
You may also skip the first few lines of the file if you know that the content doesn't start till later in the file. So, for example, you can skip the first two lines by doing:
CSVReader reader = new CSVReader(new FileReader("yourfile.csv"), '\t', '\'', 2);
Can I write csv files with opencsv?
Yes. There is a CSVWriter in the same package that follows the same semantics as the CSVReader. For example, to write a tab separated file:
CSVWriter writer = new CSVWriter(new FileWriter("yourfile.csv"), '\t');
// feed in your array (or convert your data to an array)
String[] entries = "first#second#third".split("#");
writer.writeNext(entries);
writer.close();
If you'd prefer to use your own quote characters, you may use the three arg version of the constructor, which takes a quote character (or feel free to pass in CSVWriter.NO_QUOTE_CHARACTER).
You can also customise the line terminators used in the generated file (which is handy when you're exporting from your Linux web application to Windows clients). There is a constructor argument for this purpose.
Can I dump out SQL tables to CSV?
Yes you can. There is a feature on CSVWriter so you can pass writeAll() a ResultSet.
java.sql.ResultSet myResultSet = ....
writer.writeAll(myResultSet, includeHeaders);
Is there a way to bind my CSV file to a list of Javabeans?
Yes there is. There is a set of classes to allow you to bind a CSV file to a list of JavaBeans based on column name, column position, or a custom mapping strategy. You can find the new classes in the com.opencsv.bean package. Here's how you can map to a java bean based on the field positions in your CSV file:
ColumnPositionMappingStrategy strat = new ColumnPositionMappingStrategy();
strat.setType(YourOrderBean.class);
String[] columns = new String[] {"name", "orderNumber", "id"}; // the fields to bind do in your JavaBean
strat.setColumnMapping(columns);
CsvToBean csv = new CsvToBean();
List list = csv.parse(strat, yourReader);

How do I autoscan local documents to add words to a custom dictionary?

I'd like my dictionary to know more of the words I use - and don't want to manually add all possible words as I end up typing them (I'm a biologist/bioinformatician - there's lots of jargon and specific software and species names). Instead I want to:
Take a directory of existing documents. These are PDFs or Word/latex documents of scientific articles; I guess they could be "easily" be converted to plain text.
Pull out all words that are not in the "normal" dictionary.
Add these to my local custom dictionary (on my mac that's ~/Library/Spelling/LocalDictionary. But it would make sense to add them in the libreoffice/word/ispell custom dictionaries as well.
1 and 3 are easy. How can I do 2? Thanks!
As far as I understand you want to remove duplicates (that already exist in the system dictionary). You might want to ask first, if this is really necessary, though. I guess they won't cause any problems and won't increase word-spell-checking excessively, so there is no real reason for step 2 in my opinion.
I think you'll have a much harder time with step 1. Extracting plain-text from a PDF may sound easy, but it certainly is not. You'll end up with plenty of unknown symbols. You need to fix split-words at the end of a line and you probably want to exclude equations/links/numbers/etc. before adding all these to your dictionary.
But if you have some tool to get this done and can create a couple of .txt files really containing only the words/sentences you need, then I would go with something like the following python code to "solve" the merge for your local dictionary only. Of course you can also extend this to load the system dictionary (wherever that is?) and merge it the same way I show below.
Please note that I left out any error handling on purpose.
Save as import_to_dict.py, adjust the paths to your requirements and call with python import_to_dict.py
#!/usr/bin/env python
import os,re
# 1 - load existing dictionaries from files (adjust paths here!)
dictionary_file = '~/Library/Spelling/LocalDictionary'
global_dictionary_file = '/Library/Spelling/GlobalDictionary'
txt_file_folder = '~/Documents/ConvertedPapers'
reg_exp = r'[\s,.|/]+' #add symbols here
with open(local_dictionary_file, 'r') as f:
# splitting with regular expressions shouldn't really be needed for the dictionary, but it should work
dictionary = set(re.split(reg_exp,f.read()))
with open(global_dictionary_file, 'r') as f:
# splitting with regular expressions shouldn't really be needed for the dictionary, but it should work
global_dictionary = set(re.split(reg_exp,f.read()))
# 2 - walk over all sub-dirs in your folder
for root, dirs, files in os.walk(txt_file_folder):
# open all files (this could easily be limited to only .txt files)
for file in files:
with open(os.path.join(root, file), 'r') as txt_f:
# read the file contents
words = txt_f.read()
# split into word-set (set guarantees no duplicates)
word_set = set(re.split(reg_exp,words))
# remove any already in dictionary existing words
missing_words = (word_set - dictionary) - global_dictionary
# add missing words to dictionary
dictionary |= missing_words
# 3 - write dictionary file
with open(dictionary_file, 'w') as f:
f.write('\n'.join(dictionary))
Here is a basic java program that will generate a text file containing all of the unique words in a directory of plain text files, separated by a newline.
You can just replace the input directory and output file path strings with correct values for your system and run it.
import java.io.*;
import java.util.*;
public class MakeDictionary {
public static void main(String args[]) throws IOException {
Hashtable<String, Boolean> dictionary = new Hashtable<String, Boolean>();
String inputDir = "C:\\test";
String outputFile = "C:\\out\\dictionary.txt";
File[] files = new File(inputDir).listFiles();
BufferedWriter out = new BufferedWriter(new FileWriter(outputFile));
for (File file : files) {
if (file.isFile()) {
BufferedReader in = null;
try {
in = new BufferedReader(new FileReader(file.getCanonicalPath()));
String line;
while ((line = in.readLine()) != null) {
String[] words = line.split(" ");
for (String word : words) {
dictionary.put(word, true);
}
}
} finally {
if (in != null) {
in.close();
}
}
}
}
Set<String> wordset = dictionary.keySet();
Iterator<String> iter = wordset.iterator();
while(iter.hasNext()) {
out.write(iter.next());
out.newLine();
}
out.close();
}
}

urlrewriting tuckey using Tuckey

My project (we have Spring 3) needs to rewrite URLs from the form
localhost:8888/testing/test.htm?param1=val1&paramN=valN
to
localhost:8888/nottestinganymore/test.htm?param1=val1&paramN=valN
My current rule looks like:
<from>^/testing/(.*/)?([a-z0-9]*.htm.*)$</from>
<to type="passthrough">/nottestinganymore/$2</to>
But my query parameters are being doubled, so I am getting param1=val1,val1 and paramN=valN,valN...please help! This stuff is a huge pain.
To edit/add, we have use-query-string=true on the project and I doubt I can change that.
The regular expression needs some tweaking. Tuckey uses the java regular expression engine unless specified otherwise. Hence the best way to deal with this is to write a small test case that will confirm if your regular expression is correct. For e.g. a slightly tweaked example of your regular expression with a test case is below.
#Test public void testRegularExpression()
{
String regexp = "/testing/(.*)([a-z0-9]*.htm.*)$";
String url = "localhost:8888/testing/test.htm?param1=val1&paramN=valN";
Pattern pattern = Pattern.compile(regexp);
Matcher matcher = pattern.matcher(url);
if (matcher.find())
{
System.out.println("$1 : " + matcher.group(1) );
System.out.println("$2 : " + matcher.group(2) );
}
}
The above will print the output as follows :
$1 : test
$2 : .htm?param1=val1&paramN=valN
You can modify the expression now to see what "groups" you want to extract from URL and then form the target URL.

How to skip empty lines when copying files with Gradle?

I would like to copy some files in Gradle and the resulting files should not contain any blank lines, i.e., the blank lines are not copied. I assume that can be done with filter(...) and maybe with the TokenFilter from ant. However, I am not sure how to the syntax would look like.
Thanks.
Gradle uses Ant for filtering, because of its powerful implementation. For example, you can use the LineContainsRegExp Ant filter to filter out any line that is only empty or whitespaces.
The appropriate regexp can be [^ \n\t\r]+
You can use Ant directly from Gradle like this:
task copyTheAntWay {
ant.copy(file:'input.txt', tofile:'output.txt', overwrite:true) {
filterchain {
filterreader(classname:'org.apache.tools.ant.filters.LineContainsRegExp') {
param(type:'regexp', value:'[^ \n\t\r]+')
}
}
}
}
or by using the Gradle CopySpec's filter method:
task copyGradlefied(type:Copy) {
def regexp = new org.apache.tools.ant.types.RegularExpression()
regexp.pattern = '[^ \n\t\r]+'
from(projectDir) {
include 'input.txt'
filter(org.apache.tools.ant.filters.LineContainsRegExp, regexps:[regexp])
}
into "outputDir"
}

Resources