why Array Index Out Of Bound Exception while re arranging doc file paragraph? - for-loop

Here is a code snippet. Its giving arrayindexoutofboundexception. dont know why ?
import java.io.File;
import java.io.FileInputStream;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.xslf.usermodel.XSLFTextParagraph;
public class wordcount
{
public static void main(String[] args) throws Exception
{
File file = new File("E:\\myFiles\\abc.doc");
FileInputStream fis=new FileInputStream(file.getAbsolutePath());
HWPFDocument document=new HWPFDocument(fis);
WordExtractor extractor = new WordExtractor(document);
String [] fileData = extractor.getParagraphText();
for (int i = 0; i < fileData.length; i++)
{
// System.out.println(fileData[i].toString());
String[] paraword = fileData[i].toString().split(" ");
// out.println(paraword.length);
if(paraword[i].length() == 0 )
{
System.out.println("\n");
}
else if(paraword[i].length() > 0 && paraword[i].length() < 12)
{
for(int k=0 ; k < paraword[i].length()-1 ; k++)
{
System.out.println(paraword[k].toString());
}
}
else if(paraword[i].length() >= 12 )
{
for(int k=0 ; k < 12 ; k++)
{
System.out.println(paraword[k].toString());
}
}
System.out.println("\n");
}
}
}
This is the image of the abc.doc file
Note : Expected output will be printed on java console.
and the output will contain 12 words in each line. But after executing first line the error occurs.
Any help would be appreciated
TIA

Honestly, I'm not familiar with the apache.org API, but just by looking at your logic it looks like you want to replace every instance of:
paraword[i].length()
with:
paraword.length
Because it looks like you want to check how many words are in the paragraph and not how long the first word of the paragraph is. Correct me if I'm wrong, but I think that will fix you up.

Here is the correct code snippet
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Arrays;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
public class ExtractWordDocument
{
public String myString() throws IOException
{
File file = new File("PATH FOR THE .doc FILE");
FileInputStream fis=new FileInputStream(file.getAbsolutePath());
HWPFDocument document=new HWPFDocument(fis);
WordExtractor extractor = new WordExtractor(document);
String [] fileData = extractor.getParagraphText();
ArrayList<Object> EntireDoc = new ArrayList<>();
for (int i = 0; i < fileData.length; i++)
{
String[] paraword = fileData[i].toString().split("\\s+");
if(paraword.length == 0 )
{EntireDoc.add("\n");}
else if(paraword.length > 0 && paraword.length < 12)
{
for(int k=0 ; k < paraword.length ; k++)
{EntireDoc.add(paraword[k].toString()+" ");}
}
else if(paraword.length > 12 )
{
java.util.List<String> arrAsList = Arrays.asList(paraword);
String formatedString = arrAsList.toString()
.replace(",", "") //remove the commas
.replace("[", "") //remove the right bracket
.replace("]", ""); //remove the left bracket
StringBuilder sb = new StringBuilder(formatedString);
int i1 = 0;
while ((i1 = sb.indexOf(" ", i1 + 75)) != -1)
{sb.replace(i1, i1 + 1, "\n");}
EntireDoc.add(sb.toString());
}
EntireDoc.add("\n");
}
String formatedString = EntireDoc.toString()
.replace(",", "") //remove the commas
.replace("[", "") //remove the right bracket
.replace("]", ""); //remove the left bracket
return formatedString;
}
public static void main(String[] args)
{
try{
System.out.print(new ExtractWordDocument().myString());
}
catch(IOException ioe){System.out.print(ioe);}
}
}
Note : This code will not print 12 words in each line but 75 charecters in each line.

Related

No need to check this! skip

So what I'm trying to do but clearly struggling to execute isSo what I'm trying to do but clearly struggling to execute isSo what I'm trying to do but clearly struggling to execute isSo what I'm trying to do but clearly struggling to execute isSo what I'm trying to do but clearly struggling to execute isSo what I'm trying to do but clearly struggling to execute isSo what I'm trying to do but clearly struggling to execute isSo what I'm trying to do but clearly struggling to execute is
a single line in the text f
import java.util.Scanner;
import java.io.*;
public class hello
{
public static void main(String[] args) throws IOException
{
Scanner Keyboard = new Scanner(System.in);
System.out.print();
String response = Keyboard.nextLine();
File inFile = new File(response);
Scanner route = new Scanner(inFile);
while ()
{
System.out.print(");
String word = Keyboard.next();
String Street = route.next();
String stopNum = route.next();
You are closing your file after you read one "line" (actually, I'm not sure how many lines you're reading - you don't call nextLine). You also aren't parsing the line. Also, I'd prefer a try-with-resources over an explicit close (and many of your variables look like class names). Finally, you need to check if the line matches your criteria. That might be done like,
Scanner keyboard = new Scanner(System.in);
System.out.print("Enter filename >> ");
String response = keyboard.nextLine();
File inFile = new File(response);
System.out.print("Enter tram tracker ID >> ");
String word = keyboard.nextLine(); // <-- read a line. Bad idea to leave trailing
// new lines.
try (Scanner route = new Scanner(inFile)) {
while (route.hasNextLine()) {
String[] line = route.nextLine().split("\\^");
String street = line[0];
String stopNum = line[1];
String trkID = line[2];
String road = line[3];
String suburb = line[4];
if (!trkID.equals(word)) {
continue;
}
System.out.printf("street: %s, stop: %s, id: %s, road: %s, suburb: %s%n",
street, stopNum, trkID, road, suburb);
}
}
Your code print everything in the file.
To print a line with an given ID:
You can first buffer all lines of the file into a ArrayList like this in the main method:
ArrayList<String> lines = new ArrayList<>();
while (route.hasNextLine())
{
lines.add(route.nextLine());
}
Then create a method to find a line with a specific ID:
public static int find(ArrayList information, int ID)
{
String idString = "" + ID;
ListIterator<String> li = information.listIterator();
String currentLine = "";
int index = 0;
while(li.hasNext())
{
currentLine = li.next();
int count = 0;
int index1 = 0;
int index2 = 0;
/*Trying to locate the string between the 2nd and 3rd ^ */
for(int i = 0; i < currentLine.length(); i++)
{
if(currentLine.substring(i, i+1).equals("^"))
{
count++;
if(count == 2)
index1 = i;
else if(count == 3)
{
index2 = i;
break;
}
}
}
if(currentLine.substring(index1+1, index2).equals (idString))
return(index);
index++;
}
//If no such ID found, return -1;
return -1;
}
In the main method:
System.out.println("enter an ID")
int ID = Integer.parseInt(Keyboard.next());
int lineNumber = find(lines, ID);
if(lineNumber == -1)
System.out.println("no information found");
else
System.out.println(lines.get(lineNumber));

How to add tags to a parsed tree that has no tag?

For example, the parsing tree from Stanford Sentiment Treebank
"(2 (2 (2 near) (2 (2 the) (2 end))) (3 (3 (2 takes) (2 (2 on) (2 (2 a) (2 (2 whole) (2 (2 other) (2 meaning)))))) (2 .)))",
where the number is the sentiment label of each node.
I want to add POS tagging information to each node. Such as:
"(NP (ADJP (IN near)) (DT the) (NN end)) "
I have tried to directly parse the sentence, but the resulted tree is different from that in the Sentiment Treebank (may be because of the parsing version or parameters, I have tried to contact to the author but there is no response).
How can I obtain the tagging information?
I think the code in edu.stanford.nlp.sentiment.BuildBinarizedDataset should be helpful. The main() method steps through how these binary trees can be created in Java code.
Some key lines to look out for in the code:
LexicalizedParser parser = LexicalizedParser.loadModel(parserModel);
TreeBinarizer binarizer = TreeBinarizer.simpleTreeBinarizer(parser.getTLPParams().headFinder(), parser.treebankLanguagePack());
...
Tree tree = parser.apply(tokens);
Tree binarized = binarizer.transformTree(tree);
You can access the node tag information from the Tree object. You should look at the javadoc for edu.stanford.nlp.trees.Tree to see how to access this information.
Also in this answer I have some code that shows accessing a Tree:
How to get NN andNNS from a text?
You want to look at the label() of each tree and subtree to get the tag for a node.
Here is the reference on GitHub to BuildBinarizedDataset.java:
https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/sentiment/BuildBinarizedDataset.java
Please let me know if anything is unclear about this and I can provide further assistance!
First, you need to download the Stanford Parser
Set up
private LexicalizedParser parser;
private TreeBinarizer binarizer;
private CollapseUnaryTransformer transformer;
parser = LexicalizedParser.loadModel(PCFG_PATH);
binarizer = TreeBinarizer.simpleTreeBinarizer(
parser.getTLPParams().headFinder(), parser.treebankLanguagePack());
transformer = new CollapseUnaryTransformer();
Parse
Tree tree = parser.apply(tokens);
Access POSTAG
public String[] constTreePOSTAG(Tree tree) {
Tree binarized = binarizer.transformTree(tree);
Tree collapsedUnary = transformer.transformTree(binarized);
Trees.convertToCoreLabels(collapsedUnary);
collapsedUnary.indexSpans();
List<Tree> leaves = collapsedUnary.getLeaves();
int size = collapsedUnary.size() - leaves.size();
String[] tags = new String[size];
HashMap<Integer, Integer> index = new HashMap<Integer, Integer>();
int idx = leaves.size();
int leafIdx = 0;
for (Tree leaf : leaves) {
Tree cur = leaf.parent(collapsedUnary); // go to preterminal
int curIdx = leafIdx++;
boolean done = false;
while (!done) {
Tree parent = cur.parent(collapsedUnary);
if (parent == null) {
tags[curIdx] = cur.label().toString();
break;
}
int parentIdx;
int parentNumber = parent.nodeNumber(collapsedUnary);
if (!index.containsKey(parentNumber)) {
parentIdx = idx++;
index.put(parentNumber, parentIdx);
} else {
parentIdx = index.get(parentNumber);
done = true;
}
tags[curIdx] = parent.label().toString();
cur = parent;
curIdx = parentIdx;
}
}
return tags;
}
Here is the full source code ConstituencyParse.java that run:
Use param:
java ConstituencyParse -tokpath outputtoken.toks -parentpath outputparent.txt -tagpath outputag.txt < input_sentence_in_text_file_one_sent_per_line.txt
(Note: the source code is adapt from treelstm repo, you also need to replace preprocess-sst.py to call ConstituencyParse.java file below)
import edu.stanford.nlp.process.WordTokenFactory;
import edu.stanford.nlp.ling.HasWord;
import edu.stanford.nlp.ling.Word;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.ling.TaggedWord;
import edu.stanford.nlp.process.PTBTokenizer;
import edu.stanford.nlp.util.StringUtils;
import edu.stanford.nlp.parser.lexparser.LexicalizedParser;
import edu.stanford.nlp.parser.lexparser.TreeBinarizer;
import edu.stanford.nlp.tagger.maxent.MaxentTagger;
import edu.stanford.nlp.trees.GrammaticalStructure;
import edu.stanford.nlp.trees.GrammaticalStructureFactory;
import edu.stanford.nlp.trees.PennTreebankLanguagePack;
import edu.stanford.nlp.trees.Tree;
import edu.stanford.nlp.trees.Trees;
import edu.stanford.nlp.trees.TreebankLanguagePack;
import edu.stanford.nlp.trees.TypedDependency;
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.StringReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Collection;
import java.util.List;
import java.util.HashMap;
import java.util.Properties;
import java.util.Scanner;
public class ConstituencyParse {
private boolean tokenize;
private BufferedWriter tokWriter, parentWriter, tagWriter;
private LexicalizedParser parser;
private TreeBinarizer binarizer;
private CollapseUnaryTransformer transformer;
private GrammaticalStructureFactory gsf;
private static final String PCFG_PATH = "edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz";
public ConstituencyParse(String tokPath, String parentPath, String tagPath, boolean tokenize) throws IOException {
this.tokenize = tokenize;
if (tokPath != null) {
tokWriter = new BufferedWriter(new FileWriter(tokPath));
}
parentWriter = new BufferedWriter(new FileWriter(parentPath));
tagWriter = new BufferedWriter(new FileWriter(tagPath));
parser = LexicalizedParser.loadModel(PCFG_PATH);
binarizer = TreeBinarizer.simpleTreeBinarizer(
parser.getTLPParams().headFinder(), parser.treebankLanguagePack());
transformer = new CollapseUnaryTransformer();
// set up to produce dependency representations from constituency trees
TreebankLanguagePack tlp = new PennTreebankLanguagePack();
gsf = tlp.grammaticalStructureFactory();
}
public List<HasWord> sentenceToTokens(String line) {
List<HasWord> tokens = new ArrayList<>();
if (tokenize) {
PTBTokenizer<Word> tokenizer = new PTBTokenizer(new StringReader(line), new WordTokenFactory(), "");
for (Word label; tokenizer.hasNext(); ) {
tokens.add(tokenizer.next());
}
} else {
for (String word : line.split(" ")) {
tokens.add(new Word(word));
}
}
return tokens;
}
public Tree parse(List<HasWord> tokens) {
Tree tree = parser.apply(tokens);
return tree;
}
public String[] constTreePOSTAG(Tree tree) {
Tree binarized = binarizer.transformTree(tree);
Tree collapsedUnary = transformer.transformTree(binarized);
Trees.convertToCoreLabels(collapsedUnary);
collapsedUnary.indexSpans();
List<Tree> leaves = collapsedUnary.getLeaves();
int size = collapsedUnary.size() - leaves.size();
String[] tags = new String[size];
HashMap<Integer, Integer> index = new HashMap<Integer, Integer>();
int idx = leaves.size();
int leafIdx = 0;
for (Tree leaf : leaves) {
Tree cur = leaf.parent(collapsedUnary); // go to preterminal
int curIdx = leafIdx++;
boolean done = false;
while (!done) {
Tree parent = cur.parent(collapsedUnary);
if (parent == null) {
tags[curIdx] = cur.label().toString();
break;
}
int parentIdx;
int parentNumber = parent.nodeNumber(collapsedUnary);
if (!index.containsKey(parentNumber)) {
parentIdx = idx++;
index.put(parentNumber, parentIdx);
} else {
parentIdx = index.get(parentNumber);
done = true;
}
tags[curIdx] = parent.label().toString();
cur = parent;
curIdx = parentIdx;
}
}
return tags;
}
public int[] constTreeParents(Tree tree) {
Tree binarized = binarizer.transformTree(tree);
Tree collapsedUnary = transformer.transformTree(binarized);
Trees.convertToCoreLabels(collapsedUnary);
collapsedUnary.indexSpans();
List<Tree> leaves = collapsedUnary.getLeaves();
int size = collapsedUnary.size() - leaves.size();
int[] parents = new int[size];
HashMap<Integer, Integer> index = new HashMap<Integer, Integer>();
int idx = leaves.size();
int leafIdx = 0;
for (Tree leaf : leaves) {
Tree cur = leaf.parent(collapsedUnary); // go to preterminal
int curIdx = leafIdx++;
boolean done = false;
while (!done) {
Tree parent = cur.parent(collapsedUnary);
if (parent == null) {
parents[curIdx] = 0;
break;
}
int parentIdx;
int parentNumber = parent.nodeNumber(collapsedUnary);
if (!index.containsKey(parentNumber)) {
parentIdx = idx++;
index.put(parentNumber, parentIdx);
} else {
parentIdx = index.get(parentNumber);
done = true;
}
parents[curIdx] = parentIdx + 1;
cur = parent;
curIdx = parentIdx;
}
}
return parents;
}
// convert constituency parse to a dependency representation and return the
// parent pointer representation of the tree
public int[] depTreeParents(Tree tree, List<HasWord> tokens) {
GrammaticalStructure gs = gsf.newGrammaticalStructure(tree);
Collection<TypedDependency> tdl = gs.typedDependencies();
int len = tokens.size();
int[] parents = new int[len];
for (int i = 0; i < len; i++) {
// if a node has a parent of -1 at the end of parsing, then the node
// has no parent.
parents[i] = -1;
}
for (TypedDependency td : tdl) {
// let root have index 0
int child = td.dep().index();
int parent = td.gov().index();
parents[child - 1] = parent;
}
return parents;
}
public void printTokens(List<HasWord> tokens) throws IOException {
int len = tokens.size();
StringBuilder sb = new StringBuilder();
for (int i = 0; i < len - 1; i++) {
if (tokenize) {
sb.append(PTBTokenizer.ptbToken2Text(tokens.get(i).word()));
} else {
sb.append(tokens.get(i).word());
}
sb.append(' ');
}
if (tokenize) {
sb.append(PTBTokenizer.ptbToken2Text(tokens.get(len - 1).word()));
} else {
sb.append(tokens.get(len - 1).word());
}
sb.append('\n');
tokWriter.write(sb.toString());
}
public void printParents(int[] parents) throws IOException {
StringBuilder sb = new StringBuilder();
int size = parents.length;
for (int i = 0; i < size - 1; i++) {
sb.append(parents[i]);
sb.append(' ');
}
sb.append(parents[size - 1]);
sb.append('\n');
parentWriter.write(sb.toString());
}
public void printTags(String[] tags) throws IOException {
StringBuilder sb = new StringBuilder();
int size = tags.length;
for (int i = 0; i < size - 1; i++) {
sb.append(tags[i]);
sb.append(' ');
}
sb.append(tags[size - 1]);
sb.append('\n');
tagWriter.write(sb.toString().toLowerCase());
}
public void close() throws IOException {
if (tokWriter != null) tokWriter.close();
parentWriter.close();
tagWriter.close();
}
public static void main(String[] args) throws Exception {
String TAGGER_MODEL = "stanford-tagger/models/english-left3words-distsim.tagger";
Properties props = StringUtils.argsToProperties(args);
if (!props.containsKey("parentpath")) {
System.err.println(
"usage: java ConstituencyParse -deps - -tokenize - -tokpath <tokpath> -parentpath <parentpath>");
System.exit(1);
}
// whether to tokenize input sentences
boolean tokenize = false;
if (props.containsKey("tokenize")) {
tokenize = true;
}
// whether to produce dependency trees from the constituency parse
boolean deps = false;
if (props.containsKey("deps")) {
deps = true;
}
String tokPath = props.containsKey("tokpath") ? props.getProperty("tokpath") : null;
String parentPath = props.getProperty("parentpath");
String tagPath = props.getProperty("tagpath");
ConstituencyParse processor = new ConstituencyParse(tokPath, parentPath, tagPath, tokenize);
Scanner stdin = new Scanner(System.in);
int count = 0;
long start = System.currentTimeMillis();
while (stdin.hasNextLine() && count < 2) {
String line = stdin.nextLine();
List<HasWord> tokens = processor.sentenceToTokens(line);
//end tagger
Tree parse = processor.parse(tokens);
// produce parent pointer representation
int[] parents = deps ? processor.depTreeParents(parse, tokens)
: processor.constTreeParents(parse);
String[] tags = processor.constTreePOSTAG(parse);
// print
if (tokPath != null) {
processor.printTokens(tokens);
}
processor.printParents(parents);
processor.printTags(tags);
// print tag
StringBuilder sb = new StringBuilder();
int size = tags.length;
for (int i = 0; i < size - 1; i++) {
sb.append(tags[i]);
sb.append(' ');
}
sb.append(tags[size - 1]);
sb.append('\n');
count++;
if (count % 100 == 0) {
double elapsed = (System.currentTimeMillis() - start) / 1000.0;
System.err.printf("Parsed %d lines (%.2fs)\n", count, elapsed);
}
}
long totalTimeMillis = System.currentTimeMillis() - start;
System.err.printf("Done: %d lines in %.2fs (%.1fms per line)\n",
count, totalTimeMillis / 100.0, totalTimeMillis / (double) count);
processor.close();
}
}

How to find the address where width and height are stored inside an mp4 file?

I need to find the addresses where the width and height are stored, but the IUT version of the standard don't give a clear definition of the file format.
What I found so far... :
Both values are stored in "a QuickTime float". I couldn't find the format, but it seems it use two 16-bits integer: a signed one followed by an unsigned one.
Unlike many file format, there are no fixed position, so it is file specific. It depend on the TrackHeaderBox address.
What I desperatly need :
A clear canonical answer describing the places to find only those kind of information. I don't want answers only referring to third party libraries (unless they are written in proper JavaScript). Some pseudo C like structures can help.
There is no fixed position. You need to parse into the file. Please check this Java example.
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.Arrays;
import java.util.List;
public class GetHeight {
public static void main(String[] args) throws IOException {
FileInputStream fis = new FileInputStream(new File(args[0]));
GetHeight ps = new GetHeight();
ps.find(fis);
}
byte[] lastTkhd;
private void find(InputStream fis) throws IOException {
while (fis.available() > 0) {
byte[] header = new byte[8];
fis.read(header);
long size = readUint32(header, 0);
String type = new String(header, 4, 4, "ISO-8859-1");
if (containers.contains(type)) {
find(fis);
} else {
if (type.equals("tkhd")) {
lastTkhd = new byte[(int) (size - 8)];
fis.read(lastTkhd);
} else {
if (type.equals("hdlr")) {
byte[] hdlr = new byte[(int) (size - 8)];
fis.read(hdlr);
if (hdlr[8] == 0x76 && hdlr[9] == 0x69 && hdlr[10] == 0x64 && hdlr[11] == 0x65) {
System.out.println("Video Track Header identified");
System.out.println("width: " + readFixedPoint1616(lastTkhd, lastTkhd.length - 8));
System.out.println("height: " + readFixedPoint1616(lastTkhd, lastTkhd.length - 4));
System.exit(1);
}
} else {
fis.skip(size - 8);
}
}
}
}
}
public static long readUint32(byte[] b, int s) {
long result = 0;
result |= ((b[s + 0] << 24) & 0xFF000000);
result |= ((b[s + 1] << 16) & 0xFF0000);
result |= ((b[s + 2] << 8) & 0xFF00);
result |= ((b[s + 3]) & 0xFF);
return result;
}
public static double readFixedPoint1616(byte[] b, int s) {
return ((double) readUint32(b, s)) / 65536;
}
List<String> containers = Arrays.asList(
"moov",
"mdia",
"trak"
);
}

XPATH better way to parse xml

What is the better way to parse such xml:
<FindLicensesResponse xmlns="http://abc.com">
<FindLicensesResult>
<Licensies>
<ActivityLicense>
<id>1</id>
<DateIssue>2011-12-29T00:00:00</DateIssue>
<ActivityType xmlns:s01="http://www.w3.org/2001/XMLSchema-instance" s01:type="ActivityType">
<code>somecode1</code>
</ActivityType>
<ActivityTerritory xmlns:s02="http://www.w3.org/2001/XMLSchema-instance" s02:type="Territory">
<code>somecode2</code>
</ActivityTerritory>
<ActivityLicenseAttachments />
</ActivityLicense>
<ActivityLicense>
<id>2</id>
<DateIssue>2011-12-21T00:00:00</DateIssue>
<ActivityType xmlns:s01="http://www.w3.org/2001/XMLSchema-instance" s01:type="ActivityType">
<code>somecode3</code>
</ActivityType>
<ActivityTerritory xmlns:s02="http://www.w3.org/2001/XMLSchema-instance" s02:type="Territory">
<code>somecode4</code>
</ActivityTerritory>
<ActivityLicenseAttachments />
</ActivityLicense>
</Licensies>
</FindLicensesResult>
I need to get values from each ActivityLicense: id, DateIssue and inner ActivityType: code and inner ActivityTerritory: code.
Now I do it like this:
CachedXPathAPI xpathAPI = new CachedXPathAPI();
Element nsctx = result.getSOAPPart().createElementNS(null, "nsctx");
nsctx.setAttributeNS("http://www.w3.org/2000/xmlns/","xmlns:el","http://abc.com");
NodeList activityLicenses = xpathAPI.selectNodeList(result.getSOAPPart(),"//el:ActivityLicense", nsctx);
for (int i = 0; i < activityLicenses.getLength(); i++) {
Node id = xpathAPI.selectSingleNode(activityLicenses.item(i), "//el:id", nsctx);
Node dateIssue = xpathAPI.selectSingleNode(activityLicenses.item(i), "//el:DateIssue",nsctx);
System.out.println("id: " + id.getTextContent());
System.out.println("dateIssue: " + dateIssue.getTextContent());
}
But I can't get values from ActivityType/code and ActivityTerritory/code
check out this solution
import java.io.ByteArrayInputStream;
import java.io.File;
import java.io.InputStream;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathFactory;
public class StringTest {
public static void main(String[] args) throws Exception {
String xml = "";
java.util.Scanner sc = new java.util.Scanner(new File("xml.xml"));
while(sc.hasNextLine()){
xml+=sc.nextLine();
}
javax.xml.parsers.DocumentBuilderFactory dbFactory = javax.xml.parsers.DocumentBuilderFactory.newInstance();
javax.xml.parsers.DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
InputStream is = new ByteArrayInputStream(xml.getBytes());
org.w3c.dom.Document doc = dBuilder.parse(is);
doc.getDocumentElement().normalize();
XPath xpath = XPathFactory.newInstance().newXPath();
org.w3c.dom.NodeList nodeList = doc.getElementsByTagName("ActivityLicense");
for(int i=0;i<nodeList.getLength();i++){
org.w3c.dom.Node node = nodeList.item(i);
System.out.println(xpath.evaluate("ActivityTerritory/code/text()", node, XPathConstants.STRING));
}
}
}

Problems during counting strings in the txt file

I am developing a progam which reads a text file and creates a report. The content of the report is the following: the number of every string in file, its "status", and some symbols of every string beginning. It works well with file up to 100 Mb.
But when I run the program with input files which are bigger than 1,5Gb in size and contain more than 100000 lines, I get the following error:
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> at java.util.Arrays.copyOfRange(Unknown Source) at
> java.lang.String.<init>(Unknown Source) at
> java.lang.StringBuffer.toString(Unknown Source) at
> java.io.BufferedReader.readLine(Unknown Source) at
> java.io.BufferedReader.readLine(Unknown Source) at
> org.apache.commons.io.IOUtils.readLines(IOUtils.java:771) at
> org.apache.commons.io.IOUtils.readLines(IOUtils.java:723) at
> org.apache.commons.io.IOUtils.readLines(IOUtils.java:745) at
> org.apache.commons.io.FileUtils.readLines(FileUtils.java:1512) at
> org.apache.commons.io.FileUtils.readLines(FileUtils.java:1528) at
> org.apache.commons.io.ReadFileToListSample.main(ReadFileToListSample.java:43)
I increased VM arguments up to -Xms128m -Xmx1600m (in eclipse run configuration) but this did not help. Specialists from OTN forum advised me to read some books and improve my program's performance. Could anybody help me to improve it? Thank you.
code:
import org.apache.commons.io.FileUtils;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.FileReader;
import java.io.IOException;
import java.io.LineNumberReader;
import java.io.PrintStream;
import java.util.List;
public class ReadFileToList {
public static void main(String[] args) throws FileNotFoundException
{
File file_out = new File ("D:\\Docs\\test_out.txt");
FileOutputStream fos = new FileOutputStream(file_out);
PrintStream ps = new PrintStream (fos);
System.setOut (ps);
// Create a file object
File file = new File("D:\\Docs\\test_in.txt");
FileReader fr = null;
LineNumberReader lnr = null;
try {
// Here we read a file, sample.txt, using FileUtils
// class of commons-io. Using FileUtils.readLines()
// we can read file content line by line and return
// the result as a List of string.
List<String> contents = FileUtils.readLines(file);
//
// Iterate the result to print each line of the file.
fr = new FileReader(file);
lnr = new LineNumberReader(fr);
for (String line : contents)
{
String begin_line = line.substring(0, 38); // return 38 chars from the string
String begin_line_without_null = begin_line.replace("\u0000", " ");
String begin_line_without_null_spaces = begin_line_without_null.replaceAll(" +", " ");
int stringlenght = line.length();
line = lnr.readLine();
int line_num = lnr.getLineNumber();
String status;
// some correct length for if
int c_u_length_f = 12;
int c_ea_length_f = 13;
int c_a_length_f = 2130;
int c_u_length_e = 3430;
int c_ea_length_e = 1331;
int c_a_length_e = 442;
int h_ext = 6;
int t_ext = 6;
if ( stringlenght == c_u_length_f ||
stringlenght == c_ea_length_f ||
stringlenght == c_a_length_f ||
stringlenght == c_u_length_e ||
stringlenght == c_ea_length_e ||
stringlenght == c_a_length_e ||
stringlenght == h_ext ||
stringlenght == t_ext)
status = "ok";
else status = "fail";
System.out.println(+ line_num + stringlenght + status + begin_line_without_null_spaces);
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
Also specialists from OTN said that this programm opens the input and reading it twice. May be some mistakes in "for statement"? But I can't find it.
Thank you.
You're declaring variables inside the loop and doing a lot of uneeded work, including reading the file twice - not good for peformance either. You can use the line number reader to get the line number and the text and reuse the line variable (declared outside the loop). Here's a shortened version that does what you need. You'll need to complete the validLength method to check all the values since I included only the first couple of tests.
import java.io.*;
public class TestFile {
//a method to determine if the length is valid implemented outside the method that does the reading
private static String validLength(int length) {
if (length == 12 || length == 13 || length == 2130) //you can finish it
return "ok";
return "fail";
}
public static void main(String[] args) {
try {
LineNumberReader lnr = new LineNumberReader(new FileReader(args[0]));
BufferedWriter out = new BufferedWriter(new FileWriter(args[1]));
String line;
int length;
while (null != (line = lnr.readLine())) {
length = line.length();
line = line.substring(0,38);
line = line.replace("\u0000", " ");
line = line.replace("+", " ");
out.write( lnr.getLineNumber() + length + validLength(length) + line);
out.newLine();
}
out.close();
}
catch (Exception e) {
e.printStackTrace();
}
}
}
Call this as java TestFile D:\Docs\test_in.txt D:\Docs\test_in.txt or replace the args[0] and args[1] with the file names if you want to hard code them.

Resources