Problems during counting strings in the txt file - for-loop

I am developing a progam which reads a text file and creates a report. The content of the report is the following: the number of every string in file, its "status", and some symbols of every string beginning. It works well with file up to 100 Mb.
But when I run the program with input files which are bigger than 1,5Gb in size and contain more than 100000 lines, I get the following error:
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> at java.util.Arrays.copyOfRange(Unknown Source) at
> java.lang.String.<init>(Unknown Source) at
> java.lang.StringBuffer.toString(Unknown Source) at
> java.io.BufferedReader.readLine(Unknown Source) at
> java.io.BufferedReader.readLine(Unknown Source) at
> org.apache.commons.io.IOUtils.readLines(IOUtils.java:771) at
> org.apache.commons.io.IOUtils.readLines(IOUtils.java:723) at
> org.apache.commons.io.IOUtils.readLines(IOUtils.java:745) at
> org.apache.commons.io.FileUtils.readLines(FileUtils.java:1512) at
> org.apache.commons.io.FileUtils.readLines(FileUtils.java:1528) at
> org.apache.commons.io.ReadFileToListSample.main(ReadFileToListSample.java:43)
I increased VM arguments up to -Xms128m -Xmx1600m (in eclipse run configuration) but this did not help. Specialists from OTN forum advised me to read some books and improve my program's performance. Could anybody help me to improve it? Thank you.
code:
import org.apache.commons.io.FileUtils;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.FileReader;
import java.io.IOException;
import java.io.LineNumberReader;
import java.io.PrintStream;
import java.util.List;
public class ReadFileToList {
public static void main(String[] args) throws FileNotFoundException
{
File file_out = new File ("D:\\Docs\\test_out.txt");
FileOutputStream fos = new FileOutputStream(file_out);
PrintStream ps = new PrintStream (fos);
System.setOut (ps);
// Create a file object
File file = new File("D:\\Docs\\test_in.txt");
FileReader fr = null;
LineNumberReader lnr = null;
try {
// Here we read a file, sample.txt, using FileUtils
// class of commons-io. Using FileUtils.readLines()
// we can read file content line by line and return
// the result as a List of string.
List<String> contents = FileUtils.readLines(file);
//
// Iterate the result to print each line of the file.
fr = new FileReader(file);
lnr = new LineNumberReader(fr);
for (String line : contents)
{
String begin_line = line.substring(0, 38); // return 38 chars from the string
String begin_line_without_null = begin_line.replace("\u0000", " ");
String begin_line_without_null_spaces = begin_line_without_null.replaceAll(" +", " ");
int stringlenght = line.length();
line = lnr.readLine();
int line_num = lnr.getLineNumber();
String status;
// some correct length for if
int c_u_length_f = 12;
int c_ea_length_f = 13;
int c_a_length_f = 2130;
int c_u_length_e = 3430;
int c_ea_length_e = 1331;
int c_a_length_e = 442;
int h_ext = 6;
int t_ext = 6;
if ( stringlenght == c_u_length_f ||
stringlenght == c_ea_length_f ||
stringlenght == c_a_length_f ||
stringlenght == c_u_length_e ||
stringlenght == c_ea_length_e ||
stringlenght == c_a_length_e ||
stringlenght == h_ext ||
stringlenght == t_ext)
status = "ok";
else status = "fail";
System.out.println(+ line_num + stringlenght + status + begin_line_without_null_spaces);
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
Also specialists from OTN said that this programm opens the input and reading it twice. May be some mistakes in "for statement"? But I can't find it.
Thank you.

You're declaring variables inside the loop and doing a lot of uneeded work, including reading the file twice - not good for peformance either. You can use the line number reader to get the line number and the text and reuse the line variable (declared outside the loop). Here's a shortened version that does what you need. You'll need to complete the validLength method to check all the values since I included only the first couple of tests.
import java.io.*;
public class TestFile {
//a method to determine if the length is valid implemented outside the method that does the reading
private static String validLength(int length) {
if (length == 12 || length == 13 || length == 2130) //you can finish it
return "ok";
return "fail";
}
public static void main(String[] args) {
try {
LineNumberReader lnr = new LineNumberReader(new FileReader(args[0]));
BufferedWriter out = new BufferedWriter(new FileWriter(args[1]));
String line;
int length;
while (null != (line = lnr.readLine())) {
length = line.length();
line = line.substring(0,38);
line = line.replace("\u0000", " ");
line = line.replace("+", " ");
out.write( lnr.getLineNumber() + length + validLength(length) + line);
out.newLine();
}
out.close();
}
catch (Exception e) {
e.printStackTrace();
}
}
}
Call this as java TestFile D:\Docs\test_in.txt D:\Docs\test_in.txt or replace the args[0] and args[1] with the file names if you want to hard code them.

Related

Append a String to the end of the existing String with specific position in a text file in Java

Exp -
In a text file we have the following topics with some description.
#Repeat the annotation
It is the major topic for .....
#Vector analysis
It covers all the aspects of sequential....
#Cloud Computing
Create header accounts for all the users
We have to add / append new Tags to the Topics in specific line
For exp-
#Repeat the annotation #Maven build
#Cloud Computing #SecondYear
File f = new File("/user/imp/value/GSTR.txt");
FileReader fr = new FileReader(f);
Object fr1;
while((fr1 = fr.read()) != null) {
if(fr1.equals("#Repeat the annotation")) {
FileWriter fw = new FileWriter(f,true);
fw.write("#Maven build");
fw.close();
}
}
****** #Maven is getting added to the last line of the text file but not at the specific position next to the topic
The output is written to the file GSTR_modified.txt. The code along with an example input file is also available here. The code in the github repository reads the file "input.txt" and writes to the file "output.txt".
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.util.ArrayList;
public class Main {
public static void main(String[] args) throws IOException {
// Create a list to store the file content.
ArrayList<String> list = new ArrayList<>();
// Store the file content in the list. Each line becomes an element in the list.
try (BufferedReader br = new BufferedReader(new FileReader("/user/imp/value/GSTR.txt""))) {
String line;
while ((line = br.readLine()) != null) {
list.add(line);
}
}
// Iterate the list of lines.
for (int i = 0; i < list.size(); i++) {
// line is the element at the index i.
String line = list.get(i);
// Check if a line is equal to "#Repeat the annotation"
if (line.contains("#Repeat the annotation")){
// Set the list element at index i to the line itself concatenated with
// the string " #Maven build".
list.set(i,line.concat(" #Maven build"));
}
// Same pattern as above.
if (line.contains("#Cloud Computing")){
list.set(i,line.concat(" #SecondYear"));
}
}
// Write the contents of the list to a file.
FileWriter writer = new FileWriter("GSTR_modified.txt");
for(String str: list) {
// Append newline character \n to each element
// and write it to file.
writer.write(str+"\n");
}
writer.close();
}
}

No need to check this! skip

So what I'm trying to do but clearly struggling to execute isSo what I'm trying to do but clearly struggling to execute isSo what I'm trying to do but clearly struggling to execute isSo what I'm trying to do but clearly struggling to execute isSo what I'm trying to do but clearly struggling to execute isSo what I'm trying to do but clearly struggling to execute isSo what I'm trying to do but clearly struggling to execute isSo what I'm trying to do but clearly struggling to execute is
a single line in the text f
import java.util.Scanner;
import java.io.*;
public class hello
{
public static void main(String[] args) throws IOException
{
Scanner Keyboard = new Scanner(System.in);
System.out.print();
String response = Keyboard.nextLine();
File inFile = new File(response);
Scanner route = new Scanner(inFile);
while ()
{
System.out.print(");
String word = Keyboard.next();
String Street = route.next();
String stopNum = route.next();
You are closing your file after you read one "line" (actually, I'm not sure how many lines you're reading - you don't call nextLine). You also aren't parsing the line. Also, I'd prefer a try-with-resources over an explicit close (and many of your variables look like class names). Finally, you need to check if the line matches your criteria. That might be done like,
Scanner keyboard = new Scanner(System.in);
System.out.print("Enter filename >> ");
String response = keyboard.nextLine();
File inFile = new File(response);
System.out.print("Enter tram tracker ID >> ");
String word = keyboard.nextLine(); // <-- read a line. Bad idea to leave trailing
// new lines.
try (Scanner route = new Scanner(inFile)) {
while (route.hasNextLine()) {
String[] line = route.nextLine().split("\\^");
String street = line[0];
String stopNum = line[1];
String trkID = line[2];
String road = line[3];
String suburb = line[4];
if (!trkID.equals(word)) {
continue;
}
System.out.printf("street: %s, stop: %s, id: %s, road: %s, suburb: %s%n",
street, stopNum, trkID, road, suburb);
}
}
Your code print everything in the file.
To print a line with an given ID:
You can first buffer all lines of the file into a ArrayList like this in the main method:
ArrayList<String> lines = new ArrayList<>();
while (route.hasNextLine())
{
lines.add(route.nextLine());
}
Then create a method to find a line with a specific ID:
public static int find(ArrayList information, int ID)
{
String idString = "" + ID;
ListIterator<String> li = information.listIterator();
String currentLine = "";
int index = 0;
while(li.hasNext())
{
currentLine = li.next();
int count = 0;
int index1 = 0;
int index2 = 0;
/*Trying to locate the string between the 2nd and 3rd ^ */
for(int i = 0; i < currentLine.length(); i++)
{
if(currentLine.substring(i, i+1).equals("^"))
{
count++;
if(count == 2)
index1 = i;
else if(count == 3)
{
index2 = i;
break;
}
}
}
if(currentLine.substring(index1+1, index2).equals (idString))
return(index);
index++;
}
//If no such ID found, return -1;
return -1;
}
In the main method:
System.out.println("enter an ID")
int ID = Integer.parseInt(Keyboard.next());
int lineNumber = find(lines, ID);
if(lineNumber == -1)
System.out.println("no information found");
else
System.out.println(lines.get(lineNumber));

How to find the address where width and height are stored inside an mp4 file?

I need to find the addresses where the width and height are stored, but the IUT version of the standard don't give a clear definition of the file format.
What I found so far... :
Both values are stored in "a QuickTime float". I couldn't find the format, but it seems it use two 16-bits integer: a signed one followed by an unsigned one.
Unlike many file format, there are no fixed position, so it is file specific. It depend on the TrackHeaderBox address.
What I desperatly need :
A clear canonical answer describing the places to find only those kind of information. I don't want answers only referring to third party libraries (unless they are written in proper JavaScript). Some pseudo C like structures can help.
There is no fixed position. You need to parse into the file. Please check this Java example.
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.Arrays;
import java.util.List;
public class GetHeight {
public static void main(String[] args) throws IOException {
FileInputStream fis = new FileInputStream(new File(args[0]));
GetHeight ps = new GetHeight();
ps.find(fis);
}
byte[] lastTkhd;
private void find(InputStream fis) throws IOException {
while (fis.available() > 0) {
byte[] header = new byte[8];
fis.read(header);
long size = readUint32(header, 0);
String type = new String(header, 4, 4, "ISO-8859-1");
if (containers.contains(type)) {
find(fis);
} else {
if (type.equals("tkhd")) {
lastTkhd = new byte[(int) (size - 8)];
fis.read(lastTkhd);
} else {
if (type.equals("hdlr")) {
byte[] hdlr = new byte[(int) (size - 8)];
fis.read(hdlr);
if (hdlr[8] == 0x76 && hdlr[9] == 0x69 && hdlr[10] == 0x64 && hdlr[11] == 0x65) {
System.out.println("Video Track Header identified");
System.out.println("width: " + readFixedPoint1616(lastTkhd, lastTkhd.length - 8));
System.out.println("height: " + readFixedPoint1616(lastTkhd, lastTkhd.length - 4));
System.exit(1);
}
} else {
fis.skip(size - 8);
}
}
}
}
}
public static long readUint32(byte[] b, int s) {
long result = 0;
result |= ((b[s + 0] << 24) & 0xFF000000);
result |= ((b[s + 1] << 16) & 0xFF0000);
result |= ((b[s + 2] << 8) & 0xFF00);
result |= ((b[s + 3]) & 0xFF);
return result;
}
public static double readFixedPoint1616(byte[] b, int s) {
return ((double) readUint32(b, s)) / 65536;
}
List<String> containers = Arrays.asList(
"moov",
"mdia",
"trak"
);
}

analyzing zipped or any archive file

I was wondering if anyone can recommend a tool to analyze zipped or any archive file. I do not mean checking what is inside the archive but more about how it was compressed, with what compression method, etc.
Thanks!
For data compressed into a ZIP file, the command-line tool zipinfo is quite helpful, particularly when using the '-v' argument (for verbose mode). I learned of zipinfo from this zip-related question on SuperUser
I recently ran into an issue where the zip's being created by one tool would only open with certain programs and not others. The issue turned out to be that directories didn't have entries in the zip file, they were just implied by the presence of files in them. Also all the directory separators were backslashes instead of forward slashes.
zipinfo didn't really help with these bits. I needed to see the zip entries so I ended up writing this the following which allowed me diff the directory entries with a known good version
using System;
using System.IO;
using System.Text;
namespace ZipAnalysis
{
class Program
{
static void Main(string[] args)
{
if (args.Length < 1)
{
Console.WriteLine("No filename specified");
Console.WriteLine("Press any key to exit");
Console.ReadKey(true);
return;
}
string fileName = args[0];
if (!File.Exists(fileName))
{
Console.WriteLine($"File not found: {fileName}");
Console.WriteLine("Press any key to exit");
Console.ReadKey(true);
return;
}
using (var file = File.OpenRead(fileName))
{
//First, find the End of central directory record
BinaryReader reader = new BinaryReader(file);
int entryCount = ReadEndOfCentralDirectory(reader);
if (entryCount > 0)
{
ReadCentralDirectory(reader, entryCount);
}
}
Console.WriteLine("Press any key to exit");
Console.ReadKey(true);
}
private static int ReadEndOfCentralDirectory(BinaryReader reader)
{
var b = reader.ReadByte();
int result = 0;
long fileSize = reader.BaseStream.Length;
while (result == 0 && reader.BaseStream.Position < fileSize)
{
while (b != 0x50)
{
if (reader.BaseStream.Position < fileSize)
b = reader.ReadByte();
else
break;
}
if (reader.BaseStream.Position >= fileSize)
{
break;
}
if (reader.ReadByte() == 0x4b && reader.ReadByte() == 0x05 && reader.ReadByte() == 0x06)
{
int diskNumber = reader.ReadInt16();
int centralDirectoryStartDiskNumber = reader.ReadInt16();
int centralDirectoryCount = reader.ReadInt16();
int centralDirectoryTotal = reader.ReadInt16();
result = centralDirectoryTotal;
int centralDirectorySize = reader.ReadInt32();
int centralDirectoryOffset = reader.ReadInt32();
int commentLength = reader.ReadInt16();
string comment = Encoding.ASCII.GetString(reader.ReadBytes(commentLength));
Console.WriteLine("EOCD Found");
Console.WriteLine($"Disk Number: {diskNumber}");
Console.WriteLine($"Central Directory Disk Number: {centralDirectoryStartDiskNumber}");
Console.WriteLine($"Central Directory Count: {centralDirectoryCount}");
Console.WriteLine($"Central Directory Total: {centralDirectoryTotal}");
Console.WriteLine($"Central Directory Size: {centralDirectorySize}");
Console.WriteLine($"Central Directory Offset: {centralDirectoryOffset}");
Console.WriteLine($"Comment: {comment}");
reader.BaseStream.Seek(centralDirectoryOffset, SeekOrigin.Begin);
}
b=0;
}
return result;
}
private static void ReadCentralDirectory(BinaryReader reader, int count)
{
for (int i = 0; i < count; i++)
{
var signature = reader.ReadInt32();
if (signature == 0x02014b50)
{
Console.WriteLine($"Version Made By: {reader.ReadInt16()}");
Console.WriteLine($"Minimum version to extract: {reader.ReadInt16()}");
Console.WriteLine($"Bit Flag: {reader.ReadInt16()}");
Console.WriteLine($"Compression Method: {reader.ReadInt16()}");
Console.WriteLine($"File Last Modification Time: {reader.ReadInt16()}");
Console.WriteLine($"File Last Modification Date: {reader.ReadInt16()}");
Console.WriteLine($"CRC: {reader.ReadInt32()}");
Console.WriteLine($"CompressedSize: {reader.ReadInt32()}");
Console.WriteLine($"UncompressedSize: {reader.ReadInt32()}");
var fileNameLength = reader.ReadInt16();
var extraFieldLength = reader.ReadInt16();
var fileCommentLength = reader.ReadInt16();
Console.WriteLine($"Disk number where file starts: {reader.ReadInt16()}");
Console.WriteLine($"Internal file attributes: {reader.ReadInt16()}");
Console.WriteLine($"External file attributes: {reader.ReadInt32()}");
Console.WriteLine($"Relative offset of local file header: {reader.ReadInt32()}");
string filename = Encoding.ASCII.GetString(reader.ReadBytes(fileNameLength));
string extraField = Encoding.ASCII.GetString(reader.ReadBytes(extraFieldLength));
string fileComment = Encoding.ASCII.GetString(reader.ReadBytes(fileCommentLength));
Console.WriteLine($"Filename: {filename}");
Console.WriteLine($"Extra Field: {extraField}");
Console.WriteLine($"File Comment: {fileComment}");
}
}
}
}
}

why Array Index Out Of Bound Exception while re arranging doc file paragraph?

Here is a code snippet. Its giving arrayindexoutofboundexception. dont know why ?
import java.io.File;
import java.io.FileInputStream;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.xslf.usermodel.XSLFTextParagraph;
public class wordcount
{
public static void main(String[] args) throws Exception
{
File file = new File("E:\\myFiles\\abc.doc");
FileInputStream fis=new FileInputStream(file.getAbsolutePath());
HWPFDocument document=new HWPFDocument(fis);
WordExtractor extractor = new WordExtractor(document);
String [] fileData = extractor.getParagraphText();
for (int i = 0; i < fileData.length; i++)
{
// System.out.println(fileData[i].toString());
String[] paraword = fileData[i].toString().split(" ");
// out.println(paraword.length);
if(paraword[i].length() == 0 )
{
System.out.println("\n");
}
else if(paraword[i].length() > 0 && paraword[i].length() < 12)
{
for(int k=0 ; k < paraword[i].length()-1 ; k++)
{
System.out.println(paraword[k].toString());
}
}
else if(paraword[i].length() >= 12 )
{
for(int k=0 ; k < 12 ; k++)
{
System.out.println(paraword[k].toString());
}
}
System.out.println("\n");
}
}
}
This is the image of the abc.doc file
Note : Expected output will be printed on java console.
and the output will contain 12 words in each line. But after executing first line the error occurs.
Any help would be appreciated
TIA
Honestly, I'm not familiar with the apache.org API, but just by looking at your logic it looks like you want to replace every instance of:
paraword[i].length()
with:
paraword.length
Because it looks like you want to check how many words are in the paragraph and not how long the first word of the paragraph is. Correct me if I'm wrong, but I think that will fix you up.
Here is the correct code snippet
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Arrays;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
public class ExtractWordDocument
{
public String myString() throws IOException
{
File file = new File("PATH FOR THE .doc FILE");
FileInputStream fis=new FileInputStream(file.getAbsolutePath());
HWPFDocument document=new HWPFDocument(fis);
WordExtractor extractor = new WordExtractor(document);
String [] fileData = extractor.getParagraphText();
ArrayList<Object> EntireDoc = new ArrayList<>();
for (int i = 0; i < fileData.length; i++)
{
String[] paraword = fileData[i].toString().split("\\s+");
if(paraword.length == 0 )
{EntireDoc.add("\n");}
else if(paraword.length > 0 && paraword.length < 12)
{
for(int k=0 ; k < paraword.length ; k++)
{EntireDoc.add(paraword[k].toString()+" ");}
}
else if(paraword.length > 12 )
{
java.util.List<String> arrAsList = Arrays.asList(paraword);
String formatedString = arrAsList.toString()
.replace(",", "") //remove the commas
.replace("[", "") //remove the right bracket
.replace("]", ""); //remove the left bracket
StringBuilder sb = new StringBuilder(formatedString);
int i1 = 0;
while ((i1 = sb.indexOf(" ", i1 + 75)) != -1)
{sb.replace(i1, i1 + 1, "\n");}
EntireDoc.add(sb.toString());
}
EntireDoc.add("\n");
}
String formatedString = EntireDoc.toString()
.replace(",", "") //remove the commas
.replace("[", "") //remove the right bracket
.replace("]", ""); //remove the left bracket
return formatedString;
}
public static void main(String[] args)
{
try{
System.out.print(new ExtractWordDocument().myString());
}
catch(IOException ioe){System.out.print(ioe);}
}
}
Note : This code will not print 12 words in each line but 75 charecters in each line.

Resources