How does the POI Event API read data from Excel and why does it use less RAM? - events

I am currently writing my bachelor thesis and I am using the POI Event API from Apache. In short, my work is about a more efficient way to read data from Excel.
I get asked by developers again and again how exactly this is meant with Event API. Unfortunately I don't find anything on the Apache page about the basic principle.
Following code, how I use the POI Event API (This is from the Apache example for XSSF and SAX):
import java.io.InputStream;
import java.util.Iterator;
import org.apache.poi.ooxml.util.SAXHelper;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.xssf.eventusermodel.XSSFReader;
import org.apache.poi.xssf.model.SharedStringsTable;
import org.xml.sax.Attributes;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.DefaultHandler;
import javax.xml.parsers.ParserConfigurationException;
public class ExampleEventUserModel {
public void processOneSheet(String filename) throws Exception {
OPCPackage pkg = OPCPackage.open(filename);
XSSFReader r = new XSSFReader( pkg );
SharedStringsTable sst = r.getSharedStringsTable();
XMLReader parser = fetchSheetParser(sst);
// To look up the Sheet Name / Sheet Order / rID,
// you need to process the core Workbook stream.
// Normally it's of the form rId# or rSheet#
InputStream sheet2 = r.getSheet("rId2");
InputSource sheetSource = new InputSource(sheet2);
parser.parse(sheetSource);
sheet2.close();
}
public void processAllSheets(String filename) throws Exception {
OPCPackage pkg = OPCPackage.open(filename);
XSSFReader r = new XSSFReader( pkg );
SharedStringsTable sst = r.getSharedStringsTable();
XMLReader parser = fetchSheetParser(sst);
Iterator<InputStream> sheets = r.getSheetsData();
while(sheets.hasNext()) {
System.out.println("Processing new sheet:\n");
InputStream sheet = sheets.next();
InputSource sheetSource = new InputSource(sheet);
parser.parse(sheetSource);
sheet.close();
System.out.println("");
}
}
public XMLReader fetchSheetParser(SharedStringsTable sst) throws SAXException, ParserConfigurationException {
XMLReader parser = SAXHelper.newXMLReader();
ContentHandler handler = new SheetHandler(sst);
parser.setContentHandler(handler);
return parser;
}
/**
* See org.xml.sax.helpers.DefaultHandler javadocs
*/
private static class SheetHandler extends DefaultHandler {
private SharedStringsTable sst;
private String lastContents;
private boolean nextIsString;
private SheetHandler(SharedStringsTable sst) {
this.sst = sst;
}
public void startElement(String uri, String localName, String name,
Attributes attributes) throws SAXException {
// c => cell
if(name.equals("c")) {
// Print the cell reference
System.out.print(attributes.getValue("r") + " - ");
// Figure out if the value is an index in the SST
String cellType = attributes.getValue("t");
if(cellType != null && cellType.equals("s")) {
nextIsString = true;
} else {
nextIsString = false;
}
}
// Clear contents cache
lastContents = "";
}
public void endElement(String uri, String localName, String name)
throws SAXException {
// Process the last contents as required.
// Do now, as characters() may be called more than once
if(nextIsString) {
int idx = Integer.parseInt(lastContents);
lastContents = sst.getItemAt(idx).getString();
nextIsString = false;
}
// v => contents of a cell
// Output after we've seen the string contents
if(name.equals("v")) {
System.out.println(lastContents);
}
}
public void characters(char[] ch, int start, int length) {
lastContents += new String(ch, start, length);
}
}
public static void main(String[] args) throws Exception {
ExampleEventUserModel example = new ExampleEventUserModel();
example.processOneSheet(args[0]);
example.processAllSheets(args[0]);
}
}
Can someone please explain to me how the Event API works? Is it the same as the event-based architecture or is it something else?

A *.xlsx file, which is Excel stored in Office Open XML and is what apache poi handles as XSSF, is a ZIP archive containing the data in XML files within a directory structure. So we can unzip the *.xlsx file and get the data directly from the XML files then.
There is /xl/sharedStrings.xml having all the string cell values in it. And there is /xl/workbook.xml describing the workbook structure. And there are /xl/worksheets/sheet1.xml, /xl/worksheets/sheet2.xml, ... which are storing the sheets' data. And there is /xl/styles.xml having the style settings for all cells in the sheets.
Per default while creating a XSSFWorkbook all those parts of the *.xlsx file will become object representations as XSSFWorkbook, XSSFSheet, XSSFRow, XSSFCell, ... and further objects of org.apache.poi.xssf.*.* in memory.
To get an impression of how memory consuming XSSFSheet, XSSFRow and XSSFCell are, a look into the sources will be good. Each of those objects contains multiple Lists and Maps as internally members and of course multiple methods too. Now imagine a sheet having hundreds of thousands of rows each containing up to hundreds of cells. Each of those rows and cells will be represented by a XSSFRow or a XSSFCell in memory. This cannot be an accusation to apache poi because those objects are necessary if working with those objects is needed. But if the need is really only getting the content out of the Excel sheet, then those objects are not all necessary. That's why the XSSF and SAX (Event API) approach.
So if the need is only reading data from sheets one could simply parsing the XML of all the /xl/worksheets/sheet[n].xml files without the need for creating memory consuming objects for each sheet and for each row and for each cell in those sheets.
Parsing XML in event based mode means that the code goes top down through the XML and has callback methods defined which get called if the code detects the start of an element, the end of an element or character content within an element. The appropriate callback methods then handle what to do on start, end or with character content of an element. So reading the XML file only means running top down through the file once, handle the events (start, end, character content of an element) and are able getting all needed content out of it. So memory consuming is reduced to storing the text data gotten from the XML.
XSSF and SAX (Event API) uses class SheetHandler which extends DefaultHandler for this.
But if we are already at this level where we get at the underlying XML data and process it, then we could go one more step back too. Native Java is able handling ZIP and parsing XML. So we would not even need additional libraries at all. See how read excel file having more than 100000 row in java? where I have shown this. My code uses Package javax.xml.stream which also provides using event based XMLEventReader but not using callbacks but linear code. Maybe this code is simpler to understand because it is all in one.
For detecting whether a number format is a date format, and so the formatted cell contains a date / time value, one single apache poi class org.apache.poi.ss.usermodel.DateUtil is used. This is done to simplify the code. Of course even this class we could have coded our self.

Related

Getting a FileNotFoundException in VSCode, but not in JGrasp

Ok, so this is what's going on. I'm trying to learn how to use vscode (switching over from jgrasp). I'm trying to run this old school assignment that requires the use of outside .txt files. The .txt files, as well as other classes that I have written are in the same folder and everything. When I try to run this program in JGrasp, it works fine. Though, in VSCode, I get an exception. Not sure what is going wrong here. Thanks Here is an example:
import java.io.*;
public class HangmanMain {
public static final String DICTIONARY_FILE = "dictionary.txt";
public static final boolean SHOW_COUNT = true; // show # of choices left
public static void main(String[] args) throws FileNotFoundException {
System.out.println("Welcome to the cse143 hangman game.");
System.out.println();
// open the dictionary file and read dictionary into an ArrayList
Scanner input = new Scanner(new File(DICTIONARY_FILE));
List<String> dictionary = new ArrayList<String>();
while (input.hasNext()) {
dictionary.add(input.next().toLowerCase());
}
// set basic parameters
Scanner console = new Scanner(System.in);
System.out.print("What length word do you want to use? ");
int length = console.nextInt();
System.out.print("How many wrong answers allowed? ");
int max = console.nextInt();
System.out.println();
//The rest of the program is not shown. This was included just so you guys could see a little bit of it.
If you're not using a project, jGRASP makes the working directory for your program the same one that contains the source file. You are creating the file with a relative path, so it is assumed to be in the working directory. You can print new File(DICTIONARY_FILE).getAbsolutePath() to see where VSCode is looking (probably a separate "classes" directory) and move your data file there, or use an absolute path.

get method to display containing folder, size, and time of last modification Java

I have a program I am writing for a class. I have got the first part down, but need help with the code for this part:
containing folder, size, and time of last modification these steps are the ones I need help writing in.
Here is the challenge
1. Create a file using any word-processing program or text editor. Write an application that displays the file’s name, containing folder, size, and time of last modification.
below is my code so far
import java.nio.file.*;
import java.nio.file.attribute.*;
import java.io.IOException;
import static java.nio.file.AccessMode.*;
public class FileStatistics
{
public static void main(String[] args)
{
Path filePath =
Paths.get("C:\\Users\\John\\Desktop\\N Drive\\St Leo Master folder\\COM-209\\module 6\\sixtestfile.txt");
System.out.println("Path is" + filePath.toString ());
try
{
filePath.getFileSystem().provider().checkAccess
(filePath, READ, EXECUTE);
System.out.println("File can be read & executed");
}
catch(IOException e)
{
System.out.println("File cannot be used in this app");
}
}
}

Android Viewpager to load images from SD Card

Guys Im using the following custom code to load 20 images from resources and present in a viewpager
public class CustomPagerAdapter extends PagerAdapter {
int[] mResources = {
R.drawable.slide1,
R.drawable.slide2,
R.drawable.slide3,
R.drawable.slide4,
R.drawable.slide5,
R.drawable.slide6,
R.drawable.slide7,
R.drawable.slide8,
R.drawable.slide9,
R.drawable.slide10,
R.drawable.slide11,
R.drawable.slide12,
R.drawable.slide13,
R.drawable.slide14,
R.drawable.slide15,
R.drawable.slide16,
R.drawable.slide17,
R.drawable.slide18,
R.drawable.slide19,
R.drawable.slide20,
};
Context mContext;
LayoutInflater mLayoutInflater;
public CustomPagerAdapter(Context context) {
mContext = context;
mLayoutInflater = (LayoutInflater) mContext.getSystemService(Context.LAYOUT_INFLATER_SERVICE);
}
#Override
public int getCount() {
return mResources.length;
}
#Override
public boolean isViewFromObject(View view, Object object) {
return view == ((LinearLayout) object);
}
#Override
public Object instantiateItem(ViewGroup container, int position) {
View itemView = mLayoutInflater.inflate(R.layout.pager_item, container, false);
ImageView imageView = (ImageView) itemView.findViewById(R.id.imageView);
imageView.setImageResource(mResources[position]);
container.addView(itemView);
return itemView;
}
#Override
public void destroyItem(ViewGroup container, int position, Object object) {
container.removeView((LinearLayout) object);
}
}
This works fine but I want to put the jpgs in a directory on the device so that they can be changed without recompiling the app
I think I need to get the images into the mResource array. I can get the path but not sure what format the code should be instead of using the draw-able lines
i have read articles on here but none make sense to me I am really new to this and the code looks nothing like the code I am using
can anyone point me in the right direction?
Any help is greatly appreciated
Mark
Yes, you can certainly do so. I will try to explain you the process step-by-step,
Step 1
Have a File object pointing to the path, like,
File directory = new File("path-to-directory");
Ensure that the path is to the directory with the images,
Step 2
List all the files inside the directory using listFiles() method, like
File[] allImages = directory.listFiles();
Now you have an array of all the files just like int[] mResources. The only difference being, now you have actual file references, while previously you had resource ids.
Step 3
You can just display the images in the ViewPager just like you did previously. But this is a bit tricky and can take you a considerable amount of time and code to get an image properly displayed from File.
You also need to take care of caching, so that when you load a previously loaded image again, it gets it from the cache.
To do all this, I recommend you to use this library (recommended by Google), Glide.
Setting an image is one line of code,
Glide.with(context).from(file).into(imageView);
That's it. Now you have your images displayed in a ViewPager from a directory in the device.

Modify file using Files.lines

I'd like to read in a file and replace some text with new text. It would be simple using asm and int 21h but I want to use the new java 8 streams.
Files.write(outf.toPath(),
(Iterable<String>)Files.lines(inf)::iterator,
CREATE, WRITE, TRUNCATE_EXISTING);
Somewhere in there I'd like a lines.replace("/*replace me*/","new Code()\n");. The new lines are because I want to test inserting a block of code somewhere.
Here's a play example, that doesn't work how I want it to, but compiles. I just need a way to intercept the lines from the iterator, and replace certain phrases with code blocks.
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import static java.nio.file.StandardOpenOption.*;
import java.util.Arrays;
import java.util.stream.Stream;
public class FileStreamTest {
public static void main(String[] args) {
String[] ss = new String[]{"hi","pls","help","me"};
Stream<String> stream = Arrays.stream(ss);
try {
Files.write(Paths.get("tmp.txt"),
(Iterable<String>)stream::iterator,
CREATE, WRITE, TRUNCATE_EXISTING);
} catch (IOException ex) {}
//// I'd like to hook this next part into Files.write part./////
//reset stream
stream = Arrays.stream(ss);
Iterable<String> it = stream::iterator;
//I'd like to replace some text before writing to the file
for (String s : it){
System.out.println(s.replace("me", "my\nreal\nname"));
}
}
}
edit: I've gotten this far and it works. I was trying with filter and maybe it isn't really necessary.
Files.write(Paths.get("tmp.txt"),
(Iterable<String>)(stream.map((s) -> {
return s.replace("me", "my\nreal\nname");
}))::iterator,
CREATE, WRITE, TRUNCATE_EXISTING);
The Files.write(..., Iterable, ...) method seems tempting here, but converting the Stream to an Iterable makes this cumbersome. It also "pulls" from the Iterable, which is a bit odd. It would make more sense if the file-writing method could be used as the stream's terminal operation, within something like forEach.
Unfortunately, most things that write throw IOException, which isn't permitted by the Consumer functional interface that forEach expects. But PrintWriter is different. At least, its writing methods don't throw checked exceptions, although opening one can still throw IOException. Here's how it could be used.
Stream<String> stream = ... ;
try (PrintWriter pw = new PrintWriter("output.txt", "UTF-8")) {
stream.map(s -> s.replaceAll("foo", "bar"))
.forEachOrdered(pw::println);
}
Note the use of forEachOrdered, which prints the output lines in the same order in which they were read, which is presumably what you want!
If you're reading lines from an input file, modifying them, and then writing them to an output file, it would be reasonable to put both files within the same try-with-resources statement:
try (Stream<String> input = Files.lines(Paths.get("input.txt"));
PrintWriter output = new PrintWriter("output.txt", "UTF-8"))
{
input.map(s -> s.replaceAll("foo", "bar"))
.forEachOrdered(output::println);
}

Visual Studio: Who is writing to console?

OK, here's a good one (I think) - I'm working on an application with lots (far too many) dependency dlls, created by a team of developers. I'm trying to debug just one assembly, but the console output is 'polluted' by the Console.WriteLines and Debug.WriteLines left scattered around the code.
Is there anyway I can work out exactly which assembly a given line is coming from, so I can get the author to clean up their source?
UPDATE If you're also experiencing this kind of issue, note that there is another potential source of output messages which is any breakpoints with 'When hit' set to print a message. Having said which, this is a VERY cool feature, which can prevent the kind of problems I was having above.
Yes - replace Console.Out. Use Console.SetOut after creating a TextWriter which not only dumps the requested data to the original console, but also dumps a stack trace (and timestamp, and the requested data) to a file.
Here's some code, adapted from Benjol's answer:
(Note: you will want to adapt this code depending on whether you want a stack trace after each write, or after each writeline. In the code below, each char is followed by a stack trace!)
using System.Diagnostics;
using System.IO;
using System.Text;
public sealed class StackTracingWriter : TextWriter
{
private readonly TextWriter writer;
public StackTracingWriter (string path)
{
writer = new StreamWriter(path) { AutoFlush = true };
}
public override System.Text.Encoding Encoding
{
get { return Encoding.UTF8; }
}
public override void Write(string value)
{
string trace = (new StackTrace(true)).ToString();
writer.Write(value + " - " + trace);
}
public override void Write(char[] buffer, int index, int count)
{
Write(new string(buffer, index, count));
}
public override void Write(char value)
{
// Note that this will create a stack trace for each character!
Write(value.ToString());
}
public override void WriteLine()
{
// This is almost always going to be called in conjunction with
// real text, so don't bother writing a stack trace
writer.WriteLine();
}
protected override void Dispose(bool disposing)
{
writer.Dispose();
}
}
To use this for logging both Console.WriteLine and Debug.WriteLine to a file, make calls like this as early as possible in your code:
var writer = new StackTracingWriter(#"C:\Temp\ConsoleOut.txt");
Console.SetOut(writer);
Debug.Listeners.Add(new TextWriterTraceListener(writer));
Note that this currently doesn't also write to the original console. To do so, you'd need to have a second TextWriter (for the original console) in StackTracingWriter, and write to both places each time. Debug will however continue to be written to the original console.
Download Reflector and you can open up the mscorlib assembly, add your application's assemblies, then right click on the Console class and click Analyze and you can show all methods that reference the Console class.

Resources