Related
I recently worked on a project in google sheets. While everything is working, changes in the page take a while to process and load. There are a few parts of the project (described below). Is there a way to find out what is causing the biggest load on the project, so that I can work on that area?
Alternatively, if someone has experience with the following types of functions, what do you think is causing the biggest slowdown:
I have a query to lookup and match values. Would this be faster as a vlookup with sort in an arrayformula?
=IFERROR(QUERY(Record!A:C,"Select A where B = '"&B7&"' order by A desc limit 1 label A ''"),"")
I have random number generation through an arrayformula:
=ARRAYFORMULA(IF(ROW(B6:B)=6,"RANDOM",IF(ISBLANK(B6:B),"",RANDBETWEEN(0+0*ROW(B6:B),COUNTA(B6:B)))))
This fills in a cell with a random number if the one next to it has a value. I use this for random sampling in another query later.
I have some conditional formatting based on whether the cell has something in it.
I have some data validation based on a range of cells.
Note: Since my main question is about performance, I didn't think having an example file would be beneficial. It would take me a little to make one so if Ii should, let me know. Also, since other questions deal with scripting performance (like this one and this one) I feel like my question is different.
I suspect the RANDBETWEEN formula is your biggest culprit - basically every time the spreadsheet changes in any way whatsoever, even if you dont actually edit, the numbers all change, so inside an arrayformula, depending on how many rows you have, its always recalculating the rand for every single row
I was recently asked this in an interview (Software Engineer) and didn't really know how to go about answering the question.
The question was focused on both the algorithm of the spreadsheet and how it would interact with the browser. I was a bit confused on what data structure would be optimal to handle the cells and their values. I guess any form of hash table would work with cells being the unique key and the value being the object in the cell? And then when something gets updated, you'd just update that entry in your table. The interviewer hinted at a graph but I was unsure of how a graph would be useful for a spreadsheet.
Other things I considered were:
Spreadsheet in a browser = auto-save. At any update, send all the data back to the server
Cells that are related to each other, i.e. C1 = C2+C3, C5 = C1-C4. If the value of C2 changes, both C1 and C5 change.
Usage of design patterns? Does one stand out over another for this particular situation?
Any tips on how to tackle this problem? Aside from the algorithm of the spreadsheet itself, what else could the interviewer have wanted? Does the fact that its in a browser as compared to a separate application add any difficulties?
Thanks!
For an interview this is a good question. If this was asked as an actual task in your job, then there would be a simple answer of use a third party component, there are a few good commercial ones.
While we can't say for sure what your interviewer wanted, for me this is a good question precisely because it is so open ended and has so many correct possible answers.
You can talk about the UI and how to implement the kind of dynamic grid you need for a spreadsheet and all the functionality of the cells and rows and columns and selection of cells and ranges and editing of values and formulas. You probably could talk for a while on the UI implications alone.
Alternatively you can go the data route, talk about data structures to hold a spreadsheet, talk exactly about links between cells for formulas, talk about how to detect and deal with circular references, talk about how in a browser you have less control over memory and for very large spreadsheets you could run into problems earlier. You can talk about what is available in JavaScript vs a native language and how this impacts the data structures and calculations. Also along with data, a big important issue with spreadsheets is numerical accuracy and floating point number calculations. Floating point numbers are made to be fast but are not necessarily accurate in extreme levels of precision and this leads to a lot of confusing questions. I believe very recently Excel switched to their own representation of a fixed decimal number as it's now viable to due spreadsheet level calculations without using the built-in floating point calculations. You can also talk about data structures and calculation and how they affect performance. In a browser you don't have threads (yet) so you can't run all the calculations in the background. If you have 100,000 rows with complex calculations and change one value that cascades across everything, you can get a warning about a slow script. You need to break up the calculation.
Finally you can run form the user experience angle. How is the experience in a browser different from a native application? What are the advantages and what cool things can you do in a browser that may be difficult in a desktop application? What things are far more complicated or even totally impossible (example, associate your spreadsheet app with a file type so a user can double-click a file and open it in your online spreadsheet app, although I may be wrong about that still being unsupported).
Good question, lots of right answers, very open ended.
On the other hand, you could also have had a bad interviewer that is specifically looking for the answer they want and in that case you're pretty much out of luck unless you're telepathic.
You can say hopelessly too much about this. I'd probably start with:
If most of the cells are filled, use a simply 2D array to store it.
Otherwise use a hash table of location to cell
Or perhaps something like a kd-tree, which should allow for more efficient "get everything in the displayed area" queries.
By graph, your interviewer probably meant have each cell be a vertex and each reference to another cell be a directed edge. This would allow you to do checks for circular references fairly easily, and allow for efficiently updating of all cells that need to change.
"In a browser" (presumably meaning "over a network" - actually "in a browser" doesn't mean all that much by itself - one can write a program that runs in a browser but only runs locally) is significant - you probably need to consider:
What are you storing locally (everything or just the subset of cells that are current visible)
How are you sending updates to the server (are you sending every change or keeping a collection of changed cells and only sending updates on save, or are you not storing changes separately and just sending the whole grid across during save)
Auto-save should probably be considered as well
Will you have an "undo", will this only be local, if not, how will you handle this on the server and how will you send through the updates
Is only this one user allowed to work with it at a time (or do you have to cater for multi-user, which brings dealing with conflicts, among other things, to the table)
Looking at the CSS cursor property just begs for one to create
a spreadsheet web application.
HTML table or CSS grid? HTML tables are purpose built for tabular
data.
Resizing cell height and width is achievable with offsetX and
offsetY.
Storing the data is trivial. It can be Mongo, mySQL, Firebase,
...whatever. On blur, send update.
Javascrip/ECMA is more than capable of delivering all the Excel built-in
functions. Did I mention web workers?
Need to increment letters as in column ID's? I got you covered.
Most importantly, don't do it. Why? Because it's already been done.
Find a need and work that project.
Would OCR Software be able to reliably translate an image such as the following into a list of values?
UPDATE:
In more detail the task is as follows:
We have a client application, where the user can open a report. This report contains a table of values.
But not every report looks the same - different fonts, different spacing, different colors, maybe the report contains many tables with different number of rows/columns...
The user selects an area of the report which contains a table. Using the mouse.
Now we want to convert the selected table into values - using our OCR tool.
At the time when the user selects the rectangular area I can ask for extra information
to help with the OCR process, and ask for confirmation that the values have been correct recognised.
It will initially be an experimental project, and therefore most likely with an OpenSource OCR tool - or at least one that does not cost any money for experimental purposes.
Simple answer is YES, you should just choose right tools.
I don't know if open source can ever get close to 100% accuracy on those images, but based on the answers here probably yes, if you spend some time on training and solve table analisys problem and stuff like that.
When we talk about commertial OCR like ABBYY or other, it will provide you 99%+ accuracy out of the box and it will detect tables automatically. No training, no anything, just works. Drawback is that you have to pay for it $$. Some would object that for open source you pay your time to set it up and mantain - but everyone decides for himself here.
However if we talk about commertial tools, there is more choice actually. And it depends on what you want. Boxed products like FineReader are actually targeting on converting input documents into editable documents like Word or Excell. Since you want actually to get data, not the Word document, you may need to look into different product category - Data Capture, which is essentially OCR plus some additional logic to find necessary data on the page. In case of invoice it could be Company name, Total amount, Due Date, Line items in the table, etc.
Data Capture is complicated subject and requires some learning, but being properly used can give quaranteed accuracy when capturing data from the documents. It is using different rules for data cross-check, database lookups, etc. When necessary it may send datafor manual verification. Enterprises are widely usind Data Capture applicaitons to enter millions of documents every month and heavily rely on data extracted in their every day workflow.
And there are also OCR SDK ofcourse, that will give you API access to recognition results and you will be able to program what to do with the data.
If you describe your task in more detail I can provide you with advice what direction is easier to go.
UPDATE
So what you do is basically Data Capture application, but not fully automated, using so-called "click to index" approach. There is number of applications like that on the market: you scan images and operator clicks on the text on the image (or draws rectangle around it) and then populates fields to database. It is good approach when number of images to process is relatively small, and manual workload is not big enough to justify cost of fully automated application (yes, there are fully automated systems that can do images with different font, spacing, layout, number of rows in the tables and so on).
If you decided to develop stuff and instead of buying, then all you need here is to chose OCR SDK. All UI you are going to write yoursself, right? The big choice is to decide: open source or commercial.
Best Open source is tesseract OCR, as far as I know. It is free, but may have real problems with table analysis, but with manual zoning approach this should not be the problem. As to OCR accuracty - people are often train OCR for font to increase accuracy, but this should not be the case for you, since fonts could be different. So you can just try tesseract out and see what accuracy you will get - this will influence amount of manual work to correct it.
Commertial OCR will give higher accuracy but will cost you money. I think you should anyway take a look to see if it worth it, or tesserack is good enough for you. I think the simplest way would be to download trial version of some box OCR prouct like FineReader. You will get good idea what accuracy would be in OCR SDK then.
If you always have solid borders in your table, you can try this solution:
Locate the horizontal and vertical lines on each page (long runs of
black pixels)
Segment the image into cells using the line coordinates
Clean up each cell (remove borders, threshold to black and white)
Perform OCR on each cell
Assemble results into a 2D array
Else your document have a borderless table, you can try to follow this line:
Optical Character Recognition is pretty amazing stuff, but it isn’t
always perfect. To get the best possible results, it helps to use the
cleanest input you can. In my initial experiments, I found that
performing OCR on the entire document actually worked pretty well as
long as I removed the cell borders (long horizontal and vertical
lines). However, the software compressed all whitespace into a single
empty space. Since my input documents had multiple columns with
several words in each column, the cell boundaries were getting lost.
Retaining the relationship between cells was very important, so one
possible solution was to draw a unique character, like “^” on each
cell boundary – something the OCR would still recognize and that I
could use later to split the resulting strings.
I found all this information in this link, asking Google "OCR to table". The author published a full algorithm using Python and Tesseract, both opensource solutions!
If you want to try the Tesseract power, maybe you should try this site:
http://www.free-ocr.com/
Which OCR you are talking about?
Will you be developing codes based on that OCR or you will be using something off the shelves?
FYI:
Tesseract OCR
it has implemented the document reading executable, so you can feed the whole page in, and it will extract characters for you. It recognizes blank spaces pretty well, it might be able to help with tab-spacing.
I've been OCR'ing scanned documents since '98. This is a recurring problem for scanned docs, specially for those that include rotated and/or skewed pages.
Yes, there are several good commercial systems and some could provide, once well configured, terrific automatic data-mining rate, asking for the operator's help only for those very degraded fields. If I were you, I'd rely on some of them.
If commercial choices threat your budget, OSS can lend a hand. But, "there's no free lunch". So, you'll have to rely on a bunch of tailor-made scripts to scaffold an affordable solution to process your bunch of docs. Fortunately, you are not alone. In fact, past last decades, many people have been dealing with this. So, IMHO, the best and concise answer for this question is provided by this article:
https://datascience.blog.wzb.eu/2017/02/16/data-mining-ocr-pdfs-using-pdftabextract-to-liberate-tabular-data-from-scanned-documents/
Its reading is worth! The author offers useful tools of his own, but the article's conclusion is very important to give you a good mindset about how to solve this kind of problem.
"There is no silver bullet."
(Fred Brooks, The Mitical Man-Month)
It really depends on implementation.
There are a few parameters that affect the OCR's ability to recognize:
1. How well the OCR is trained - the size and quality of the examples database
2. How well it is trained to detect "garbage" (besides knowing what's a letter, you need to know what is NOT a letter).
3. The OCR's design and type
4. If it's a Nerural Network, the Nerural Network structure affects its ability to learn and "decide".
So, if you're not making one of your own, it's just a matter of testing different kinds until you find one that fits.
You could try other approach. With tesseract (or other OCRS) you can get coordinates for each word. Then you can try to group those words by vercital and horizontal coordinates to get rows/columns. For example to tell a difference between a white space and tab space. It takes some practice to get good results but it is possible. With this method you can detect tables even if the tables use invisible separators - no lines. The word coordinates are solid base for table recog
We also have struggled with the issue of recognizing text within tables. There are two solutions which do it out of the box, ABBYY Recognition Server and ABBYY FlexiCapture. Rec Server is a server-based, high volume OCR tool designed for conversion of large volumes of documents to a searchable format. Although it is available with an API for those types of uses we recommend FlexiCapture. FlexiCapture gives low level control over extraction of data from within table formats including automatic detection of table items on a page. It is available in a full API version without a front end, or the off the shelf version that we market. Reach out to me if you want to know more.
Here are the basic steps that have worked for me. Tools needed include Tesseract, Python, OpenCV, and ImageMagick if you need to do any rotation of images to correct skew.
Use Tesseract to detect rotation and ImageMagick mogrify to fix it.
Use OpenCV to find and extract tables.
Use OpenCV to find and extract each cell from the table.
Use OpenCV to crop and clean up each cell so that there is no noise that will confuse OCR software.
Use Tesseract to OCR each cell.
Combine the extracted text of each cell into the format you need.
The code for each of these steps is extensive, but if you want to use a python package, it's as simple as the following.
pip3 install table_ocr
python3 -m table_ocr.demo https://raw.githubusercontent.com/eihli/image-table-ocr/master/resources/test_data/simple.png
That package and demo module will turn the following table into CSV output.
Cell,Format,Formula
B4,Percentage,None
C4,General,None
D4,Accounting,None
E4,Currency,"=PMT(B4/12,C4,D4)"
F4,Currency,=E4*C4
If you need to make any changes to get the code to work for table borders with different widths, there are extensive notes at https://eihli.github.io/image-table-ocr/pdf_table_extraction_and_ocr.html
I am using Bing Maps with Ajax and I have about 80,000 locations to drop pushpins into. The purpose of the feature is to allow a user to search for restaurants in Louisiana and click the pushpin to see the health inspection information.
Obviously it doesn't do much good to have 80,000 pins on the map at one time, but I am struggling to find the best solution to this problem. Another problem is that the distance between these locations is very small (All 80,000 are in Louisiana). I know I could use clustering to keep from cluttering the map, but it seems like that would still cause performance problems.
What I am currently trying to do is to simply not show any pins until a certain zoom level and then only show the pins within the current view. The way I am currently attempting to do that is by using the viewchangeend event to find the zoom level and the boundaries of the map and then querying the database (through a web service) for any points in that range.
It feels like I am going about this the wrong way. Is there a better way to manage this large amount of data? Would it be better to try to load all points initially and then have the data on hand without having to hit my web service every time the map moves. If so, how would I go about it?
I haven't been able to find answers to my questions, which usually means that I am asking the wrong questions. If anyone could help me figure out the right question it would be greatly appreciated.
Well, I've implemented a slightly different approach to this. It was just a fun exercise, but I'm displaying all my data (about 140.000 points) in Bing Maps using the HTML5 canvas.
I previously load all the data to the client. Then, I've optimized the drawing process so much that I've attached it to the "Viewchange" event (which fires all the time during the view change process).
I've blogged about this. You can check it here.
My example does not have interaction on it but could be easily done (should be a nice topic for a blog post). You would have thus to handle the events manually and search for the corresponding points yourself or, if the amount of points to draw and/or the zoom level was below some threshold, show regular pushpins.
Anyway, another option, if you're not restricted to Bing Maps, is to use the likes of Leaflet. It allows you to create a Canvas Layer which is a tile-based layer but rendered in client-side using HTML5 canvas. It opens a new range of possibilities. Check for example this map in GisCloud.
Yet another option, although more suitable to static data, is using a technique called UTFGrid. The lads that developed it can certainly explain it better than me, but it scales for as many points as you want with a fenomenal performance. It consists on having a tile layer with your info, and an accompanying json file with something like an "ascii-art" file describing the features on the tiles. Then, using a library called wax it provides complete mouse-over, mouse-click events on it, without any performance impact whatsoever.
I've also blogged about it.
I think clustering would be your best bet if you can get away with using it. You say that you tried using clustering but it still caused performance problems? I went to test it out with 80000 data points at the V7 Interactive SDK and it seems to perform fine. Test it out yourself by going to the link and change the line in the Load module - clustering tab:
TestDataGenerator.GenerateData(100,dataCallback);
to
TestDataGenerator.GenerateData(80000,dataCallback);
then hit the Run button. The performance seems acceptable to me with that many data points.
I've been experimenting with different ideas of how to store a 2D game world. I'm interested in hearing techniques of storing large quantities of objects while managing the set that's visible ( lets say 100,000 tiles square ). Obviously the techniques can vary based on how the game renders that space.
Lets assume that we're describing a scrolling 2d game world rather than screen based as you could fairly easily do screen based rendering from such a setup while the converse is a bit more messy.
Looking for language agnostic solutions here so it's more helpful to others.
Edit: I think a good answer here would be a general review of the ideas to consider when thinking about this, as some of the responders have attempted, but also begin to explain how different solutions would apply to those scenarios. It's a somewhat complex question, so I would expect a good answer to reflect that.
Quadtrees are a fairly efficient solution for storing data about a large 2-dimensional world and the objects within it.
You might get some ideas on how to implement this from some spatial data structures like range or kd trees.
However, the answer to this question would vary considerably depending exactly on how your game works.
Are we talking a 2D platformer with 10 enemies onscreen, 20 more offscreen but "active", and an unknown number more "inactive"? If so, you can probably store your whole level as an array of "screens" where you manipulate the ones closest to you.
Or do you mean a true 2D game with lots of up/down movement too? You might have to be a bit more careful here.
The platform is also of some importance. If you're implementing a simple platformer for desktop PCs, you probably wouldn't have to worry about performance as much as you would on an embedded device. This is no excuse to be naive about it, but you might not have to be terribly clever either.
This is a somewhat interesting question I think. Presumably someone smarter than I who has experience with implementing platformers has thought these things out already.
Break the world into smaller areas, and deal with them. Any solution to this problem is going to boil down to this concept (such as quadtrees, mentioned in another answer). The differences will be in how they subdivide the world.
How much data is stored per tile? How fast can players move across the world? What's the behavior of NPCs, etc., that are offscreen? Do they just reset when the player comes back (like old Zelda games)? Do they simply resume where they were? Do they do some kind of catch-up script?
How much different rendering data is going to be needed for different areas?
How much of the world can be seen at one time?
All of these questions are going to immpact your solution, as well as the capabilities of your platform. Coming up with a general answer for these without having a reasonable idea of these parameters is going to be a bit difficult.
Assuming that your game will only update what is visible and some area around what is visible, just break the world in "screens" (a "screen" is a rectangular area on the tilemap that can fill the whole screen). Keep in memory the "screens" around the visible area (and some more if you want to update entities which are close to the character - but there is little reason to update an entity that far away) and have the rest on disk with a cache to avoid loading/unloading of commonly visited areas when you move around. Some setup like:
+---+---+---+---+---+---+---+
|FFF|FFF|FFF|FFF|FFF|FFF|FFF|
+---+---+---+---+---+---+---+
|FFF|NNN|NNN|NNN|NNN|NNN|FFF|
+---+---+---+---+---+---+---+
|FFF|NNN|NNN|NNN|NNN|NNN|FFF|
+---+---+---+---+---+---+---+
|FFF|NNN|NNN|VVV|NNN|NNN|FFF|
+---+---+---+---+---+---+---+
|FFF|NNN|NNN|NNN|NNN|NNN|FFF|
+---+---+---+---+---+---+---+
|FFF|NNN|NNN|NNN|NNN|NNN|FFF|
+---+---+---+---+---+---+---+
|FFF|FFF|FFF|FFF|FFF|FFF|FFF|
+---+---+---+---+---+---+---+
Where "V" part is the "screen" where the center (hero or whatever) is, the "N" parts are those who are nearby and have active (updating) entities, are checked for collisions, etc and "F" parts are far parts which might get updated infrequently and are prone to be "swapped" out (stored to disk). Of course you might want to use more "N" screens than two :-).
Note btw that since 2D games do not usually hold much data instead of saving the far away parts to disk you might want to just keep them in memory compressed.
You probably want to use a single int or byte array that links to block types. If you need more optimization from there, then you'll want to link to more complicated data structures like oct trees from your array. There is a good discussion on a Java game forum here: http://www.javagaming.org/index.php/topic,20505.30.html text
Anything with links becomes very expensive because the pointer takes up something like 8 bytes each, depending upon the language, so depending upon how populated your world is it can get expensive very quickly (8 pointers 8 bytes each is 64 bytes per item, and a byte array is 1 byte per item). So unless 1/64 of your world is empty, a byte array is going to be a much better option. You're also going to need to spend a lot of time iterating down the tree whenever you're doing a lookup for collision or whatever else - a byte array will be an instantaneous lookup.
Hopefully that's detailed enough for you. :-)