Drupal Performace difference between cck field and text? - performance

What do you think the performance difference would be?
20,000 nodes
Each node has a Link field. The number of values range from 50 to 200. The Links will have no title.
OR
20,000 nodes
Each node will have the links in the body field as straight text with filtered html. As so:
http://link1.com
http://link2.com
http://link3.com
http://link4.com
http://link5.com
http://link6.com
http://link7.com
http://link8.com
http://link9.com
http://link10.com

It really depends how/what you are going to use them. I doubt you are going to display 20.000 nodes at once. It's really hard to say much about performance, without a specific use case, and even then, you have to take caching and what not into consideration as well.
In any regard, CCK will probably always be a tiny bit slower, because you are extracting multiple values instead of a single value, which makes the query a tiny bit more complex. I doubt that you will be able to measure that on your drupal site though.
Another thing to keep in mind, is that using CCK fields will give you added flexibility, is it integrates well with views. So you can easily pull out the links and format them in different ways.

Related

How to implement a spreadsheet in a browser?

I was recently asked this in an interview (Software Engineer) and didn't really know how to go about answering the question.
The question was focused on both the algorithm of the spreadsheet and how it would interact with the browser. I was a bit confused on what data structure would be optimal to handle the cells and their values. I guess any form of hash table would work with cells being the unique key and the value being the object in the cell? And then when something gets updated, you'd just update that entry in your table. The interviewer hinted at a graph but I was unsure of how a graph would be useful for a spreadsheet.
Other things I considered were:
Spreadsheet in a browser = auto-save. At any update, send all the data back to the server
Cells that are related to each other, i.e. C1 = C2+C3, C5 = C1-C4. If the value of C2 changes, both C1 and C5 change.
Usage of design patterns? Does one stand out over another for this particular situation?
Any tips on how to tackle this problem? Aside from the algorithm of the spreadsheet itself, what else could the interviewer have wanted? Does the fact that its in a browser as compared to a separate application add any difficulties?
Thanks!
For an interview this is a good question. If this was asked as an actual task in your job, then there would be a simple answer of use a third party component, there are a few good commercial ones.
While we can't say for sure what your interviewer wanted, for me this is a good question precisely because it is so open ended and has so many correct possible answers.
You can talk about the UI and how to implement the kind of dynamic grid you need for a spreadsheet and all the functionality of the cells and rows and columns and selection of cells and ranges and editing of values and formulas. You probably could talk for a while on the UI implications alone.
Alternatively you can go the data route, talk about data structures to hold a spreadsheet, talk exactly about links between cells for formulas, talk about how to detect and deal with circular references, talk about how in a browser you have less control over memory and for very large spreadsheets you could run into problems earlier. You can talk about what is available in JavaScript vs a native language and how this impacts the data structures and calculations. Also along with data, a big important issue with spreadsheets is numerical accuracy and floating point number calculations. Floating point numbers are made to be fast but are not necessarily accurate in extreme levels of precision and this leads to a lot of confusing questions. I believe very recently Excel switched to their own representation of a fixed decimal number as it's now viable to due spreadsheet level calculations without using the built-in floating point calculations. You can also talk about data structures and calculation and how they affect performance. In a browser you don't have threads (yet) so you can't run all the calculations in the background. If you have 100,000 rows with complex calculations and change one value that cascades across everything, you can get a warning about a slow script. You need to break up the calculation.
Finally you can run form the user experience angle. How is the experience in a browser different from a native application? What are the advantages and what cool things can you do in a browser that may be difficult in a desktop application? What things are far more complicated or even totally impossible (example, associate your spreadsheet app with a file type so a user can double-click a file and open it in your online spreadsheet app, although I may be wrong about that still being unsupported).
Good question, lots of right answers, very open ended.
On the other hand, you could also have had a bad interviewer that is specifically looking for the answer they want and in that case you're pretty much out of luck unless you're telepathic.
You can say hopelessly too much about this. I'd probably start with:
If most of the cells are filled, use a simply 2D array to store it.
Otherwise use a hash table of location to cell
Or perhaps something like a kd-tree, which should allow for more efficient "get everything in the displayed area" queries.
By graph, your interviewer probably meant have each cell be a vertex and each reference to another cell be a directed edge. This would allow you to do checks for circular references fairly easily, and allow for efficiently updating of all cells that need to change.
"In a browser" (presumably meaning "over a network" - actually "in a browser" doesn't mean all that much by itself - one can write a program that runs in a browser but only runs locally) is significant - you probably need to consider:
What are you storing locally (everything or just the subset of cells that are current visible)
How are you sending updates to the server (are you sending every change or keeping a collection of changed cells and only sending updates on save, or are you not storing changes separately and just sending the whole grid across during save)
Auto-save should probably be considered as well
Will you have an "undo", will this only be local, if not, how will you handle this on the server and how will you send through the updates
Is only this one user allowed to work with it at a time (or do you have to cater for multi-user, which brings dealing with conflicts, among other things, to the table)
Looking at the CSS cursor property just begs for one to create
a spreadsheet web application.
HTML table or CSS grid? HTML tables are purpose built for tabular
data.
Resizing cell height and width is achievable with offsetX and
offsetY.
Storing the data is trivial. It can be Mongo, mySQL, Firebase,
...whatever. On blur, send update.
Javascrip/ECMA is more than capable of delivering all the Excel built-in
functions. Did I mention web workers?
Need to increment letters as in column ID's? I got you covered.
Most importantly, don't do it. Why? Because it's already been done.
Find a need and work that project.

Benefits and trade offs for improving text search on small data in PostgreSQL

I have 4 text columns of interest.
Each column is up to about 100 characters.
The text in 3 of the columns is mostly Latin words. (The data is a biological catalog, and these are names of things.)
The data is currently about 500 rows. I don't expect this to grow beyond 1000.
A small number of users (under 10) will have editing privileges to add, update, and delete data. I do not expect these users to put a heavy load on the database.
So all this suggests a pretty small data set to consider.
I need to perform a search on all 4 columns for rows where at least 1 column contains the search text (case insensitive). The query will be issued (and the results served) via a web application. I'm a bit lost about how to approach it.
PostgreSQL offers a few options for improving text searching speed. The possible options built into PostgreSQL I've been considering are
Don't try to index this at all. Just use ILIKE, LIKE on lower, or similar. (Without an index?)
Index with pg_trgm to improve search speed. I would assume that I would need to index the concatenation somehow.
Full text searching. I assume this would involve concatenating for the index also.
Unfortunately, I'm not really familiar with the expected performance of any of these or the benefits and trades off, so it's hard to know what things I should try first and what things I shouldn't even consider. Some things I have read suggest that doing the indexing for 2 and 3 is pretty slow, which conflicts with the fact that I'll be having occasional modifications going on. And the mixed language makes full text search seem unattractive since it appears to be language based, unless it can handle multiple languages simultaneously. Would I expect that for data this small, a simple ILIKE or maybe a LIKE on lower is probably fast enough? Or maybe the indexing is fast enough for the low load of modifications on data this small? Would I be better off looking for something outside the database?
Granted, I would have to actually benchmark all these to really know for sure what's fastest, but unfortunately, I don't have much time for this project. So what are the benefits and trade offs of these methods? What of these options are not appropriate for solving this type of problem? What are some other types of solutions (including potentially outside the database) worth considering?
(I suppose I might find some kind of beginner's tutorial on text searching in PG useful, but my searches turn up Full Text Search for the most part, which I don't even know if it's useful for me.)
I'm on PG 9.2.4, so any goodies pre-9.3 are an option.
Update: I've expanded this answer into a detailed blog post.
Rather than focusing purely on speed, please consider search semantics first. Define your requirements.
For example, do users need to be able to differentiate based on the order of terms? Should
radiata pinus
find:
pinus radiata
? Does the same rule apply to words within a column as between columns?
Are spaces always word separators, or are spaces within a column part of the search term?
Do you need wildcards? If so, do you need only left-anchored wildcards (think staph%) or do you need right-anchored or infix wildcards too (%ccus, p%s)? Only pg_tgrm will help you with infix wildcards. Suffix wildcards can be handled by an index on the reverse() of a word, but that gets clumsy quickly so in practice pg_tgrm is the best option there.
If you're mostly searching for discrete words and word-order isn't important, Pg's full-text search with to_tsvector and to_tsquery will be desirable. It supports left-anchored wildcard searches, weighting, categories, etc.
If you're mostly doing prefix searches of discrete columns then simple LIKE queries on a regular b-tree index per column will be the way to go.
So. Figure out what you need, then how to do it. Your current uncertainty probably stems partly from not really knowing quite what you want.
For a 1000 rows, I would guess that LIKE together with lower() should be fast enough. After a couple of queries the table will most probably be completely cached.
Regarding the indexing using pg_trgm: you are talking about "occasional" updates/inserts to the table. I would think that the additional costs of using a trigram index would only show up when you update/insert that table a lot - like several times a second.
If "occasional" only means several times an hour (or even less), then I doubt you'd see the difference in real live. I think somewhere in Depesz's blob there was also an article that compared the insert speed with and without a trigram index, but I can't find it anymore.

Organizing large collection of things - usability and UI point of view

In our application we have a repository that contains things (they are called methods and queries, but this is not particularly relevant for this question). Each thing has a title, description (though some may lack both) and some other data. Users save things to repository and load and use things from repository.
I wonder what is the best way to organize the repository from usability point of view. There seems to be two major approaches. The first approach is to put things in folders, subfolders and so on, and have a hierarchical structure similar to a filesystem. The second approach (that has become fashionable) is two have a flat space and assign zero or more tags to each thing, so that users can view a list of things for a particular tag.
Currently we use flat space, tags and search. It appears to be somewhat unmanageable. I am not sure if switching to folders/subfolders will make it better.
I would like to learn more about the pros and cons of each approach and what properties of the collection and the things themselves suggest using one or another approach or a combination of both. If anybody can point me to some studies or discussions of those, I would really appreciate that.
There is no reason you can't use both methods. To some extent finding things is dependent upon what the thing is and why it is being looked for. A hierachical design can work well when somebody knows what they are looking for and a tag / keyword based system can work better when the structure is less obvious.
Also a network structure that links similar things can also be very good as you can see with the internet or a wikipedia.
I use the law of symmetry to help me in this situation.
First you build the tree like structure in the back end and then build the tagging system for the front end.
You use both to organize your data collection.
A tag cloud works better than a hierarchy if
the taxonomy is uncertain
("Now is this a small car or a large truck?")
there is no central authority for classification
there is no obivous or natural order between the classes
(cars can be classified by color or by size, there is no obvious rank between color and size)
new categories may be created on the fly
Otherwise, a hierarchy gives more confidence in completeness, as every item has exactly one obviously correct location: did I find all documents about birds? Is there really no document about five-story houses?
Tag clouds need some maintenance, I am not sure if this can be completely user-provided:
Dealing with synonyms, tag synonyms, merging tags, clarifying tags (e.g. is "blue" a feeling or a color?)
Another option are attribute-value pairs. They can be built upon a well-maintained tag cloud, e.g. grouping "red / black / blue" tags under "color". They can also work with floating values, search can be extended to similar values in case of not enough results (such as age, date, even multidimensionals like color).
However, this requires to know ahead what search criteria users need. If you need to introduce a new category, you need to re-tag the entire body of documents.
See also my request for clarification: what are the problems? Not enough tagging? tagging to distinct? Users not finding what they are loking for? Users not confident in search results?

What should I do with an over-bloated select-box/drop-down

All web developers run into this problem when the amount of data in their project grows, and I have yet to see a definitive, intuitive best practice for solving it. When you start a project, you often create forms with tags to help pick related objects for one-to-many relationships.
For instance, I might have a system with Neighbors and each Neighbor belongs to a Neighborhood. In version 1 of the application I create an edit user form that has a drop down for selecting users, that simply lists the 5 possible neighborhoods in my geographically limited application.
In the beginning, this works great. So long as I have maybe 100 records or less, my select box will load quickly, and be fairly easy to use. However, lets say my application takes off and goes national. Instead of 5 neighborhoods I have 10,000. Suddenly my little drop-down takes forever to load, and once it loads, its hard to find your neighborhood in the massive alphabetically sorted list.
Now, in this particular situation, having hierarchical data, and letting users drill down using several dynamically generated drop downs would probably work okay. However, what is the best solution when the objects/records being selected are not hierarchical in nature? In the past, of done this with a popup with a search box, and a list, but this seems clunky and dated. In today's web 2.0 world, what is a good way to find one object amongst many for ones forms?
I've considered using an Ajaxifed search box, but this seems to work best for free text, and falls apart a little when the data to be saved is just a reference to another object or record.
Feel free to cite specific libraries with generic solutions to this problem, or simply share what you have done in your projects in a more general way
I think an auto-completing text box is a good approach in this case. Here on SO, they also use an auto-completing box for tags where the entry already needs to exist, i.e. not free-text but a selection. (remember that creating new tags requires reputation!)
I personally prefer this anyways, because I can type faster than select something with the mouse, but that is programmer's disease I guess :)
Auto-complete is usually the best solution in my experience for searches, but only where the user is able to provide text tokens easily, either as part of the object name or taxonomy that contains the object (such as a product category, or postcode).
However this doesn't always work, particularly where 'browse' behavior would be more suitable - to give a real example, I once wrote a page for a community site that allowed a user to send a message to their friends. We used auto-complete there, allowing multiple entries separated by commas.
It works great when you know the names of the people you want to send the message to, but we found during user acceptance that most people didn't really know who was on their friend list and couldn't use the page very well - so we added a list popup with friend icons, and that was more successful.
(this was quite some time ago - everyone just copies Facebook now...)
Different methods of organizing large amounts of data:
Hierarchies
Spatial (geography/geometry)
Tags or facets
Different methods of searching large amounts of data:
Filtering (including autocomplete)
Sorting/paging (alphabetically-sorted data can also be paged by first letter)
Drill-down (assuming the data is organized as above)
Free-text search
Hierarchies are easy to understand and (usually) easy to implement. However, they can be difficult to navigate and lead to ambiguities. Spatial visualization is by far the best option if your data is actually spatial or can be represented that way; unfortunately this applies to less than 1% of the data we normally deal with day-to-day. Tags are great, but - as we see here on SO - can often be misused, misunderstood, or otherwise rendered less effective than expected.
If it's possible for you to reorganize your data in some relatively natural way, then that should always be the first step. Whatever best communicates the natural ordering is usually the best answer.
No matter how you organize the data, you'll eventually need to start providing search capabilities, and unlike organization of data, search methods tend to be orthogonal - you can implement more than one. Filtering and sorting/paging are the easiest, and if an autocomplete textbox or paged list (grid) can achieve the desired result, go for that. If you need to provide the ability to search truly massive amounts of data with no coherent organization, then you'll need to provide a full textual search.
If I could point you to some list of "best practices", I would, but HID is rarely so clear-cut. Use the aforementioned options as a starting point and see where that takes you.

Why is pagination so resource-expensive?

It's one of those things that seems to have an odd curve where the more I think about it, the more it makes sense. To a certain extent, of course. And then it doesn't make sense to me at all.
Care to enlighten me?
Because in most cases you've got to sort your results first. For example, when you search on Google, you can view only up to 100 pages of results. They don't bother sorting by page-rank beyond 1000 websites for given keyword (or combination of keywords).
Pagination is fast. Sorting is slow.
Lubos is right, the problem is not the fact that you are paging (which takes a HUGE amount of data off the wire), but that you need to figure out what is actually going on the page..
The fact that you need to page implies there is a lot of data. A lot of data takes a long time to sort :)
This is a really vague question. We'd need a concrete example to get a better idea of the problem.
This question seems pretty well covered, but I'll add a little something MySQL specific as it catches out a lot of people:
Avoid using SQL_CALC_FOUND_ROWS. Unless the dataset is trivial, counting matches and retrieving x amount of matches in two separate queries is going to be a lot quicker. (If it is trivial, you'll barely notice a difference either way.)
I thought you meant pagination of the printed page - that's where I cut my teeth. I was going to enter a great monologue about collecting all the content for the page, positioning (a vast number of rules here, constrait engines are quite helpful) and justification... but apparently you were talking about the process of organizing information on webpages.
For that, I'd guess database hits. Disk access is slow. Once you've got it in memory, sorting is cheap.
Of course sorting on a random query takes some time, but if you're having problems with the same paginated query being used regulary, there's either something wrong with the database setup (improperly indexing/none at all, too little memory etc. I'm not a db-manager) or you're doing pagination seriously wrong:
Terribly wrong: e.g. doing select * from hugetable where somecondition; into an array getting the page count with the array.length pick the relevant indexes and dicard the array - then repeating this for each page... That's what I call seriously wrong.
The better solution two queries: one getting just the count then another getting results using limit and offset. (Some proprietary, nonstandard-sql server might have a one query option, I dunno)
The bad solution might actually work quite okay in on small tables (in fact it's not unthinkable that it's faster on very small tables, because the overhead of making two queries is bigger than getting all rows in one query. I'm not saying it is so...) but as soon as the database begins to grow the problems become obvious.

Resources