Does a machine learning algorithm copy the data it learns from? - algorithm

I am not a programmer, rather a law student, but I am currently researching for a project involving artificial intelligence and copyright law. I am currently looking at whether the learning process of a machine learning algorithm may be copyright infringement if a protected work is used by the algorithm. However, this relies on whether or not the algorithm copies the work or does something else.
Can anyone tell me whether machine learning algorithms typically copy the data (picture/text/video/etc.) they are analysing (even if only briefly) or if they are able to obtain the required information from the data through other methods that do not require copying (akin to a human looking at a stop sign and recognising it as a stop sign without necessarily copying the image).
Apologies for my lack of knowledge and I'm sorry if any of my explanation flies in the face of any established machine learning knowledge. As I said, I am merely a lowly law student.
Thanks in advance!

A few machine learning algorithms actually retain a copy of the training set, for example k-nearest neighbours. See https://en.wikipedia.org/wiki/Instance-based_learning. Not all do this; in fact it is usually regarded as a disadvantage, because the training set can be large.
Also, computers are also built round a number of different stores of data of different sizes and speeds. They usually copy data they are working on to small fast stores while they are working on it, because the larger stores take much longer to read and write. One of many possible examples of this has been the subject of legal wrangling of which I know little - see e.g. https://law.stackexchange.com/questions/2223/why-does-browser-cache-not-count-as-copyright-infringement and others for browser cache copyright. If a computer has added two numbers, it will certainly have stored them in its internal memory. It is very likely that it will have stored at least one of them in what are called internal registers - very small very fast memory intended for storing numbers to be worked on.
If a computer (or any other piece of electronic equipment) has been used to process classified data, it is usual to treat it as classified from then on, making the worst case assumption that it might have retained some copy of any of the data it has been used to process, even if retrieving that data from it would in practice require a great deal of specialised expertise with specialised equipment.

Typically, no. The first thing that typical ML algorithms do with their inputs is not to copy or store it, but to compute something based on it and then forget the original. And this is a fair description of what neural networks, regression algorithms and statistical methods do. There is no 'eidetic memory' in mainstream ML. I imagine anything doing that would be marketed as a database or a full text indexing engine or somesuch.
But how will you present your data to an algorithm running on a machine without first copying the data to that machine?

Does a machine learning algorithm copy the data it learns from?
There are many different machine learning algorithms. If you are talking about k nearest neighbor (k-NN) then the answer is simply yes.
However, k-NN is rarely used. Most (all?) other models are not that simple. Usually, a machine learning developer wants the training data to be compressed (a lot, lossy) by the model for several reasons: (1) The amount of training data is large (many GB), (2) Generalization might be better if the training data is compressed (3) inference of new examples might take really long if the data is not compressed. (By "compress", I mean that the relevant information for the task is extracted and irrelevant data is removed. Not compression in the usual sense.)
For other models than k-NN, the answer is more complicated. It depends on what you consider a "copy". For example, from artificial neural networks (especially the sub-type of convolutional neural networks, short: CNNs) the training data can partially be restored. Those models ware state of the art for many (all?) computer vision tasks.
I could not find papers which show that you can (partially) restore / extract training data from CNNs with the focus on possible privacy / copyright problems, but I'm ~70% certain I have read an abstract about this problem. I think I've also heard a talk where a researcher said this was a problem when building a detector for child pornography. However, I don't think that was recorded or anything published about this.
Here are two papers which indicate that restoring training data from CNNs might be possible:
Understanding deep learning requires rethinking generalization
Visualizing Deep Convolutional Neural Networks Using Natural Pre-Images and the Zeiler & Fergus paper

It depends on what you mean by the word "copy". If you run any program, it will copy the data from the hard disk to RAM for processing. I am assuming this is not what you meant.
So let's say you have the copyrighted data in a particular machine and you run your machine learning algorithms on the data, then there is no reason for the algorithm to copy the data out of the machine.
On the other hand, if you use a cloud ML service(AWS/IBM Bluemix/Azure), then you need to upload the data to the cloud before you can run ML algorithms. This would mean you are copying the data.
Hopefully this sheds more light !
Lowly ML student

Some of the machines do copy the data set such as KNN. Unfortunately, such algorithms are not commonly used in practice because they can't be scaled for large data set.
Most ML algorithms use the data set to identify a pattern, that's why pattern recognition is another name for machine learning. The pattern is almost always much smaller (in terms of memory and variables etc) than the original data set.

Related

number of layers in convolution neural network

I am a beginner in convolution networks. I use digits to implement them and facing with few doubts.
While trying out a basic classification problem of images, how do we decide on the number of layers - how many conv layers/ fully connected layer, etc.
In digits we have 3 standard papers implemented, for a particular dataset is there any way to find out which architecture to use – or when should we use our own architecture.
How can the hidden layers be helpful in solving the problems – i.e. what possible decisions can we take by looking at the results in the hidden layer
Deciding on how many layers or neurons is needed or the best architecture for building neural network was never clear or possible. the main procedure was taken before is to try building on some parameters and then measure the performance on training set and testing set not bias or to over fit the data and decide on the best parameters, or try some other algorithm like genetic algorithm.
conclusion either you start from scratch every time to measure the network performance or apply other algorithms which doesn't need to start from scratch and can build incrementally by applying transfer learning and fine tuning on the network architecture.
The core philosophy that makes deep learning so democratic and amazing is simple "Don't be a Hero".
What it means is that in most cases the best deep learning models take millions of data points and weeks to train, something most of us cannot achieve with our low performance PC's (yes a single GPU system is low performance). So why would you want to waste your time in building and training NN architectures. Simple you don't.
Transfer learning is your solution!! try to find models that are trained on data similar to your problem and use their pre-trained weights to fine tune your data set. Doing this not only do you get an already proven NN architecture but also a major head start in training.
The best place to find pre-trained models is the caffe model zoo so go have a look at it.

What is tuning in machine learning? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
The community reviewed whether to reopen this question 1 year ago and left it closed:
Original close reason(s) were not resolved
Improve this question
I am a novice learner of machine learning and am confused by tuning.
What is the purpose of tuning in machine learning? To select the best parameters for an algorithm?
How does tuning works?
Without getting into a technical demonstration that would seem appropriate for Stackoverflow, here are some general thoughts. Essentially, one can argue that the ultimate goal of machine learning is to make a machine system that can automatically build models from data without requiring tedious and time consuming human involvement. As you recognize, one of the difficulties is that learning algorithms (eg. decision trees, random forests, clustering techniques, etc.) require you to set parameters before you use the models (or at least to set constraints on those parameters). How you set those parameters can depend on a whole host of factors. That said, your goal, is usually to set those parameters to so optimal values that enable you to complete a learning task in the best way possible. Thus, tuning an algorithm or machine learning technique, can be simply thought of as process which one goes through in which they optimize the parameters that impact the model in order to enable the algorithm to perform the best (once, of course you have defined what "best" actual is).
To make it more concrete, here are a few examples. If you take a machine learning algorithm for clustering like KNN, you will note that you, as the programmer, must specify the number of K's in your model (or centroids), that are used. How do you do this? You tune the model. There are many ways that you can do this. One of these can be trying many many different values of K for a model, and looking to understand how the inter and intra group error as you very the number of K's in your model.
As another example, let us consider say support vector machine (SVM) classication. SVM classification requires an initial learning phase in which the training data are used
to adjust the classication parameters. This really refers to an initial parameter tuning phase where you, as the programmer, might try to "tune" the models in order to achieve high quality results.
Now, you might be thinking that this process can be difficult, and you are right. In fact, because of the difficulty of determining what optimal model parameters are, some researchers use complex learning algorithms before experimenting adequately with simpler alternatives with better tuned parameters.
In the abstract sense of machine learning, tuning is working with / "learning from" variable data based on some parameters which have been identified to affect system performance as evaluated by some appropriate1 metric. Improved performance reveals which parameter settings are more favorable (tuned) or less favorable (untuned).
Translating this into common sense, tuning is essentially selecting the best parameters for an algorithm to optimize its performance given a working environment such as hardware, specific workloads, etc. And tuning in machine learning is an automated process for doing this.
For example, there is no such thing as a "perfect set" of optimizations for all deployments of an Apache web server. A sysadmin learns from the data "on the job" so to speak and optimizes their own Apache web server configuration as appropriate for its specific environment. Now imagine an automated process for doing this same thing, i.e., a system that can learn from data on its own, which is the definition of machine learning. A system that tunes its own parameters in such a data-based fashion would be an instance of tuning in machine learning.
1 System performance as mentioned here, can be many things, and is much more general than the computers themselves. Performance can be measured by minimizing the number of adjustments needed for a self-driving car to parallel park, or the number of false predictions in autocomplete; or it could be maximizing the time an average visitor spends on a website based on advertisement dimensions, or the number of in-app purchases in Candy Crush.
Cleverly defining what "performance" means in a way that is both meaningful and measurable is key in a successful machine learning system.
A little pedantic but just want to clarify that a parameter is something that is internal to the model (you do not set it). What you are referring to is a hyperparameter.
Different machine learning algorithms have a set of hyperparameters that can be adjusted to improve performance (or make it worse). The most common and maybe simplest way to find the best hyperparameter is through what's known as a grid search (searching across a set of values).
Some examples of hyperparameters include the number of trees for a random forest algorithm, or a value for regularization.
Important note: hyperparameters must be tuned on a separate set of training data. Lot's of new folks to machine learning will modify the hyperparameters on the training data set until they see the best performance on the test dataset. You are essentially overfitting the hyperparameter by doing this.

What is the difference between Big Data and Data Mining? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
As Wikpedia states
The overall goal of the data mining process is to extract information
from a data set and transform it into an understandable structure for
further use
How is this related with Big Data? Is it correct if I say that Hadoop is doing data mining in a parallel manner?
Big data is everything
Big data is a marketing term, not a technical term. Everything is big data these days. My USB stick is a "personal cloud" now, and my harddrive is big data. Seriously. This is a totally unspecific term that is largely defined by what the marketing departments of various very optimistic companies can sell - and the C*Os of major companies buy, in order to make magic happen. Update: and by now, the same applies to data science. It's just marketing.
Data mining is the old big data
Actually, data mining was just as overused... it could mean anything such as
collecting data (think NSA)
storing data
machine learning / AI (which predates the term data mining)
non-ML data mining (as in "knowledge discovery", where the term data mining was actually coined; but where the focus is on new knowledge, not on learning of existing knowledge)
business rules and analytics
visualization
anything involving data you want to sell for truckloads of money
It's just that marketing needed a new term. "Business intelligence", "business analytics", ... they still keep on selling the same stuff, it's just rebranded as "big data" now.
Most "big" data mining isn't big
Since most methods - at least those that give interesting results - just don't scale, most data "mined" isn't actually big. It's clearly much bigger than 10 years ago, but not big as in Exabytes. A survey by KDnuggets had something like 1-10 GB being the average "largest data set analyzed". That is not big data by any data management means; it's only large by what can be analyzed using complex methods. (I'm not talking about trivial algorithms such a k-means).
Most "big data" isn't data mining
Now "Big data" is real. Google has Big data, and CERN also has big data. Most others probably don't. Data starts being big, when you need 1000 computers just to store it.
Big data technologies such as Hadoop are also real. They aren't always used sensibly (don't bother to run hadoop clusters less than 100 nodes - as this point you probably can get much better performance from well-chosen non-clustered machines), but of course people write such software.
But most of what is being done isn't data mining. It's Extract, Transform, Load (ETL), so it is replacing data warehousing. Instead of using a database with structure, indexes and accelerated queries, the data is just dumped into hadoop, and when you have figured out what to do, you re-read all your data and extract the information you really need, tranform it, and load it into your excel spreadsheet. Because after selection, extraction and transformation, usually it's not "big" anymore.
Data quality suffers with size
Many of the marketing promises of big data will not hold. Twitter produces much less insights for most companies than advertised (unless you are a teenie rockstar, that is); and the Twitter user base is heavily biased. Correcting for such a bias is hard, and needs highly experienced statisticians.
Bias from data is one problem - if you just collect some random data from the internet or an appliction, it will usually be not representative; in particular not of potential users. Instead, you will be overfittig to the existing heavy-users if you don't manage to cancel out these effects.
The other big problem is just noise. You have spam bots, but also other tools (think Twitter "trending topics" that cause reinforcement of "trends") that make the data much noiser than other sources. Cleaning this data is hard, and not a matter of technology but of statistical domain expertise. For example Google Flu Trends was repeatedly found to be rather inaccurate. It worked in some of the earlier years (maybe because of overfitting?) but is not anymore of good quality.
Unfortunately, a lot of big data users pay too little attention to this; which is probably one of the many reasons why most big data projects seem to fail (the others being incompetent management, inflated and unrealistic expectations, and lack of company culture and skilled people).
Hadoop != data mining
Now for the second part of your question. Hadoop doesn't do data mining. Hadoop manages data storage (via HDFS, a very primitive kind of distributed database) and it schedules computation tasks, allowing you to run the computation on the same machines that store the data. It does not do any complex analysis.
There are some tools that try to bring data mining to Hadoop. In particular, Apache Mahout can be called the official Apache attempt to do data mining on Hadoop. Except that it is mostly a machine learning tool (machine learning != data mining; data mining sometimes uses methods from machine learning). Some parts of Mahout (such as clustering) are far from advanced. The problem is that Hadoop is good for linear problems, but most data mining isn't linear. And non-linear algorithms don't just scale up to large data; you need to carefully develop linear-time approximations and live with losses in accuracy - losses that must be smaller than what you would lose by simply working on smaller data.
A good example of this trade-off problem is k-means. K-means actually is a (mostly) linear problem; so it can be somewhat run on Hadoop. A single iteration is linear, and if you had a good implementation, it would scale well to big data. However, the number of iterations until convergence also grows with data set size, and thus it isn't really linear. However, as this is a statistical method to find "means", the results actually do not improve much with data set size. So while you can run k-means on big data, it does not make a whole lot of sense - you could just take a sample of your data, run a highly-efficient single-node version of k-means, and the results will be just as good. Because the extra data just gives you some extra digits of precision of a value that you do not need to be that precise.
Since this applies to quite a lot of problems, actual data mining on Hadoop doesn't seem to kick off. Everybody tries to do it, and a lot of companies sell this stuff. But it doesn't really work much better than the non-big version. But as long as customers want to buy this, companies will sell this functionality. And as long as it gets you a grant, researchers will write papers on this. Whether it works or not. That's life.
There are a few cases where these things work. Google search is an example, and Cern. But also image recognition (but not using Hadoop, clusters of GPUs seem to be the way to go there) has recently benefited from an increase in data size. But in any of these cases, you have rather clean data. Google indexes everything; Cern discards any non-interesting data, and only analyzes interesting measurements - there are no spammers feeding their spam into Cern... and in image analysis, you train on preselected relevant images, not on say webcams or random images from the internet (and if so, you treat them as random images, not as representative data).
What is the difference between big data and Hadoop?
A: The difference between big data and the open source software program Hadoop is a distinct and fundamental one. The former is an asset, often a complex and ambiguous one, while the latter is a program that accomplishes a set of goals and objectives for dealing with that asset.
Big data is simply the large sets of data that businesses and other parties put together to serve specific goals and operations. Big data can include many different kinds of data in many different kinds of formats. For example, businesses might put a lot of work into collecting thousands of pieces of data on purchases in currency formats, on customer identifiers like name or Social Security number, or on product information in the form of model numbers, sales numbers or inventory numbers. All of this, or any other large mass of information, can be called big data. As a rule, it’s raw and unsorted until it is put through various kinds of tools and handlers.
Hadoop is one of the tools designed to handle big data. Hadoop and other software products work to interpret or parse the results of big data searches through specific proprietary algorithms and methods. Hadoop is an open-source program under the Apache license that is maintained by a global community of users. It includes various main components, including a MapReduce set of functions and a Hadoop distributed file system (HDFS).
The idea behind MapReduce is that Hadoop can first map a large data set, and then perform a reduction on that content for specific results. A reduce function can be thought of as a kind of filter for raw data. The HDFS system then acts to distribute data across a network or migrate it as necessary.
Database administrators, developers and others can use the various features of Hadoop to deal with big data in any number of ways. For example, Hadoop can be used to pursue data strategies like clustering and targeting with non-uniform data, or data that doesn't fit neatly into a traditional table or respond well to simple queries.
See the article posted at http://www.shareideaonline.com/cs/what-is-the-difference-between-big-data-and-hadoop/
Thanks
Ankush
This answer is really intended to add some specificity to the excellent answer from Anony-Mousse.
There's a lot of debate over exactly what Big Data is. Anony-Mousse called out a lot of the issues here around the overuse of terms like analytics, big data, and data mining, but there are a few things I want to provide more detail on.
Big Data
For practical purposes, the best definition I've heard of big data is data that is inconvenient or does not function in a traditional relational database. This could be data of 1PB that cannot be worked with or even just data that is 1GB but has 5,000 columns.
This is a loose and flexible definition. There are always going to be setups or data management tools which can work around it, but, this is where tools like Hadoop, MongoDB, and others can be used more efficiently that prior technology.
What can we do with data that is this inconvenient/large/difficult to work with? It's difficult to simply look at a spreadsheet and to find meaning here, so we often use data mining and machine learning.
Data Mining
This was called out lightly above - my goal here is to be more specific and hopefully to provide more context. Data mining generally applies to somewhat supervised analytic or statistical methods for analysis of data. These may fit into regression, classification, clustering, or collaborative filtering. There's a lot of overlap with machine learning, however, this is still generally driven by a user rather that unsupervised or automated execution, which defines machine learning fairly well.
Machine Learning
Often, machine learning and data mining are used interchangeably. Machine learning encompasses a lot of the same areas as data mining but also includes AI, computer vision, and other unsupervised tasks. The primary difference, and this is definitely a simplification, is that user input is not only unnecessary but generally unwanted. The goal is for these algorithms or systems to self-optimize and to improve, rather than an iterative cycle of development.
Big Data is a TERM which consists of collection of frameworks and tools which could do miracles with the very large data sets including Data Mining.
Hadoop is a framework which will split the very large data sets into blocks(by default 64 mb) then it will store it in HDFS (Hadoop Distributed File System) and then when its execution logic(MapReduce) comes with any bytecode to process the data stored at HDFS. It will take the split based on block(splits can be configured) and impose the extraction and computation via Mapper and Reducer process. By this way you could do ETL process, Data Mining, Data Computation, etc.,
I would like to conclude that Big Data is a terminology which could play with very large data sets. Hadoop is a framework which can do parallel processing very well with its components and services. By that way you can acquire Data mining too..
Big Data is the term people use to say how storage is cheap and easy these days and how data is available to be analyzed.
Data Mining is the process of trying to extract useful information from data.
Usually, Data Mining is related to Big Data for 2 reasons
when you have lots of data, patterns are not so evident, so someone could not just inspect and say "hah". He/she needs tools for that.
for many times lots of data can improve the statistical meaningful to your analysis because your sample is bigger.
Can we say hadoop is dois data mining in parallel? What is hadoop? Their site says
The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers
using simple programming models
So the "parallel" part of your statement is true. The "data mining" part of it is not necessarily. You can just use hadoop to summarize tons of data and this is not necessarily data mining, for example. But for most cases, you can bet people are trying to extract useful info from big data using hadoop, so this is kind of a yes.
I would say that BigData is a modernized framework for addressing the new business needs.
As many people might know BigData is all about 3 v's Volume,Variety and Velocity. BigData is a need to leverage a variety of data (structured and un structured data) and using clustering technique to address volume issue and also getting results in less time ie.velocity.
Where as Datamining is on ETL principle .i.e finding useful information from large datasets using modelling techinques. There are many BI tools available in market to achieve this.

trouble with recurrent neural network algorithm for structured data classification

TL;DR
I need help understanding some parts of a specific algorithm for structured data classification. I'm also open to suggestions for different algorithms for this purpose.
Hi all!
I'm currently working on a system involving classification of structured data (I'd prefer not to reveal anything more about it) for which I'm using a simple backpropagation through structure (BPTS) algorithm. I'm planning on modifying the code to make use of a GPU for an additional speed boost later, but at the moment I'm looking for better algorithms than BPTS that I could use.
I recently stumbled on this paper -> [1] and I was amazed by the results. I decided to give it a try, but I have some trouble understanding some parts of the algorithm, as its description is not very clear. I've already emailed some of the authors requesting clarification, but haven't heard from them yet, so, I'd really appreciate any insight you guys may have to offer.
The high-level description of the algorithm can be found in page 787. There, in Step 1, the authors randomize the network weights and also "Propagate the input attributes of each node through the data structure from frontier nodes to root forwardly and, hence, obtain the output of root node". My understanding is that Step 1 is never repeated, since it's the initialization step. The part I quote indicates that a one-time activation also takes place here. But, what item in the training dataset is used for this activation of the network? And is this activation really supposed to happen only once? For example, in the BPTS algorithm I'm using, for each item in the training dataset, a new neural network - whose topology depends on the current item (data structure) - is created on the fly and activated. Then, the error backpropagates, the weights are updated and saved, and the temporary neural network is destroyed.
Another thing that troubles me is Step 3b. There, the authors mention that they update the parameters {A, B, C, D} NT times, using equations (17), (30) and (34). My understanding is that NT denotes the number of items in the training dataset. But equations (17), (30) and (34) already involve ALL items in the training dataset, so, what's the point of solving them (specifically) NT times?
Yet another thing I failed to get is how exactly their algorithm takes into account the (possibly) different structure of each item in the training dataset. I know how this works in BPTS (I described it above), but it's very unclear to me how it works with their algorithm.
Okay, that's all for now. If anyone has any idea of what might be going on with this algorithm, I'd be very interested in hearing it (or rather, reading it). Also, if you are aware of other promising algorithms and / or network architectures (could long short term memory (LSTM) be of use here?) for structured data classification, please don't hesitate to post them.
Thanks in advance for any useful input!
[1] http://www.eie.polyu.edu.hk/~wcsiu/paper_store/Journal/2003/2003_J4-IEEETrans-ChoChiSiu&Tsoi.pdf

Artificial Neural Network Question

Generally speaking what do you get out of extending an artificial neural net by adding more nodes to a hidden layer or more hidden layers?
Does it allow for more precision in the mapping, or does it allow for more subtlety in the relationships it can identify, or something else?
There's a very well known result in machine learning that states that a single hidden layer is enough to approximate any smooth, bounded function (the paper was called "Multilayer feedforward networks are universal approximators" and it's now almost 20 years old). There are several things to note, however.
The single hidden layer may need to be arbitrarily wide.
This says nothing about the ease with which an approximation may be found; in general large networks are hard to train properly and fall victim to overfitting quite frequently (the exception are so-called "convolutional neural networks" which really are only meant for vision problems).
This also says nothing about the efficiency of the representation. Some functions require exponential numbers of hidden units if done with one layer but scale much more nicely with more layers (for more discussion of this read Scaling Learning Algorithms Towards AI)
The problem with deep neural networks is that they're even harder to train. You end up with very very small gradients being backpropagated to the earlier hidden layers and the learning not really going anywhere, especially if weights are initialized to be small (if you initialize them to be of larger magnitude you frequently get stuck in bad local minima). There are some techniques for "pre-training" like the ones discussed in this Google tech talk by Geoff Hinton which attempt to get around this.
This is very interesting question but it's not so easy to answer. It depends on the problem you try to resolve and what neural network you try to use. There are several neural network types.
I general it's not so clear that more nodes equals more precision. Research show that you need mostly only one hidden layer. The numer of nodes should be the minimal numer of nodes that are required to resolve a problem. If you don't have enough of them - you will not reach solution.
From the other hand - if you have reached the number of nodes that is good to resolve solution - you can add more and more of them and you will not see any further progress in result estimation.
That's why there are so many types of neural networks. They try to resolve different types of problems. So you have NN to resolve static problems, to resolve time related problems and so one. The number of nodes is not so important like the design of them.
When you have a hidden layer is that you are creating a combined feature of the input. So, is the problem better tackled by more features of the existing input, or through higher-order features that come from combining existing features? This is the trade-off for a standard feed-forward network.
You have a theoretical reassurance that any function can be represented by a neural network with two hidden layers and non-linear activation.
Also, consider using additional resources for boosting, instead of adding more nodes, if you're not certain of the appropriate topology.
Very rough rules of thumb
generally more elements per layer for bigger input vectors.
more layers may let you model more non-linear systems.
If the kind of network you are using has delays in propagation , more layers may allow modelling of time series . Take care to have time jitter in the delays or it wont work very well. If this is just gobbledegook to you, ignore it.
More layers lets you insert recurrent features. This can be very useful for discrimination tasks. You ANN implementation my not permit this.
HTH
The number of units per hidden layer accounts for the ANN's potential to describe an arbitrarily complex function. Some (complicated) functions may require many hidden nodes, or possibly more than one hidden layer.
When a function can be roughly approximated by a certain number of hidden units, any extra nodes will provide more accuracy...but this is only true if the training samples used are enough to justify this addition - otherwise what will happen is "overconvergence". Overconvergence means that your ANN has lost its generalization abilities because it has overemphasized on the particular samples.
In general it is best to use the less hidden units possible, if the resulting network can give good results. The additional training patterns required to justify more hidden nodes can not be found easily in most cases, and accuracy is not the NNs' strong point.

Resources