How to calculate maximal parallelism in a DAG?

Given a DAG (directed acyclic graph), how does one calculate the maximal parallelism?
Instantaneous parallelism is the maximum number of processors that can be kept busy at each point in execution of algorithm; the maximal parallelism is the highest instantaneous parallelism.
Put another way, given a DAG representing a dependency graph of tasks, what is the minimum number of processors/threads such that no task is ever blocked?
The closest approach I found here is:
apply a topological sort on the DAG
traverse over the nodes by the topological order, calculate the minimum level:
no parents: 0
otherwise: minimum parent level + 1
return the max level width (max num of nodes assigned the same level)
This algorithm worked for me on several samples, however doesn't work on a tree. E.g.:
o 0
/ \
o 1 o 1
/ \
o 2 o 2
/ \
o 3 o 3
According to the algorithm above, max width is 2, but clearly max parallelism in a tree is the number of leafs, 4 in the example above.
A similar approach is partially described here (see slide titled Computing critical path etc., which describes how to calculate earliest start times of nodes and that "maximal...parallelism can easily be computed from this").
Edit 1:
#AliSoltani's solution to use BFS to find the length of the critical path and that is the max parallelism degree is incorrect, since it only applies to a subset of examples, mainly trees in which the number of leafs is equal to the longest path. Here's an illustration of a case where this wouldn't work:
Edit 2:
#AliSultani's 2nd solution using BFS to find the level with maximum number of nodes, and set that max as the max parallelism, is also incorrect, as it doesn't take into account cases where nodes from different levels may run concurrently. See this counterexample:

This problem is reducible to the Maximum Directed Cut problem.
Let's build an auxiliary DAG from the original one.
For every vertex u[i] of the original graph add vertexes v[i] and w[i] to the new graph, and connect them using an edge (v[i], w[i]) with a cost 1.
For every edge (u[i], u[j]) of the original graph add an edge (w[i], v[j]) with a cost 0 to the new graph.
Now the problem is equivalent to finding the maximum directed cut in the auxiliary graph.

You should find critical path length in DAG. A critical path is a directed
path that has the maximum execution requirement among all other paths in DAG. critical path length in DAG with n node has n node. So maximal parallelism is n.
Critical path is longest path from root to leaf (in DAG) and for find it you can use BFS algorithm (Breath First Search).
Example 1
BFS order in this tree is O(|V|+|E|). This is optimal solution for this problem.
Edit: Find maximum degree of concurrency by BFS
You can determine the maximum degree of concurrency by running the breadth-first search algorithm too:
The algorithm starts from the root node and proceeds towards the
leafs level-wise.
before inspecting nodes located on the next level it explores all of
the nodes belonging to the same level.
Count the number of nodes on each level and update a variable holding
the maximum number of nodes per level.
Example 2 (Step by step)
So in this example maximum degree of concurrency is 4.
Final Edit
With the last explanations you gave, Maximal independent set of tasks is what you are looking for. To solve this problem see this article.

I have not tested the algorithm, but my proposal would be the following:
Start from the origin node.
Select each connected edge. Current concurrency is the number of selected edges. Remember that.
Sort the selected nodes which are connected by the edges by the number of outgoing edges. Ignore all nodes, which have incoming edges which weren't yet selected.
Start going down the edge with the node with the most outgoing edges.
If not at end node: Repeat from 2)
Get the maximum of current concurrency for all iterations.
Here is an implementation in python using networkx. The document you have linked does something different. It calculates the number of concurrent tasks when the graph is executed with the attached timings to the nodes (1 for each node in that case). This is an easy tasks and probably the one the author of the document refers to. My algorithm however calculates the theoretical maximum and does not take the running time of each task into account.


