To ensure that the policy iteration (performed on an MDP where parameters have been replaced
by a parameter valuation) converges and converges on the right result in the presence of
zero-reward end components, we initialise the policy iteration for Rmin[F] with a
proper scheduler, i.e., that reaches the goal with probability one (see [BertsekasTsitsiklis91]).
Together with the previous commit, this fixes#4 and #15.
In the state eliminator, i.e., solving for a DTMC, a state that has probability
zero of reaching the target states (i.e., that can not reach the target state)
should get infinite reward.
Previously, the check for this looked at the returned set of collectStatesBackward(),
which always returns the whole state space (used for backward elimination order).
Now, we use the variant of collectStatesBackward() that only returns the states
that can reach the target set.
Additionally, we now compute the set of infinity states for Rmin/Rmax using
prob0a / prob0e, respectively, and ensure that they get value infinity.