Initial guess for Newton Raphson - algorithm

How can I determine the initial guess of the equation Ax+Bsin(x)=C in terms of A,B and C ?
I am trying to solve it using Newton Raphson. A,B and C will be given during runtime.
Is there any other method more efficient than Newton Raphson for this purpose ?

The optimal initial guess is the root itself, so finding an "optimal" guess isn't really valid.
Any guess will give you a valid solution eventually as long as f'(x0) != 0 for any step, which only occurs at the zeroes of cos(x), which are k*pi + pi/2 for any integer k.
I would try x0 = C * pi, just to see if it works.
Your biggest problem, however, would be the periodic nature of your function. Newton's method will be slow (if it even works) for your function as sin(x) will shift x0 back and forth over and over.
Precaution:
In Newton's method, do you notice how f'(xn) is in the denominator? f'(x) approaches 0 infinitely many times. If your f'(x) = 0.0001 (or anywhere close to zero, which has a chance of happening), your xn+1 gets thrown really far away from xn.
Worse yet, this can happen over and over due to f'(x) being a periodic function, which means that Newton's method might never even converge for an arbitrary x0.

The simplest "good" approximation is to just assume that sin(x) is approximately zero, and so set:
x0 = C/A

Well, if A,B and C are real and different from 0, then (B+C)/A is an upper quote to the highest root and (C-B)/A is a lower quote to the lowest root, as -1 <= sin(x) <= 1. You could start with those.

Newton method can work with any guess. the problem is simple,
if there is an equation and I guessed x0=100
and the best close solution for it is x0=2
and I know the answer is 2.34*
by using any guess in the world you will eventually get to 2.34*
the method says to choose a guess because without a valid guess it will take many solutions which aren't comfortable no one wants to repeat the method 20 times
and guessing a solution is not hard
you just find a critical point-
for example, 3 is too big and 2 is too small
so the answer is between 2 and 3
but if instead guessing 2 you guess 50
you will still get to the right solution.
like I said it will just take you much longer
I tested the method by myself
I guessed 1000 to a random equation
and I knew the best guess was 4
the answer was between 4 and 5
I chose 1000 it took me much time
but after a few hours, I got down from 1000 to 4.something
if you somehow can't find a critical point you can actually put a random number equals to x0 and then eventually you will get to the right solution
no matter what number you guessed.

Related

Expectation Maximization coin toss examples

I've been self-studying the Expectation Maximization lately, and grabbed myself some simple examples in the process:
http://cs.dartmouth.edu/~cs104/CS104_11.04.22.pdf
There are 3 coins 0, 1 and 2 with P0, P1 and P2 probability landing on Head when tossed. Toss coin 0, if the result is Head, toss coin 1 three times else toss coin 2 three times. The observed data produced by coin 1 and 2 is like this: HHH, TTT, HHH, TTT, HHH. The hidden data is coin 0's result. Estimate P0, P1 and P2.
http://ai.stanford.edu/~chuongdo/papers/em_tutorial.pdf
There are two coins A and B with PA and PB being the probability landing on Head when tossed. Each round, select one coin at random and toss it 10 times then record the results. The observed data is the toss results provided by these two coins. However, we don't know which coin was selected for a particular round. Estimate PA and PB.
While I can get the calculations, I can't relate the ways they are solved to the original EM theory. Specifically, during the M-Step of both examples, I don't see how they're maximizing anything. It just seems they are recalculating the parameters and somehow, the new parameters are better than the old ones. Moreover, the two E-Steps don't even look similar to each other, not to mention the original theory's E-Step.
So how exactly do these example work?
The second PDF won't download for me, but I also visited the wikipedia page http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm which has more information. http://melodi.ee.washington.edu/people/bilmes/mypapers/em.pdf (which claims to be a gentle introduction) might be worth a look too.
The whole point of the EM algorithm is to find parameters which maximize the likelihood of the observed data. This is the only bullet point on page 8 of the first PDF, the equation for capital Theta subscript ML.
The EM algorithm comes in handy where there is hidden data which would make the problem easy if you knew it. In the three coins example this is the result of tossing coin 0. If you knew the outcome of that you could (of course) produce an estimate for the probability of coin 0 turning up heads. You would also know whether coin 1 or coin 2 was tossed three times in the next stage, which would allow you to make estimates for the probabilities of coin 1 and coin 2 turning up heads. These estimates would be justified by saying that they maximized the likelihood of the observed data, which would include not only the results that you are given, but also the hidden data that you are not - the results from coin 0. For a coin that gets A heads and B tails you find that the maximum likelihood for the probability of A heads is A/(A+B) - it might be worth you working this out in detail, because it is the building block for the M step.
In the EM algorithm you say that although you don't know the hidden data, you come in with probability estimates which allow you to write down a probability distribution for it. For each possible value of the hidden data you could find the parameter values which would optimize the log likelihood of the data including the hidden data, and this almost always turns out to mean calculating some sort of weighted average (if it doesn't the EM step may be too difficult to be practical).
What the EM algorithm asks you to do is to find the parameters maximizing the weighted sum of log likelihoods given by all the possible hidden data values, where the weights are given by the probability of the associated hidden data given the observations using the parameters at the start of the EM step. This is what almost everybody, including the Wikipedia algorithm, calls the Q-function. The proof behind the EM algorithm, given in the Wikipedia article, says that if you change the parameters so as to increase the Q-function (which is only a means to an end), you will also have changed them so as to increase the likelihood of the observed data (which you do care about). What you tend to find in practice is that you can maximize the Q-function using a variation of what you would do if you know the hidden data, but using the probabilities of the hidden data, given the estimates at the start of the EM-step, to weight the observations in some way.
In your example it means totting up the number of heads and tails produced by each coin. In the PDF they work out P(Y=H|X=) = 0.6967. This means that you use weight 0.6967 for the case Y=H, which means that you increment the counts for Y=H by 0.6967 and increment the counts for X=H in coin 1 by 3*0.6967, and you increment the counts for Y=T by 0.3033 and increment the counts for X=H in coin 2 by 3*0.3033. If you have a detailed justification for why A/(A+B) is a maximum likelihood of coin probabilities in the standard case, you should be ready to turn it into a justification for why this weighted updating scheme maximizes the Q-function.
Finally, the log likelihood of the observed data (the thing you are maximizing) gives you a very useful check. It should increase with every EM step, at least until you get so close to convergence that rounding error comes in, in which case you may have a very small decrease, signalling convergence. If it decreases dramatically, you have a bug in your program or your maths.
As luck would have it, I have been struggling with this material recently as well. Here is how I have come to think of it:
Consider a related, but distinct algorithm called the classify-maximize algorithm, which we might use as a solution technique for a mixture model problem. A mixture model problem is one where we have a sequence of data that may be produced by any of N different processes, of which we know the general form (e.g., Gaussian) but we do not know the parameters of the processes (e.g., the means and/or variances) and may not even know the relative likelihood of the processes. (Typically we do at least know the number of the processes. Without that, we are into so-called "non-parametric" territory.) In a sense, the process which generates each data is the "missing" or "hidden" data of the problem.
Now, what this related classify-maximize algorithm does is start with some arbitrary guesses at the process parameters. Each data point is evaluated according to each one of those parameter processes, and a set of probabilities is generated-- the probability that the data point was generated by the first process, the second process, etc, up to the final Nth process. Then each data point is classified according to the most likely process.
At this point, we have our data separated into N different classes. So, for each class of data, we can, with some relatively simple calculus, optimize the parameters of that cluster with a maximum likelihood technique. (If we tried to do this on the whole data set prior to classifying, it is usually analytically intractable.)
Then we update our parameter guesses, re-classify, update our parameters, re-classify, etc, until convergence.
What the expectation-maximization algorithm does is similar, but more general: Instead of a hard classification of data points into class 1, class 2, ... through class N, we are now using a soft classification, where each data point belongs to each process with some probability. (Obviously, the probabilities for each point need to sum to one, so there is some normalization going on.) I think we might also think of this as each process/guess having a certain amount of "explanatory power" for each of the data points.
So now, instead of optimizing the guesses with respect to points that absolutely belong to each class (ignoring the points that absolutely do not), we re-optimize the guesses in the context of those soft classifications, or those explanatory powers. And it so happens that, if you write the expressions in the correct way, what you're maximizing is a function that is an expectation in its form.
With that said, there are some caveats:
1) This sounds easy. It is not, at least to me. The literature is littered with a hodge-podge of special tricks and techniques-- using likelihood expressions instead of probability expressions, transforming to log-likelihoods, using indicator variables, putting them in basis vector form and putting them in the exponents, etc.
These are probably more helpful once you have the general idea, but they can also obfuscate the core ideas.
2) Whatever constraints you have on the problem can be tricky to incorporate into the framework. In particular, if you know the probabilities of each of the processes, you're probably in good shape. If not, you're also estimating those, and the sum of the probabilities of the processes must be one; they must live on a probability simplex. It is not always obvious how to keep those constraints intact.
3) This is a sufficiently general technique that I don't know how I would go about writing code that is general. The applications go far beyond simple clustering and extend to many situations where you are actually missing data, or where the assumption of missing data may help you. There is a fiendish ingenuity at work here, for many applications.
4) This technique is proven to converge, but the convergence is not necessarily to the global maximum; be wary.
I found the following link helpful in coming up with the interpretation above: Statistical learning slides
And the following write-up goes into great detail of some painful mathematical details: Michael Collins' write-up
I wrote the below code in Python which explains the example given in your second example paper by Do and Batzoglou.
I recommend that you read this link first for a clear explanation of how and why the 'weightA' and 'weightB' in the code below are obtained.
Disclaimer : The code does work but I am certain that it is not coded optimally. I am not a Python coder normally and have started using it two weeks ago.
import numpy as np
import math
#### E-M Coin Toss Example as given in the EM tutorial paper by Do and Batzoglou* ####
def get_mn_log_likelihood(obs,probs):
""" Return the (log)likelihood of obs, given the probs"""
# Multinomial Distribution Log PMF
# ln (pdf) = multinomial coeff * product of probabilities
# ln[f(x|n, p)] = [ln(n!) - (ln(x1!)+ln(x2!)+...+ln(xk!))] + [x1*ln(p1)+x2*ln(p2)+...+xk*ln(pk)]
multinomial_coeff_denom= 0
prod_probs = 0
for x in range(0,len(obs)): # loop through state counts in each observation
multinomial_coeff_denom = multinomial_coeff_denom + math.log(math.factorial(obs[x]))
prod_probs = prod_probs + obs[x]*math.log(probs[x])
multinomial_coeff = math.log(math.factorial(sum(obs))) - multinomial_coeff_denom
likelihood = multinomial_coeff + prod_probs
return likelihood
# 1st: Coin B, {HTTTHHTHTH}, 5H,5T
# 2nd: Coin A, {HHHHTHHHHH}, 9H,1T
# 3rd: Coin A, {HTHHHHHTHH}, 8H,2T
# 4th: Coin B, {HTHTTTHHTT}, 4H,6T
# 5th: Coin A, {THHHTHHHTH}, 7H,3T
# so, from MLE: pA(heads) = 0.80 and pB(heads)=0.45
# represent the experiments
head_counts = np.array([5,9,8,4,7])
tail_counts = 10-head_counts
experiments = zip(head_counts,tail_counts)
# initialise the pA(heads) and pB(heads)
pA_heads = np.zeros(100); pA_heads[0] = 0.60
pB_heads = np.zeros(100); pB_heads[0] = 0.50
# E-M begins!
delta = 0.001
j = 0 # iteration counter
improvement = float('inf')
while (improvement>delta):
expectation_A = np.zeros((5,2), dtype=float)
expectation_B = np.zeros((5,2), dtype=float)
for i in range(0,len(experiments)):
e = experiments[i] # i'th experiment
ll_A = get_mn_log_likelihood(e,np.array([pA_heads[j],1-pA_heads[j]])) # loglikelihood of e given coin A
ll_B = get_mn_log_likelihood(e,np.array([pB_heads[j],1-pB_heads[j]])) # loglikelihood of e given coin B
weightA = math.exp(ll_A) / ( math.exp(ll_A) + math.exp(ll_B) ) # corresponding weight of A proportional to likelihood of A
weightB = math.exp(ll_B) / ( math.exp(ll_A) + math.exp(ll_B) ) # corresponding weight of B proportional to likelihood of B
expectation_A[i] = np.dot(weightA, e)
expectation_B[i] = np.dot(weightB, e)
pA_heads[j+1] = sum(expectation_A)[0] / sum(sum(expectation_A));
pB_heads[j+1] = sum(expectation_B)[0] / sum(sum(expectation_B));
improvement = max( abs(np.array([pA_heads[j+1],pB_heads[j+1]]) - np.array([pA_heads[j],pB_heads[j]]) ))
j = j+1
The key to understanding this is knowing what the auxiliary variables are that make estimation trivial. I will explain the first example quickly, the second follows a similar pattern.
Augment each sequence of heads/tails with two binary variables, which indicate whether coin 1 was used or coin 2. Now our data looks like the following:
c_11 c_12
c_21 c_22
c_31 c_32
...
For each i, either c_i1=1 or c_i2=1, with the other being 0. If we knew the values these variables took in our sample, estimation of parameters would be trivial: p1 would be the proportion of heads in samples where c_i1=1, likewise for c_i2, and \lambda would be the mean of the c_i1s.
However, we don't know the values of these binary variables. So, what we basically do is guess them (in reality, take their expectation), and then update the parameters in our model assuming our guesses were correct. So the E step is to take the expectation of the c_i1s and c_i2s. The M step is to take maximum likelihood estimates of p_1, p_2 and \lambda given these cs.
Does that make a bit more sense? I can write out the updates for the E and M step if you prefer. EM then just guarantees that by following this procedure, likelihood will never decrease as iterations increase.

what is the right algorithm for ... eh, I don't know what it's called

Suppose I have a long and irregular digital signal made up of smaller but irregular signals occurring at various times (and overlapping each other). We will call these shorter signals the "pieces" that make up the larger signal. By "irregular" I mean that it is not a specific frequency or pattern.
Given the long signal I need to find the optimal arrangement of pieces that produce (as closely as possible) the larger signal. I know what the pieces look like but I don't know how many of them exist in the full signal (or how many times any one piece exists in the full signal). What software algorithm would you use to do this optimization? What do I search for on the web to get help on solving this problem?
Here's a stab at it.
This is actually the easier of the deconvolution problems. It is easier in that you may be able to have a unique answer. The harder problem is that you also don't know what the pieces look like. That case is called blind deconvolution. It is a harder problem and is usually iterative and statistical (ML or MAP), and the solution may not be right.
Luckily, your case is easier, but still not so easy because you have multiple pieces :p
I think that it may be commonly called mixture deconvolution?
So let f[t] for t=1,...N be your long signal. Let h1[t]...hn[t] for t=0,1,2,...M be your short signals. Obviously here, N>>M.
So your hypothesis is that:
(1) f[t] = h1[t+a1[1]]+h1[t+a1[2]] + ...
+h2[t+a2[1]]+h2[t+a2[2]] + ...
+....
+hn[t+an[1]]+h2[t+an[2]] + ...
Observe that each row of that equation is actually hj * uj where uj is the sum of shifted Kronecker delta. The * here is convolution.
So now what?
Let Hj be the (maybe transposed depending on how you look at it) Toeplitz matrix generated by hj, then the equation above becomes:
(2) F = H1 U1 + H2 U2 + ... Hn Un
subject to the constraint that uj[k] must be either 0 or 1.
where F is the vector [f[0],...F[N]] and Uj is the vector [uj[0],...uj[N]].
So you can rewrite this as:
(3) F = H * U
where H = [H1 ... Hn] (horizontal concatenation) and U = [U1; ... ;Un] (vertical concatenation).
H is an Nx(nN) matrix. U is an nN vector.
Ok, so the solution space is finite. It is 2^(nN) in size. So you can try all possible combinations to see which one gives you the lowest ||F - H*U||, but that will take too long.
What you can do is solve equation (3) using pseudo-inverse, multi-linear regression (which uses least square, which comes out to pseudo-inverse), or something like this
Is it possible to solve a non-square under/over constrained matrix using Accelerate/LAPACK?
Then move that solution around within the null space of H to get a solution subject to the constraint that uj[k] must be either 0 or 1.
Alternatively, you can use something like Nelder-Mead or Levenberg-Marquardt to find the minimum of:
||F - H U|| + lambda g(U)
where g is a regularization function defined as:
g(U) = ||U - U*||
where U*[j] = 0 if |U[j]|<|U[j]-1|, else 1
Ok, so I have no idea if this will converge. If not, you have to come up with your own regularizer. It's kinda dumb to use a generalized nonlinear optimizer when you have a set of linear equations.
In reality, you're going to have noise and what not, so it actually may not be a bad idea to use something like MAP and apply the small pieces as prior.

Minimize a function

Suppose you are given a function of a single variable and arguments a and b and are asked to find the minimum value that the function takes on the interval [a, b]. (You can assume that the argument is a double, though in my application I may need to use an arbitrary-precision library.)
In general this is a hard problem because functions can be weird. A simple version of this problem would be to minimize the function assuming that it is continuous (no gaps or jumps) and single-peaked (there is a unique minimum; to the left of the minimum the function is decreasing and to the right it is increasing). Is there a good way to solve this easier (but perhaps not easy!) problem?
Assume that the function may be difficult to calculate but not particularly expensive to store an answer that you've computed. (Obviously, it's better if you don't have to make giant arrays of key/value pairs.)
Bonus points for good ideas on improving the algorithm in the fortunate case in which it's nice (e.g.: derivative exists, function is smooth/analytic, derivative can be computed in closed form, derivative can be computed at no cost when the function is evaluated).
The version you describe, with a single minimum, is easy to solve.
The idea is this. Suppose that I have 3 points with a < b < c and f(b) < f(a) and f(b) < f(c). Then the true minimum is between a and c. Furthermore if I pick another point d somewhere in the interval, then I can throw away one of a or d and still have an interval with the true minimum in the middle. My approximations will improve exponentially quickly as I do more iterations.
We don't quite start with this. We start with 2 points, a and b, and know that the answer is somewhere in the middle. Take the mid-point. If f there is below the end points, we're into the case I discussed above. Otherwise it must be below one of the end points, and above the other. We can throw away the higher end point and repeat.
If the function is nice, i.e., single-peaked and strictly monotonic (i.e., strictly decreasing to the left of the minimum and strictly increasing to the right), then you can find the minimum with binary search:
Set x = (b-a)/2
test whether x is to the right of the minimum or to the left
if x is left of the minimum:b = x
if x is right of the minimum:a = x
repeat from start until you get bored
the minimum is at x
To test whether x is left/right of the minimum, invent a small value epsilon and check whether f(x - epsilon) < f(x + epsilon). If it is, the minimum is to the left, otherwise it's to the right. By "until you get bored", I mean: invent another small value delta and stop if fabs(f(x - epsilon) - f(x + epsilon)) < delta.
Note that in the general case where you don't know anything about the behavior of a function f, it's not possible to decide a non-trivial property of f. Well, unless you're willing to try all possible inputs. See Rice's Theorem for details.
The Boost project has an implementation of Brent's algorithm that may be useful.
It seems to assume that the function is continuous, and has no maxima (only a minimum) in the input interval.
Not a direct answer but a pointer to more reading:
scipy.optimize: http://docs.scipy.org/doc/scipy/reference/optimize.html
section e04 of naglib: http://www.nag.co.uk/numeric/cl/nagdoc_cl09/html/genint/libconts.html
For the special case where the function is differentiable twice (and the two derivatives can be calculated easily), one can use Newton's method for optimization, i.e. essentially finding the roots of the first derivative (which is a necessary condition for the minimum).
Concerning the general case, note that the extreme case of 'weird' is a function which is continuous nowhere and for which it is very hard if not impossible to find the minimum (in finite time). So I guess you should try to make at least some assumptions about the function you are trying to minimize.
What you want is to optimize an Unimodal function. The correct algorithm is similar to btilly's but you need extra points.
Take 4 points a < b < c < d.
We want to minimize f in [a,d].
If f(b) < f(c) we know the minimum is in [a, c]
If f(b) > f(c) " " " " is in [b, d]
This can give an algorithm by itself, but there is a nice trick involving the golden ratio that allows you to reuse the intermediate values (in a way you only need to compute f once per iteration instead of twice)
If you have an expression for the function, there are global optimization algorithms based on interval analysis.

Simple algebra question, for a program I'm writing

How would I solve:
-x^3 - x - 4 = 0
You can't use quadratic, because it is to the 3rd power, right?
I know I it should come out to ~1.3788 but I'm not sure how I would derive that.
I started with:
f(x) = x + (4/(x^2 + 1)).
Solving for 0, moving the x over to the other side, multiplying by (x^2 + 1) on both sides, I end up with:
-x(x^2 + 1) = 4,
or
-x^3 - x - 4 = 0.
Finding roots of equations by using Newton's method or Fixed point iteration
Algebraically, you want to use Cardano's method:
http://www.math.ucdavis.edu/~kkreith/tutorials/sample.lesson/cardano.html
Using this method, it's about as easy to solve as the quadratic.
Actually, this is possibly clearer:
http://en.wikipedia.org/wiki/Cubic_function#Summary
Find a root by using newton iteration (see link below). Then divide the polynomial by (x-TheRootYouFound). The result will be a quadratic formula that you can plug into your quadratic root finder of your choice.
About Newton Iteration:
http://en.wikipedia.org/wiki/Newton%27s_method
About Polynomial Division
http://en.wikipedia.org/wiki/Polynomial_long_division
This article may be interesting for you as well. It covers more robust ways to solve your problem at the expense of some additional complexity.
http://en.wikipedia.org/wiki/Root-finding_algorithm
It's a cubic function. You're correct, the quadratic formula does not apply.
You gave one root, but in general there are three.
How did you arrive at that single value? Trial and error? That's legitimate. You don't need to "derive" anything.
x^3 + a2*x^2 + a1*x + a0 = 0 can be written as (x-x1)*(x-x2)*(x-x3) = 0, where x1, x2, and x3 are the three roots. If you know that the root you cited is correct, you can divide it out and leave (x-x2)*(x-x3) = 0, which is a quadratic that you can apply the usual techniques to.
This may not help from a programming viewpoint, but from a math perspective...
Note than in this particular cubic function you need to consider imaginary numbers, because when x = i then you have a denominator that is zero (in your original equation). Also, generally speaking you shouldn't multiply or divide by variables (adding and subtracting is fine though) when you move them to the other side of the equation because you'll generally forget of about the condition where the term you multiplied or divided by is zero. Those answers need to be excluded from the solution set.
x = i is an example of an excluded solution in the above cubic. You need to evaluate your excluded solutions before you manipulate the equation at all.

Using Taylor Series to Avoid Loss of Precision

I'm trying to use Taylor series to develop a numerically sound algorithm for solving a function. I've been at it for quite a while, but haven't had any luck yet. I'm not sure what I'm doing wrong.
The function is
f(x)=1 + x - sin(x)/ln(1+x) x~0
Also: why does loss of precision even occur in this function? when x is close to zero, sin(x)/ln(1+x) isn't even close to being the same number as x. I don't see where significance is even being lost.
In order to solve this, I believe that I will need to use the Taylor expansions for sin(x) and ln(1+x), which are
x - x^3/3! + x^5/5! - x^7/7! + ...
and
x - x^2/2 + x^3/3 - x^4/4 + ...
respectively. I have attempted to use like denominators to combine the x and sin(x)/ln(1+x) components, and even to combine all three, but nothing seems to work out correctly in the end. Any help is appreciated.
The loss of precision can come in because when x ~ 0, ln(1+x) is also close to 0, so you wind up dividing by a very small number. Computers aren't very good at that ;-)
If you use the Taylor series for ln(1+x) directly, it's going to be kind of a pain because you'll wind up dividing by an infinite series of terms. For cases like this, I usually prefer to just compute the Taylor series for the entire function as a whole from the definition:
f(x) = f(0) + f'(0) x + f''(0) x/2 + f'''(0) x/6 + ...
from which you'll get
f(x) = 2 + 3x/2 - x^2/4 - x^3/24 - x^4/240 - 23x^5/1440 + 31x^6/2880 ...
(I cheated and plugged it into Mathematica ;-) Like Steve says, this series doesn't converge all that quickly, although I can't think of a better method at the moment.
EDIT: I think I misread the question - if all you're trying to do is find the zeros of the function, there are definitely better ways than using a Taylor series.
As this is homework, I'm just going to try to give a few pointers in the right direction.
Solution 1
Rather than using the Talyor series approximation, try to simply use a root finding algorithm such as the Newton-Raphson method, linear interpolation, or interval bisection (or combine them even). They are very simple to implement, and with an appropiate choice of starting value(s), the root can converge to a precise value quite quickly.
Solution 2
If you really need to use the Taylor series approximation for whatever reason, then just expand the sin(x), ln(x), and whatever else. (Multiplying through by ln(x) to remove the denominator in your case will work). Then you'll need to use some sort of polynomial equation solver. If you want a reasonable degree of accuracy, you'll need to go beyond the 3rd or 4th powers I'd imagine, which means a simple analytical solution is not going to be easy. However, you may want to look into something like the Durand-Kerner method for solving general polynomials of any order. Still, if you need to use high-order terms this approach is just going to lead to complications, so I would definitely recommend solution 1.
Hope that helps...
I think you need to look at what happens to ln(x+1) as x -->0 and you will see why this function does not behave well near x = 0.
I haven't looked into this that closely, but you should be aware that some taylor series converge very, very slowly.
Just compute the Taylor series of f directly.
Maxima gives me (first 4 terms about x=0):
(%i1) f(x):=1 + x - sin(x)/log(1+x);
- sin(x)
(%o1) f(x) := 1 + x + ----------
log(1 + x)
(%i2) taylor(f(x),x,0,4);
2 3 4
x x x x
(%o2)/T/ - + -- + -- + --- + . . .
2 4 24 240
Method used in question is correct - just make sure your calculator is in radians mode.

Resources