Why do LSTMs solve the vanishing gradient problem, when we multiply f_t a lot of times while backpropagating through time?
I am currently working on my assignment about LSTMs and want readers to understand why we even use those. I can explain, why the vanishing / exploding gradient problem is happening with normal RNNs. But after doing the math on BPTT on LSTMs i am left with a term that multiplies the value of the forget gate many times.