numpy or TensorFlow piece must have a in addition to the problem, respectively debugging TensorFlow code and numpy code executed separately. Found that when converting arrays np.array() execution efficiency is extremely low.
Solution:
Change numpy version
tensorflow accelerated optimization method
1. stochastic gradient decent (CGD)
Put data into neural network in small batches for calculation
W += - Learning rate * dx
Disadvantages: difficult to choose the right learning rate
Slow
Easily converges to a local optimum, and can be trapped in saddle points in some cases
2. momentum
Mimics the notion of momentum in physics, accumulating the previous momentum to replace the true gradient. (Exploits slope inertia)
m = b1 * m - Learning rate * dx
W += m
Features: accelerates SGD in the direction of interest, suppresses oscillations, and thus speeds up convergence
Relying on manually setting the global learning rate, the accumulation of the gradient squared on the denominator in the mid-to-late stages will be larger and larger, making the training end early
3. adagrad
Each parameter update has its own learning rate (bad shoes to walk in)
v += dx^2
W += -Learning rate * dx / √v
Characteristics: pre-amplification of the gradient, late constraints on the gradient, suitable for dealing with sparse gradients
4. RMSProp
Combines the advantages of momentum and adagrad
v = b1 * v + (1 - b1) * dx^2
W += -Learning rate * dx / √v
Characteristics: Depends on the global learning rate
Good for dealing with non-smooth targets - for RNNs Good for RNN
5. Adam (fast and good)
m = b1 * m + (1 - b1) * dx
v = b2 * v + (1 - b2) * dx^2
W += -Learning rate * m / √v
Characteristics: Combines Adagrad's ability to deal with sparse gradients with RMSprop's ability to deal with non-smooth targets. RMSprop is good at dealing with non-smooth objectives
Low memory requirements
Compute different adaptive learning rates for different parameters
Also suitable for mostly non-convex optimization - for large datasets and high-dimensional spaces
6. Optimizers
Used to vary learning efficiency