Reference: /column/c_1170719557072326656
The inverse convolution is also known as the transposed convolution, if you implement the convolution operation with matrix multiplication, flattening the convolution kernel to a matrix, the transposed convolution is left-multiplied by the transposed WT of this matrix in forward computation, and in the reverse propagation it is left-multiplied by W, which is just the reverse of the convolution operation, and it should be noted that. Inverse convolution is not the inverse operation of convolution.
[Zhihu question + caffe implementation]
Implementation of upsampling; approximate reconstruction of the input image, visualization of the convolutional layer.
A neural network with at least one hidden layer can approximate any continuous function on a closed interval to any specified accuracy, provided that the activation function is properly chosen and the number of neurons is sufficient.
Discriminative models that output either the class labels directly, or the class posterior probability p(y|x)
[ /question/268906476]
[ /p/40024110]
[ /p/159189617]
The BN is normalized on the batch normalization on this dimension, and GN is the computation of the mean-variance of each group in the channel direction.
The intersection of detections and Ground Truth over their concatenation is the accuracy of the detection IoU
Memory/memory usage; speed of convergence of the model, etc.
Hessian matrices are n*n, which is very large in high dimensions, and computation and storage are both problems.
A mini-batch that is too small can lead to slower convergence, and one that is too large can easily fall into sharp minima, which is not good for generalization.
You can think of dropout as an ensemble method, where each dropout is equivalent to finding a thinner network from the original one.
The pooling operation increases the sense field, but some information is lost. Null convolution inserts values with a weight of 0 into the convolution kernel, and therefore skips some pixels in each convolution;
Null convolution increases the perceptual field of each point in the convolution output, and does not lose information like pooling, and is more widely used in speech sequence problems where the image needs global information or a long sequence dependency.
The expression is:
The reason for using BN is that changing parameters at each layer of network training leads to changes in the distribution of the inputs at each subsequent layer, and the learning process adapts each layer to the distribution of the inputs, so the learning rate of the network has to be lowered, and the initialization has to be internal covariant shift
If you only use pooling, you can use it for the first time, and then you can use it for the second time. p>
If the data is made to have zero mean and unit variance only by normalization, the expressive power of the layers is reduced (e.g., only linear regions are used when using the sigmoid function)
The exact process of the BN (note that epsilon is added to the denominator of the third equation)
The best way of explaining this is through the 1 * 1 convolutional nuclear energy. The best explanation is to realize decoupling between multiple channels, decoupling cross-channel correlation and spatial correlation. However, because the decoupling is not complete, there are subsequent group convolution methods for mobile net and shuffle net group convolution methods
Since 1×1 doesn't change height and width, the first and most intuitive result of changing channels is that the original amount of data can be increased or decreased. All that changes is the size of the channels dimension of height × width × channels.
1*1 convolution kernel, you can keep the scale of the feature map unchanged (i.e., no loss of resolution) under the premise of a significant increase in the nonlinear properties (using the backward nonlinear activation function), to make the network very deep.
Note: a filter corresponds to the convolution to get a feature map, and different filters ( different weights and bias) can be used to get a feature map. different weight and bias), after convolution to get a different feature map, extract different features, get the corresponding specialized neuron.
Example: using 1x1 convolution kernel, to achieve the operation of downgrading and upgrading dimensions is actually a linear combination of changes in the information between the channels, 3x3, 64channels of convolution kernel. 64channels convolution kernel add a 1x1, 28channels convolution kernel, it becomes 3x3, 28channels convolution kernel, the original 64channels can be interpreted as a linear combination of cross-channel into 28channels, which is the interaction of the information between the channels
Note: only in the Note that the linear combination is only done in the channel dimension, and the sliding window on W and H are ****-weighted
It does not mean that the model is invalid, and the reasons why the model does not converge may be
A. In the real world, we should try to use ADAM, and avoid SGD
B. For the same initial learning rate, the convergence speed of ADAM is always faster than that of SGD
B. The convergence speed of ADAM is always faster than that of SGD, and the learning rate is always higher than that of SGD. ADAM always converges faster than the SGD method for the same initial learning rate
C. For the same number of hyper-parameters, SGD with manual tuning usually achieves better results than adaptive learning rate tuning
D. For the same initial learning rate, ADAM is easier to overfit than SGD
AdAM is more likely to be overfitted than SGD. > A. Ensure that the receptive field of each layer remains unchanged and the depth of the network is deepened, which makes the network more accurate
B. Make the receptive field of each layer larger, which makes the ability to learn small features larger
C. Effectively extract the high-level semantic information and process the high-level semantics, which effectively improves the accuracy of the network
D. Using the structure to effectively reduce the weight of the network
A. Simple to compute
B. Nonlinear
C. Has a saturation region
D. Almost everywhere microscopic
The relu function is not microscopic at 0.
A. Adam converges slower than RMSprop
B. Compared to optimizers such as SGD or RMSprop, Adam converges the best
C. For lightweight neural networks, it is better to use Adam than RMSprop
D. Compared to Adam or optimizers such as RMSprop, SGD has the best convergence
SGD usually takes longer to train and tends to fall into saddle points, but with good initialization and learning rate scheduling schemes, the results are more reliable. Learning rate adaptive optimization is recommended if you care about faster convergence and need to train deeper and more complex networks.
A. Using ReLU as the activation function effectively prevents gradient explosion
B. Using Sigmoid as the activation function is more prone to gradient vanishing
C. Using Batch Normalization layer effectively prevents gradient explosion
D. Using parametric weight decay, which to a certain extent prevents model overfitting
Doubtful about the results. Consider that both can be prevented.
A.SGD
B.FTRL
C.RMSProp
D.L-BFGS
L-BFGS (Limited-memory BFGS) method:
L-BFGS (Limited-memory BFGS) method. strong>
All data is involved in training, and the algorithm incorporates variance normalization and mean normalization. Large dataset training DNN, easy to parameter amount is too large (evolutionary version of Newton's method, looking for better optimization direction, reduce the number of iteration rounds) from the LBFGS algorithm's process, its entire core is how to quickly calculate an approximation of the Hesse: focus on the first is the approximation, so there are LBFGS algorithm in the use of the first m approximation of the descending direction of the iterative process; focus on the second is the fast This is reflected in the fact that there is no need to save the Hesse matrix, which can be accomplished using only a saved sequence of first-order derivatives, and thus does not require a large amount of storage, thus saving computational resources; focus three, is the use of the rank-two correction in the derivation to construct a positive definite matrix, which, even if it is not the optimal descent direction, at least guarantees that the function is descending.
FTRL (Follow-the-regularized-Leader) is a common optimization algorithm for online learning with a large number of sparse features, which is suitable for dealing with ultra-large-scale data. It is convenient, practical and effective, and commonly used for updating online CTR prediction models; FTRL performs very well in dealing with convex optimization problems with non-smooth regular terms (e.g., the L1 regular). optimization problems, not only can control the sparsity of the model through the L1 regularity, but also fast convergence;
A. LSTM solves the problem of traditional RNN gradient disappearance or gradient explosion to a certain extent
B. One of the advantages of CNN compared to the fully-connected is that it has a low complexity of the model, which mitigates overfitting
C .As long as the parameter settings are reasonable, the effect of deep learning should be at least better than that of stochastic algorithms
D. Stochastic gradient descent can alleviate the problem of getting stuck in the saddle point during the network training process
In fact, there are a lot of measures and improvements for small targets, as follows:
The most common is Upsample to Rezie network inputs The size of the image;
Use dilated/astrous and other such special convolution to improve the sensitivity of the detector to the resolution; (Hollow convolution is a kind of convolution idea for the semantic segmentation problem of the image in which downsampling will reduce the resolution of the image and lose information. Using the addition of voids to expand the receptive field, so that the original 3 x3 convolutional kernel, in the same number of parameters and computational effort to have a 5x5 (dilated rate = 2) or larger receptive field, thus eliminating the need for downsampling. (Increasing the convolutional kernel's receptive field while keeping the number of parameters the same)
There is a more direct way to make predictions on shallow and deep Feature Maps directly and independently of each other, and this is often referred to as the scaling problem.
With FPN this kind of shallow features and deep features fusion, or finally in the prediction of the shallow features and deep features together;
SNIP (Scale Normalization for Image Pyramids) the main idea:
In the training and back propagation of the updated parameters, only consider those targets that are within a specified range of scales, thus proposing a special multi-scale training method.