Chen Danqi and tsinghua Special prize students released a new achievement: break the training rules proposed by Google BERT

2022-05-11 0 By

Xiao Xiao from sunken temple qubits QbitAI | public, new jin Sloan prize winner how to celebrate?Does the disclosure of the latest research results count?On the same day the Sloan Prize was announced, Chen’s team presented its latest research.The team found that the classic NLP model BERT’s pretraining “15% cover rate” rule can be broken!”15% masking” refers to randomly covering 15% of words in a pre-training task and training the AI to predict which words are covered.Chen’s team believes that if the masking rate is increased to 40%, the performance is even better than at 15% : not only that, but this paper also proposes a new method to better improve the performance of NLP model training at 40% masking rate.”The funny thing about BERT,” says one engineer at Hugging Face, “is that although BERT is a pioneering study, its training methods are both wrong and unnecessary.”Gao Tianyu, co-author of this paper, is also a special prize winner of Tsinghua University and has published four papers during his undergraduate studies.So how did the paper come to this conclusion?”Large model is more suitable for high masking rate” Chen Danqi’s team first verified this conclusion from three aspects of masking rate, number of iterations and model size.They first trained the NLP model with a series of different masking rates, with the following parameters: It was found that, except for a small number of data sets, the model performed better on MNLI, QNLI, QQP, STS-B, SQuAD, etc., with 40% masking rate than 15%.In order to further train the effect of iteration steps influenced by the masking rate, the authors also recorded the effect of the model under different iteration rates.The results showed that 40% cover rate generally performed better than 15% as the number of iterations increased: not only that, but the authors also found that 40% cover rate was more suitable for training with larger models.The results showed that the large model performed better than the medium NLP model at 40% cover rate: it is true that training with only 15% cover rate is not as good as training with 40% cover rate, and the larger NLP model is more suitable for training with 40% cover rate.The team guessed that harder tasks would encourage the model to learn more features, which is what large models have.To explore the mechanism, the authors propose a new evaluation method.To be specific, the masking rate is divided into two indicators: corruption rate and prediction rate.Where, the destruction rate is the proportion of sentence destruction, and the prediction rate is the proportion predicted by the model.For example, the “I like to play basketball” corpus may be corrupted into “I [MASK][MASK][MASK]” provided to the model, but the model only needs to predict whether the first [MASK] is “like”.In this way, the failure rate can be used to control the difficulty of the pre-training task, and the prediction rate can be used to control the optimization effect of the model.In this paper, the failure rate (McOrr) and prediction rate (MPRED) are further studied, and a new rule is found: higher prediction rate leads to better model effect;But the failure rate is higher, and the model is worse: it allows for a more accurate way to evaluate various pre-training tasks.Finally, the authors tested multiple masks against this metric to see which ones worked better at higher masking rates.The results show that, with the increase of Masking rate, the Uniform performance of random Masking is better than Span Masking and PMI-masking.However, in many previous NLP models, more complex masks such as PMI-masking or Span Masking are basically used for training.This also shows that the pre-training effect of NLP large model cannot be generalized, and the training method alone is worth further study.Several authors of the paper are from Danqi Chen’s group.Tianyu Gao is currently a second-year doctoral student at Princeton University. She graduated from Tsinghua University and was awarded the Tsinghua Undergraduate Special Scholarship.When he was an undergraduate, Gao Tianyu did scientific research in Professor Liu Zhiyuan’s team, during which he published a total of 4 papers (two AAAI papers and two EMNLP papers).Co-author Alexander Wettig, a first-year PhD student at Princeton university and a Graduate of Cambridge University, is interested in the direction of NLP generalization capabilities.Zexuan Zhong is a PhD student at Princeton University with a MASTER’s degree from the University of Illinois at Urbana-Champaign under the supervision of Tao Xie;I graduated from the Computer department of Peking University with a bachelor’s degree, and had an internship in Microsoft Asia Research Institute with Nie Zaiqing as my tutor.With this finding, many large NLP models may be able to get better results through improved training methods.