Text-to-density representation techniques vary widely, evolving from character bigrams to advanced subword vectorization to combat OOV challenges such as adversarial attacks and typos.
This strategy includes subword-level tokenization and decomposition of unknown words into N-grams for effective neural network training.
Researchers at Google recently developed announced A new resilient and efficient text vectorizer called RETVec protects Gmail users from malicious emails and spam.
StorageGuard scans, detects, and remediates security misconfigurations and vulnerabilities across hundreds of storage and backup devices.
RETVec
RETVec is an efficient, multilingual, next-generation text vectorizer with built-in adversarial resilience. This next-generation text vectorizer is robust to character-level operations such as:
- insert
- delete
- typo
- homoglyph
- LEET substitution
The RETVec character encoder has two layers, which are described below.
- integerization layer
- Binalizer layer
RETVec uses its own character encoder to handle UTF-8 efficiently. Easily supports over 100 languages without lookup tables or fixed vocabularies. Also, because it is a layer, it fits seamlessly into any TF model without any additional preprocessing.
RETVec Binarizer enhances word representation but lacks competitiveness. Researchers enhance it with smaller models to increase accuracy and outperform other models.
TensorFlow models easily employ RETVec for string vectorization in just one line. In addition to this, the raw strings were processed with built-in preprocessing.
Additionally, the system is fully functional for mobile and web use cases on devices as it supports:
Researchers tested RETVec against hostile content using Google spam filters. Replacing SentencePiece with RETVec improved spam detection by 38% with a 0.80% false positive rate and reduced latency by 30%.
This suggests that RETVec is competitive for real-world tasks, increasing confidence in its effectiveness.
How to optimize RETVec to improve multilingual skills, robustness, and smaller models within large language models (LLMs) is an important question. For small LLMs, the vocabulary layer can exceed 20% of the parameters, and RETVec removes it.
However, using RETVec in a generative model poses a challenge because its 256 floating-point embeddings do not translate directly to the softmax output. We need new training methods that are compatible with text generation.
Experimenting with character-by-character decoding and VQ-VAE models yields inconclusive results. In future work, we will address these limitations and explore the use of RETVec as a word embedding, replacing word2vec with GloVe and using its character encoder to train a text similarity model.
install
You can use ‘pip’ to install the latest TensorFlow version of RETVec.
In addition to this, RETVec has already been tested in TensorFlow 2.6 and later and Python 3.8 and later.
Experience how StorageGuard eliminates security blind spots in your storage systems. 14-day free trial.