Introduction

The eBay Machine Learning competition is an annual event for college students, spanning from undergraduates to PhD candidates. This year’s challenge focused on Named Entity Recognition (NER), also known as Token Classification. eBay assembled a dataset of 10 million entries comprising German Athletic Wear for this purpose. Participants were provided with 5000 entries as training data. Due to contract agreements, specific source code won't be shared, but there will be a broad discussion of general ideas. In the end, we classified 12th out of 887 teams.

eBay Logo

Data

The training data is a TSV file formatted as seen in Figure 1. TSV files are a common choice because they contain no commas, whereas our data does not contain any tabs.

Record Number Title Token Tag
1 Air Jordan Nike Air Produktlinie
1 Air Jordan Nike Jordan
1 Air Jordan Nike Nike Marke
2 Air Jordan 7000 Nike Air Produktlinie
Figure 1 - Example Training Dataset

The record number only increments when a new title occurs in the dataset. A blank tag signifies identification with the previous token and the current token, separated by a space. Consequently, for record numbers 1 and 2, the data represents 'Air Jordan' with the tag 'Produktlinie.' This pattern extends to 'n' different tokens without assigned tags.

Another dataset-specific choice involves treating all commas, periods, slashes, apostrophes, etc., as space-separated entities within the sentence. Hence, individual punctuations must be classified. This design decision is valid and applies to the test set, rendering further discussion unnecessary.

Example Transformation: 'That's too bad, I wish it was cheaper.' -> 'That's too bad , I wish it was cheaper .'

It's important to note that the data isn't 100% accurate due to human labeling, introducing some noise. Consequently, our modeled function may not be completely accurate either.

The test dataset was identical, except for the 'Token' and 'Tag' columns.

Data Pipeline

Since the data is not publicly available, and specific to individual needs, we will give a brief overview. I first converted the tsv data into a Hugging Face compatible dataset. There are numerous NER tutorials that discuss what these datasets look like. I would suggest using conll2003 as the gold standard of datasets for Hugging Face and NER.

Ablation Outline

To provide a complete context for the ablation progression in this article, I began the competition with no prior NLP expertise. Consequently, it was akin to learning on the job. The subsequent ablations are presented in chronological order.

Initial Models

For the initial model training, I followed a straightforward SpaCy tutorial online, omitting the use of training/validation sets. Instead, I allowed my local CPU to train the model for 1.5 days until completion. The model achieved an F1 score of approximately 0.900 using a basic inference pipeline.

To enhance the inference pipeline, I encountered an issue with SpaCy not assigning tags to every token in the Title. This led to random tokens lacking tags, visibly impacting the final score. I addressed this by employing a simple algorithm that recursively broke down the text until all tokens received a tag from the pipeline. Tokens that remained untagged were labeled as 'No Tag'.

Despite my lack of experience in German, I identified several errors in trivial examples generated by this model. To rectify this, I implemented hard-coded rules tailored to handle these specific issues. As a result, the model's performance improved to an F1 score of around 0.905.

Recognizing the limitations of the current model's capabilities, particularly its inability to accurately approximate simple functions within the training data, I decided to upgrade to a larger model. This decision led me to transition the model to SpaCy Transformers, leveraging larger transformers available on Hugging Face. However, this required GPU-based training, so I migrated all training processes to Google Colab and allowed it to train for approximately 3 hours.

Surprisingly, the resulting performance remained consistent, indicating that the current model architectures were insufficient to surpass the existing benchmarks of approximately 0.94 at that time.

Hugging Face Models

Seeking significant model improvement, I switched to models available on Hugging Face. For those unfamiliar, Hugging Face is a leading company for Open Source Models housing many cutting-edge models developed by companies such as Google and Facebook.

Unlike the previous SpaCy models, the top-tier models on Hugging Face are notably larger and have undergone extensive fine-tuning and training by these tech giants, resulting in anticipated better performance.

Initially, I conducted a sweep on the following models presented. They are seperated into two categories, those that peformed well and those that did not.

Peformant Models: Non-Performant Models:

Following these initial ablations, we conducted a more comprehensive test on the top 2 models. To ensure the validity of these tests, we employed K-Cross Fold Validation with an 80-20 Train/Test split.

These ablations were executed on standard free Google Colab accounts using a T4 GPU. xlm-roberta-large exhibited superior performance, approximately ~0.05 better than gilbert-large. Consequently, we achieved a performance of ~0.950 f1 score after approximately 50 epochs of training.

After hyperparameter tuning, the performance further increased to ~0.964 f1 score.

SUPER BIG MODELS

Despite significant improvements, our model has not yet surpassed the current top scores of approximately 0.945 F1 score.

To achieve this metric, I opted to conduct ablations on the xlm-roberta-xxl, an 11 billion parameter model requiring 43 GiB. This change pushed my hardware capabilities to the maximum.

After several days of attempting to procure cloud instances from AWS and GCP, I managed to secure a p3.16xlarge instance on AWS. For anyone seeking instances with ample computational power, I recommend submitting AWS quota requests well in advance. As of 2023, obtaining superior GPUs from GCP is nearly impossible. If you prefer Google, consider using Google Colab Pro+ to ensure access to an A100 GPU.

DeepSpeed

To accommodate our colossal model across the 8 V100 instances, we utilized Microsoft's DeepSpeed. DeepSpeed enabled Tensor Parallelism, a necessity for our model's fitting. Implementation merely required a DeepSpeed config file and integrating DeepSpeed as a Hugging Face Trainer Argument.

Debugging with DeepSpeed posed immense challenges. A range of errors emerged, spanning GPU/CPU OOM (Out of Memory), Network Socket Failure, and Bus failure. These errors often displayed as simple exit codes like -7, -11, with minimal explanation provided in the DeepSpeed documentation beyond GitHub Pull Requests.

Training the complete model parameters incurred significant expenses, with a single training run costing approximately $250. While the full model's training might promise superior performance, our limited budget constrained exploring dropout and hyperparameters. A single training run proved unsatisfactory within 25 epochs, each taking 40 minutes.

LoRA and LoRAQ

To tackle the costly training challenges, I implemented LoRA into the transformer model. This approach proved to be cost-effective and should deliver similar results.

However, incorporating DeepSpeed with LoRA posed significant difficulties. Despite conflicting information online about the compatibility of LoRA with DeepSpeed, I found it unworkable. I encountered errors related to tensor dimensions that were indecipherable, and due to the limited time remaining in the competition, debugging wasn't a viable option. Before giving up on DeepSpeed on the p3.16xlarge instances, I attempted to modify the DeepSpeed source code, but it was ineffective.

To resolve these issues, I transitioned to an A100 40GB GPU using Google Colab Pro+. This cost only $50 and provided ample computing credits for training models.

Using a single GPU provided a seamless experience. LoRA worked effectively with rank matrices below 512, and I could experiment with QLoRA for larger ranks.

The ablations significantly converged faster toward comparable performance. However, despite experimenting with rank, alpha, and dropout, we were unable to surpass our xlm-roberta-large performance of 0.9364.

Reflection

While our score didn't show improvement, our larger model had untapped potential. The highest quiz score in the competition reached approximately 0.935. I believe the larger model might have needed additional regularization, data augmentation techniques, or synthetic data to enhance its performance. Moreover, exploring open-source datasets alongside eBay's data could be beneficial for our model training.

In summary, the experience was valuable as it helped me acquire Deep Learning/NLP skills. With an earlier start and more experience, achieving a more competitive score would have been feasible.

This revised version maintains the essence of your message while refining the structure and clarity for a smoother read.

If you have any questions, feel free to reach out to me on LinkedIn or email.

Acknowledgement

I would like to thank eBay for hosting this competition and providing me with the opportunity to learn and grow. I would also like to thank Elvin Liu, who helped me throughout the competition. Without his immense contribution, we would have significantly fewer Google accounts.