What training data do open source LLM models need for classification?

Insight from top 10 papers

Training Data for Open Source LLM Classification

Types of Training Data

General Language Corpora

Large-scale text datasets to build foundational language understanding:

  • Wikipedia articles
  • Web crawl data
  • Books and literature
  • News articles

Example: LLaMA models are trained on a mixture of 7 publicly available datasets, comprising 1.4T tokens (Jiao et al., 2023)

Task-Specific Datasets

Curated datasets for specific classification tasks:

  • Sentiment analysis corpora
  • Topic classification datasets
  • Named entity recognition data
  • Question-answering pairs

Example: The COIG dataset for Chinese instruction-following tasks (Jiao et al., 2023)

Instruction-Tuning Data

Datasets designed to improve model's ability to follow instructions:

  • Human-written instructions and responses
  • Multi-turn conversations
  • Task-specific prompts and completions

Example: The OpenAssistant dataset, a crowd-sourced conversational dataset for RLHF training (Han et al., 2023)

Data Quality and Preprocessing

Data Cleaning

  • Removing irrelevant or low-quality content
  • Handling special characters and formatting
  • Normalizing text (e.g., lowercase, removing extra whitespace)
  • Addressing potential biases or inappropriate content

Note: Careful preprocessing is crucial to avoid introducing biases or errors in the training data.

Data Augmentation

  • Techniques to increase dataset size and diversity:
    • Synonym replacement
    • Back-translation
    • Text generation using existing LLMs

Example: The Panda LLM project uses up-sampling techniques on the COIG dataset to improve performance (Jiao et al., 2023)

Domain-Specific Considerations

Medical Domain

  • Specialized medical datasets:
    • Medical exam questions (e.g., USMLE)
    • PubMed articles and abstracts
    • Electronic health records (anonymized)
    • Medical forum conversations

Example: MedAlpaca uses various medical datasets for training, including MedQA and PubMed Causal Benchmark (Han et al., 2023)

Multilingual and Cross-lingual Data

  • Datasets in multiple languages
  • Parallel corpora for translation tasks
  • Code-switched data for multilingual models

Example: The Panda LLM project incorporates Chinese-specific datasets to enhance performance on Chinese language tasks (Jiao et al., 2023)

Ethical Considerations and Bias Mitigation

Addressing Hate, Abuse, and Profanity (HAP)

  • Importance of detecting and mitigating HAP content in training data
  • Implementing HAP detectors to create civil and unbiased LLMs
  • Balancing between realistic language representation and ethical concerns

Example: Research on efficient models for detecting Hate, Abuse, and Profanity in multiple languages (Tillmann et al., 2024)

Fairness and Representation

  • Ensuring diverse representation in training data
  • Avoiding reinforcement of stereotypes or biases
  • Considering cultural and regional differences in language use

Note: Careful curation and analysis of training data is essential to promote fairness and reduce biases in the resulting models.

Data Generation Techniques

Synthetic Data Generation

  • Using existing LLMs to generate labeled training data
  • Zero-shot learning via dataset generation
  • Prompting techniques for diverse and high-quality synthetic data

Example: The Fabricator toolkit for generating labeled training data using teacher LLMs (Golde et al., 2023)

Human-in-the-Loop Data Creation

  • Combining automated generation with human verification
  • Iterative refinement of generated data
  • Ensuring data quality and relevance to the target task

Note: This approach can help balance the efficiency of automated generation with the accuracy of human annotation.

Training Strategies

Fine-tuning Approaches

  • Full fine-tuning of all model parameters
  • Parameter-efficient fine-tuning techniques (e.g., LoRA)
  • Instruction tuning for improved task performance

Example: MedAlpaca explores various fine-tuning approaches, including full fine-tuning and LoRA (Han et al., 2023)

Data Sampling and Balancing

  • Techniques to handle class imbalance in classification tasks
  • Curriculum learning approaches
  • Dynamic data sampling during training

Note: Proper data sampling and balancing can significantly impact model performance and generalization.

Evaluation and Benchmarking

Task-Specific Evaluation Metrics

  • Classification accuracy, F1 score, precision, recall
  • Domain-specific metrics (e.g., BLEU for translation)
  • Human evaluation for subjective tasks

Example: MedAlpaca uses USMLE exam performance as an evaluation metric for medical knowledge (Han et al., 2023)

Cross-Domain Generalization

  • Evaluating model performance on out-of-domain tasks
  • Zero-shot and few-shot learning capabilities
  • Assessing transfer learning potential

Note: Assessing generalization helps understand the model's robustness and applicability to diverse classification tasks.

Source Papers (10)
Pandora's White-Box: Increased Training Data Leakage in Open LLMs
Automated, LLM enabled extraction of synthesis details for reticular materials from scientific literature
Panda LLM: Training Data and Evaluation for Open-Sourced Chinese Instruction-Following Large Language Models
MedAlpaca - An Open-Source Collection of Medical Conversational AI Models and Training Data
Exploiting Large Language Models to Train Automatic Detectors of Sensitive Data
Mobile-LLaMA: Instruction Fine-Tuning Open-Source LLM for Network Analysis in 5G Networks
Efficient Models for the Detection of Hate, Abuse and Profanity
Fabricator: An Open Source Toolkit for Generating Labeled Training Data with Teacher LLMs
AutoTrain: No-code training for state-of-the-art models
Automatic Instruction Optimization for Open-source LLM Instruction Tuning