273 lines
6.8 KiB
Plaintext
273 lines
6.8 KiB
Plaintext
# QA Dataset Sampling Tool
|
|
|
|
A comprehensive tool for sampling QA datasets and generating answers using OpenAI's GPT models. This tool helps you create high-quality question-answering datasets from large-scale collections like MS MARCO.
|
|
|
|
## Features
|
|
|
|
- **Smart Sampling**: Intelligently sample queries, documents, and relevance judgments from large datasets
|
|
- **Answer Generation**: Automatically generate high-quality answers using OpenAI's GPT models
|
|
- **Resume Support**: Continue interrupted answer generation from where it left off
|
|
- **Progress Tracking**: Real-time progress updates and statistics
|
|
- **Result Visualization**: Easy-to-read display of generated QA pairs with context
|
|
|
|
## Installation
|
|
|
|
### Prerequisites
|
|
|
|
- Python 3.7+
|
|
- OpenAI API key
|
|
|
|
### Install Dependencies
|
|
|
|
```bash
|
|
pip install pandas pyarrow openai
|
|
```
|
|
|
|
### Set Environment Variables
|
|
|
|
```bash
|
|
export OPENAI_API_KEY="your-openai-api-key"
|
|
# Optional: Use custom OpenAI endpoint
|
|
export OPENAI_BASE_URL="https://api.openai.com/v1"
|
|
```
|
|
|
|
### Parpare dataset
|
|
|
|
We provide pre-processed samples from popular QA datasets:
|
|
|
|
MarkrAI/msmarco_sample_autorag
|
|
|
|
## Quick Start
|
|
|
|
### 1. Sample Data from Large Dataset
|
|
|
|
First, sample a subset of queries, documents, and relevance judgments from your full dataset:
|
|
|
|
```bash
|
|
python dataset/qa_dataset.py sample \
|
|
--queries ~/dataset/mmarco-queries.parquet \
|
|
--corpus ~/dataset/mmarco-corpus.parquet \
|
|
--qrels ~/dataset/mmarco-qrels.parquet \
|
|
--nq 100 \
|
|
--output_dir ./dataset/samples
|
|
```
|
|
|
|
### 2. Generate Answers
|
|
|
|
Use OpenAI's GPT model to generate answers for the sampled questions:
|
|
|
|
```bash
|
|
python dataset/qa_dataset.py generate \
|
|
--input_dir ./dataset/samples \
|
|
--output_dir ./dataset/samples
|
|
```
|
|
|
|
### 3. View Results
|
|
|
|
Display the generated QA pairs with their context:
|
|
|
|
```bash
|
|
python dataset/qa_dataset.py show \
|
|
--input_dir ./dataset/samples \
|
|
-n 5
|
|
```
|
|
|
|
## Detailed Usage
|
|
|
|
### Sample Command
|
|
|
|
Create a representative sample from your full dataset.
|
|
|
|
```bash
|
|
python dataset/qa_dataset.py sample [OPTIONS]
|
|
```
|
|
|
|
**Required Parameters:**
|
|
- `--queries`: Path to queries parquet file (columns: `id`, `text`)
|
|
- `--corpus`: Path to corpus parquet file (columns: `id`, `text`)
|
|
- `--qrels`: Path to qrels parquet file (columns: `qid`, `pid`)
|
|
|
|
**Optional Parameters:**
|
|
- `--nq`: Number of queries to sample (default: 1000)
|
|
- `--output_dir`: Output directory for sampled data (default: ./save)
|
|
|
|
**Example:**
|
|
```bash
|
|
python dataset/qa_dataset.py sample \
|
|
--queries data/queries.parquet \
|
|
--corpus data/corpus.parquet \
|
|
--qrels data/qrels.parquet \
|
|
--nq 500 \
|
|
--output_dir ./my_sample
|
|
```
|
|
|
|
### Generate Command
|
|
|
|
Generate answers for sampled questions using OpenAI API.
|
|
|
|
```bash
|
|
python dataset/qa_dataset.py generate [OPTIONS]
|
|
```
|
|
|
|
**Required Parameters:**
|
|
- `--input_dir`: Directory containing sampled data (queries.parquet, corpus.parquet, qrels.parquet)
|
|
|
|
**Optional Parameters:**
|
|
- `--output_dir`: Output directory for generated answers (default: ./save)
|
|
|
|
**Features:**
|
|
- **Resume Support**: Automatically continues from where it left off if interrupted
|
|
- **Error Handling**: Retries failed API calls up to 3 times
|
|
- **Progress Saving**: Saves progress after each successful answer generation
|
|
|
|
**Example:**
|
|
```bash
|
|
python dataset/qa_dataset.py generate \
|
|
--input_dir ./my_sample \
|
|
--output_dir ./my_sample
|
|
```
|
|
|
|
### Show Command
|
|
|
|
Display generated QA pairs with full context.
|
|
|
|
```bash
|
|
python dataset/qa_dataset.py show [OPTIONS]
|
|
```
|
|
|
|
**Required Parameters:**
|
|
- `--input_dir`: Directory containing QA data (queries.parquet, corpus.parquet, qrels.parquet, qas.parquet, answers.parquet)
|
|
|
|
**Optional Parameters:**
|
|
- `-n`: Number of results to display (default: 5)
|
|
|
|
**Example:**
|
|
```bash
|
|
python dataset/qa_dataset.py show \
|
|
--input_dir ./my_sample \
|
|
-n 3
|
|
```
|
|
|
|
## Input Data Format
|
|
|
|
### Queries File (queries.parquet)
|
|
| Column | Type | Description |
|
|
|--------|------|-------------|
|
|
| id | string | Unique query identifier |
|
|
| text | string | The actual question text |
|
|
|
|
### Corpus File (corpus.parquet)
|
|
| Column | Type | Description |
|
|
|--------|------|-------------|
|
|
| id | string | Unique passage/document identifier |
|
|
| text | string | The passage/document content |
|
|
|
|
### Qrels File (qrels.parquet)
|
|
| Column | Type | Description |
|
|
|--------|------|-------------|
|
|
| qid | string | Query ID (matches queries.id) |
|
|
| pid | string | Passage ID (matches corpus.id) |
|
|
|
|
## Output Files
|
|
|
|
After running all commands, your output directory will contain:
|
|
|
|
### Sampled Data
|
|
- `queries.parquet`: Sampled queries subset
|
|
- `corpus.parquet`: Sampled documents subset
|
|
- `qrels.parquet`: Sampled relevance judgments
|
|
|
|
### Generated Answers
|
|
- `answers.parquet`: Generated answers with unique IDs
|
|
- `qas.parquet`: Question-answer mapping (qid → aid)
|
|
|
|
## Advanced Usage
|
|
|
|
### Custom OpenAI Configuration
|
|
|
|
You can use different OpenAI models or endpoints:
|
|
|
|
```bash
|
|
# Use GPT-4 Turbo
|
|
export OPENAI_API_KEY="your-key"
|
|
python dataset/qa_dataset.py generate --input_dir ./samples
|
|
|
|
# Use Azure OpenAI
|
|
export OPENAI_API_KEY="azure-key"
|
|
export OPENAI_BASE_URL="https://your-resource.openai.azure.com/openai/deployments/gpt-4"
|
|
python dataset/qa_dataset.py generate --input_dir ./samples
|
|
```
|
|
|
|
### Large Dataset Sampling
|
|
|
|
For very large datasets, consider sampling in batches:
|
|
|
|
```bash
|
|
# First batch
|
|
python dataset/qa_dataset.py sample --nq 1000 --output_dir ./batch1
|
|
python dataset/qa_dataset.py generate --input_dir ./batch1
|
|
|
|
# Second batch
|
|
python dataset/qa_dataset.py sample --nq 1000 --output_dir ./batch2
|
|
python dataset/qa_dataset.py generate --input_dir ./batch2
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
**1. OpenAI API Errors**
|
|
- Ensure your API key is set correctly: `echo $OPENAI_API_KEY`
|
|
- Check your API quota and billing status
|
|
- Verify network connectivity to OpenAI
|
|
|
|
**2. Memory Issues with Large Datasets**
|
|
- Reduce `--nq` parameter for smaller samples
|
|
- Ensure sufficient RAM for pandas operations
|
|
- Consider using smaller parquet files
|
|
|
|
**3. File Not Found Errors**
|
|
- Verify all input file paths are correct
|
|
- Ensure parquet files have correct column names
|
|
- Check file permissions
|
|
|
|
### Debug Mode
|
|
|
|
Enable verbose output by adding print statements or using Python debugger:
|
|
|
|
```bash
|
|
python -m pdb dataset/qa_dataset.py sample --queries ...
|
|
```
|
|
|
|
## Example Workflow
|
|
|
|
```bash
|
|
# 1. Setup environment
|
|
export OPENAI_API_KEY="sk-..."
|
|
|
|
# 2. Sample 200 queries from MS MARCO
|
|
python dataset/qa_dataset.py sample \
|
|
--queries ~/mmarco/queries.parquet \
|
|
--corpus ~/mmarco/corpus.parquet \
|
|
--qrels ~/mmarco/qrels.parquet \
|
|
--nq 200 \
|
|
--output_dir ./marco_sample
|
|
|
|
# 3. Generate answers (may take time depending on API rate limits)
|
|
python dataset/qa_dataset.py generate \
|
|
--input_dir ./marco_sample \
|
|
--output_dir ./marco_sample
|
|
|
|
# 4. Review results
|
|
python dataset/qa_dataset.py show \
|
|
--input_dir ./marco_sample \
|
|
-n 10
|
|
```
|
|
|
|
## Contributing
|
|
|
|
Feel free to submit issues and enhancement requests!
|
|
|
|
## License
|
|
|
|
MIT License - feel free to use this tool for your research and projects. |