# QA Dataset Sampling Tool
A comprehensive tool for sampling QA datasets and generating answers using OpenAI's GPT models. This tool helps you create high-quality question-answering datasets from large-scale collections like MS MARCO.
## Features
- **Smart Sampling**: Intelligently sample queries, documents, and relevance judgments from large datasets
- **Answer Generation**: Automatically generate high-quality answers using OpenAI's GPT models
- **Resume Support**: Continue interrupted answer generation from where it left off
- **Progress Tracking**: Real-time progress updates and statistics
- **Result Visualization**: Easy-to-read display of generated QA pairs with context
## Installation
### Prerequisites
- Python 3.7+
- OpenAI API key
### Install Dependencies
```bash
pip install pandas pyarrow openai
```
### Set Environment Variables
```bash
export OPENAI_API_KEY="your-openai-api-key"
# Optional: Use custom OpenAI endpoint
export OPENAI_BASE_URL="https://api.openai.com/v1"
```
### Parpare dataset
We provide pre-processed samples from popular QA datasets:
MarkrAI/msmarco_sample_autorag
## Quick Start
### 1. Sample Data from Large Dataset
First, sample a subset of queries, documents, and relevance judgments from your full dataset:
```bash
python dataset/qa_dataset.py sample \
--queries ~/dataset/mmarco-queries.parquet \
--corpus ~/dataset/mmarco-corpus.parquet \
--qrels ~/dataset/mmarco-qrels.parquet \
--nq 100 \
--output_dir ./dataset/samples
```
### 2. Generate Answers
Use OpenAI's GPT model to generate answers for the sampled questions:
```bash
python dataset/qa_dataset.py generate \
--input_dir ./dataset/samples \
--output_dir ./dataset/samples
```
### 3. View Results
Display the generated QA pairs with their context:
```bash
python dataset/qa_dataset.py show \
--input_dir ./dataset/samples \
-n 5
```
## Detailed Usage
### Sample Command
Create a representative sample from your full dataset.
```bash
python dataset/qa_dataset.py sample [OPTIONS]
```
**Required Parameters:**
- `--queries`: Path to queries parquet file (columns: `id`, `text`)
- `--corpus`: Path to corpus parquet file (columns: `id`, `text`)
- `--qrels`: Path to qrels parquet file (columns: `qid`, `pid`)
**Optional Parameters:**
- `--nq`: Number of queries to sample (default: 1000)
- `--output_dir`: Output directory for sampled data (default: ./save)
**Example:**
```bash
python dataset/qa_dataset.py sample \
--queries data/queries.parquet \
--corpus data/corpus.parquet \
--qrels data/qrels.parquet \
--nq 500 \
--output_dir ./my_sample
```
### Generate Command
Generate answers for sampled questions using OpenAI API.
```bash
python dataset/qa_dataset.py generate [OPTIONS]
```
**Required Parameters:**
- `--input_dir`: Directory containing sampled data (queries.parquet, corpus.parquet, qrels.parquet)
**Optional Parameters:**
- `--output_dir`: Output directory for generated answers (default: ./save)
**Features:**
- **Resume Support**: Automatically continues from where it left off if interrupted
- **Error Handling**: Retries failed API calls up to 3 times
- **Progress Saving**: Saves progress after each successful answer generation
**Example:**
```bash
python dataset/qa_dataset.py generate \
--input_dir ./my_sample \
--output_dir ./my_sample
```
### Show Command
Display generated QA pairs with full context.
```bash
python dataset/qa_dataset.py show [OPTIONS]
```
**Required Parameters:**
- `--input_dir`: Directory containing QA data (queries.parquet, corpus.parquet, qrels.parquet, qas.parquet, answers.parquet)
**Optional Parameters:**
- `-n`: Number of results to display (default: 5)
**Example:**
```bash
python dataset/qa_dataset.py show \
--input_dir ./my_sample \
-n 3
```
## Input Data Format
### Queries File (queries.parquet)
| Column | Type | Description |
|--------|------|-------------|
| id | string | Unique query identifier |
| text | string | The actual question text |
### Corpus File (corpus.parquet)
| Column | Type | Description |
|--------|------|-------------|
| id | string | Unique passage/document identifier |
| text | string | The passage/document content |
### Qrels File (qrels.parquet)
| Column | Type | Description |
|--------|------|-------------|
| qid | string | Query ID (matches queries.id) |
| pid | string | Passage ID (matches corpus.id) |
## Output Files
After running all commands, your output directory will contain:
### Sampled Data
- `queries.parquet`: Sampled queries subset
- `corpus.parquet`: Sampled documents subset
- `qrels.parquet`: Sampled relevance judgments
### Generated Answers
- `answers.parquet`: Generated answers with unique IDs
- `qas.parquet`: Question-answer mapping (qid → aid)
## Advanced Usage
### Custom OpenAI Configuration
You can use different OpenAI models or endpoints:
```bash
# Use GPT-4 Turbo
export OPENAI_API_KEY="your-key"
python dataset/qa_dataset.py generate --input_dir ./samples
# Use Azure OpenAI
export OPENAI_API_KEY="azure-key"
export OPENAI_BASE_URL="https://your-resource.openai.azure.com/openai/deployments/gpt-4"
python dataset/qa_dataset.py generate --input_dir ./samples
```
### Large Dataset Sampling
For very large datasets, consider sampling in batches:
```bash
# First batch
python dataset/qa_dataset.py sample --nq 1000 --output_dir ./batch1
python dataset/qa_dataset.py generate --input_dir ./batch1
# Second batch
python dataset/qa_dataset.py sample --nq 1000 --output_dir ./batch2
python dataset/qa_dataset.py generate --input_dir ./batch2
```
## Troubleshooting
### Common Issues
**1. OpenAI API Errors**
- Ensure your API key is set correctly: `echo $OPENAI_API_KEY`
- Check your API quota and billing status
- Verify network connectivity to OpenAI
**2. Memory Issues with Large Datasets**
- Reduce `--nq` parameter for smaller samples
- Ensure sufficient RAM for pandas operations
- Consider using smaller parquet files
**3. File Not Found Errors**
- Verify all input file paths are correct
- Ensure parquet files have correct column names
- Check file permissions
### Debug Mode
Enable verbose output by adding print statements or using Python debugger:
```bash
python -m pdb dataset/qa_dataset.py sample --queries ...
```
## Example Workflow
```bash
# 1. Setup environment
export OPENAI_API_KEY="sk-..."
# 2. Sample 200 queries from MS MARCO
python dataset/qa_dataset.py sample \
--queries ~/mmarco/queries.parquet \
--corpus ~/mmarco/corpus.parquet \
--qrels ~/mmarco/qrels.parquet \
--nq 200 \
--output_dir ./marco_sample
# 3. Generate answers (may take time depending on API rate limits)
python dataset/qa_dataset.py generate \
--input_dir ./marco_sample \
--output_dir ./marco_sample
# 4. Review results
python dataset/qa_dataset.py show \
--input_dir ./marco_sample \
-n 10
```
## Contributing
Feel free to submit issues and enhancement requests!
## License
MIT License - feel free to use this tool for your research and projects.